English translation of the Slovene Natural Language Inference Dataset SI-NLI-en 1.0

PID

SI-NLI-en is an English translation of the SI-NLI Slovene Natural Language Inference Dataset (http://hdl.handle.net/11356/1707). The English version was compiled by first using machine translation (DeepL) to translate all the premises and hypotheses from SI-NLI into English. The machine translations were then manually checked and corrected by a group of 7 students of translation at the University of Ljubljana. Each translator was given both the Slovene premise and all its hypotheses as well as the translations of both the premise and the hypotheses, so the translations were not checked in isolation, but as units to ensure maximum semantic coherence.

Just like SI-NLI, SI-NLI-en contains 5,937 sentence pairs (premise and hypothesis) that are manually labeled with the labels "entailment", "contradiction", and "neutral". The dataset is split into train, validation, and test sets, with sizes of 4,392, 547, and 998.

The dataset is released in a tabular TSV format. The 00README.txt file contains a description of the attributes. Only the hypothesis and premise are provided in the test set (with no annotations) since SI-NLI-en is integrated into the Slovene evaluation framework SloBENCH (https://slobench.cjvt.si/). If you use the dataset to train your models, please consider submitting the test set predictions to SloBENCH to get the evaluation score and see how it compares to others.

Identifier
PID http://hdl.handle.net/11356/1934
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1934
Provenance
Creator Klemen, Matej; Žagar, Aleš; Čibej, Jaka; Robnik-Šikonja, Marko
Publisher Faculty of Computer and Information Science, University of Ljubljana
Publication Year 2024
Rights Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0); https://creativecommons.org/licenses/by-nc-sa/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language English
Resource Type corpus
Format text/plain; charset=utf-8; application/zip; downloadable_files_count: 1
Discipline Linguistics