The SloWIC dataset is a Slovenian dataset for the Word in Context task. Each example in the dataset contains a target word with multiple meanings and two sentences that both contain the target word. Each example is also annotated with a label that shows if both sentences use the same meaning of the target word. The dataset contains 1808 manually annotated sentence pairs and additional 13150 automatically annotated pairs to help with training larger models.
The dataset is stored in the JSON format following the format used in the SuperGLUE version of the Word in Context task (https://super.gluebenchmark.com/).
Each example contains the following data fields:
- word: The target word with multiple meanings
- sentence1: The first sentence containing the target word
- sentence2: The second sentence containing the target word
- idx: The index of the example in the dataset
- label: Label showing if the sentences contain the same meaning of the target word
- start1: Start of the target word in the first sentence
- start2: Start of the target word in the second sentence
- end1: End of the target word in the first sentence
- end2: End of the target word in the second sentence
- version: The version of the annotation
- manual_annotation: Boolean showing if the label was manually annotated
- group: The group of annotators that labelled the example