This dataset is meant for evaluation of systems for semantic change detection in Slovenian.
The "semantic_shift_gs_dataset folder contains 3 files:
1) "gigafida_to_1997_vs_2018.tsv" - contains sources from the Gigafida 2.0 reference corpus (, dating either from year 1997 (or earlier) or year 2018. The corpus, which can be used for training, domain adaptation or word representation extraction is in a .tsv format with 4 columns:
- 'title': Title of the text
- 'publisher': Text's publisher name
- 'date': Year of the text's publishing
- 'type': Text's type (e.g., whether text was scraped from the internet or it appeared in print)
- 'text': Text in non-processed form
2) "word_usage_annotations_1997_2018.tsv" - contains example word usages for 105 predefined words. For each word, we extract from the Gigafida 2.0 corpus 30 usage examples (sentences) from year 1997 and 30 usage examples from year 2018. The sentences from both time periods are randomly matched (e.g. each pair contains a random sentence from 1997 and a random sentence from 2018, both containing the same target word), resulting in 3150 sentence pairs. These pairs were annotated by three human annotators on a scale from 1 to 4:
1: usages in the sentences are unrelated
2: usages in the sentences are distantly related
3: usages in the sentences are closely related
4: usages are identical, i.e. they have the same sense
Label 0 was also allowed, meaning "I can't decide", e.g. due to insufficient context.
The file in the .tsv format contains the following columns:
- 'id': id of the sentence pair
- 'word': target word
- 'sentence 1997': sentence from year 1997
- 'sentence 2018': sentence from year 2018
- 'score_anno1': score given by annotator 1
- 'score_anno2': score given by annotator 2
- 'score_anno3': score given by annotator 3
3) "semantic_shift_scores.tsv": contains final "gold standard" scores for each word, obtained by averaging scores across sentence pairs and across all three annotators in order to obtain a single numerical value for each word in the list. The examples containing zeros were excluded and the word 'zenit' was excluded from the list due to too many sentence pairs containing zeros.
The file in the .tsv format contains the following columns:
- 'word': target word
- 'score': semantic change score