-
Error-annotated developmental corpus Šolar 2.0 Error
The corpus contains 2094 texts from the corpus Šolar 2.0 (http://hdl.handle.net/11356/1214), i.e. only those in which error annotations can be found. For each text, the... -
Frequency lists of collocations from the Gigafida 2.1 corpus
Frequency lists of collocations were extracted from the Gigafida 2.1 Corpus of Written Standard Slovene (https://www.clarin.si/noske/run.cgi/corp_info?corpname=gfida21) using... -
CMC training corpus Janes-Tag 3.0
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC) consisting of about 15,000 short texts (190,000 words), mostly tweets but also blogs,... -
CMC training corpus Janes-Syn 1.0
Janes-Syn is a syntactically annotated corpus of Slovene tweets and is meant as a gold-standard training and testing dataset for syntactic annotation of Slovene... -
Slovene ontology of semantic types for nouns SLONEST-noun 1.0
SLONEST stands for Slovene Ontologies of Semantic Types. The first subset – SLONEST-noun 1.0 – represents an ontology developed for nouns. SLONEST-noun contains an XML file with... -
Slovene learner corpus KOST 1.0
The corpus of Slovene as a foreign language KOST (Korpus slovenščine kot tujega jezika) contains 6,311 texts (just over 1 million words) written by adult speakers for whom... -
List of word relations from the Sloleks 2.0 lexicon 1.0
This entry consists of a TSV file containing a list of 66,347 Slovene word pairs from the Sloleks Morphological Lexicon of Slovene (v2.0; http://hdl.handle.net/11356/1230) that... -
CMC training corpus Janes-Tag 2.0
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence... -
Frequency lists of word-level n-grams from the Gigafida 2.0 corpus
Frequency lists of word-level n-grams (or word sets) were extracted from the Gigafida 2.0 Corpus of Written Standard Slovene (https://viri.cjvt.si/gigafida/) using the LIST... -
Frequency lists of word-level n-grams from the GOS 1.0 corpus 1.1
Frequency lists of word-level n-grams (or word sets) were extracted from the GOS 1.0 Corpus of Spoken Slovene (http://hdl.handle.net/11356/1040) using the LIST corpus extraction... -
Consonant-vowel structures in the GOS 1.0 corpus 1.1
The lists contain consonant-vowel structures of all lemmas, word forms, and standardized word forms in the GOS 1.0 Corpus of Spoken Slovene (http://hdl.handle.net/11356/1040).... -
Reference List of Slovene Frequent Common Words
The reference list of Slovene most frequent common words was prepared by selecting vocabulary at the intersection of the most frequent 10,000 lemmas of four Slovene text... -
Frequency lists of word parts from the GOS 1.0 corpus 1.1
Frequency lists of words split into word parts were extracted from the GOS 1.0 Corpus of Spoken Slovene (http://hdl.handle.net/11356/1040) using the LIST corpus extraction tool... -
Consonant-vowel structures in the Gigafida 2.0 corpus
The lists contain consonant-vowel structures of all lemmas and word forms in the Gigafida 2.0 corpus. In each unit, its characters were converted as follows: C - consonant (in... -
CMC training corpus Janes-Norm 1.0
Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation,... -
CMC training corpus Janes-Tag 1.0
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence... -
Developmental corpus (without language corrections) Šolar 2.0 Clear
Šolar 2.0 Clear is an adapted version of the Šolar 2.0 corpus, cf. http://hdl.handle.net/11356/1214. The Šolar 2.0 Clear corpus consists of texts written by students in Slovene... -
List of word relations from the Sloleks 2.0 lexicon 1.1
This entry consists of a TSV file containing a list of 66,347 Slovene word pairs from the Sloleks Morphological Lexicon of Slovene (v2.0; http://hdl.handle.net/11356/1230) that... -
Annotated corpora and tools of the PARSEME Shared Task on Automatic Identific...
This multilingual resource contains corpora in which verbal MWEs have been manually annotated. VMWEs include idioms (let the cat out of the bag), light-verb constructions (make... -
PARSEME corpora annotated for verbal multiword expressions (version 1.3)
This multilingual resource contains corpora in which verbal MWEs have been manually annotated. VMWEs include idioms (let the cat out of the bag), light-verb constructions (make...