-
Valency lexicon extracted from the Gigafida 2.1 corpus
The valency lexicon was extracted from the Gigafida 2.1 Corpus of Written Standard Slovene (https://www.clarin.si/noske/run.cgi/corp_info?corpname=gfida21) using specialized... -
The Orange workflow for observing collocation trends ColTrend 1.0
The Orange workflow for observing collocation trends ColTrend 1.0 ColTrend is a workflow (.OWS file) for Orange Data Mining (an open-source machine learning and data... -
Corpus of Written Standard Slovene Gigafida 2.0
Gigafida 2.0, with about 1.1 billion words, is a reference corpus of written Slovene text published in the period 1990-2018. It is comprised of daily news, magazines, a... -
Corpus of textbooks for learning Slovenian as L2 KUUS 1.0
The KUUS corpus comprises 17 textbooks for Slovenian as a second and foreign language published between 2002 and 2022 at the Centre for Slovene as a Second and Foreign Language... -
Multiword Expressions lexicon extracted from the Gigafida 2.1 corpus
The MWE lexicon was extracted from the Gigafida 2.1 Corpus of Written Standard Slovene (https://www.clarin.si/noske/run.cgi/corp_info?corpname=gfida21) using specialized scripts... -
Consonant-vowel structures in the GOS 1.0 corpus
The lists contain consonant-vowel structures of all lemmas, word forms, and normalized word forms in the GOS 1.0 Corpus of Spoken Slovene (http://hdl.handle.net/11356/1040). In... -
Corpus of textbooks for learning Slovenian as L2 ccKUUS 2.0
The ccKUUS 2.0 corpus consists of a set of two textbooks and two workbooks for learning Slovenian as a second and foreign language, aimed at adolescents. Published by the Centre... -
Corpus extraction tool LIST 1.3
The LIST corpus extraction tool is a Java program for extracting lists from text corpora on the levels of characters, word parts, words, and word sets. It supports VERT and TEI... -
Training corpus ssj500k 2.2
The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation.... -
Morphological lexicon Sloleks 2.0
Sloleks is the reference morphological lexicon for Slovenian language, developed to be used in NLP applications and language manuals. Encoded in LMF XML, the lexicon contains... -
Developmental corpus ccŠolar 1.0
The ccŠolar corpus contains 1693 texts collected during 2016-2018, as part of the upgrade of the corpus Šolar project. The project aims were to increase the size of the Šolar... -
Training corpus ssj500k 2.1
The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation.... -
Dataset for evaluation of Slovene spell- and grammar-checking tools Šolar-Eva...
Šolar-Eval is a specialized dataset designed for the evaluation of Slovene spell- and grammar-checking tools and methodologies. It encompasses 109 essays authored by Slovene... -
CMC training corpus Janes-Norm 1.2
Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation,... -
Developmental corpus Šolar 3.0
The Developmental corpus Šolar consists of 5,485 texts written by students in Slovenian secondary schools (age 15-19) and pupils in the 7th-9th grade of primary school (13-15),... -
Comprehensive Slovenian-Hungarian Dictionary 2.0
The Comprehensive Slovenian-Hungarian dictionary is a general bilingual dictionary that is being compiled at the Centre for Language Resources and Technologies of the University... -
Thesaurus of Modern Slovene 2.0
Thesaurus of Modern Slovene is the largest automatically generated open-access collection of Slovene synonyms. It is sourced from the data in two principal language resources:... -
Frequency lists of word-level n-grams from the GOS 1.0 corpus
Frequency lists of word-level n-grams (or word sets) were extracted from the GOS 1.0 Corpus of Spoken Slovene (http://hdl.handle.net/11356/1040) using the LIST corpus extraction... -
Comprehensive Slovenian-Hungarian Dictionary 1.0
The Comprehensive Slovenian-Hungarian dictionary is a general bilingual dictionary that is being compiled at the Centre for Language Resources and Technologies of the University... -
CMC training corpus Janes-Tag 1.1
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence...