-
English translation of the Slovene Natural Language Inference Dataset SI-NLI-...
SI-NLI-en is an English translation of the SI-NLI Slovene Natural Language Inference Dataset (http://hdl.handle.net/11356/1707). The English version was compiled by first using... -
CMC training corpus Janes-Tag 2.1
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence... -
Training corpus SUK 1.1
The SUK training corpus contains about 1 million tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation, with... -
Monitor corpus of Slovene Trendi 2023-02
The Trendi corpus is a monitor corpus of Slovene. It contains news from 107 different media websites, published by 72 different publishers. Trendi 2023-02 covers the period from... -
Monitor corpus of Slovene Trendi 2024-06
The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 74 publishers. Trendi 2024-06 covers the period from January... -
Frequency lists of words from the Gigafida 2.0 corpus
Frequency lists of words were extracted from the Gigafida 2.0 Corpus of Written Standard Slovene (https://viri.cjvt.si/gigafida/) using the LIST corpus extraction tool... -
Frequency list of words by source from the Trendi corpus 2022-07
The frequency list of words by source was prepared in the following manner: words (i.e. lemmas with their lexical features) were extracted from 15 most frequent sources in the... -
Morphological lexicon Sloleks 3.0
Sloleks is a reference morphological lexicon of Slovene that was developed to be used in various NLP applications and language manuals. It contains Slovene lemmas, their... -
Open Slovene WordNet OSWN 1.0
Open Slovene WordNet (OSWN) is derived from Open English WordNet (https://en-word.net/), which itself is derived from Princeton WordNet by the Open English WordNet Community.... -
Morphological patterns from the Sloleks 2.0 lexicon 1.0
This entry consists of XML files with 96,290 lexical units (nouns, verbs, adjectives, and adverbs) from the Sloleks Morphological Lexicon of Slovene 2.0... -
Monitor corpus of Slovene Trendi 2024-01
The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 70 publishers. Trendi 2024-01 covers the period from January... -
Parallel sense-annotated corpus ELEXIS-WSD 1.1
ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.1 contains sentences for 10... -
Thesaurus of Modern Slovene 1.0
This is an automatically created Slovene thesaurus from Slovene data available in a comprehensive English–Slovenian dictionary, a monolingual dictionary, and a corpus. A network... -
Monitor corpus of Slovene Trendi 2023-09
The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 70 publishers. Trendi 2023-09 covers the period from January... -
Frequency lists of word-level n-grams from the Trendi corpus 2020
Frequency lists of word-level n-grams (or word sets) were extracted from the Trendi Monitor Corpus of Slovene (version 2022-05: http://hdl.handle.net/11356/1590) using the LIST... -
Thesaurus of Modern Slovene 1.0 (ELEXIS)
Slovar sopomenk sodobne slovenščine 1.0. This is an automatically created Slovene thesaurus from Slovene data available in a comprehensive English–Slovenian dictionary, a... -
Valency lexicon extracted from the Gigafida 2.1 corpus
The valency lexicon was extracted from the Gigafida 2.1 Corpus of Written Standard Slovene (https://www.clarin.si/noske/run.cgi/corp_info?corpname=gfida21) using specialized... -
The Orange workflow for observing collocation trends ColTrend 1.0
The Orange workflow for observing collocation trends ColTrend 1.0 ColTrend is a workflow (.OWS file) for Orange Data Mining (an open-source machine learning and data... -
Corpus of Written Standard Slovene Gigafida 2.0
Gigafida 2.0, with about 1.1 billion words, is a reference corpus of written Slovene text published in the period 1990-2018. It is comprised of daily news, magazines, a... -
Multiword Expressions lexicon extracted from the Gigafida 2.1 corpus
The MWE lexicon was extracted from the Gigafida 2.1 Corpus of Written Standard Slovene (https://www.clarin.si/noske/run.cgi/corp_info?corpname=gfida21) using specialized scripts...