CLARIN - Repositories

Training corpus SUK 1.1

The SUK training corpus contains about 1 million tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation, with...

Monitor corpus of Slovene Trendi 2023-02

The Trendi corpus is a monitor corpus of Slovene. It contains news from 107 different media websites, published by 72 different publishers. Trendi 2023-02 covers the period from...

Monitor corpus of Slovene Trendi 2024-06

The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 74 publishers. Trendi 2024-06 covers the period from January...

Frequency lists of words from the Gigafida 2.0 corpus

Frequency lists of words were extracted from the Gigafida 2.0 Corpus of Written Standard Slovene (https://viri.cjvt.si/gigafida/) using the LIST corpus extraction tool...

Training corpus jos1M 1.2

The jos1M corpus contains 1 million words of sampled paragraphs from the Gigafida corpus. It is meant to serve as a training corpus for word-level tagging of Slovene. This...

Morphological lexicon Sloleks 3.0

Sloleks is a reference morphological lexicon of Slovene that was developed to be used in various NLP applications and language manuals. It contains Slovene lemmas, their...

xLiMe Twitter Corpus XTC 1.0.1

The xLiMe Twitter Corpus contains tweets in German, Italian and Spanish manually annotated with part-of-speech, named entities, and message-level sentiment polarity. In total,...

Written corpus ccKres 1.0

Corpus ccKres consists of 9,376 documents, each containing information about the source (e.g. newspapers, magazines), year of publication, text type (fiction, newspaper), the...

Open Slovene WordNet OSWN 1.0

Open Slovene WordNet (OSWN) is derived from Open English WordNet (https://en-word.net/), which itself is derived from Princeton WordNet by the Open English WordNet Community....

Learners' corpus Šolar 1.0

Šolar consists of 2,703 texts written by students in Slovene secondary schools (age 15-19) and pupils in the 7th-9th grade of primary school (13-15), with a small percentage...

Morphological patterns from the Sloleks 2.0 lexicon 1.0

This entry consists of XML files with 96,290 lexical units (nouns, verbs, adjectives, and adverbs) from the Sloleks Morphological Lexicon of Slovene 2.0...

Character-level part-of-speech tagger of Slovene language

Part-of-speech tagger for Slovene language implemented using convolutional and LSTM neural networks. Tagger uses character-level representation of sentences. The tagger has been...

Dialogue act annotated spoken corpus GORDAN 1.0 (audio/video)

The GORDAN 1.0 corpus contains authentic data of spoken communication, annotated for dialogue acts. This entry contains the complete audio files of the corpus (seven wav files,...

Monitor corpus of Slovene Trendi 2024-01

The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 70 publishers. Trendi 2024-01 covers the period from January...

Parallel sense-annotated corpus ELEXIS-WSD 1.1

ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.1 contains sentences for 10...

Thesaurus of Modern Slovene 1.0

This is an automatically created Slovene thesaurus from Slovene data available in a comprehensive English–Slovenian dictionary, a monolingual dictionary, and a corpus. A network...

Training corpus ssj500k 1.4

The ssj500k training corpus contains 500,000 words, manually annotated on the levels of tokenization, sentence segmentation, morphosyntactic tagging, lemmatisation, named...

Automatically stress labelled morphological lexicon Sloleks 1.2, version 1.1

This lexicon is an extended version of Sloleks 1.2, http://hdl.handle.net/11356/1039. It contains all the original data from Sloleks with added information about the stress of...

Monitor corpus of Slovene Trendi 2023-09

The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 70 publishers. Trendi 2023-09 covers the period from January...

Thesaurus of Modern Slovene 1.0 (ELEXIS)

Slovar sopomenk sodobne slovenščine 1.0. This is an automatically created Slovene thesaurus from Slovene data available in a comprehensive English–Slovenian dictionary, a...

121 datasets found