CLARIN - Repositories

Corpus of texts by Hijacint Repič in "Cvetje z vertov sv. Frančiška" CVET 1.0

The CVET corpus contains 230 texts (around 175 thousand words) of varying length, published in the religious journal "Cvetje z vertov sv. Frančiška" between 1887 and 1916, when...

CMC training corpus Janes-Tag 1.0

Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence...

Serbian Twitter training corpus ReLDI-NormTag-sr 1.0

ReLDI-NormTag-sr 1.0 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word...

Multilingual comparable corpora of parliamentary debates ParlaMint 2.0

ParlaMint is a multilingual set of comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2020, with each corpus being about 20 million...

Corpus of scientific texts from the Open Science Slovenia portal OSS 1.0

OSS is a large collection of scientific writing in the Slovenian language gathered from the Open Science Slovenia portal (https://openscience.si). It consists of over 150...

Spoken corpus Gos 2.1 (transcriptions)

The spoken corpus Gos 2.1 is the reference speech corpus of the Slovenian language. This second edition contains about 300 hours of speech, or 2.4 million words, 127 thousand...

Serbian Twitter training corpus ReLDI-NormTagNER-sr 2.0

ReLDI-NormTagNER-sr 2.0 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,...

Croatian Twitter training corpus ReLDI-NormTagNER-hr 3.0

ReLDI-NormTagNER-hr 3.0 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,...

Reference corpus of historical Slovene goo300k 1.2

goo300k is a manually annotated reference corpus of historical Slovene. It contains 1,100 pages (about 300,000 tokens) sampled from 89 texts from the period 1584-1899. Each text...