-
CMC shortening corpus Janes-Kratko 1.0
Janes-Kratko is a corpus of Slovene tweets manually annotated with shortening phenomena according to the supplied typology covering different types of spelling, lexical and... -
Ukrainian parliamentary corpus ParlaMint-UA 4.0.1
The Ukrainian parliamentary corpus ParlaMint-UA 4.0.1 is an extended version of the ParlaMint-UA 4.0 corpus (available as a collection of plain texts along with TSV metadata of... -
Training corpus ssj500k 2.2
The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation.... -
Multilingual comparable corpora of parliamentary debates ParlaMint 1.0
ParlaMint is a multilingual set of comparable corpora containing parliamentary debates mostly starting at the end of 2015 and extending to mid-2020, with each corpus being about... -
Lexicon of historical Slovene imp25k 1.1
The imp25k lexicon of historical Slovene was created automatically from the goo300k and foo3M annotated corpora and contains attested and manually verified word forms and their... -
Training corpus ssj500k 2.1
The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation.... -
Training corpus ssj500k 2.0
The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation.... -
Parallel corpus of idiomatic text ParaDiom 1.0
ParaDiom is a parallel corpus with sentences sampled from existing corpora. The corpus contains 1,000 Slovene sentences with their English translation and 1,000 English... -
Serbian Twitter training corpus ReLDI-NormTag-sr 1.1
ReLDI-NormTag-sr 1.1 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word... -
CMC training corpus Janes-Norm 1.2
Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation,... -
Corpus of Academic Slovene (PhD theses) KAS-dr 1.0
The KAS-dr corpus of Slovene PhD theses consists of almost 1,600 texts (266 thousand pages or 100 million tokens) written 2000 - 2018 and gathered from the digital libraries of... -
CMC training corpus Janes-Tag 1.1
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence... -
Corpus of longer narrative Slovenian prose KDSP 1.0
The KDSP corpus contains 262 texts of longer older Slovenian narrative prose. The texts were published between 1836 and 1918 and are at least 20,000 words long. The texts have... -
Epigraphic corpus of Medieval and Early Modern inscriptions in Slovenia MEMIS...
The Epigraphic corpus of Mediaeval and Early Modern inscriptions in Slovenia collects carefully made transcriptions of Latin inscriptions that are found or have been discovered... -
Speech Database of Spoken Flight Information Enquiries SOFES 1.0
The SOFES speech database (Spoken Flight Enquiries in Slovene) is a collection of transcribed and segmented audio recordings of spoken flight-information enquiries in Slovene.... -
Slovenian parliamentary corpus (1990-2018) siParl 1.0
The siParl corpus contains minutes of the Assembly of the Republic of Slovenia for 11th legislative period 1990-1992, minutes of the National Assembly of the Republic of... -
CMC training corpus Janes-Tag 3.0
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC) consisting of about 15,000 short texts (190,000 words), mostly tweets but also blogs,... -
Spoken corpus Gos 2.0 (transcriptions)
The spoken corpus Gos 2.0 is the reference speech corpus of the Slovenian language. This second edition contains about 300 hours of speech, or 2.4 million words, 127 thousand... -
Croatian Twitter training corpus ReLDI-NormTagNER-hr 2.0
ReLDI-NormTagNER-hr 2.0 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,... -
Corpus of questions and answers of the Terminologišče terminological counsell...
Terminological counselling at the Terminologošče web site is a service intended for the expert public facing specific terminological naming problems. When elaborating...