-
Corpus of Academic Slovene (PhD theses) KAS-dr 1.0
The KAS-dr corpus of Slovene PhD theses consists of almost 1,600 texts (266 thousand pages or 100 million tokens) written 2000 - 2018 and gathered from the digital libraries of... -
Corpus of academic Slovene KAS 1.0
The KAS corpus of Slovene academic writing consists of almost 65,000 BSc/BA, 16,000 MSc/MA and 1,600 PhD theses (82 thousand texts, 5 million pages or 1,7 billion tokens)... -
ŠUSS archive of questions and answers about the Slovenian language (1998-2010)
This corpus contains the Q&A archive of the ŠUSS language consultancy service. The ŠUSS internet forum was active 1998-2010. Questions posted by users were answered by a... -
Croatian Twitter training corpus ReLDI-NormTagNER-hr 2.1
ReLDI-NormTagNER-hr 2.1 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,... -
Serbian Twitter training corpus ReLDI-NormTagNER-sr 2.1
ReLDI-NormTagNER-sr 2.1 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,... -
Corpus of Informatics DSI 5.0
The DSI corpus is meant as a terminological resource for the field of informatics, esp. for the development of the on-line terminological dictionary of informatics, Islovar... -
CMC training corpus Janes-Tag 2.1
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence... -
Corpus of "Attacks on the Yugoslav National Army" (1989) VAYNA 1.1
The corpus of the "Attacks on the Yugoslav National Army" is one of the oldest corpora (1989) compliled at the Jožef Stefan Institute, made with the purpose of analysing the... -
Slovenian parliamentary corpus siParl 1.0 (1990-2018)
The siParl corpus contains minutes of the Assembly of the Republic of Slovenia for 11th legislative period 1990-1992, minutes of the National Assembly of the Republic of... -
Spoken corpus Gos VideoLectures 4.0 (audio)
Gos VideoLectures is an add-on to the Gos reference corpus of spoken Slovene (http://hdl.handle.net/11356/1040), and covers public academic speech. The Gos VideoLectures corpus... -
Training corpus jos1M 1.2
The jos1M corpus contains 1 million words of sampled paragraphs from the Gigafida corpus. It is meant to serve as a training corpus for word-level tagging of Slovene. This... -
Training corpus ssj500k 2.2
The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation.... -
Training corpus SETimes.SR 1.0
The SETimes.SR training corpus contains 86 726 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic... -
JRC EU DGT Translation Memory Parsebank DGT-UD 1.0
DGT-UD is a 2 billion word 23-language parallel syntactically parsed corpus, which consists of the JRC DGT translation memory of European law, automatically annotated with... -
Training corpus hr500k 1.0
The hr500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and... -
Training corpus ssj500k 2.1
The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation.... -
Croatian language corpus Riznica 0.1
The Croatian Language Corpus was built between 2007 and 2011 at the Institute of Croatian Language and Linguistics in the scope of the research programme "Hrvatska jezična... -
English-Montenegrin parallel corpus of subtitles Opus-MontenegrinSubs 1.0
This corpus contains parallel English-Montenegrin subtitles collected in the scope of conducting a linguistic and translatological research by Petar Božović for his PhD thesis... -
Serbian Twitter training corpus ReLDI-NormTagNER-sr 2.0
ReLDI-NormTagNER-sr 2.0 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,... -
Croatian Twitter training corpus ReLDI-NormTagNER-hr 2.0
ReLDI-NormTagNER-hr 2.0 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,...