-
Monitor corpus of Slovene Trendi 2022-05
The Trendi corpus is a monitor corpus of Slovene. It contains news from 107 different media websites, published by 48 different publishers. Trendi 2022-05 covers the period from... -
Slovenian parliamentary corpus (1990-2018) siParl 2.0
The siParl corpus contains minutes of the Assembly of the Republic of Slovenia for 11th legislative period 1990-1992, minutes of the National Assembly of the Republic of... -
Training corpus hr500k 1.0
The hr500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and... -
Monitor corpus of Slovene Trendi 2022-10
The Trendi corpus is a monitor corpus of Slovene. It contains news from 106 different media websites, published by 48 different publishers. Trendi 2022-10 covers the period from... -
CMC training corpus Janes-Norm 3.0
Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC) consisting of about 20,000 short texts (280,000 words), mostly tweets but also blogs,... -
Slovenian parliamentary corpus (1990-2022) siParl 3.0
The siParl corpus contains minutes of the Assembly of the Republic of Slovenia for 11th legislative period 1990-1992, minutes of the National Assembly of the Republic of... -
Corpus of academic Slovene KAS 2.0
The KAS corpus of Slovene academic writing consists of almost 65,000 BSc/BA, 16,000 MSc/MA and 1,600 PhD theses (82 thousand texts, 5 million pages or 1,5 billion tokens)... -
English-Montenegrin parallel corpus of subtitles Opus-MontenegrinSubs 1.0
This corpus contains parallel English-Montenegrin subtitles collected in the scope of conducting a linguistic and translatological research by Petar Božović for his PhD thesis... -
CMC training corpus Janes-Tag 2.1
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence... -
Summarization datasets from the KAS corpus KAS-Sum 1.0
Summarization datasets were created from the text bodies in the KAS 2.0 corpus (http://hdl.handle.net/11356/1448) and the abstracts from the KAS-Abs 2.0 corpus... -
Multilingual comparable corpora of parliamentary debates ParlaMint 4.0
ParlaMint 4.0 is a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions, mostly starting in 2015 and... -
Linguistically annotated multilingual comparable corpora of parliamentary deb...
ParlaMint 3.0 is a multilingual set of 26 comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2022, with the individual corpora... -
Training corpus SUK 1.1
The SUK training corpus contains about 1 million tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation, with... -
MULTEXT-East "1984" annotated corpus 4.0
The novel "1984" by George Orwell is the central component of the MULTEXT-East corpus. This parallel and sentence aligned corpus contains the novel in the English original... -
Corpus of combined Slovenian corpora metaFida 1.0
Slovenia has a large number of diverse corpora available for online analysis via the CLARIN.SI concordancers. However, if users are interested in the same queries across... -
Monitor corpus of Slovene Trendi 2023-02
The Trendi corpus is a monitor corpus of Slovene. It contains news from 107 different media websites, published by 72 different publishers. Trendi 2023-02 covers the period from... -
Blog post and comment corpus Janes-Blog 1.0
Janes-Blog is an annotated corpus of Slovene blogs from websites rtvslo.si and publishwall.si from the period 2006-10 to 2016-01. The corpus is structured into individual texts... -
ReLDI token+tag+lemma+NER web service for WebLicht
WebLicht (https://weblicht.sfs.uni-tuebingen.de/) registry entry for webservice comprising tokenisation, PoS tagging and Named Entity Recognition. Tool source files are... -
Linguistically annotated multilingual comparable corpora of parliamentary deb...
ParlaMint-en.ana 4.0 is the English machine translation of the ParlaMint.ana 4.0 (http://hdl.handle.net/11356/1860) set of corpora of parliamentary debates across Europe. The... -
Croatian language corpus Riznica 0.1
The Croatian Language Corpus was built between 2007 and 2011 at the Institute of Croatian Language and Linguistics in the scope of the research programme "Hrvatska jezična...