-
Linguistically annotated multilingual comparable corpora of parliamentary deb...
ParlaMint 3.0 is a multilingual set of 26 comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2022, with the individual corpora... -
Multilingual comparable corpora of parliamentary debates ParlaMint 3.0
ParlaMint 3.0 is a multilingual set of 26 comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2022, with the individual corpora... -
Corpus of term-annotated texts RSDO5 1.1
The RSDO5 corpus was compiled in order to serve as a training set for automatic term identification. It consists of 12 texts with 250,000 words and almost 38,000 manually... -
Collection of Slovenian paremiological units Pregovori 1.0
This corpus collects and annotates the extensive and highly valuable diachronic collection of Slovenian proverbs, 50 years and more in the making at the ZRC SAZU Institute of... -
Spoken corpus Gos VideoLectures 4.2 (transcription)
Gos VideoLectures is an add-on to the Gos reference corpus of spoken Slovene (http://hdl.handle.net/11356/1040), and covers public academic speech. It can be used for training... -
Spoken corpus Gos 1.1
Gos is a corpus of spoken Slovene that includes the transcripts of approximately 120 hours of speech recorded in various situations: radio and TV shows, school lessons and... -
Training corpus ssj500k 2.3
The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation.... -
Multilingual comparable corpora of parliamentary debates ParlaMint 2.1
ParlaMint 2.1 is a multilingual set of 17 comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2020, with each corpus being about 20... -
Linguistically annotated multilingual comparable corpora of parliamentary deb...
ParlaMint 2.1 is a multilingual set of 17 comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2020, with each corpus being about 20... -
Montenegrin web corpus meWaC 1.0
The Montenegrin web corpus meWaC was built by crawling the .me top-level domain in 2019. The corpus was near-deduplicated on paragraph level, normalised via transliteration into... -
Comparable corpora of South-Slavic Wikipedias CLASSLA-Wikipedia 1.0
This comparable corpus collection consists of Wikipedia dumps of the Bosnian, Croatian, Macedonian, Montenegrin, Serbian, Serbo-Croatian and Slovenian Wikipedia, harvested on... -
Corpus of Croatian news portals ENGRI (2014-2018)
The corpus consists of texts collected from the most popular (based on the Reuters Institute Digital News Report for 2018, retrieved from http://www.digitalnewsreport.org in... -
Corpus of Slovenian school texts SBSJ 1.0
Corpus of Slovenian school texts is a lemmatized and POS-tagged specialized corpus, which includes 428 short school texts written primarily by primary-school students from 1st... -
Linguistically annotated multilingual comparable corpora of parliamentary deb...
ParlaMint is a multilingual set of comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2020, with each corpus being about 20 million... -
Corpus of Written Standard Slovene Gigafida 2.0
Gigafida 2.0, with about 1.1 billion words, is a reference corpus of written Slovene text published in the period 1990-2018. It is comprised of daily news, magazines, a... -
The corpus of older Slovenian narrative prose PriLit 1.0
The PriLit corpus contains 37 texts of older Slovenian narrative prose by 12 authors. One text, Sreča v nesreči (Fortune in Misfortune) by Janez Cigler (first published in... -
Slovenian parliamentary corpus (1990-2018) siParl 2.0
The siParl corpus contains minutes of the Assembly of the Republic of Slovenia for 11th legislative period 1990-1992, minutes of the National Assembly of the Republic of... -
Spoken Torlak dialect corpus 1.0 (transcription)
Torlak corpus represents a spoken variety of the endangered Torlak dialect from the Timok area in Southeast Serbia. It comprises transcripts of interviews with the local... -
Corpus of Academic Slovene (BSc/BA theses) KAS-dipl 1.0
The KAS-dipl corpus of Slovene BSc/BA theses consists of almost 65,000 texts (3,5 million pages or 1,1 billion tokens) written 2000 - 2018 and gathered from the digital... -
Corpus of Academic Slovene (MSc/MA theses) KAS-mag 1.0
The KAS-mag corpus of Slovene MSc/MA theses consists of almost 16,000 texts (1,360 thousand pages or 500 million tokens) written 2000 - 2018 and gathered from the digital...