CLARIN - Repositories

Croatian Twitter training corpus ReLDI-NormTag-hr 1.1

ReLDI-NormTag-hr 1.1 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word...

Abstracts from the KAS corpus KAS-Abs 2.0

The KAS-abs 2.0 corpus contains 125,202 automatically identified Slovenian and/or English abstracts from BSc/BA, MSc/MA, and PhD theses included in the KAS Corpus of Academic...

Linguistically annotated multilingual comparable corpora of parliamentary deb...

ParlaMint 2.1 is a multilingual set of 17 comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2020, with each corpus being about 20...

Machine Translation datasets from the KAS corpus KAS-MT 1.0

The Machine Translation datasets KAS-MT 1.0 contain automatically sentence-aligned Slovene and English plain-text abstracts from KAS-Abs 2.0 (http://hdl.handle.net/11356/1449)...

Corpus of term-annotated texts RSDO5 1.1

The RSDO5 corpus was compiled in order to serve as a training set for automatic term identification. It consists of 12 texts with 250,000 words and almost 38,000 manually...

News comment corpus Janes-News 1.0

Janes-News is an annotated corpus of comments on online news articles from websites rtvslo.si, mladina.si, and reporter.si from the period 2007-03 to 2015-01. The corpus is...

Forum corpus Janes-Forum 1.0

Janes-Forum is an annotated corpus of Slovene forums from websites med.over.net, avtomobilizem.com, and kvarkadabra.net from the period 2001-02 to 2015-01. The corpus is...

ŠUSS archive of questions and answers about the Slovenian language (1998-2010)

This corpus contains the Q&A archive of the ŠUSS language consultancy service. The ŠUSS internet forum was active 1998-2010. Questions posted by users were answered by a...

The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Mace...

This model for morphosyntactic annotation of standard Macedonian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the...

Offensive language dataset of Croatian, English and Slovenian comments FRENK 1.0

The FRENK dataset consists of comments to Facebook posts (news articles) of mainstream media outlets from Croatia, Great Britain, and Slovenia, on the topics of migrants and...

Serbian Twitter training corpus ReLDI-NormTagNER-sr 3.0

ReLDI-NormTagNER-sr 3.0 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,...

MULTEXT-East "1984" document corpus 4.0

The novel "1984" by George Orwell is the central component of the MULTEXT-East corpus. This parallel and sentence aligned corpus contains the novel in the English original...

Spoken corpus Gos VideoLectures 2.0 (transcription)

Gos VideoLectures is an add-on to the Gos reference corpus of spoken Slovene (http://hdl.handle.net/11356/1040), and covers public academic speech. The Gos VideoLectures corpus...

Word embeddings CLARIN.SI-embed.sl 2.0

CLARIN.SI-embed.sl contains word embeddings induced from a large collection of Slovene texts composed of existing corpora of Slovene, e.g GigaFida, Janes, KAS, slWaC, MaCoCu-sl,...

Croatian parliamentary corpus ParlaMeter-hr 1.0

The ParlaMeter-hr corpus contains minutes of the National Assembly of the Republic of Croatia and currently covers its VIth mandate (2016-11-15 - 2018-11-21). The corpus...

Monitor corpus of Slovene Trendi 2024-07

The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 74 publishers. Trendi 2024-07 covers the period from January...

Training corpus SUK 1.0

The SUK training corpus contains about 1 million tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation, with...

Dictionary of Twitterese Janes-Dict 1.0

The Dictionary of Twitterese 1.0 is the first attempt at a lexicographic description of non-standard Slovene as found on Twitter. Version 1.0 contains 1,002 entries, of which...

Automatically constructed multiword lexicon slMWELex v0.5

The slMWELex lexicon is an automatically constructed lexicon of Slovene multiword expression candidates (mostly collocations) from the parsed KRES corpus by using the DepMWEx...

The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Mace...

This model for morphosyntactic annotation of standard Macedonian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the...

191 datasets found