CLARIN - Repositories

Prague Dependency Treebank - Consolidated 2.0 (PDT-C 2.0)

A manually annotated and genre-diversified language resource with rich linguistic information from morphology and syntax to semantics, the Prague Dependency Treebank –...

Prague Dependency Treebank - Consolidated 1.0 (PDT-C 1.0)

A richly annotated and genre-diversified language resource, The Prague Dependency Treebank – Consolidated 1.0 (PDT-C 1.0, or PDT-C in short in the sequel) is a consolidated...

Wizerunek Andreja Babiša i Mateusza Morawieckiego w kontekście sytuacji kryzy...

Zbiór artykułów z prasy czeskiej dotyczący Mateusza Morawickiegi (iDnes) oraz z prasy polskiej dotyczących Andreja Babiša (Rzeczpospolita)

[MCSQ]: The Multilingual Corpus of Survey Questionnaires

The Multilingual Corpus of Survey Questionnaires (MCSQ) is the very first publicly available multilingual database comprised of international survey texts. Its latest version...

Multilingual Constructicon (2017-10-16) Flerspråkigt konstruktikon (2017-10-16)

A multilingual constructicon. Ett flerspråkigt konstruktikon.

ASPAC – Swedish-Czech (2017-10-16) ASPAC – svenska-tjeckiska (2017-10-16)

Part of The Amsterdam Slavic Parallel Aligned Corpus. The material is sentence scrambled. Del av The Amsterdam Slavic Parallel Aligned Corpus. Materialet är meningsomkastat.

JRC EU DGT Translation Memory Parsebank DGT-UD 1.0

DGT-UD is a 2 billion word 23-language parallel syntactically parsed corpus, which consists of the JRC DGT translation memory of European law, automatically annotated with...

Multilingual comparable corpora of parliamentary debates ParlaMint 3.0

ParlaMint 3.0 is a multilingual set of 26 comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2022, with the individual corpora...

Linguistically annotated multilingual comparable corpora of parliamentary deb...

ParlaMint 2.1 is a multilingual set of 17 comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2020, with each corpus being about 20...

MULTEXT-East "1984" document corpus 4.0

The novel "1984" by George Orwell is the central component of the MULTEXT-East corpus. This parallel and sentence aligned corpus contains the novel in the English original...

Concreteness and imageability lexicon MEGA.HR-Crossling

The lexicon contains concreteness and imageability predictions of words in 77 languages. The resource is built via supervised machine learning, using average human responses...

The multilingual sentiment dataset of parliamentary debates ParlaSent 1.0

The dataset consists of mid-length sentences from the parliamentary proceedings of Bosnia and Herzegovina, Croatia, Czechia, Serbia, Slovakia, Slovenia, and the United Kingdom,...

Linguistically annotated multilingual comparable corpora of parliamentary deb...

ParlaMint is a multilingual set of comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2020, with each corpus being about 20 million...

Multilingual comparable corpora of parliamentary debates ParlaMint 4.0

ParlaMint 4.0 is a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions, mostly starting in 2015 and...

Linguistically annotated multilingual comparable corpora of parliamentary deb...

ParlaMint 3.0 is a multilingual set of 26 comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2022, with the individual corpora...

MULTEXT-East "1984" annotated corpus 4.0

The novel "1984" by George Orwell is the central component of the MULTEXT-East corpus. This parallel and sentence aligned corpus contains the novel in the English original...

Parliamentary spoken corpus of Czech ParlaSpeech-CZ 1.0

The ParlaSpeech-CZ dataset is built from the transcripts of parliamentary proceedings available in the Czech part of the ParlaMint corpus, and the parliamentary recordings...

Linguistically annotated multilingual comparable corpora of parliamentary deb...

ParlaMint 4.0 is a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions, mostly starting in 2015 and...

Linguistically annotated multilingual comparable corpora of parliamentary deb...

ParlaMint 4.1 is a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions, mostly starting in 2015 and...

MULTEXT-East free lexicons 4.0

The MULTEXT-East morphosyntactic lexicons have a simple structure, where each line is a lexical entry with three tab-separated fields: (1) the word-form, the inflected form of...

434 datasets found