CLARIN - Repositories

W2C – Web to Corpus – Corpora

A set of corpora for 120 languages automatically collected from wikipedia and the web. Collected using the W2C toolset: http://hdl.handle.net/11858/00-097C-0000-0022-60D6-1

Annotated corpora and tools of the PARSEME Shared Task on Semi-Supervised Ide...

This multilingual resource contains corpora in which verbal MWEs have been manually annotated, gathered at the occasion of the 1.2 edition of the PARSEME Shared Task on...

L1 Acquisition Gaby Cablitz

Language Acquisition corpus

UFAL Parallel Corpus of North Levantine 1.0

This is the first release of the UFAL Parallel Corpus of North Levantine, compiled by the Institute of Formal and Applied Linguistics (ÚFAL) at Charles University within the...

Extended CLEF eHealth 2013-2015 IR Test Collection

This package contains an extended version of the test collection used in the CLEF eHealth Information Retrieval tasks in 2013--2015. Compared to the original version, it...

Czech Malach Cross-lingual Speech Retrieval Test Collection

The package contains Czech recordings of the Visual History Archive which consists of the interviews with the Holocaust survivors. The archive consists of audio recordings, four...

DiscoMT 2015 Shared Task on Pronoun Translation

The data set includes training, development and test data from the shared tasks on pronoun-focused machine translation and cross-lingual pronoun prediction from the EMNLP 2015...

DiscoMT 2017 Shared Task on Cross-lingual Pronoun Prediction

Data used in the 2017 shared task on cross-lingual pronoun prediction.

Universal Derivations v0.5

Universal Derivations (UDer) is a collection of harmonized lexical networks capturing word-formation, especially derivational relations, in a cross-linguistically consistent...

C4Corpus (CC BY-NC-ND part)

A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly...

Universal Dependencies 2.1

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

UDify Pretrained Model

Pretrained model weights for the UDify model, and extracted BERT weights in pytorch-transformers format. Note that these weights slightly differ from those used in the paper.

CorPipe 24 Multilingual CorefUD 1.2 Model (corpipe24-corefud1.2-240906)

The corpipe24-corefud1.2-240906 is a mT5-large-based multilingual model for coreference resolution usable in CorPipe 24 (https://github.com/ufal/crac2024-corpipe). It is...

Universal Dependencies 2.0 – CoNLL 2017 Shared Task Development and Test Data

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

CorPipe 23 multilingual CorefUD 1.1 model (corpipe23-corefud1.1-231206)

The corpipe23-corefud1.1-231206 is a mT5-large-based multilingual model for coreference resolution usable in CorPipe 23 (https://github.com/ufal/crac2023-corpipe). It is...

Deep Universal Dependencies 2.5

Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-3105). It contains additional...

Universal Dependencies 2.15

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

Deep Universal Dependencies 2.6

Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-3226). It contains additional...

Universal Dependencies 1.0

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

WMT 13 Test Set

We provide the Vietnamese version of the multi-lingual test set from WMT 2013 [1] competition. The Vietnamese version was manually translated from English. For completeness,...

216 datasets found