CLARIN - Repositories

LatinISE corpus (version 4)

The LatinISE corpus is a text corpus collected from the LacusCurtius, Intratext and Musisque Deoque websites. Corpus texts have rich metadata containing information as genre,...

DaMuEL 1.0: A Large Multilingual Dataset for Entity Linking

We present DaMuEL, a large Multilingual Dataset for Entity Linking containing data in 53 languages. DaMuEL consists of two components: a knowledge base that contains...

Word representations for multiple languages

Dictionaries with different representations for various languages. Representations include brown clusters of different sizes and morphological dictionaries extracted using...

Universal Dependencies 2.6 models for UDPipe 2 (2020-08-31)

Tokenizer, POS Tagger, Lemmatizer and Parser models for 99 treebanks of 63 languages of Universal Depenencies 2.6 Treebanks, created solely using UD 2.6 data...

W2C – Web to Corpus – Corpora

A set of corpora for 120 languages automatically collected from wikipedia and the web. Collected using the W2C toolset: http://hdl.handle.net/11858/00-097C-0000-0022-60D6-1

A Human-Annotated Dataset of Scanned Images and OCR Texts from Medieval Docum...

This is an open dataset of scanned images and OCR texts from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations...

Thesaurus linguae Latinae

The Thesaurus linguae Latinae is the first comprehensive dictionary of ancient Latin; • it is compiled on the basis of all Latin texts surviving from antiquity (until AD...

Universal Derivations v0.5

Universal Derivations (UDer) is a collection of harmonized lexical networks capturing word-formation, especially derivational relations, in a cross-linguistically consistent...

A Human-Annotated Dataset for Language Modeling and Named Entity Recognition ...

This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains a corpus for language modeling and...

On-line Dictionary of medieval latin in the Czech lands

The Dictionary of Medieval Latin in the Czech Lands registers and explains the vocabulary of Medieval Latin as used in the Czech lands since the beginnings of Latin writing in...

Universal Dependencies 2.1

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

UDify Pretrained Model

Pretrained model weights for the UDify model, and extracted BERT weights in pytorch-transformers format. Note that these weights slightly differ from those used in the paper.

Universal Dependencies 2.0 – CoNLL 2017 Shared Task Development and Test Data

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

Deep Universal Dependencies 2.5

Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-3105). It contains additional...

Mannheimer Texte Online (MATEO)

As a sub-section of MATEO, MARABU (Mannheimer Reihe Altes Buch) includes illustrated books, (manu)scripts and texts on the history of the Electoral Palatinate. Als...

A Human-Annotated Dataset for Language Modeling and Named Entity Recognition ...

This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains a corpus for language modeling and...

Universal Dependencies 2.15

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

Deep Universal Dependencies 2.6

Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-3226). It contains additional...

Universal Dependencies 1.2

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

Universal Dependencies 2.5

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

906 datasets found