CLARIN - Repositories

Khresmoi Summary Translation Test Data 2.0

This package contains data sets for development (Section dev) and testing (Section test) of machine translation of sentences from summaries of medical articles between Czech,...

Coreference in Universal Dependencies 1.0 (CorefUD 1.0)

CorefUD is a collection of previously existing datasets annotated with coreference, which we converted into a common annotation scheme. In total, CorefUD in its current version...

French emblems at Glasgow

French emblem books (27 in total) of the 16th century, together with Latin versions where appropriate. Transcribed and facsimile versions, and extensive search functionality.

CoNLL 2017 and 2018 Shared Task Blind and Preprocessed Test Data

CoNLL 2017 and 2018 shared tasks: Multilingual Parsing from Raw Text to Universal Dependencies This package contains the test data in the form in which they ware presented to...

Khresmoi Summary Translation Test Data 1.1

This package contains data sets for development and testing of machine translation of sentences from summaries of medical articles between Czech, English, French, and German.

Universal Dependencies 2.4 Models for UDPipe (2019-05-31)

Tokenizer, POS Tagger, Lemmatizer and Parser models for 90 treebanks of 60 languages of Universal Depenencies 2.4 Treebanks, created solely using UD 2.4 data...

Botanicus Digital Library

Digital copies of historical botanic papers from the Missouri Botanical Garden Library; Bilddigitalisate von historischen botanischen Schriften; deutschsprachige Texte stellen...

Deltacorpus 1.1

Texts in 107 languages from the W2C corpus (http://hdl.handle.net/11858/00-097C-0000-0022-6133-9), first 1,000,000 tokens per language, tagged by the delexicalized tagger...

TermSciences

500.000 terms (fr, en, de, es), RDB / XML

Khresmoi Query Translation Test Data 2.0

This package contains data sets for development and testing of machine translation of medical queries between Czech, English, French, German, Hungarian, Polish, Spanish ans...

Frantext

mainly literature (17th to 20th century)

OmegaWiki

This dataset has no description

Universal Dependencies 2.13

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

Universal Segmentations 1.0 (UniSegments 1.0)

Universal Segmentations (UniSegments) is a collection of lexical resources capturing morphological segmentations harmonised into a cross-linguistically consistent annotation...

Universal Dependencies 2.15 models for UDPipe 2 (2024-11-21)

Tokenizer, POS Tagger, Lemmatizer and Parser models for 147 treebanks of 78 languages of Universal Depenencies 2.15 Treebanks, created solely using UD 2.15 data...

Universal Dependencies 2.10

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

L1 & L2 Acquisition Marzena Watorek French Project

Language Acquisition corpus

Termoteca

Galician terminology databank, 6,000 terms

DaMuEL 1.0: A Large Multilingual Dataset for Entity Linking

We present DaMuEL, a large Multilingual Dataset for Entity Linking containing data in 53 languages. DaMuEL consists of two components: a knowledge base that contains...

Universal Dependencies 2.6 models for UDPipe 2 (2020-08-31)

Tokenizer, POS Tagger, Lemmatizer and Parser models for 99 treebanks of 63 languages of Universal Depenencies 2.6 Treebanks, created solely using UD 2.6 data...

216 datasets found