Dataset - B2FIND

Czech-English Manual Word Alignment

Corpus of manually aligned Czech-English parallel sentences. It comprises 2500 parallel sentences from 7 different sources.

English-Slovak Parallel Corpus

English-Slovak parallel corpus consisting of several freely available corpora (Acquis [1], Europarl [2], Official Journal of the European Union [3] and part of OPUS corpus [4] –...

CzEng 0.7

CzEng 0.7 is a Czech-English parallel corpus compiled at the Institute of Formal and Applied Linguistics (ÚFAL), Charles University, Prague. The corpus contains no manual...

Additional German-Czech reference translations of the WMT'11 test set

Additional three Czech reference translations of the whole WMT 2011 data set (http://www.statmt.org/wmt11/test.tgz), translated from the German originals. Original segmentation...

ParaCrawl Corpus version 1.0

The January 2018 release of the ParaCrawl is the first version of the corpus. It contains parallel corpora for 11 languages paired with English, crawled from a large number of...

IDENTICv1.0-raw

Raw Text

WMT 13 Test Set

We provide the Vietnamese version of the multi-lingual test set from WMT 2013 [1] competition. The Vietnamese version was manually translated from English. For completeness,...

HindEnCorp 0.5

HindEnCorp parallel texts (sentence-aligned) come from the following sources: Tides, which contains 50K sentence pairs taken mainly from news articles. This dataset was...

FAUST 0.5

Syntactic (including deep-syntactic - tectogrammatical) annotation of user-generated noisy sentences. The annotation was made on Czech-English and English-Czech Faust Dev/Test...

Czech and English abstracts of ÚFAL papers

This is a document-aligned parallel corpus of English and Czech abstracts of scientific papers published by authors from the Institute of Formal and Applied Linguistics, Charles...

Synthetic part of CzEng 2.0

CzEng is a sentence-parallel Czech-English corpus compiled at the Institute of Formal and Applied Linguistics (ÚFAL). While the full CzEng 2.0 is freely available for...

Hindi Visual Genome 1.0

Data Hindi Visual Genome 1.0, a multimodal dataset consisting of text and images suitable for English-to-Hindi multimodal machine translation task and multimodal research. We...

Multilingual corpus of juridical texts

International conventions and treaties arranged as a paralell corpus aligned on paragraph level

Hunglish Corpus

Billingual written general; 2 million sentences

LongEval Test Collection

The collection consists of queries and documents provided by the Qwant search Engine (https://www.qwant.com). The queries, which were issued by the users of Qwant, are based on...

Prague Czech-English Dependency Treebank 2.0

Texts The Prague Czech-English Dependency Treebank 2.0 (PCEDT 2.0) is a major update of the Prague Czech-English Dependency Treebank 1.0 (LDC2004T25). It is a manually parsed...

ParCorFull: A Parallel Corpus Annotated with Full Coreference

ParCorFull is a parallel corpus annotated with full coreference chains that has been created to address an important problem that machine translation and other multilingual...

IDENTICv1.0

IDENTIC is an Indonesian-English parallel corpus for research purposes. The corpus is a bilingual corpus paired with English. The aim of this work is to build and provide...

Covert translation: Business Communication (new)

Translation corpora of original texts with translations and comparable texts from the genre external business communication. Übersetzungs- und Vergleichskorpus mit authentischen...

Covert translation: Business Communication (new)

Translation corpora of original texts with translations and comparable texts from the genre external business communication. Übersetzungs- und Vergleichskorpus mit...

84 datasets found