84 datasets found

Keywords: parallel corpus

Filter Results
  • Czech-English Manual Word Alignment

    Corpus of manually aligned Czech-English parallel sentences. It comprises 2500 parallel sentences from 7 different sources.
  • English-Slovak Parallel Corpus

    English-Slovak parallel corpus consisting of several freely available corpora (Acquis [1], Europarl [2], Official Journal of the European Union [3] and part of OPUS corpus [4] –...
  • CzEng 0.7

    CzEng 0.7 is a Czech-English parallel corpus compiled at the Institute of Formal and Applied Linguistics (ÚFAL), Charles University, Prague. The corpus contains no manual...
  • Additional German-Czech reference translations of the WMT'11 test set

    Additional three Czech reference translations of the whole WMT 2011 data set (http://www.statmt.org/wmt11/test.tgz), translated from the German originals. Original segmentation...
  • ParaCrawl Corpus version 1.0

    The January 2018 release of the ParaCrawl is the first version of the corpus. It contains parallel corpora for 11 languages paired with English, crawled from a large number of...
  • IDENTICv1.0-raw

    Raw Text
  • WMT 13 Test Set

    We provide the Vietnamese version of the multi-lingual test set from WMT 2013 [1] competition. The Vietnamese version was manually translated from English. For completeness,...
  • HindEnCorp 0.5

    HindEnCorp parallel texts (sentence-aligned) come from the following sources: Tides, which contains 50K sentence pairs taken mainly from news articles. This dataset was...
  • FAUST 0.5

    Syntactic (including deep-syntactic - tectogrammatical) annotation of user-generated noisy sentences. The annotation was made on Czech-English and English-Czech Faust Dev/Test...
  • Czech and English abstracts of ÚFAL papers

    This is a document-aligned parallel corpus of English and Czech abstracts of scientific papers published by authors from the Institute of Formal and Applied Linguistics, Charles...
  • Synthetic part of CzEng 2.0

    CzEng is a sentence-parallel Czech-English corpus compiled at the Institute of Formal and Applied Linguistics (ÚFAL). While the full CzEng 2.0 is freely available for...
  • Hindi Visual Genome 1.0

    Data Hindi Visual Genome 1.0, a multimodal dataset consisting of text and images suitable for English-to-Hindi multimodal machine translation task and multimodal research. We...
  • Multilingual corpus of juridical texts

    International conventions and treaties arranged as a paralell corpus aligned on paragraph level
  • Hunglish Corpus

    Billingual written general; 2 million sentences
  • LongEval Test Collection

    The collection consists of queries and documents provided by the Qwant search Engine (https://www.qwant.com). The queries, which were issued by the users of Qwant, are based on...
  • Prague Czech-English Dependency Treebank 2.0

    Texts The Prague Czech-English Dependency Treebank 2.0 (PCEDT 2.0) is a major update of the Prague Czech-English Dependency Treebank 1.0 (LDC2004T25). It is a manually parsed...
  • ParCorFull: A Parallel Corpus Annotated with Full Coreference

    ParCorFull is a parallel corpus annotated with full coreference chains that has been created to address an important problem that machine translation and other multilingual...
  • IDENTICv1.0

    IDENTIC is an Indonesian-English parallel corpus for research purposes. The corpus is a bilingual corpus paired with English. The aim of this work is to build and provide...
  • Covert translation: Business Communication (new)

    Translation corpora of original texts with translations and comparable texts from the genre external business communication. Übersetzungs- und Vergleichskorpus mit authentischen...
  • Covert translation: Business Communication (new)

    Translation corpora of original texts with translations and comparable texts from the genre external business communication. Übersetzungs- und Vergleichskorpus mit...
You can also access this registry using the API (see API Docs).