-
Eye-Tracking Recordings from a Pilot Study of WMT-style MT Outputs Ranking
This package contains the eye-tracker recordings of 8 subjects evaluating English-to-Czech machine translation quality using the WMT-style ranking of sentences. We provide the... -
Coreference in Universal Dependencies 0.1 (CorefUD 0.1)
CorefUD is a collection of previously existing datasets annotated with coreference, which we converted into a common annotation scheme. In total, CorefUD in its current version... -
Universal Derivations v1.0
Universal Derivations (UDer) is a collection of harmonized lexical networks capturing word-formation, especially derivational relations, in a cross-linguistically consistent... -
NameTag 3 Multilingual CoNLL Model
This is a trained model for the supervised machine learning tool NameTag 3 (https://ufal.mff.cuni.cz/nametag/3/), trained jointly on several NE corpora: English CoNLL-2003,... -
Prague Discourse Treebank 2.0
PDiT 2.0 is a new version of the Prague Discourse Treebank. It contains a complex annotation of discourse phenomena enriched by the annotation of secondary connectives. -
Universal Dependencies 2.0 alpha (obsolete)
This release contains errors in several files. Please use http://hdl.handle.net/11234/1-1983 instead. -
HamleDT 3.0
HamleDT (HArmonized Multi-LanguagE Dependency Treebank) is a compilation of existing dependency treebanks (or dependency conversions of other treebanks), transformed so that... -
Parallel Global Voices, Czech-English NER+NEL
Annotation of named entities to the existing source Parallel Global Voices, ces-eng language pair. The named entity annotations distinguish four classes: Person, Organization,... -
ParCzech 3.0
The ParCzech 3.0 corpus is the third version of ParCzech consisting of stenographic protocols that record the Chamber of Deputies’ meetings held in the 7th term (2013-2017) and... -
Retrograde Morphemic Dictionary of Czech
The data contains the morphemic dictionary scanned in the PDF format. It is divided into 3 parts: introductions.pdf - pp. 11-102 main_dictionary.pdf - pp. 113-506 appendices.pdf... -
SYN v9: large corpus of written Czech
Corpus of contemporary written (printed) Czech sized 4.7 GW (i.e. 5.7 billion tokens). It covers mostly the 1990-2019 period and features rich metadata including detailed... -
GECCC Grammar Error Correction Corpus for Czech (2022-09-28)
Grammar Error Correction Corpus for Czech (GECCC) consists of 83 058 sentences and covers four diverse domains, including essays written by native students, informal website... -
HamleDT 2.0
HamleDT 2.0 is a collection of 30 existing treebanks harmonized into a common annotation style, the Prague Dependencies, and further transformed into Stanford Dependencies, a... -
Annotated corpora and tools of the PARSEME Shared Task on Automatic Identific...
The PARSEME shared task aims at identifying verbal MWEs in running texts. Verbal MWEs include idioms (let the cat out of the bag), light verb constructions (make a decision),... -
Czech Natural Language Inference Dataset with Explanations
The dataset contains two parts: the original Stanford Natural Language Inference (SNLI) dataset with automatic translations to Czech, for some items from the SNLI, it contains... -
C4Corpus (CC BY-NC-SA part)
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly... -
Speech databases of typical children and children with SLI
Our Laboratory of Artificial Neural Network Applications (LANNA) in the Czech Technical University in Prague (head of the laboratory is professor Jana Tučková) collaborates on a... -
SQAD v2
Simple question answering database (SQAD) created from Czech Wikipedia. Each record of SQAD consist of four files (in vertical form provided with lemmatization and POS tagging)... -
Manually Ranked Translation Outputs
Manually ranked outputs of Czech-Slovak translations. Three annotators manually ranked outputs of five MT systems (Česílko, Česílko2, Google Translate and two Moses setups) on... -
Universal Dependencies 2.14
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...