-
VIADAT-GIS (2019-12-31)
A VIADAT module; VIADAT-GIS connects the platform with maps. Developed in cooperation with ÚSD AV ČR and NFA. -
Czech Grammar Agreement Dataset for Evaluation of Language Models
AGREE is a dataset and task for evaluation of language models based on grammar agreement in Czech. The dataset consists of sentences with marked suffixes of past tense verbs.... -
Czech Named Entity Corpus 1.0
The presented Czech Named Entity Corpus 1.0 is the first publicly available corpus providing a large body of manually annotated named entities in Czech sentences, including a... -
Czech PDT-C 1.0 Model for UDPipe 2 (2023-11-16)
Tokenizer, POS Tagger, Lemmatizer, and Parser model based on the PDT-C 1.0 treebank (https://hdl.handle.net/11234/1-3185). The model documentation including performance can be... -
Preamble 1.0
Preamble 1.0 is a multilingual annotated corpus of the preamble of the EU REGULATION 2020/2092 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL. The corpus consists of four... -
Universal Dependencies 2.11
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual... -
Vystadial 2013 – scripts
Vystadial 2013 is a dataset of telephone conversations in English and Czech, developed for training acoustic models for automatic speech recognition in spoken dialogue systems.... -
ORTOFON v3: corpus of informal spoken Czech with multi-tier transcription (tr...
ORTOFON v3 is a corpus of authentic spoken Czech used in informal situations (private environment, spontaneity, unpreparedness etc.) that covers the area of the whole Czech... -
MorfFlex CZ 2.1 (2024-12-23)
MorfFlex CZ 2.1 is the Czech morphological dictionary developed originally by Jan Hajič as a spelling checker and lemmatization dictionary. MorfFlex CZ 2.1 is a part of the... -
Corpus for training and evaluating diacritics restoration systems
Corpus of texts in 12 languages. For each language, we provide one training, one development and one testing set acquired from Wikipedia articles. Moreover, each language... -
Extended Textual Coreference and Bridging Relations in PDT 2.0
Annotation of extended textual coreference and bridging relations in the Prague Dependency Treebank 2.0 -
Deep Universal Dependencies 2.8
Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-3687). It contains additional... -
VALLEX 3.0
VALLEX 3.0 provides information on the valency structure (combinatorial potential) of verbs in their particular senses, which are characterized by glosses and examples. VALLEX... -
Testimonies of Roma and Sinti
The key idea of our project is to convey to the widest possible readership detailed abstracts of the testimonies of Roma and Sinti and thus their personal and irreplaceable... -
CorPipe 23 multilingual CorefUD 1.2 model (corpipe23-corefud1.2-240906)
The corpipe23-corefud1.2-240906 is a mT5-large-based multilingual model for coreference resolution usable in CorPipe 23 https://github.com/ufal/crac2023-corpipe. It is released... -
Czech Legal Text Treebank 2.0
The Czech Legal Text Treebank 2.0 (CLTT 2.0) annotates the same texts as the CLTT 1.0. These texts come from the legal domain and they are manually syntactically annotated. The... -
Universal Derivations v1.1
Universal Derivations (UDer) is a collection of harmonized lexical networks capturing word-formation, especially derivational relations, in a cross-linguistically consistent... -
FAUST cs-en 0.5
This machine translation test set contains 2223 Czech sentences collected within the FAUST project (https://ufal.mff.cuni.cz/grants/faust, http://hdl.handle.net/11234/1-3308).... -
Khresmoi Summary Translation Test Data 2.0
This package contains data sets for development (Section dev) and testing (Section test) of machine translation of sentences from summaries of medical articles between Czech,... -
Czech Named Entity Corpus 1.1
Czech Named Entity Corpus 1.1 fixes some issues of the Czech Named Entity Corpus 1.0: misannotated entities are fixed, all formats contain the same data, tmt format is replaced...