CLARIN - Repositories

AKCES 5 (CzeSL-SGT)

Essays written by non-native learners of Czech, a part of AKCES/CLAC – Czech Language Acquisition Corpora. CzeSL-SGT stands for Czech as a Second Language with Spelling, Grammar...

Universal Dependencies 2.1

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

UDify Pretrained Model

Pretrained model weights for the UDify model, and extracted BERT weights in pytorch-transformers format. Note that these weights slightly differ from those used in the paper.

WordSim353-cs: Evaluation Dataset for Lexical Similarity and Relatedness, bas...

Czech translation of WordSim353. The Czech translation of English WordSim353 word pairs were obtained from four translators. All translation variants were scored according to...

POS Tagging and Lemmatization (Czech model)

Model trained for Czech POS Tagging and Lemmatization using Czech version of BERT model, RobeCzech. Model is trained on data from Prague Dependency Treebank 3.5. Model is a part...

AKCES-GEC Grammatical Error Correction Dataset for Czech

AKCES-GEC is a grammar error correction corpus for Czech generated from a subset of AKCES. It contains train, dev and test files annotated in M2 format. Note that in comparison...

DeriNet 1.5

DeriNet is a lexical network which models derivational relations in the lexicon of Czech. Nodes of the network correspond to Czech lexemes, while edges represent derivational...

CorPipe 24 Multilingual CorefUD 1.2 Model (corpipe24-corefud1.2-240906)

The corpipe24-corefud1.2-240906 is a mT5-large-based multilingual model for coreference resolution usable in CorPipe 24 (https://github.com/ufal/crac2024-corpipe). It is...

CorPipe 23 multilingual CorefUD 1.1 model (corpipe23-corefud1.1-231206)

The corpipe23-corefud1.1-231206 is a mT5-large-based multilingual model for coreference resolution usable in CorPipe 23 (https://github.com/ufal/crac2023-corpipe). It is...

Universal Dependencies 1.2 Models for UDPipe

Tokenizer, POS Tagger, Lemmatizer and Parser models for all Universal Depenencies 1.2 Treebanks, created solely using UD 1.2 data (http://hdl.handle.net/11234/1-1548). To use...

Universal Dependencies 2.15

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

MorphoDiTa: Morphological Dictionary and Tagger

MorphoDiTa: Morphological Dictionary and Tagger is an open-source tool for morphological analysis of natural language texts. It performs morphological analysis, morphological...

CoNLL 2017 Shared Task - UDPipe Baseline Models and Supplementary Materials

Baseline UDPipe models for CoNLL 2017 Shared Task in UD Parsing, and supplementary material. The models require UDPipe version at least 1.1 and are evaluated using the official...

Czech Models (MorfFlex CZ 160310 + PDT 3.0) for MorphoDiTa 160310

Czech models for MorphoDiTa, providing morphological analysis, morphological generation and part-of-speech tagging. The morphological dictionary is created from MorfFlex CZ...

GrandStaff-LMX: Linearized MusicXML Encoding of the GrandStaff Dataset

The GrandStaff-LMX dataset is based on the GrandStaff dataset described in the "End-to-end optical music recognition for pianoform sheet music" paper by Antonio Ríos-Vila et...

Universal Dependencies 2.5

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual...

CoNLL 2017 Shared Task - Automatically Annotated Raw Texts and Word Embeddings

Automatic segmentation, tokenization and morphological and syntactic annotations of raw texts in 45 languages, generated by UDPipe (http://ufal.mff.cuni.cz/udpipe), together...

EvaLatin 2020 models for UDPipe 2 (2020-08-31)

POS Tagger and Lemmatizer models for EvaLatin2020 data (https://github.com/CIRCSE/LT4HALA). The model documentation including performance can be found at...

Universal Dependencies 2.12 models for UDPipe 2 (2023-07-17)

Tokenizer, POS Tagger, Lemmatizer and Parser models for 131 treebanks of 72 languages of Universal Depenencies 2.12 Treebanks, created solely using UD 2.12 data...

Czech Relationship Extraction Dataset

CERED (Czech Relationship Dataset) is a family of datasets created via distant supervision on Czech Wikipedia and Wikidata. It was created as part of a thesis on Relationship...

97 datasets found