Dataset - B2FIND

Dependency tree extraction tool STARK 3.0

STARK is a highly customizable tool designed for extracting different types of syntactic structures (trees) from parsed corpora (treebanks), aimed at corpus-driven linguistic...

Croatian linguistic training corpus hr500k 2.0

The hr500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and...

Deep Sequoia corpus - PARSEME-FR corpus - FrSemCor

The Sequoia corpus is a set of 3,099 linguistically-annotated French sentences, originating from four sources (Europarl, European Agency Reports, French regional journal L'Est...

Multiword expressions in the Prague Dependency Treebank 2.0

This dataset adds annotation of multiword expressions and multiword named entities to the original PDT 2.0 data. The annotation is stand-off, stored in the same PML format as...

Czech Verbal MWEs

Lexicon of Czech verbal multiword expressions (VMWEs) used in Parseme Shared Task 2017....

Gold Standard Reference Data for Multiword Expression Extraction: Czech Depen...

Annotated list of dependency bigrams occurring in the PDT more than five times and having part-of-speech patterns that can possibly form a collocation. Each bigram is assigned...

ParaDi 2.0 (2018-01-24)

ParaDi 2.0. is a dictionary of single verb paraphrases of Czech verbal multiword expressions - light verb constructions and idiomatic verb constructions. Moreover, it provides...

ParaDi 2.0

ParaDi 2.0. is a dictionary of single verb paraphrases of Czech verbal multiword expressions - light verb constructions and idiomatic verb constructions. Moreover, it provides...

Annotated corpora and tools of the PARSEME Shared Task on Semi-Supervised Ide...

This multilingual resource contains corpora in which verbal MWEs have been manually annotated, gathered at the occasion of the 1.2 edition of the PARSEME Shared Task on...

PARSEME corpora annotated for verbal multiword expressions (version 1.3)

This multilingual resource contains corpora in which verbal MWEs have been manually annotated. VMWEs include idioms (let the cat out of the bag), light-verb constructions (make...

Czech Multiword Expressions

The dataset contains 4731 frozen continuous Czech multiword expressions. Inflectional word forms are generated for those MWEs where applicable. In total, the dataset contains...

Prague Dependency Treebank 2.5

The Prague Dependency Treebank 2.5 annotates the same texts as the PDT 2.0. The annotation on the original four layers was fixed or improved in various aspects (see...

Prague Dependency Treebank 3.5

The Prague Dependency Treebank 3.5 is the 2018 edition of the core Prague Dependency Treebank (PDT). It contains all PDT annotation made at the Institute of Formal and Applied...

33 datasets found