Dataset - B2FIND

Prague Dependency Treebank - Consolidated 2.0 (PDT-C 2.0)

A manually annotated and genre-diversified language resource with rich linguistic information from morphology and syntax to semantics, the Prague Dependency Treebank –...

Prague Dependency Treebank - Consolidated 1.0 (PDT-C 1.0)

A richly annotated and genre-diversified language resource, The Prague Dependency Treebank – Consolidated 1.0 (PDT-C 1.0, or PDT-C in short in the sequel) is a consolidated...

Colloc -- A Tool for Automatic Identification of Multiword Expressions

Colloc -- a tool for automatic identification of multiword expressions (MWE) is freely available for online use at http://resursai.mwe.lt/atpazintuvas. As material for training...

The Database of Lithuanian multiword expressions

The Database of Lithuanian multiword expressions (MWEs) is freely accessible for online search at: https://resursai.pastovu.vdu.lt/paieska/paprastoji from 2019. It contains...

Database of Lithuanian Multiword Expressions

Database of Lithuanian multiword expressions (MWE) contains bi-gram and tri-gram MWE that occured in DELFI.lt corpus (http://tekstynas.mwe.lt/) at least 10 times. In the...

Gos corpus n-grams 1.0

This is a collection of n-grams extracted from the Gos corpus of spoken Slovene. http://hdl.handle.net/11356/1040. In addition to the separate lists of n-grams for tokens and...

Terminological multiword expressions lexicon

The Terminological Multiword Expressions Lexicon contains multiword terms extracted from various terminological sources. The entries were lemmatized and tagged according to the...

Automatically constructed multiword lexicon srMWELex v0.5

The srMWELex lexicon is an automatically constructed lexicon of Serbian multiword expression candidates (mostly collocations) from the parsed srWaC 1.0 corpus by using the...

List of formulaic sequences in standard written Slovenian

This document contains 1,891 formulaic sequences in standard written Slovenian, i.e. frequently recurring strings of two to five words, manually annotated for syntactic...

Gos corpus n-grams 2.0

A collection of n-grams extracted from the Gos corpus of spoken Slovene (cf. http://eng.slovenscina.eu/korpusi/gos). Three sets of n-gram lists are provided for lowercased word...

Automatically constructed multiword lexicon slMWELex v0.5

The slMWELex lexicon is an automatically constructed lexicon of Slovene multiword expression candidates (mostly collocations) from the parsed KRES corpus by using the DepMWEx...

List of formulaic sequences in spoken Slovenian

This document contains 2,374 formulaic sequences in spoken Slovenian, i.e. frequently recurring strings of two to five words, manually annotated for syntactic structure,...

Kres corpus n-grams 2.0

A collection of n-grams extracted from the Kres corpus of written Slovene (cf. http://eng.slovenscina.eu/korpusi/kres). Three sets of n-gram lists are provided for lowercased...

Janes corpus n-grams 1.0

A collection of n-grams extracted from the Janes corpus of Slovenian user-generated content version 1.0 (cf. http://nl.ijs.si/janes/). Three sets of n-gram lists are provided...

Dataset of Slovene idiomatic expressions SloIE

SloIE is a manually labelled dataset of Slovene idiomatic expressions. It contains 29,400 sentences with 75 different expressions that can occur with either a literal or an...

Automatically constructed multiword lexicon hrMWELex v0.5

The hrMWELex lexicon is an automatically constructed lexicon of Croatian multiword expression candidates (mostly collocations) from the parsed hrWaC 2.0 corpus by using the...

IMP corpus n-grams 1.0

This is a collection of n-grams extracted from the IMP corpus of historical Slovene (http://hdl.handle.net/11356/1031). In addition to the separate lists of n-grams for tokens...

Multiword Expressions lexicon extracted from the Gigafida 2.1 corpus

The MWE lexicon was extracted from the Gigafida 2.1 Corpus of Written Standard Slovene (https://www.clarin.si/noske/run.cgi/corp_info?corpname=gfida21) using specialized scripts...

KRES corpus n-grams 1.0

This is a collection of n-grams extracted from the KRES corpus of written Slovene. In addition to the separate lists of n-grams for tokens and their attributes (morphosyntacic...

IMP corpus n-grams 2.0

A collection of n-grams extracted from the IMP corpus of historical Slovene (cf. https://nl.ijs.si/imp/). Three sets of n-gram lists are provided for lowercased word n-grams of...

33 datasets found