Dataset - B2FIND

Beserman multimedia corpus

Beserman multimedia corpus This deposit contains transcriptions of monologues and conversations in spoken Beserman (formerly classified as a dialect of Udmurt, ISO 639-2 code...

Prague Dependency Treebank - Consolidated 2.0 (PDT-C 2.0)

A manually annotated and genre-diversified language resource with rich linguistic information from morphology and syntax to semantics, the Prague Dependency Treebank –...

Prague Dependency Treebank - Consolidated 1.0 (PDT-C 1.0)

A richly annotated and genre-diversified language resource, The Prague Dependency Treebank – Consolidated 1.0 (PDT-C 1.0, or PDT-C in short in the sequel) is a consolidated...

ENIAMtoolkit

ENIAMtoolkit is a collection of libraries that: - perform tokenization, lemmatization, part of speech tagging; - detect MWE and abbreviations; - split text into sentences.

PolEval 2019 Task 2: Lemmatization of proper names and multi-word phrases — t...

The task consists in developing a tool for lemmatization of proper names and multi-word phrases. The generated lemmas should follow the KPWr guidelines...

KPWr dump r240

Dump of the Polish Corpus of Wrocław University of Technology (KPWr) containing a set of documents annotated with named entities and keywords.

ENIAMtoolkit (2017-03-06)

ENIAMtoolkit is a collection of libraries that: - perform tokenization, lemmatization, part of speech tagging; - detect MWE and abbreviations; - split text into sentences; - LCG...

KPWr annotation guidelines - named entity and phrase lemmatization 2.0

Guidelines for named entity and multi-word phrase lemmatization used in in KPWr (Polish Corpus of Wrocław University of Technology).

PolEval 2019 Task 2: Lemmatization of proper names and multi-word phrases — t...

The task consists in developing a tool for the lemmatization of proper names and multi-word phrases. The generated lemmas should follow the KPWr guidelines...

KPWr annotation guidelines - phrase lemmatization

Annotation guidelines for manual phrase lemmatisation in KPWr (Polish Corpus of Wrocław University of Technology).

Indonesian web corpus (idWac)

Indonesian text corpus from web. Crawling done by SpiderLing in 2017. Filtering by JusText and Onion (see http://corpus.tools/ for details). Tagged and lemmatized by MorphInd...

Czech Verbal MWEs

Lexicon of Czech verbal multiword expressions (VMWEs) used in Parseme Shared Task 2017....

Czech PDT-C 1.0 Model for UDPipe 2 (2023-11-16)

Tokenizer, POS Tagger, Lemmatizer, and Parser model based on the PDT-C 1.0 treebank (https://hdl.handle.net/11234/1-3185). The model documentation including performance can be...

CorpusExplorer

Software for corpus linguists and text/data mining enthusiasts. The CorpusExplorer combines over 45 interactive visualizations under a user-friendly interface. Routine tasks...

Universal Dependencies 2.4 Models for UDPipe (2019-05-31)

Tokenizer, POS Tagger, Lemmatizer and Parser models for 90 treebanks of 60 languages of Universal Depenencies 2.4 Treebanks, created solely using UD 2.4 data...

Universal Dependencies 2.15 models for UDPipe 2 (2024-11-21)

Tokenizer, POS Tagger, Lemmatizer and Parser models for 147 treebanks of 78 languages of Universal Depenencies 2.15 Treebanks, created solely using UD 2.15 data...

UDPipe 2

UDPipe 2 is a POS tagger, lemmatizer and dependency parser. Compared to UDPipe 1: UDPipe 2 is Python-only and tested only in Linux, UDPipe 2 is meant as a research tool,...

Universal Dependencies 2.6 models for UDPipe 2 (2020-08-31)

Tokenizer, POS Tagger, Lemmatizer and Parser models for 99 treebanks of 63 languages of Universal Depenencies 2.6 Treebanks, created solely using UD 2.6 data...

Czech Morphological Analyzer v1

One of the very first steps in automatic processing of Czech text is morphological analysis and lemmatization.

POS Tagging and Lemmatization (Czech model)

Model trained for Czech POS Tagging and Lemmatization using Czech version of BERT model, RobeCzech. Model is trained on data from Prague Dependency Treebank 3.5. Model is a part...

35 datasets found