-
KPWr annotation guidelines - phrase lemmatization
Annotation guidelines for manual phrase lemmatisation in KPWr (Polish Corpus of Wrocław University of Technology). -
KPWr dump r240
Dump of the Polish Corpus of Wrocław University of Technology (KPWr) containing a set of documents annotated with named entities and keywords. -
PolEval 2019 Task 2: Lemmatization of proper names and multi-word phrases — t...
The task consists in developing a tool for lemmatization of proper names and multi-word phrases. The generated lemmas should follow the KPWr guidelines... -
PolEval 2019 Task 2: Lemmatization of proper names and multi-word phrases — t...
The task consists in developing a tool for the lemmatization of proper names and multi-word phrases. The generated lemmas should follow the KPWr guidelines... -
ENIAMtoolkit
ENIAMtoolkit is a collection of libraries that: - perform tokenization, lemmatization, part of speech tagging; - detect MWE and abbreviations; - split text into sentences. -
ENIAMtoolkit (2017-03-06)
ENIAMtoolkit is a collection of libraries that: - perform tokenization, lemmatization, part of speech tagging; - detect MWE and abbreviations; - split text into sentences; - LCG... -
KPWr annotation guidelines - named entity and phrase lemmatization 2.0
Guidelines for named entity and multi-word phrase lemmatization used in in KPWr (Polish Corpus of Wrocław University of Technology). -
Universal Dependencies 2.6 models for UDPipe 2 (2020-08-31)
Tokenizer, POS Tagger, Lemmatizer and Parser models for 99 treebanks of 63 languages of Universal Depenencies 2.6 Treebanks, created solely using UD 2.6 data... -
Persian Morphologically Segmented Lexicon 0.5
This dataset includes 45300 Persian word forms which are manually segmented into sequences of morphemes. -
CorpusExplorer
Software for corpus linguists and text/data mining enthusiasts. The CorpusExplorer combines over 45 interactive visualizations under a user-friendly interface. Routine tasks... -
Indonesian web corpus (idWac)
Indonesian text corpus from web. Crawling done by SpiderLing in 2017. Filtering by JusText and Onion (see http://corpus.tools/ for details). Tagged and lemmatized by MorphInd... -
Universal Dependencies 2.10 models for UDPipe 2 (2022-07-11)
Tokenizer, POS Tagger, Lemmatizer and Parser models for 123 treebanks of 69 languages of Universal Depenencies 2.10 Treebanks, created solely using UD 2.10 data... -
Universal Dependencies 2.0 Models for UDPipe (2017-08-01)
Tokenizer, POS Tagger, Lemmatizer and Parser models for all 50 languages of Universal Depenencies 2.0 Treebanks, created solely using UD 2.0 data... -
Universal Dependencies 2.5 Models for UDPipe (2019-12-06)
Tokenizer, POS Tagger, Lemmatizer and Parser models for 94 treebanks of 61 languages of Universal Depenencies 2.5 Treebanks, created solely using UD 2.5 data... -
Universal Dependencies 2.4 Models for UDPipe (2019-05-31)
Tokenizer, POS Tagger, Lemmatizer and Parser models for 90 treebanks of 60 languages of Universal Depenencies 2.4 Treebanks, created solely using UD 2.4 data... -
Prague Dependency Treebank 3.5
The Prague Dependency Treebank 3.5 is the 2018 edition of the core Prague Dependency Treebank (PDT). It contains all PDT annotation made at the Institute of Formal and Applied... -
UDPipe
UDPipe is an trainable pipeline for tokenization, tagging, lemmatization and dependency parsing of CoNLL-U files. UDPipe is language-agnostic and can be trained given only... -
EvaLatin 2020 models for UDPipe 2 (2020-08-31)
POS Tagger and Lemmatizer models for EvaLatin2020 data (https://github.com/CIRCSE/LT4HALA). The model documentation including performance can be found at... -
CoNLL 2017 Shared Task - UDPipe Baseline Models and Supplementary Materials
Baseline UDPipe models for CoNLL 2017 Shared Task in UD Parsing, and supplementary material. The models require UDPipe version at least 1.1 and are evaluated using the official... -
Czech Morphological Analyzer v1
One of the very first steps in automatic processing of Czech text is morphological analysis and lemmatization.