CLARIN - Repositories

JABLONSKIS tagset v2

JABLONSKIS VERSION 2 is a Lithuanian standard morphologiclal tagset that is based on the abbreviations of parts of speech and other grammatical categories commonly used in...

Language Technology Research Bibliography for Lithuanian 2016-2020

The language technology bibliography for Lithuanian language in the period 2016-2020. The resource is in BibTex format and it contains: 1) 91 references of research...

ORVELIT v3

ORVELIT v3 (Lith.Originalios ir Vertimų Lietuvių Kalbos Tekstynas) is a comparable monolingual corpus of original and translated Lithuanian consisting of four sub-corpora of...

Lithuanian morphologically annotated corpus - MATAS

MATAS v0.2 - Morphologically Annotated Lithuanian Corpus (manually checked) Contains 4 parts: Documents (21%), Fiction (19%), Periodicals (36%), Scientific texts (24%) Wordform...

Lithuanian keyboard for macOS users

This keyboard driver allows easy access of the Lithuanian letters via conventional keyboard layout a.k.a. „Lithuanian letters instead of numbers“. Essential new feature of this...

Corpus of the Contemporary Lithuanian Language

Corpus of the Contemporary Lithuanian Language, which comprises 208 million words, is a collection of texts designed to represent the current Lithuanian. The corpus has been...

Colloc -- A Tool for Automatic Identification of Multiword Expressions

Colloc -- a tool for automatic identification of multiword expressions (MWE) is freely available for online use at http://resursai.mwe.lt/atpazintuvas. As material for training...

Pedagogic Corpus of Lithuanian

The Pedagogic Corpus of Lithuanian is a monolingual specialized corpus, prepared for learning and teaching Lithuanian in a foreign language classroom. The pedagogic corpus...

English-Lithuanian Parallel Cybersecurity Corpus - DVITAS v2.0

English-Lithuanian parallel corpus DVITAS v2 includes original English texts on cybersecurity and their Lithuanian translations aligned on the sentence level. Version 1 of the...

DIGIRES COVID-19 ML Dataset v.1

DIGIRES COVID-19 ML dataset v.1 is a tab-separated (.tsv) file prepared for training machine learning algorithms. The training dataset was compiled from various internet public...

Lemmatised Wordlist of 1 m. Corpus of Contemporary Lithuanian

The lemmatised wordlist of 1 m. word Lithuanian corpus. The structure of the tab delimited text file (dazninis.txt): HeadwordPart of SpeechWordformFrequency of Occurrence. The...

Lithuanian speech-to-text Transcriber

Speech to text automatic transcriber for Lithuanian is a containerized application implemented into 17 containers. It covers four areas: administrative, legal, medical and...

English-Lithuanian Parallel Cybersecurity Corpus - DVITAS

English-Lithuanian parallel corpus DVITAS includes original English texts on cybersecurity and their Lithuanian translations aligned on the sentence level. The corpus was...

Read Speech Corpus (7G)

The corpus of read Lithuanian speech „7G“ was compiled in 2015-2016. The corpus consists of 352 audio recordings with a total duration of over 7 hours. Seven different speakers...

Corpus of Discourse on Crime

Specialised "Corpus of Discourse on Crime" is synchronic, monolingual, unannotated, consists of two subcorpora. Subcorpus 1: all texts on crime, published in criminal columns on...

The Scottish Gaelic Linguistic Toolkit

A linguistic analyser for tagging, lemmatisation and parsing of Scottish Gaelic texts. Morphological and syntactic analyses are available directly from the webpage (through the...

Lithuanian Word embeddings

GloVe type word vectors (embeddings) for Lithuanian. Delfi.lt corpus (~70 million words) and StanfordNLP were used for training. The training consisted of several stages: 1)...

The Database of Lithuanian multiword expressions

The Database of Lithuanian multiword expressions (MWEs) is freely accessible for online search at: https://resursai.pastovu.vdu.lt/paieska/paprastoji from 2019. It contains...

Assessment Data of the Dictionary of Modern Lithuanian versus Joint Corpora

The resource is the assessment data of The Dictionary of Modern Lithuanian, 6th edition (DML6) [1], from the point of view of its coverage in the Joint Corpus of Lithuanian...

Lithuanian Coreference Corpus

Lithuanian Coreference Corpus The corpus is made out of 100 articles from news portals focusing on political news, as such texts are rich in quotations and named entity...

4,731 datasets found