Dataset - B2FIND

Blog post and comment corpus Janes-Blog 1.0

Janes-Blog is an annotated corpus of Slovene blogs from websites rtvslo.si and publishwall.si from the period 2006-10 to 2016-01. The corpus is structured into individual texts...

Wikipedia talk corpus Janes-Wiki 1.0

Janes-Wiki is an annotated corpus of discussion pages from the Slovene Wikipedia from the period 2003-08 to 2017-06. The corpus contains page and user talks and is structured...

CMC training corpus Janes-Tag 2.0

Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence...

Croatian Twitter training corpus ReLDI-NormTag-hr 1.1

ReLDI-NormTag-hr 1.1 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word...

Serbian Twitter training corpus ReLDI-NormTag-sr 1.1

ReLDI-NormTag-sr 1.1 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word...

Tweet comma corpus Janes-Vejica 1.0

Janes-Vejica is a corpus of Slovene tweets where commas are annotated with the reason for their (in)correct use, according to the supplied typology. The corpus was sampled from...

CMC training corpus Janes-Tag 1.2

Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence...

CMC training corpus Janes-Norm 1.2

Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation,...

CMC shortening corpus Janes-Kratko 1.0

Janes-Kratko is a corpus of Slovene tweets manually annotated with shortening phenomena according to the supplied typology covering different types of spelling, lexical and...

CMC training corpus Janes-Syn 1.0

Janes-Syn is a syntactically annotated corpus of Slovene tweets and is meant as a gold-standard training and testing dataset for syntactic annotation of Slovene...

Japanese web corpus with difficulty levels jpWaC-L 1.0

The corpus contains over 300 million words, with annotations of words and sentences describing their difficulty levels. Words are assigned levels of difficulty according to the...

Digital library and corpus of historical Slovene IMP 1.1

The IMP digital library contains historical Slovene books and other publications, together 658 texts with over 45,000 pages from the period 1584-1919. Each text contains...

Reference corpus of historical Slovene goo300k 1.2

goo300k is a manually annotated reference corpus of historical Slovene. It contains 1,100 pages (about 300,000 tokens) sampled from 89 texts from the period 1584-1899. Each text...

XML-TEI Bible

This project contains the biblical text in XML-TEI (66 books). The encoded text is in Spanish, but the codification (elements, attributes, values, ids) is in English. It makes...

Slovenian parliamentary corpus SlovParl 2.0

The SlovParl corpus contains minutes of the Assembly of the Republic of Slovenia for the legislative period 1990-1992, i.e. it covers the period before, during, and after...

Training corpus ssj500k 2.0

The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation....

Training corpus jos1M 1.1

The jos1M corpus contains 1 million words of sampled paragraphs from the FidaPLUS corpus. It is meant to serve as a training corpus for word-level tagging of Slovene. This...

Lexicon of historical Slovene imp25k 1.1

The imp25k lexicon of historical Slovene was created automatically from the goo300k and foo3M annotated corpora and contains attested and manually verified word forms and their...

Training corpus ssj500k 1.3

The ssj500k training corpus is based on two training corpora built within the JOS project (https://nl.ijs.si/jos/). It contains the jos100k corpus and additional material from...

Slovenian parliamentary corpus ParlaMeter-sl 1.0

The ParlaMeter-sl corpus contains minutes of the National Assembly of the Republic of Slovenia and currently covers its VIIth mandate (2014-08-01 to 2018-06-22). The corpus...

111 datasets found