Dataset - B2FIND

The Level Stress recordings: Våmhus_08

Recording equipment The recordings were done by means of a digital recorder (Fostex FR-2LE) and two AKG C451 B microphones placed on the table in front of the speakers. The...

ELMCIP Electronic Literature Knowledge Base: Critical Writing

The database ELMCIP Critical writing includes monographs, book chapters, journal articles, reviews etc. written about electronic literature or referenced in electronic...

ELMCIP Electronic Literature Knowledge Base: Creative Works

The ELMCIP Creative Works database contains works of electronic literature, digital literary art, and print antecedents. Column titles in the data correspond to the data fields...

Swe-NERC

A resource for training and evaluation of Named Entity Recognition for Swedish

Slovene corpus for aspect-based sentiment analysis - SentiCoref 1.0

SentiCoref 1.0 corpus consists of 837 documents selected from SentiNews 1.0 corpus (http://hdl.handle.net/11356/1110). The documents were selected based on the number of...

CMC training corpus Janes-Norm 1.1

Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation,...

Croatian Twitter training corpus ReLDI-NormTag-hr 1.1

ReLDI-NormTag-hr 1.1 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word...

Developmental corpus Šolar 2.0

The Developmental corpus Šolar 2.0 consists of 5,485 texts written by students in Slovene secondary schools (age 15-19) and pupils in the 7th-9th grade of primary school...

Frequency list of language problems from Šolar 3.0

The dataset comprises 36570 examples of student writing from Slovenian primary and secondary schools, together with authentic (teacher-provided) corrections of language problems...

Multimodal corpus EVA 1.0

EVA Corpus 1.0 consists of one episode of an audio/video session plus corresponding orthographic transcriptions with a duration of 57 minutes. The multi-party spontaneous...

Annotated Corpus of Pre-Standardized Balkan Slavic Literature 1.1

The corpus contains 23 linguistically annotated samples of "damaskini" and other Balkan Slavic manuscripts and print editions from the 15th-19th century, together with over 50...

Terminology identification dataset KAS-term 1.0

The dataset contains 22,950 term candidates extracted from 15 Slovenian PhD theses. The term candidates are of length 1 to 4, extracted via morphosyntactic patterns and the...

Knowledge-Enhanced Winograd Schema Challenge KE-WSC 1.0

Knowledge-Enhanced Winograd Schema Challenge KE-WSC is an upgraded version of the original WSC dataset. It includes the following extensions: Annotation of semantically or...

CMC training corpus Janes-Tag 1.2

Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence...

Tweet code-switching corpus Janes-Preklop 1.0

Janes-Preklop is a corpus of Slovene tweets that is manually annotated for code-switching (the use of words from two or more languages within one sentence or utterance),...

Font ZRCalo 1.0

ZRCalo is an open font meant to gradually phase out the ZRCola font as one of the components of the ZRCola 2 input system (http://hdl.handle.net/11356/1090). The current version...

Opinion corpus of Slovene web commentaries KKS 1.001

The corpus of web commentaries with sentiment categorizations was developed as a part of BSc Thesis (Kadunc, 2016) and served for evaluation of the Slovene Sentiment Lexicon KSS...

CMC training corpus Janes-Tag 2.1

Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence...

Serbian Twitter training corpus ReLDI-NormTagNER-sr 2.1

ReLDI-NormTagNER-sr 2.1 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation,...

xLiMe Twitter Corpus XTC 1.0.1

The xLiMe Twitter Corpus contains tweets in German, Italian and Spanish manually annotated with part-of-speech, named entities, and message-level sentiment polarity. In total,...

77,715 datasets found