-
Slovene instruction-following dataset for large language models GaMS-Instruct...
GaMS-Instruct-MED is an instruction-following dataset designed to fine-tune Slovene large language models to follow instructions in the medical domain. It consists of pairs of... -
The CLASSLA-Stanza model for JOS dependency parsing of standard Slovenian 2.0
This model for JOS dependency parsing of standard Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SUK training corpus... -
Linguistically annotated multilingual comparable corpora of parliamentary deb...
ParlaMint 2.1 is a multilingual set of 17 comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2020, with each corpus being about 20... -
Slovenian manuscript sermons by Ignacij Holzapfel 1.0
This corpus consists of editions of three volumes of sermons written by Ignatius Holzapfel (1799-1866) when he was active as parish priest in Črnomelj and Ribnica. The bulk of... -
Slovenian Emotion Dimension and Emotion Association Lexicon SloEmoLex 1.0
SloEmoLex is a lexicon of emotion, valence, arousal and dominance for 19,998 Slovenian entries. It includes and extends the Slovenian part of the LiLaH lexicon (Ljubešić et... -
Military dictionary by students of defence studies
The military dictionary by students of defense sciences was created during the course Slovenian language and Slovenian military idioms (2012/13 and 2013/14) at the Faculty of... -
Text classification model SloBERTa-Trendi-Topics 1.0
The SloBerta-Trendi-Topics model is a text classification model for categorizing news texts with one of 13 topic labels. It was trained on a set of approx. 36,000 Slovene texts... -
Machine Translation datasets from the KAS corpus KAS-MT 1.0
The Machine Translation datasets KAS-MT 1.0 contain automatically sentence-aligned Slovene and English plain-text abstracts from KAS-Abs 2.0 (http://hdl.handle.net/11356/1449)... -
Frequency list of language problems from Šolar 3.0
The dataset comprises 36570 examples of student writing from Slovenian primary and secondary schools, together with authentic (teacher-provided) corrections of language problems... -
Trankit model for linguistic processing of spoken Slovenian
This is a retrained Slovenian spoken language model for Trankit v1.1.1 library (https://pypi.org/project/trankit/). It is able to predict sentence segmentation, tokenization,... -
List of Slovenian headwords 1.1
A list of headwords from the collection "Besede slovenskega jezika" (Words of Slovenian Language). -
Terminological dictionary of electronic smoking
The terminological dictionary of electronic smoking is a result of the master thesis Terminology in the Field of Electronic Smoking from 2019. The collection consists of... -
Corpus of term-annotated texts RSDO5 1.1
The RSDO5 corpus was compiled in order to serve as a training set for automatic term identification. It consists of 12 texts with 250,000 words and almost 38,000 manually... -
Multilingual Culture-Independent Word Analogy Datasets
Word analogy task evaluates word embeddings, based on analagous word pairs (eg. "Paris - France" should be equivalent to "Rome - Italy", "son - daughter" should be equivalent to... -
Terminological multiword expressions lexicon
The Terminological Multiword Expressions Lexicon contains multiword terms extracted from various terminological sources. The entries were lemmatized and tagged according to the... -
News comment corpus Janes-News 1.0
Janes-News is an annotated corpus of comments on online news articles from websites rtvslo.si, mladina.si, and reporter.si from the period 2007-03 to 2015-01. The corpus is... -
Forum corpus Janes-Forum 1.0
Janes-Forum is an annotated corpus of Slovene forums from websites med.over.net, avtomobilizem.com, and kvarkadabra.net from the period 2001-02 to 2015-01. The corpus is... -
ŠUSS archive of questions and answers about the Slovenian language (1998-2010)
This corpus contains the Q&A archive of the ŠUSS language consultancy service. The ŠUSS internet forum was active 1998-2010. Questions posted by users were answered by a... -
Collocations Dictionary of Modern Slovene KSSS 1.0
The database of the Collocations Dictionary of Modern Slovene 1.0 contains entries for 35,862 headwords (18,043 nouns, 5,148 verbs, 10,259 adjectives and 2,412 adverbs) and... -
Spoken corpus Gos VideoLectures 4.0 (audio)
Gos VideoLectures is an add-on to the Gos reference corpus of spoken Slovene (http://hdl.handle.net/11356/1040), and covers public academic speech. The Gos VideoLectures corpus...