CLARIN - Repositories

Core vocabulary for Slovenian as L2 1.0

The Core vocabulary for Slovenian as L2 is based on an analysis of the vocabulary appearing in the KUUS corpus (http://hdl.handle.net/11356/1696), which includes textbooks for...

Dataset of Slovene word formation trees ArboSloleks 1.0

ArboSloleks is a dataset containing Slovene word formation trees that have been automatically constructed from word relations (http://hdl.handle.net/11356/1986) extracted from...

Treq Translation Equivalents (ELEXIS)

Data for Treq interface 2.0 derived from the InterCorp parallel corpus release 12.

The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Slov...

The model for morphosyntactic annotation of standard Slovenian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the...

Twitter corpus Janes-Tweet 1.0

Janes-Tweet is an annotated corpus of almost 10 million tweets posted from 2013-06 to 2017-06 by approx. 9,000 users that tweet mostly in Slovene. The corpus is structured into...

The Trankit model for linguistic processing of written and spoken Slovenian 1.2

This is a retrained Slovenian model for the Trankit v1.1.1 library for multilingual natural language processing (https://pypi.org/project/trankit/), trained on the concatenation...

Slovene Translation of the Atomic 2020 data set SloATOMIC 2020

The SloATOMIC 2020 corpus contains the Slovene translations of the ATOMIC 2020 data set, a commonsense knowledge graph with 1.33M everyday inferential knowledge tuples about...

Parallel sense-annotated corpus ELEXIS-WSD 1.0

ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.0 contains sentences for 10...

Spoken corpus Gos 1.0

GOS is a corpus of spoken Slovene that includes the transcripts of approximately 120 hours of speech recorded in various situations: radio and TV shows, school lessons and...

List of single-word male and female occupations in Slovenian

The list of single-word occupations in Slovene is based on the Slovene Standard Classification of Occupations...

Slovene corpus for aspect-based sentiment analysis - SentiCoref 1.0

SentiCoref 1.0 corpus consists of 837 documents selected from SentiNews 1.0 corpus (http://hdl.handle.net/11356/1110). The documents were selected based on the number of...

Corpus of scientific texts of contemporary Slovenian KZB 1.0

The Corpus of scientific texts of contemporary Slovenian consists of 25 million words from scientific monographs and scientific papers written mainly between 2000 and 2023. It...

Frequency lists of character-level n-grams from the Gigafida 2.0 corpus

Frequency lists of character-level n-grams were extracted from the Gigafida 2.0 Corpus of Written Standard Slovene (https://viri.cjvt.si/gigafida/) using the LIST corpus...

Spoken corpus Berta

The Berta Spoken Corpus contains six hours of recorded speech across a variety of interactional settings. These settings include 57 different speech events, with some captured...

CMC training corpus Janes-Norm 1.1

Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation,...

Abstracts from the KAS corpus KAS-Abs 2.0

The KAS-abs 2.0 corpus contains 125,202 automatically identified Slovenian and/or English abstracts from BSc/BA, MSc/MA, and PhD theses included in the KAS Corpus of Academic...

Slovene instruction-following dataset for large language models GaMS-Instruct...

GaMS-Instruct-DH is an instruction-following dataset designed to fine-tune Slovene large language models to follow instructions. It consists of pairs of prompts and responses,...

Developmental corpus Šolar 2.0

The Developmental corpus Šolar 2.0 consists of 5,485 texts written by students in Slovene secondary schools (age 15-19) and pupils in the 7th-9th grade of primary school...

Trankit model for SST 2.15 1.1

This is a retrained Slovenian model for the Trankit v1.1.1 library for multilingual natural language processing (https://pypi.org/project/trankit/), trained on the SST treebank...

Corpus of Slovene linguistic scientific writing JezKor

JezKor is a collection of linguistic scientific writing in the Slovenian language. It consists of 43 monographs published between 2009 and 2022 by Fran Ramovš institute of...

503 datasets found