CLARIN - Repositories

The Dictionary of the Clothing Terminology of the Zilja Dialect in Canale Val...

The collection of sound clips for The Dictionary of the Clothing Terminology of the Zilja Dialect in Canale Valley (Kanalska dolina – Val Canale – Kanaltal – Valcjanâl)...

Offensive language dataset of Croatian, English and Slovenian comments FRENK 1.0

The FRENK dataset consists of comments to Facebook posts (news articles) of mainstream media outlets from Croatia, Great Britain, and Slovenia, on the topics of migrants and...

School dictionary of Slovenian language (human audio recordings)

2,060 recordings in mp3 format were made for the School Dictionary of the Slovenian Language based on the original recordings in wav format (48 kHZ, 24-bit). Around 600...

Word list of the collection Words of Slovenian Language - SBSJ (ELEXIS)

Seznam besed iz zbirke Besede slovenskega jezika. A list of 354.205 different words from the headwords of the collection Besede slovenskega jezika / Words of Slovenian Language....

Slovenian Twitter dataset 2018-2020 1.0

The dataset represents the Twitter production in Slovenian in the period from 2018 until 2020. It consists of tweet IDs, retweet IDs, pseudo-anonymized user IDs, publication...

List of formulaic sequences in standard written Slovenian

This document contains 1,891 formulaic sequences in standard written Slovenian, i.e. frequently recurring strings of two to five words, manually annotated for syntactic...

Dialogue act annotated spoken corpus GORDAN 1.0 (transcription)

The GORDAN 1.0 corpus contains authentic data of spoken communication, annotated for dialogue acts according to the GORDAN 1.0 dialogue act annotation scheme, included in the...

ELMo embeddings models for seven languages

ELMo language model (https://github.com/allenai/bilm-tf) used to produce contextual word embeddings, trained on large monolingual corpora for 7 languages: Slovenian, Croatian,...

The CLASSLA-StanfordNLP model for lemmatisation of standard Slovenian

The model for lemmatisation of standard Slovenian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the ssj500k...

Slovene Lexical Database - SLB (ELEXIS)

Leksikalna baza za slovenščino. Slovene Lexical Database was created between 2008 and 2012 and represents a comprehensive syntactic and semantic description of a selected set...

The Trankit model for linguistic process of standard written Slovenian 1.1

This is a retrained Slovenian model for the Trankit v1.1.1 library for multilingual natural language processing (https://pypi.org/project/trankit/), trained on the reference SSJ...

MULTEXT-East "1984" document corpus 4.0

The novel "1984" by George Orwell is the central component of the MULTEXT-East corpus. This parallel and sentence aligned corpus contains the novel in the English original...

Spoken corpus Gos VideoLectures 2.0 (transcription)

Gos VideoLectures is an add-on to the Gos reference corpus of spoken Slovene (http://hdl.handle.net/11356/1040), and covers public academic speech. The Gos VideoLectures corpus...

TermFrame: Terms, definitions and semantic annotations for karstology

The resource contains several datasets containing domain-specific data in three languages, English, Slovenian and Croatian, which can be used for various knowledge extraction or...

The CLASSLA-Stanza model for UD dependency parsing of standard Slovenian 2.0

This model for UD dependency parsing of standard Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SUK training corpus...

Dictionary of the Slovenian Language in the Works of Janez Svetokriški - JSV ...

Slovar jezika Janeza Svetokriškega. The Dictionary of the Slovenian Language in the Works of Janez Svetokriški presents and explains the lexis, including proper nouns, from 233...

Multimodal corpus EVA 1.0

EVA Corpus 1.0 consists of one episode of an audio/video session plus corresponding orthographic transcriptions with a duration of 57 minutes. The multi-party spontaneous...

Word embeddings CLARIN.SI-embed.sl 2.0

CLARIN.SI-embed.sl contains word embeddings induced from a large collection of Slovene texts composed of existing corpora of Slovene, e.g GigaFida, Janes, KAS, slWaC, MaCoCu-sl,...

Developmental corpus of Slovene (without language corrections) Šolar-Clear

Šolar-Clear is an adapted version of the Šolar 1.0 corpus, cf. http://hdl.handle.net/11356/1036. The Šolar(-Clear) corpus consists of texts written by students in Slovene...

The CLASSLA-StanfordNLP model for lemmatisation of standard Slovenian 1.2

The model for lemmatisation of standard Slovenian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the ssj500k...

503 datasets found