-
The Orange workflow for observing collocation clusters ColEmbed 1.0
The Orange Workflow for Observing Collocation Clusters ColEmbed 1.0 ColEmbed is a workflow (.OWS file) for Orange Data Mining (an open-source machine learning and data... -
Ekspress news article archive (in Estonian and Russian) 1.0
The dataset is an archive of articles from the Ekspress Meedia news site from 2009-2019, containing over 1.4M articles, mostly in Estonian language (1,115,120 articles) with... -
Word embeddings CLARIN.SI-embed.mk 2.0
CLARIN.SI-embed.mk contains word embeddings induced from a large collection of Macedonian texts crawled from the .mk top-level domain. The embeddings are based on the skip-gram... -
Word embeddings CLARIN.SI-embed.mk 0.1
CLARIN.SI-embed.mk contains word embeddings induced from a large collection of Macedonian texts crawled from the .mk top-level domain. The embeddings are based on the skip-gram... -
Slovenian RoBERTa contextual embeddings model: SloBERTa 2.0
The monolingual Slovene RoBERTa (A Robustly Optimized Bidirectional Encoder Representations from Transformers) model is a state-of-the-art model representing words/tokens as... -
SimLex-999 Slovenian translation SimLex-999-sl 1.0
The resource contains English SimLex-999 (Hill et al. 2015) and their Slovene translations. In the translation process, the word pairs were first translated by two translators... -
ELMo embeddings model, Slovenian
ELMo language model (https://github.com/allenai/bilm-tf) used to produce contextual word embeddings, trained on entire Gigafida 2.0 corpus... -
Word embeddings CLARIN.SI-embed.sr 1.0
CLARIN.SI-embed.sr contains word embeddings induced from the srWaC web corpus. The embeddings are based on the skip-gram model of fastText trained on 554,606,544 tokens of... -
CroSloEngual BERT
Trilingual BERT (Bidirectional Encoder Representations from Transformers) model, trained on Croatian, Slovenian, and English data. State of the art tool representing... -
Slovenian RoBERTa contextual embeddings model: SloBERTa 1.0
The monolingual Slovene RoBERTa (A Robustly Optimized Bidirectional Encoder Representations from Transformers) model is a state-of-the-art model representing words/tokens as... -
Word embeddings CLARIN.SI-embed.sl 1.0
CLARIN.SI-embed.sl contains word embeddings induced from a large collection of Slovene texts composed of existing corpora of Slovene, e.g GigaFida, Janes, KAS, slWaC etc. The... -
Word embeddings CLARIN.SI-embed.hr 2.0
CLARIN.SI-embed.hr contains word embeddings induced from a large collection of Croatian texts composed of the Croatian web corpus hrWaC, a 400-million-token-heavy collection of... -
Package of word embeddings of Czech from a large corpus
This package comprises eight models of Czech word embeddings trained by applying word2vec (Mikolov et al. 2013) to the currently most extensive corpus of Czech, namely SYN v9... -
CoNLL 2017 Shared Task - Automatically Annotated Raw Texts and Word Embeddings
Automatic segmentation, tokenization and morphological and syntactic annotations of raw texts in 45 languages, generated by UDPipe (http://ufal.mff.cuni.cz/udpipe), together... -
Multilingual static embeddings for Verbal Multiword Expressions trained on PA...
This resource is a set of 14 vector spaces for single words and Verbal Multiword Expressions (VMWEs) in different languages (German, Greek, Basque, French, Irish, Hebrew, Hindi,... -
Digital humanities: Introduction. A 10-week course with practical sessions.
The aim of the course is to introduce digital humanities and to describe various aspects of digital content processing. The course consists of 10 lessons with video material and... -
Klassifikation von Tragödien und Komödien bei Calderón de la Barca
Datenpublikation zum Artikel "Klassifikation von Tragödien und Komödien bei Calderón de la Barca": Gesprochener Text von 64 Dramen Pedro Calderón de la Barcas,... -
Pretrained word and multi-sense embeddings for Estonian
Word and multi-sense embedding for Estonian trained on lemmatized etTenTen: Corpus of the Estonian Web. Word embeddings are trained with word2vec. Sense embeddings are trained...