CLARIN - Repositories

Cases of Complements of Finnish Verbs

Context Cases of the complements of Finnish verbs. The data is useful for natural language generation (NLG). The data is described in the following paper, which should also be...

Model for Normalizing Historical English

This is an OpenNMT-py model for normalizing historical English into modern spelling. For usage, please see: https://github.com/mikahama/natas This has been described in the...

Annotated Route Description

This file set existing of a video stream, an audio stream and a multimodal annotation file is a frequently used as show case of how to do complex multimodal annotations with the...

Creative Dialog Generation for Fallout 4

Mika Hämäläinen and Khalid Alnajjar. 2019. Creative contextual dialog adaptation in an open world RPG. In Proceedings of the 14th International Conference on the Foundations of...

SIgn Language Recording

This is a Sign Language Recording made for scientific purposes.

Murre - Normalize non-standard Finnish and dialectalize standard Finnish

A python library for normalizing dialectal Finnish and dialectalizing standard Finnish. Normalization Niko Partanen, Mika Hämäläinen, and Khalid Alnajjar. 2019. Dialect Text...

Wikipedia paths

Wikipedia category embedding starting at the top category Biology for English, French and Czech. English data are not complete.

Gustav Vasa's letter production (2015-05-26) Gustav Vasas brevproduktion (20...

King Gustav I's registry Konung Gustaf den förstes registratur

Parole+ (2017-10-16)

The Swedish PAROLE Lexicon - A language technology resource with access to syntactic information, connected to SALDO senses. Svenskt PAROLE-lexikon - En språkteknologisk resurs...

Monitor corpus of Slovene Trendi 2024-12

The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 76 publishers. Trendi 2024-11 covers the period from January...

Monitor corpus of Slovene Trendi 2024-11

The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 76 publishers. Trendi 2024-11 covers the period from January...

SELEXINI corpus

We present here a large automatically annotated corpus for French. This corpus is divided into two parts: the first from BigScience, and the second from HPLT. The annotated...

SELEXINI corpus

We present here a large automatically annotated corpus for French. This corpus is divided into two parts: the first from BigScience, and the second from HPLT. The annotated...

Prague Dependency Treebank - Consolidated 2.0 (PDT-C 2.0)

A manually annotated and genre-diversified language resource with rich linguistic information from morphology and syntax to semantics, the Prague Dependency Treebank –...

Prague Dependency Treebank - Consolidated 1.0 (PDT-C 1.0)

A richly annotated and genre-diversified language resource, The Prague Dependency Treebank – Consolidated 1.0 (PDT-C 1.0, or PDT-C in short in the sequel) is a consolidated...

MariTerm v.1.2

This is an enriched version of the MariTerm maritime ontology, containing plug-ins to correpsonding synsets inside IWN. The resource was created within the collaboration of the...

HELLO CAMPANIA! Philippines Collection

The Philippines collection contains data for 66 speakers: 32 first generation (G1), 28 second generation (G2), 6 homeland (G0). The collection contains three folders for each...

Survey Data on Preferences of Lithuanian Cybersecurity Terminology

The data is provided in two files: one containing questionnaire-data and the other containing the respondentents' data. The questionnaire data is in a TXT file, which includes...

TED-ELH Parallel Corpus

The corpus contains parallelly aligned scripts of TED Talks in English, Lithuanian, and Hebrew. It contains spoken language data.

Lithuanian Treebank ALKSNIS (2019-10-24)

ALKSNIS v3.0. ALKSNIS v3,0 consists of 3,643 syntactically annotated sentences in the PML (Prague Mark-up Language) format. The format allows researchers to visualise and edit...

4,731 datasets found