CLARIN - Repositories

Exploring genealogical blends_Online Corpus

The online corpus supplement to the paper "Exploring genealogical blends: the Surinamese Creole Cluster and the Virgin Islands Dutch Creole Cluster", published in the CLARIN...

s.morfcorpus.6ec19594.20131227-2309

WMT 2013 Crawled News monolingual corpus, Czech, segmented by Morfessor

Psycholinguistic Experiment Video

This is a video recording that is being used in psycholinguistic experiments.

Prague Dependency Treebank 2.0 Sample Data

This is a small sample dataset from PDT 2.0. As such it can be released under a very permissive CC-BY license.

Interaction and dialogue with large-scale textual data: Parliamentary speeche...

Prof. Dr. Andreas Blätte's keynote talk at the CLARIN Annual Conference 2015. Additional material, including the presented 3D visualisations, are available via...

Sign Language Interaction

This is a sign language interaction recording made for scientific purposes.

Replication of part of the IFA corpus

The IFA Spoken Language corpus is a free (GPL) database of hand-segmented Dutch speech. It was constructed with off-the-shelf software using speech from 8 speakers in a variety...

TXM_0.7.7_Win64.exe

TXM 0.7.7 for Windows 64-bit setup file TXM is a free and open-source (GPL v3) textual corpora analysis platform. It combines five key components: a) the ability to import and...

Časování sloves v bengálštině

Description of verbal paradigms in Bengali. The description is written in Czech.

Language Learning Stimulus Video

This is a video recording that is used for studying language learning by young children.

Syntactically annotated Czech legal texts

Two legal texts syntactically manually annotated according to the Prague dependency treebank framework. Dependency trees are presented as images. The annotation editor TrEd was...

Orthography-based dating and localisation of Middle Dutch charters

In this study we build models for the localisation and dating of Middle Dutch charters. First, we extract character trigrams and use these to train a machine learner (K Nearest...

Annotated Route Description

This file set existing of a video stream, an audio stream and a multimodal annotation file is a frequently used as show case of how to do complex multimodal annotations with the...

SIgn Language Recording

This is a Sign Language Recording made for scientific purposes.

Wikipedia paths

Wikipedia category embedding starting at the top category Biology for English, French and Czech. English data are not complete.

HELLO CAMPANIA! Philippines Collection

The Philippines collection contains data for 66 speakers: 32 first generation (G1), 28 second generation (G2), 6 homeland (G0). The collection contains three folders for each...

TED-ELH Parallel Corpus

The corpus contains parallelly aligned scripts of TED Talks in English, Lithuanian, and Hebrew. It contains spoken language data.

English-Lithuanian Parallel Cybersecurity Corpus - DVITAS v2.0

English-Lithuanian parallel corpus DVITAS v2 includes original English texts on cybersecurity and their Lithuanian translations aligned on the sentence level. Version 1 of the...

English-Lithuanian Parallel Cybersecurity Corpus - DVITAS

English-Lithuanian parallel corpus DVITAS includes original English texts on cybersecurity and their Lithuanian translations aligned on the sentence level. The corpus was...

Lithuanian-English Cybersecurity Termbase v.0.1

The bilingual termbase is TBX export of the online termbase https://www.terminologue.org/csterms/. The termbase includes terms for 233 cybersecurity concepts.

678 datasets found