Dataset - B2FIND

TED-ELH Parallel Corpus

The corpus contains parallelly aligned scripts of TED Talks in English, Lithuanian, and Hebrew. It contains spoken language data.

A Digital Dictionary of Tunis Arabic - TUNICO (ELEXIS)

A corpus-based dictionary, enriched with historical data. The dictionary was not only built on data from the corpus of spoken language that was compiled in the same project, but...

Corpus of metaphorical expressions in spoken Slovene language G-KOMET 1.0

G-KOMET (a corpus of metaphorical expressions in spoken Slovene language) is an upgrade of the hand-annotated written corpus for metaphorical expressions KOMET...

ASR database ARTUR 1.0 (transcriptions)

Artur 1.0 is a speech database designed for the needs of developing automatic speech recognition for the Slovenian language. The complete database includes 1,067 hours of...

List of formulaic sequences in spoken Slovenian

This document contains 2,374 formulaic sequences in spoken Slovenian, i.e. frequently recurring strings of two to five words, manually annotated for syntactic structure,...

ASR database ARTUR 1.0 (audio)

Artur 1.0 is a speech database designed for the needs of automatic speech recognition for the Slovenian language. The database includes 1,067 hours of speech. 884 hours are...

ASR database ARTUR 0.1 (audio)

ARTUR is a speech database designed for the needs of automatic speech recognition for the Slovenian language. The database includes 1,035 hours of speech, although only 840...

ASR database ARTUR 0.1 (transcriptions)

ARTUR is a speech database designed for the needs of automatic speech recognition for the Slovenian language. The database includes 1,035 hours of speech, although only 840...

Albanian Spoken Corpus in Kosovo 1.0

This is the third version of a spoken corpus of Albanian in Kosovo. The data of the corpus is based on short life stories of 212 informants out of sample of 1800 speakers...

Albanian Spoken Corpus in Kosovo 0.2

This is the second version of a spoken corpus of Albanian in Kosovo. The data of the corpus is based on short life stories of 212 informants out of sample of 1800 speakers...

The "Mići Princ" text and speech dataset of Chakavian micro-dialects

The Mići Princ "text and speech" dialectal dataset is a word-aligned version of the translation of The Little Prince into various Chakavian micro-dialects, released by the...

ORTOFON v1: balanced corpus of informal spoken Czech with multi-tier transcri...

ORTOFON v1 is designed as a representation of authentic spoken Czech used in informal situations (private environment, spontaneity, unpreparedness etc.) in the area of the whole...

ORTOFON v3: corpus of informal spoken Czech with multi-tier transcription (tr...

ORTOFON v3 is a corpus of authentic spoken Czech used in informal situations (private environment, spontaneity, unpreparedness etc.) that covers the area of the whole Czech...

ORTOFON v3: corpus of informal spoken Czech with multi-tier transcription (tr...

ORTOFON v3 is a corpus of authentic spoken Czech used in informal situations (private environment, spontaneity, unpreparedness etc.) that covers the area of the whole Czech...

ORAL2013: balanced corpus of informal spoken Czech (transcriptions & audio)

ORAL2013 is designed as a representation of authentic spoken Czech used in informal situations (private environment, spontaneity, unpreparedness etc.) in the area of the whole...

ORTOFON v1: balanced corpus of informal spoken Czech with multi-tier transcri...

ORTOFON v1 is designed as a representation of authentic spoken Czech used in informal situations (private environment, spontaneity, unpreparedness etc.) in the area of the whole...

Languages in Migration

LANGUAGES IN MIGRATION is designed as a representation of authentic spoken Czech and German that is used in informal speech (private environment, spontaneity, unpreparedness...

Large-Scale Colloquial Persian 0.5

"Large Scale Colloquial Persian Dataset" (LSCP) is hierarchically organized in asemantic taxonomy that focuses on multi-task informal Persian language understanding as a...

ORAL2013: balanced corpus of informal spoken Czech (transcriptions)

ORAL2013 is designed as a representation of authentic spoken Czech used in informal situations (private environment, spontaneity, unpreparedness etc.) in the area of the whole...

Prague Dependency Treebank of Spoken Language (PDTSL) 0.5

The first edition of a speech corpus with a speech reconstruction layer (edited transcript). The project of speech reconstruction of Czech and English has been started at UFAL...

25 datasets found