Dataset - B2FIND

Read Speech Corpus (7G)

The corpus of read Lithuanian speech „7G“ was compiled in 2015-2016. The corpus consists of 352 audio recordings with a total duration of over 7 hours. Seven different speakers...

Clarin-PL Studio Corpus (EMU)

Polish speech corpus of read speech recorded in a studio. Contains many speakers, each reading a few dozen different sentences and a list of words with rare phonemes. Useful for...

Speech tools plugin for Annotation Pro

This resource describes the Annotation Pro plugin containing various tools for automatic processing of speech data. The initial tool provides only a speech aligner, but more are...

Clarin-PL Studio Corpus (EMU;updated phonetics)

Polish speech corpus of read speech recorded in a studio. Contains many speakers, each reading a few dozen different sentences and a list of words with rare phonemes. Useful for...

Cyfry

A small spoken digits corpus in polish. Contains 488 recordings of 25 speakers reading 20 digits (0-9) each. Amounts to around 76 minutes of recordings. Split into train (~72%),...

EU Parliament Speech corpus

A collection of 1040 EU parliament speeches with transcription and annotations. Includes original speeches and PL/EN translations.

Clarin-PL Mobile Corpus (EMU)

Polish speech corpus of read speech recorded over the phone. Contains many speakers, each reading a few dozen different sentences and a list of words with rare phonemes. Useful...

Business English learner speech corpus SAPS

SAPS is a specialized speech corpus which contains business meeting simulations in English between undergraduate students of Languages for Business and Economics at the School...

ParCzech 3.0

The ParCzech 3.0 corpus is the third version of ParCzech consisting of stenographic protocols that record the Chamber of Deputies’ meetings held in the 7th term (2013-2017) and...

Corpus bilingüe d’alternança de llengües (codeswitching)

8 interactive recordings of group dynamics. Bilingual speakers (L1 -> English; L1 -> Catalan/Spanish).

Czech Malach Cross-lingual Speech Retrieval Test Collection

The package contains Czech recordings of the Visual History Archive which consists of the interviews with the Holocaust survivors. The archive consists of audio recordings, four...

Oasis Numbers

spoken, monolingual, manually segmented domain-specific corpus of numbers, 5857 recorded words

English TTS speech corpus of air traffic (pilot) messages - Czech accent

The corpus contains recordings of male speaker, native in Czech, talking in English. The sentences that were read by the speaker originate in the domain of air traffic control...

Vystadial 2016 – Czech data

This is the Czech data collected during the VYSTADIAL project. It is an extension of the 'Vystadial 2013' Czech part data release. The dataset comprises of telephone...

English TTS speech corpus of air traffic (pilot) messages - Taiwanese accent

The corpus contains recordings of male speaker, native in Taiwanese, talking in English. The sentences that were read by the speaker originate in the domain of air traffic...

Phonetic Corpus of Estonian Spontaneous Speech (online search engine)

Studio recordings of spontaneous Estonian segmented phonetically on word, sound, and other linguistic levels. Current size about 22 hours of speech, 155 000 words. Online search...

STAZKA – Speech recordings from vehicles

The database actually contains two sets of recordings, both recorded in the moving or stationary vehicles (passenger cars or trucks). All data were recorded within the project...

UFAL Speech Corpus of North Levantine Arabic 1.0 - Part 2

The corpus contains recordings by the native speakers of the North Levantine Arabic (apc) acquired during 2020, 2021, and 2023 in Prague, Paris, Kabardia, and St. Petersburg....

ORAL2013: balanced corpus of informal spoken Czech (transcriptions & audio)

ORAL2013 is designed as a representation of authentic spoken Czech used in informal situations (private environment, spontaneity, unpreparedness etc.) in the area of the whole...

Database of speech corpora of Czech laryngectomy patients

The corpus contains Czech speech of laryngectomy patients recorded before a surgery causing their voice to be lost in order to preserve the voice which can be later used for...

39 datasets found