Dataset - B2FIND

Spoken corpus Gos 2.0 (transcriptions)

The spoken corpus Gos 2.0 is the reference speech corpus of the Slovenian language. This second edition contains about 300 hours of speech, or 2.4 million words, 127 thousand...

Spoken corpus Gos VideoLectures 4.2 (transcription)

Gos VideoLectures is an add-on to the Gos reference corpus of spoken Slovene (http://hdl.handle.net/11356/1040), and covers public academic speech. It can be used for training...

Spoken corpus Gos 1.1

Gos is a corpus of spoken Slovene that includes the transcripts of approximately 120 hours of speech recorded in various situations: radio and TV shows, school lessons and...

Spoken Torlak dialect corpus 1.0 (transcription)

Torlak corpus represents a spoken variety of the endangered Torlak dialect from the Timok area in Southeast Serbia. It comprises transcripts of interviews with the local...

Spoken corpus Gos VideoLectures 4.0 (audio)

Gos VideoLectures is an add-on to the Gos reference corpus of spoken Slovene (http://hdl.handle.net/11356/1040), and covers public academic speech. The Gos VideoLectures corpus...

ASR database ARTUR 0.1 (transcriptions)

ARTUR is a speech database designed for the needs of automatic speech recognition for the Slovenian language. The database includes 1,035 hours of speech, although only 840...

ASR database ARTUR 0.1 (audio)

ARTUR is a speech database designed for the needs of automatic speech recognition for the Slovenian language. The database includes 1,035 hours of speech, although only 840...

ASR database ARTUR 1.0 (transcriptions)

Artur 1.0 is a speech database designed for the needs of developing automatic speech recognition for the Slovenian language. The complete database includes 1,067 hours of...

TED-ELH Parallel Corpus (ELEXIS)

The corpus contains parallelly aligned scripts of TED Talks in English, Lithuanian, and Hebrew. It contains spoken language data. See also: http://hdl.handle.net/20.500.11821/34

Frequency lists of character-level n-grams from the GOS 1.0 corpus 1.1

Frequency lists of character-level n-grams were extracted from the GOS 1.0 Corpus of Spoken Slovene (http://hdl.handle.net/11356/1040) using the LIST corpus extraction tool...

Spoken corpus Gos VideoLectures 4.1 (transcription)

Gos VideoLectures is an add-on to the Gos reference corpus of spoken Slovene (http://hdl.handle.net/11356/1040), and covers public academic speech. It can be used for training...

Spoken corpus Gos VideoLectures 4.0 (transcription)

Gos VideoLectures is an add-on to the Gos reference corpus of spoken Slovene (http://hdl.handle.net/11356/1040), and covers public academic speech. The Gos VideoLectures corpus...

Spoken corpus Gos VideoLectures 3.0 (transcription)

Gos VideoLectures is an add-on to the Gos reference corpus of spoken Slovene (http://hdl.handle.net/11356/1040), and covers public academic speech. The Gos VideoLectures corpus...

Spoken corpus Gos VideoLectures 2.0 (audio)

Gos VideoLectures is an add-on to the Gos reference corpus of spoken Slovene (http://hdl.handle.net/11356/1040), and covers public academic speech. The Gos VideoLectures corpus...

SNABI database for continuous speech recognition 1.2

The SNABI speech database can be used to train continuous speech recognition for Slovene language. The database comprises 1530 sentences, 150 words and the alphabet. 132...

Spoken corpus Gos 1.0

GOS is a corpus of spoken Slovene that includes the transcripts of approximately 120 hours of speech recorded in various situations: radio and TV shows, school lessons and...

Frequency lists of word-level n-grams from the GOS 1.0 corpus

Frequency lists of word-level n-grams (or word sets) were extracted from the GOS 1.0 Corpus of Spoken Slovene (http://hdl.handle.net/11356/1040) using the LIST corpus extraction...

Spoken corpus Gos VideoLectures 3.0 (audio)

Gos VideoLectures is an add-on to the Gos reference corpus of spoken Slovene (http://hdl.handle.net/11356/1040), and covers public academic speech. The Gos VideoLectures corpus...

Speech Database of Spoken Flight Information Enquiries SOFES 1.0

The SOFES speech database (Spoken Flight Enquiries in Slovene) is a collection of transcribed and segmented audio recordings of spoken flight-information enquiries in Slovene....

Dialogue act annotated spoken corpus GORDAN 1.0 (transcription)

The GORDAN 1.0 corpus contains authentic data of spoken communication, annotated for dialogue acts according to the GORDAN 1.0 dialogue act annotation scheme, included in the...

39 datasets found