ASR database ARTUR 0.1 (audio)

Dataset

PID

ARTUR is a speech database designed for the needs of automatic speech recognition for the Slovenian language. The database includes 1,035 hours of speech, although only 840 hours are transcribed, while the remaining 195 hours are without transcription. The data is divided into 4 parts: (1) approx. 520 hours of read speech, which includes the reading of pre-defined sentences, selected from the Gigafida 2.0 corpus (http://hdl.handle.net/11356/1320); each sentence is contained in one file; speakers are demographically balanced; spelling is included in special files; all with manual transcriptions; (2) approx. 204 hours of public speech, which includes media recordings, online recordings of conferences, workshops, education videos, etc.; 56 hours are manually transcribed; (3) approx. 110 hours of private speech, which includes monologues and dialogues between two persons, recorded for the purposes of the speech database; the speakers are demographically balanced; two subsets for domain-specific ASR (i.e., smart-home and face-description) are included; 63 hours are manually transcribed; (4) approx. 201 hours of parliamentary speech, which includes recordings from the Slovene National Assembly, all with manual transcriptions. Audio files are WAV 44,1 kHz, pcm, 16-bit, mono.

This entry includes the recordings only; transcriptions are available at http://hdl.handle.net/11356/1718.

Identifier
PID	http://hdl.handle.net/11356/1717
Related Identifier	http://hdl.handle.net/11356/1776
Related Identifier	https://slovenscina.eu/
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1717

Provenance
Creator	Verdonik, Darinka; Bizjak, Andreja; Žgank, Andrej; Bernjak, Mitja; Antloga, Špela; Majhenič, Simona; Čakš, Peter; Pucer, Matevž; Cvetko, Mitja; Zelenik, Marijana; Pavlič, Jani; Dobrišek, Simon; Križaj, Janez; Strle, Gregor; Ivanovska, Marija; Grm, Klemen; Bajec, Marko; Lebar Bajec, Iztok; Jelovšek, Tjaša; Lokovšek, Jure; Longyka, Jure; Trojar, Mitja; Žganec Gros, Jerneja; Mihelič, Aleš; Vesnicer, Boštjan; Dretnik, Naum; Bordon, David
Publisher	Faculty of Electrical Engineering and Computer Science, University of Maribor; Faculty of Electrical Engineering, University of Ljubljana; Faculty of Computer and Information Science, University of Ljubljana; Alpineon d.o.o.; STA
Publication Year	2022
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	corpus
Format	text/plain; charset=utf-8; text/plain; application/octet-stream; downloadable_files_count: 36
Discipline	Linguistics