Parliamentary spoken corpus of Croatian ParlaSpeech-HR 2.0

PID

The ParlaSpeech-HR dataset is built from the transcripts of parliamentary proceedings available in the Croatian part of the ParlaMint corpus, and the parliamentary recordings available from the Croatian Parliament's YouTube channel. The corpus consists of audio segments that correspond to specific sentences in the transcripts. The transcript contains word-level alignments to the recordings, allowing for simple further segmentation of long sentences into shorter segments for ASR and other memory-sensitive applications. Each segment has a reference to the ParlaMint 4.0 corpus (http://hdl.handle.net/11356/1859) via utterance IDs and character offsets. All the speaker information from the ParlaMint corpus is available via the "speaker_info" key.

The main differences to the version 1.0 of the dataset are: - larger size (ParlaMint 4.0 is used here, while previously ParlaMint 2.1 was used) - improved matching pipeline - segments based on linguistically sound sentences from the ParlaMint transcripts, while previously segments surrounded with silence were used

Identifier
PID http://hdl.handle.net/11356/1914
Related Identifier https://aclanthology.org/2022.parlaclarin-1.16
Related Identifier http://hdl.handle.net/11356/1494
Related Identifier https://www.clarin.eu/content/parlamint-towards-comparable-parliamentary-corpora
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1914
Provenance
Creator Ljubešić, Nikola; Koržinek, Danijel; Rupnik, Peter
Publisher Jožef Stefan Institute
Publication Year 2024
Rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Croatian
Resource Type corpus
Format text/plain; charset=utf-8; application/gzip; application/octet-stream; text/plain; downloadable_files_count: 8
Discipline Linguistics