Parliamentary spoken corpus of Croatian ParlaSpeech-HR 2.0

Dataset

PID

The ParlaSpeech-HR dataset is built from the transcripts of parliamentary proceedings available in the Croatian part of the ParlaMint corpus, and the parliamentary recordings available from the Croatian Parliament's YouTube channel. The corpus consists of audio segments that correspond to specific sentences in the transcripts. The transcript contains word-level alignments to the recordings, allowing for simple further segmentation of long sentences into shorter segments for ASR and other memory-sensitive applications. Each segment has a reference to the ParlaMint 4.0 corpus (http://hdl.handle.net/11356/1859) via utterance IDs and character offsets. All the speaker information from the ParlaMint corpus is available via the "speaker_info" key.

The main differences to the version 1.0 of the dataset are: - larger size (ParlaMint 4.0 is used here, while previously ParlaMint 2.1 was used) - improved matching pipeline - segments based on linguistically sound sentences from the ParlaMint transcripts, while previously segments surrounded with silence were used

Identifier
PID	http://hdl.handle.net/11356/1914
Related Identifier	https://aclanthology.org/2022.parlaclarin-1.16
Related Identifier	http://hdl.handle.net/11356/1494
Related Identifier	https://www.clarin.eu/content/parlamint-towards-comparable-parliamentary-corpora
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1914

Provenance
Creator	Ljubešić, Nikola; Koržinek, Danijel; Rupnik, Peter
Publisher	Jožef Stefan Institute
Publication Year	2024
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Croatian
Resource Type	corpus
Format	text/plain; charset=utf-8; application/gzip; application/octet-stream; text/plain; downloadable_files_count: 8
Discipline	Linguistics