The ParlaSpeech-HR dataset is built from the transcripts of parliamentary proceedings available in the Croatian part of the ParlaMint corpus, and the parliamentary recordings available from the Croatian Parliament's YouTube channel. The corpus consists of audio segments that correspond to specific sentences in the transcripts. The transcript contains word-level alignments to the recordings, allowing for simple further segmentation of long sentences into shorter segments for ASR and other memory-sensitive applications. Each segment has a reference to the ParlaMint 4.0 corpus (http://hdl.handle.net/11356/1859) via utterance IDs and character offsets. All the speaker information from the ParlaMint corpus is available via the "speaker_info" key.
The main differences to the version 1.0 of the dataset are:
- larger size (ParlaMint 4.0 is used here, while previously ParlaMint 2.1 was used)
- improved matching pipeline
- segments based on linguistically sound sentences from the ParlaMint transcripts, while previously segments surrounded with silence were used