Parliamentary spoken corpus of Czech ParlaSpeech-CZ 1.0

Dataset

PID

The ParlaSpeech-CZ dataset is built from the transcripts of parliamentary proceedings available in the Czech part of the ParlaMint corpus, and the parliamentary recordings available from the AudioPSP dataset (http://hdl.handle.net/11234/1-5404). The corpus consists of audio segments that correspond to specific sentences in the transcripts. The transcript contains word-level alignments to the recordings, allowing for simple further segmentation of long sentences into shorter segments for ASR and other memory-sensitive applications. Each segment has a reference to the ParlaMint 4.0 corpus (http://hdl.handle.net/11356/1859) via utterance IDs and character offsets. All the speaker information from the ParlaMint corpus is available via the "speaker_info" key.

Different to other ParlaSpeech datasets, each instance in this dataset has an additional "sentence_id" key referring to the ParlaMint sentence ID, and an additional "id" key in the description of each word referring to the ParlaMint word ID. Namely, in this dataset original ParlaMint sentence and word segmentation was kept due to a different, centralised processing approach. Additionally, the "audio_source" key is also available, pointing at the original audio recording from the AudioPSP dataset.

Identifier
PID	http://hdl.handle.net/11356/1785
Related Identifier	https://aclanthology.org/2022.parlaclarin-1.16
Related Identifier	https://link.springer.com/chapter/10.1007/978-3-030-83527-9_25
Related Identifier	http://hdl.handle.net/11356/1337
Related Identifier	https://www.clarin.eu/parlamint
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1785

Provenance
Creator	Kopp, Matyáš; Ljubešić, Nikola
Publisher	Jožef Stefan Institute
Publication Year	2024
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Czech
Resource Type	corpus
Format	text/plain; charset=utf-8; application/gzip; application/octet-stream; downloadable_files_count: 5
Discipline	Linguistics