Parliamentary spoken corpus of Czech ParlaSpeech-CZ 1.0

PID

The ParlaSpeech-CZ dataset is built from the transcripts of parliamentary proceedings available in the Czech part of the ParlaMint corpus, and the parliamentary recordings available from the AudioPSP dataset (http://hdl.handle.net/11234/1-5404). The corpus consists of audio segments that correspond to specific sentences in the transcripts. The transcript contains word-level alignments to the recordings, allowing for simple further segmentation of long sentences into shorter segments for ASR and other memory-sensitive applications. Each segment has a reference to the ParlaMint 4.0 corpus (http://hdl.handle.net/11356/1859) via utterance IDs and character offsets. All the speaker information from the ParlaMint corpus is available via the "speaker_info" key.

Different to other ParlaSpeech datasets, each instance in this dataset has an additional "sentence_id" key referring to the ParlaMint sentence ID, and an additional "id" key in the description of each word referring to the ParlaMint word ID. Namely, in this dataset original ParlaMint sentence and word segmentation was kept due to a different, centralised processing approach. Additionally, the "audio_source" key is also available, pointing at the original audio recording from the AudioPSP dataset.

Identifier
PID http://hdl.handle.net/11356/1785
Related Identifier https://aclanthology.org/2022.parlaclarin-1.16
Related Identifier https://link.springer.com/chapter/10.1007/978-3-030-83527-9_25
Related Identifier http://hdl.handle.net/11356/1337
Related Identifier https://www.clarin.eu/parlamint
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1785
Provenance
Creator Kopp, Matyáš; Ljubešić, Nikola
Publisher Jožef Stefan Institute
Publication Year 2024
Rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Czech
Resource Type corpus
Format text/plain; charset=utf-8; application/gzip; application/octet-stream; downloadable_files_count: 5
Discipline Linguistics