Parliamentary corpus of first Yugoslavia (1919-1939) yu1Parl 1.0

Dataset

PID

The corpus contains meeting proceedings of the National Representation of the Kingdom of Yugoslavia from 1919 to 1939 (Zbirka stenografskih beležk, zapisnikov sej predstavništev, senata in skupščine Kraljevine Jugoslavije 1919-1939), in particular: - Temporary National Representation of the Kingdom of Serbs, Croats, and Slovenes (1919-1920) - Legislative Committee of National Assembly of the Kingdom of Serbs, Croats, and Slovenes (1921-1922) - National Representation (National Assembly and Senate) of the Kingdom of Yugoslavia (1931-1939)

The meeting proceedings of the National Assembly of the Kingdom of Serbs, Croats, and Slovenes between years 1923 and 1928 are not available and therefore not included in the corpus.

The corpus comprises 714 sessions (15403 pages, approximately 13 million words).

The source data (scanned images of printed Stenographic Minutes) come from the History of Slovenia - SIstory (https://www.sistory.si) portal. The images were OCR processed and the results saved as pdf, docx and txt. The documents are multilingual, in Serbo-Croatian and Slovenian, depending on the speaker. Serbo-Croatian is typeset in the Cyrillic (Serbian) or in the Latin (Croatian) alphabet.

The documents were automatically processed and the following data extracted: titles, agenda, attending, start and end of the session, speakers, and comments. Lingua (https://github.com/pemistahl/lingua-py) was used for language detection on the sentence level. Roughly 59% of sentences are in Serbian (Cyrillic script), 38% in Croatian (Latin script) and 3% in Slovenian. Some sentences in German and French were also detected. Linguistic annotation (tokenisation, MSD tagging and lemmatisation) was added using CLASSLA (https://github.com/clarinsi/classla) for Serbian, Croatian and Slovenian. Words in Serbian (Cyrillic script) have lemmas in Latin script.

The documents are in the Parla-CLARIN (https://github.com/clarin-eric/parla-clarin) compliant TEI XML format. Each session in one file.

Identifier
PID	http://hdl.handle.net/11356/1845
Related Identifier	https://www.inz.si/en/dihur/
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1845

Provenance
Creator	Kavčič, Alenka; Mundjar, Aleksander; Marolt, Matija
Publisher	Faculty of Computer and Information Science, University of Ljubljana
Publication Year	2023
Rights	Creative Commons - Attribution 4.0 International (CC BY 4.0); https://creativecommons.org/licenses/by/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Serbian; Croatian; Slovenian; Slovene
Resource Type	corpus
Format	text/plain; charset=utf-8; application/zip; downloadable_files_count: 3
Discipline	Linguistics