Parliamentary corpus of first Yugoslavia (1919-1939) yu1Parl 1.0

PID

The corpus contains meeting proceedings of the National Representation of the Kingdom of Yugoslavia from 1919 to 1939 (Zbirka stenografskih beležk, zapisnikov sej predstavništev, senata in skupščine Kraljevine Jugoslavije 1919-1939), in particular: - Temporary National Representation of the Kingdom of Serbs, Croats, and Slovenes (1919-1920) - Legislative Committee of National Assembly of the Kingdom of Serbs, Croats, and Slovenes (1921-1922) - National Representation (National Assembly and Senate) of the Kingdom of Yugoslavia (1931-1939)

The meeting proceedings of the National Assembly of the Kingdom of Serbs, Croats, and Slovenes between years 1923 and 1928 are not available and therefore not included in the corpus.

The corpus comprises 714 sessions (15403 pages, approximately 13 million words).

The source data (scanned images of printed Stenographic Minutes) come from the History of Slovenia - SIstory (https://www.sistory.si) portal. The images were OCR processed and the results saved as pdf, docx and txt. The documents are multilingual, in Serbo-Croatian and Slovenian, depending on the speaker. Serbo-Croatian is typeset in the Cyrillic (Serbian) or in the Latin (Croatian) alphabet.

The documents were automatically processed and the following data extracted: titles, agenda, attending, start and end of the session, speakers, and comments. Lingua (https://github.com/pemistahl/lingua-py) was used for language detection on the sentence level. Roughly 59% of sentences are in Serbian (Cyrillic script), 38% in Croatian (Latin script) and 3% in Slovenian. Some sentences in German and French were also detected. Linguistic annotation (tokenisation, MSD tagging and lemmatisation) was added using CLASSLA (https://github.com/clarinsi/classla) for Serbian, Croatian and Slovenian. Words in Serbian (Cyrillic script) have lemmas in Latin script.

The documents are in the Parla-CLARIN (https://github.com/clarin-eric/parla-clarin) compliant TEI XML format. Each session in one file.

Identifier
PID http://hdl.handle.net/11356/1845
Related Identifier https://www.inz.si/en/dihur/
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1845
Provenance
Creator Kavčič, Alenka; Mundjar, Aleksander; Marolt, Matija
Publisher Faculty of Computer and Information Science, University of Ljubljana
Publication Year 2023
Rights Creative Commons - Attribution 4.0 International (CC BY 4.0); https://creativecommons.org/licenses/by/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Serbian; Croatian; Slovenian; Slovene
Resource Type corpus
Format text/plain; charset=utf-8; application/zip; downloadable_files_count: 3
Discipline Linguistics