The siParl 4.0 corpus contains minutes of the Assembly of the Republic of Slovenia for 11th legislative period 1990-1992, minutes of the National Assembly of the Republic of Slovenia from the 1st to the 8th legislative period 1992-2022, minutes of the working bodies of the National Assembly of the Republic of Slovenia from the 2nd to the 8th legislative period 1996-2022, and minutes of the Council of the President of the National Assembly from the 2nd to the 8th legislative period 1996-2022. The corpus comprises of over 13 thousand sessions, one million speeches and 230 million words. The corpus is encoded according to the Parla-CLARIN schema (https://github.com/clarin-eric/parla-clarin). Each mandate is in one directory, and each session in one file.
As opposed to the previous version 3.0, this version adds new data (minutes of the National Assembly of the Republic of Slovenia of the 8th legislative period) and corrects many errors.
This item comprises the following datasets:
1. source DARAH-SI Parla-CLARIN encoded corpus in TEI format;
2. linguistically annotated Parla-CLARIN encoded corpus: tokenisation, MSD tagging, lemmatisation, Universal Dependencies features and syntactic parses, named entities;
3. automatically derived corpus in plain text with metadata on speeches;
4. automatically derived linguisticaly annotated corpus in CoNLL-U (Universal Dependencies) format with metadata on speeches;
5. automatically derived linguisticaly annotated corpus in vertical format used by CWB and Sketch Engine concordancers, together with registry file as used on the CLARIN.SI concordancers.