The Ukrainian parliamentary corpus ParlaMint-UA 4.0.1 is an extended version of the ParlaMint-UA 4.0 corpus (available as a collection of plain texts along with TSV metadata of the speeches http://hdl.handle.net/11356/1859 and as a collection of speeches with added automatic linguistic annotations http://hdl.handle.net/11356/1860, both being part of the “ParlaMint: Towards Comparable Parliamentary Corpora” project by CLARIN ERIC (https://www.clarin.eu/parlamint).
The Ukrainian parliamentary corpus ParlaMint-UA 4.0.1 contains plenary proceedings for the 4th, 5th, 6th, 7th, 8th and 9th terms of the Rada between 14 May 2002 and 10 November 2023. Tokens in Ukrainian comprise 94% and tokens in Russian comprise 6%.
The transcripts are grouped by dates with information on the term, session and meeting, and contain speeches marked by the speaker and their role (chair, regular speaker or guest). The speeches also contain marked-up transcriber comments, such as noise, applause, shouting, etc. The corpus has extensive metadata on speakers including their name, the year of birth (when available in open sources), gender, MP and minister status, and party affiliation (when known from open sources), and political parties, parliamentary factions and groups including their name, left-to-right political orientation (Wikipedia-sourced or manually encoded, when absent in Wikipedia) and coalition/opposition status.
The corpus is encoded according to the Parla-CLARIN TEI recommendation (https://clarin-eric.github.io/parla-clarin/), as well as following the much stricter ParlaMint encoding guidelines (https://clarin-eric.github.io/ParlaMint/) and schemas.
The corpus comes in two versions. One version contains plain texts of plenary speeches. The other version contains texts of the same plenary speeches that are linguistically annotated including tokenization; sentence segmentation; lemmatisation; Universal Dependencies part-of-speech, morphological features, and syntactic dependencies; and the 4-class CoNLL-2003 named entities.
Compared to ParlaMint-UA 4.0, the Ukrainian parliamentary corpus ParlaMint-UA 4.0.1 has doubled the time-span and now includes older data between 2002 and 2012 and more recent data between September and November 2023. It enhances language identification between Ukrainian and Russian from the paragraph level to the sentence level to advance research on code-switching in public discourse. Also, the errors found in ParlaMint 4.0 have been corrected.