Multilingual comparable corpora of parliamentary debates ParlaMint 4.0

PID

ParlaMint 4.0 is a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions, mostly starting in 2015 and extending to mid-2022. The individual corpora comprise between 9 and 126 million words and the complete set contains over 1.1 billion words.

The transcriptions are divided by days with information on the term, session and meeting, and contain speeches marked by the speaker and their role (e.g. chair, regular speaker). The speeches also contain marked-up transcriber comments, such as gaps in the transcription, interruptions, applause, etc. The corpora have extensive metadata, most importantly on speakers (name, gender, MP and minister status, party affiliation), the political parties and parliamentary groups (name, coalition/opposition status, Wikipedia-sourced left-to-right political orientation, and CHES variables, https://www.chesdata.eu/). Note that some corpora have further metadata, e.g. the year of birth of the speakers, links to their Wikipedia articles, their membership in various committees, etc. The transcriptions are also marked with the subcorpus they belong to ("reference", until 2020-01-30, "covid", from 2020-01-31, and "war", from 2022-02-24).

The corpora are encoded according to the Parla-CLARIN TEI recommendation (https://clarin-eric.github.io/parla-clarin/), but have been encoded against the compatible, but much stricter ParlaMint encoding guidelines (https://clarin-eric.github.io/ParlaMint/) and schemas (included in the distribution).

This entry contains the ParlaMint TEI-encoded corpora and their derived plain text versions along with TSV metadata of the speeches. Also included is the 4.0 release of the sample data and scripts available at the GitHub repository of the ParlaMint project at https://github.com/clarin-eric/ParlaMint.

Note that there also exists the linguistically marked-up version of the 4.0 ParlaMint corpus, also linked with concordancers, which is available at http://hdl.handle.net/11356/1860. Another related resource is the Linguistically annotated multilingual comparable corpora of parliamentary debates in English ParlaMint-en.ana 4.0 (http://hdl.handle.net/11356/1864).

As opposed to the previous version 3.0, this version adds corpora for Spain (ES), Finland (FI) and the Basque Country (ES-PV); extends the corpora for Austria (AT), Czechia (CZ), Hungary (HU), and Ukraine (UA) with more recent data; adds metadata to political parties and parliamentary groups on left-to-right political orientation from Wikipedia as well as CHES variables; and adds the information on whether a speaker was a minister and when for the corpora that previously lacked this information. The TEI encoding of some details has also changed, and many errors found in 3.0 corpora have been corrected.

Identifier
PID http://hdl.handle.net/11356/1859
Related Identifier https://github.com/clarin-eric/ParlaMint/
Related Identifier http://hdl.handle.net/11356/1486
Related Identifier http://hdl.handle.net/11356/1912
Related Identifier https://www.clarin.eu/content/parlamint
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1859
Provenance
Creator Erjavec, Tomaž; Kopp, Matyáš; Ogrodniczuk, Maciej; Osenova, Petya; Agirrezabal, Manex; Agnoloni, Tommaso; Aires, José; Albini, Monica; Alkorta, Jon; Antiba-Cartazo, Iván; Arrieta, Ekain; Barcala, Mario; Bardanca, Daniel; Barkarson, Starkaður; Bartolini, Roberto; Battistoni, Roberto; Bel, Nuria; Bonet Ramos, Maria del Mar; Calzada Pérez, María; Cardoso, Aida; Çöltekin, Çağrı; Coole, Matthew; Darģis, Roberts; de Libano, Ruben; Depoorter, Griet; Diwersy, Sascha; Dodé, Réka; Fernandez, Kike; Fernández Rei, Elisa; Frontini, Francesca; Garcia, Marcos; García Díaz, Noelia; García Louzao, Pedro; Gavriilidou, Maria; Gkoumas, Dimitris; Grigorov, Ilko; Grigorova, Vladislava; Haltrup Hansen, Dorte; Iruskieta, Mikel; Jarlbrink, Johan; Jelencsik-Mátyus, Kinga; Jongejan, Bart; Kahusk, Neeme; Kirnbauer, Martin; Kryvenko, Anna; Ligeti-Nagy, Noémi; Ljubešić, Nikola; Luxardo, Giancarlo; Magariños, Carmen; Magnusson, Måns; Marchetti, Carlo; Marx, Maarten; Meden, Katja; Mendes, Amália; Mochtak, Michal; Mölder, Martin; Montemagni, Simonetta; Navarretta, Costanza; Nitoń, Bartłomiej; Norén, Fredrik Mohammadi; Nwadukwe, Amanda; Ojsteršek, Mihael; Pančur, Andrej; Papavassiliou, Vassilis; Pereira, Rui; Pérez Lago, María; Piperidis, Stelios; Pirker, Hannes; Pisani, Marilina; Pol, Henk van der; Prokopidis, Prokopis; Quochi, Valeria; Rayson, Paul; Regueira, Xosé Luís; Rudolf, Michał; Ruisi, Manuela; Rupnik, Peter; Schopper, Daniel; Simov, Kiril; Sinikallio, Laura; Skubic, Jure; Tungland, Lars Magne; Tuominen, Jouni; van Heusden, Ruben; Varga, Zsófia; Vázquez Abuín, Marta; Venturi, Giulia; Vidal Miguéns, Adrián; Vider, Kadri; Vivel Couso, Ainhoa; Vladu, Adina Ioana; Wissik, Tanja; Yrjänäinen, Väinö; Zevallos, Rodolfo; Fišer, Darja
Publisher CLARIN ERIC
Publication Year 2023
Rights Creative Commons - Attribution 4.0 International (CC BY 4.0); https://creativecommons.org/licenses/by/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Bulgarian; Croatian; Polish; Slovenian; Slovene; Czech; Icelandic; French; Dutch; Flemish; Danish; Spanish; Castilian; Turkish; English; Italian; Hungarian; Latvian; Bosnian; Catalan; Valencian; German; Greek, Modern (1453-); Greek; Estonian; Portuguese; Serbian; Swedish; Ukrainian; Norwegian; Galician; Russian; Finnish; Basque
Resource Type corpus
Format text/plain; charset=utf-8; application/octet-stream; application/gzip; downloadable_files_count: 30
Discipline Linguistics