Corpus of combined Slovenian corpora metaFida 1.0

Dataset

PID

Slovenia has a large number of diverse corpora available for online analysis via the CLARIN.SI concordancers. However, if users are interested in the same queries across different corpora they have to search for relevant information in each corpus separately, and then combine this information manually, which is time-consuming and also prone to analysis errors. An additional problem is that corpora typically have different metadata and may also be labeled at different linguistic levels, which further complicates identical searches across different corpora.

For these reasons we combined a number of existing Slovenian corpora available through the CLARIN.SI concordances into the metaFida corpus. Here it was first necessary to unify the metadata and harmonize the linguistic and structural annotations between the corpora, and to create conversions of individual corpora from their vertical formats, which are used as input by the CLARIN.SI concordances, into the metaFida vertical format. As the source corpora are not completely distinct, metaFida is also deduplicated on the paragraph level.

In the metaFida corpus we keep only information that is common to most of the selected corpora. The structure is nested very shallowly (text and paragraph), as it is then easier to create subcorpora or limit the search to individual text types. All metaFida positional attributes (word, normalised form, lemma, MULTEXT-East MSD in Slovenian and English) are considered to have multiple values, separated by a space. Multiple values are needed because some corpora have normalized words (older Slovenian, user-generated content), where one original word can be mapped to several normalized ones or vice versa.

metaFida contains over 4,7 billion words or 6 billion tokens from 15 million text published 1584 - 2022 from the following 34 corpora, of which many, but not all, are also availiable for download, as indicated by their handle: * eltec_slv: ELTeC-slv (100 romanov), https://doi.org/10.5281/zenodo.4662600; 5,596,656 words * prilit: PriLit (starejša pripovedna proza), http://hdl.handle.net/11356/1319; 1,060,538 words * imp: IMP (starejša besedila), http://hdl.handle.net/11356/1031; 14,348,452 words * maj68: Maj68 (Maj 1968 v literaturi), http://hdl.handle.net/11356/1491; 1,033,971 words * vayna: VAYNA (napadi na JNA), http://hdl.handle.net/11356/1237; 256,429 words * gos20: Gos 2.0 (referenčni, govorni), http://hdl.handle.net/11356/1771; 2,436,386 words * janes_norm30: Janes Norm 3.0 (ročno normaliziran), http://hdl.handle.net/11356/1733; 249,576 words * janes_tweet: Janes Tweet (tviti 2013-2017), http://hdl.handle.net/11356/1142; 108,769,902 words * janes_wiki: Janes Wiki (Wikipedija komentarji), http://hdl.handle.net/11356/1137; 3,917,428 words * janes_blog: Janes Blog (blogi s komentarji), http://hdl.handle.net/11356/1138; 27,596,463 words * janes_forum: Janes Forum (spletni forumi), http://hdl.handle.net/11356/1139; 37,654,809 words * janes_news: Janes News (komentarji na novice), http://hdl.handle.net/11356/1140; 11,908,481 words * lemonde_sl: LeMonde: slovensko; 506,358 words * konji: Konji (konjeništvo); 395,718 words * filmi: FILMI (filmske kritike); 764,764 words * maks: MAKS (mladinska književnost); 9,881,294 words * ispac_sl: ISPAC: slovensko; 1,169,486 words * jaslo_sl: jaSlo: slovensko; 425,434 words * siparl30: siParl 3.0 (parlament 1990-2022), http://hdl.handle.net/11356/1748; 205,441,411 words * kost10_orig: KOST: izvorni (L2), https://hdl.handle.net/11356/1753; 1,020,509 words * jezkor: JezKor (jezikoslovje), http://hdl.handle.net/11356/1755; 6,243,898 words * solar30_orig: Šolar: učenci (razvojni), https://hdl.handle.net/11356/1589; 1,621,527 words * sbsj: SBSJ (šolska besedila), http://hdl.handle.net/11356/1413; 1,424,887 words * rsdo5: RSDO5 (s termini označena besedila), http://hdl.handle.net/11356/1470; 241,797 words * dsi: DSI (informatika), http://hdl.handle.net/11356/1239; 4,254,177 words * korp: KoRP (odnosi z javnostmi); 1,756,731 words * suss: ŠUSS (jezikovna vprašanja), http://hdl.handle.net/11356/1242; 272,541 words * trans5_sl: TRANS5: slovensko; 1,297,269 words * dgt15_sl: EU DGT 2015: Slovene; 48,454,851 words * gfida20_dedup: Gigafida v2.0 (referenčni, dedupliciran), http://hdl.handle.net/11356/1320; 1,105,200,611 words * oss10: OSS (znanstvena dela), http://hdl.handle.net/11356/1774; 2,342,855,598 words * classlawiki_sl: CLASSLAWiki-sl (Slovenian Wikipedia), http://hdl.handle.net/11356/1427; 41,543,793 words * slwac: slWaC (Slovene Web); 749,372,269 words * tweet_sl: Tweet-sl (stari tviti); 4,854,229 words Σ 34 corpora, 4,743,828,243 words before deduplication, which removes about 0.3% of words, 1.3% tokens, 7% texts and 11% paragraphs.

Identifier
PID	http://hdl.handle.net/11356/1775
Related Identifier	https://www.cjvt.si/rsdo/strategija-razvoja/#tab-id-14
Related Identifier	http://hdl.handle.net/11356/1746
Related Identifier	https://rsdo.slovenscina.eu/en/language-resources
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1775

Provenance
Creator	Erjavec, Tomaž
Publisher	Jožef Stefan Institute
Publication Year	2023
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	corpus
Format	downloadable_files_count: 0
Discipline	Linguistics