Monitor corpus of Slovene Trendi 2023-09

Dataset

PID

The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 70 publishers. Trendi 2023-09 covers the period from January 2019 to September 2023, complementing the Gigafida 2.0 reference corpus of written Slovene (http://hdl.handle.net/11356/1320).

The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf).

An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics).

The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers.

As opposed to the previous version of the corpus, this version adds texts from March to September 2023, adds topic classification to files previous mistakenly without them, and corrects some other minor errors.

Identifier
PID	http://hdl.handle.net/11356/1879
Related Identifier	http://euralex.org/wp-content/themes/euralex/proceedings/Euralex%202022/EURALEX2022_Pr_p230-239_Kosem.pdf
Related Identifier	https://nl.ijs.si/jtdh22/pdf/JTDH2022_Kosem-et-al_Spremljevalni-korpus-Trendi.pdf
Related Identifier	https://doi.org/10.4312/slo2.0.2023.1.161-188
Related Identifier	http://hdl.handle.net/11356/1904
Related Identifier	http://hdl.handle.net/11356/1782
Related Identifier	https://sled.ijs.si/
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1879

Provenance
Creator	Kosem, Iztok; Čibej, Jaka; Dobrovoljc, Kaja; Erjavec, Tomaž; Ljubešić, Nikola; Ponikvar, Primož; Šinkec, Mihael; Krek, Simon
Publisher	Jožef Stefan Institute; Centre for Language Resources and Technologies, University of Ljubljana
Publication Year	2023
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	corpus
Format	downloadable_files_count: 0
Discipline	Linguistics