EMBEDDIA tools output example corpus of Estonian, Croatian and Latvian news articles 1.0

PID

This dataset contains articles from EMBEDDIA Media partners with various information added by the tools developed within the EMBEDDIA project: - 12,390 Estonian articles from 2019 with tags given by Ekspress Meedia. The complete dataset without the output of EMBEDDIA tools is available at http://hdl.handle.net/11356/1408 - 5,000 Croatian articles from autumn of 2010 with tags given by 24sata. The complete dataset without the output of EMBEDDIA tools is available at http://hdl.handle.net/11356/1410 - 15,264 Latvian articles from 2019 with tags given by Ekspress Meedia. The complete dataset without the output of EMBEDDIA tools is available at http://hdl.handle.net/11356/1409

All the articles in the dataset have been analysed with texta-mlp Python package (https://pypi.org/project/texta-mlp/) via the EMBEDDIA Media assistant's Texta Toolkit (https://docs.texta.ee/). The tools used to analyse the articles were the following:

  • Latin1 and Latin2 Name Entity Recognition Tool modules (Cabrera-Diego et al., 2021, both described in https://aclanthology.org/2021.bsnlp-1.12/) . The Latin 1 results can be found folders annotated_articles_ner_latin1/ and annotated_articles_all_tools/, while the Latin 2 results are in annotated_articles_nerlatin2/ or annotated_articles_all_tools/.

  • RAKUN keyword extractor. RAKUN (Škrlj et al. 2019) is an unsupervised system for keyword extraction, so it can be used for any language. It detects keywords by turning text into a graph and the most important nodes in the graph mostly turn out to be the keywords. It is described in https://link.springer.com/chapter/10.1007/978-3-030-31372-2_26. The keyword annotation results can be found in the folder annotated_articles_rakun/ or annotated_articles_all_tools/.

  • TNT-KID keyword extractor. TNT-KID (Martinc et al. 2021, ) is a supervised system for automatic keyword extraction. It was trained on a corpus of articles with human-assigned keywords. For Croatian, the annotators were 24sata editors, for Estonian the Ekspress Meedia staff and for Latvian the Latvian Delfi staff. The system is further documented at https://doi.org/10.1017/S1351324921000127. For Croatian only TNT-KID was applied, while for Estonian and Latvian, the TNT-KID with TF-IDF, and extension by Koloski et al. (https://aclanthology.org/2021.hackashop-1.4.pdf) was used. The results of applying this tool are found in the folder annotated articles tnt_kid/ or annotated articles all tools/.

  • Sentiment analysis. Our news sentiment analyser (Pelicon et al. 2020) labels a news article as being of positive, negative, or neutral sentiment, using a fine-tuned multilingual BERT model, which was trained on Slovene sentiment annotated news articles. The system is further documented in https://doi.org/10.3390/app10175993. The results of this tools are found in the folder annotated articles sentiment/ or annotated articles all tools/.

All the data is encoded in "JSON Lines" format. Each folder has its own README file which explains the structure of the files.

Identifier
PID http://hdl.handle.net/11356/1485
Related Identifier http://embeddia.eu/
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1485
Provenance
Creator Freienthal, Linda; Pelicon, Andraž; Martinc, Matej; Škrlj, Blaž; Krustok, Ivar; Pranjić, Marko; Cabrera-Diego, Luis Adrián; Purver, Matthew; Pollak, Senja; Kuulmets, Hele-Andra; Shekhar, Ravi; Koloski, Boshko
Publisher Ekspress Meedia Group; Styria Media Group
Publication Year 2022
Funding Reference info:eu-repo/grantAgreement/EC/H2020/825153
Rights Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0); https://creativecommons.org/licenses/by-nc-nd/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Estonian; Latvian; Croatian
Resource Type corpus
Format text/plain; charset=utf-8; application/zip; downloadable_files_count: 1
Discipline Linguistics