Corpus of Croatian news portals ENGRI (2014-2018)

PID

The corpus consists of texts collected from the most popular (based on the Reuters Institute Digital News Report for 2018, retrieved from http://www.digitalnewsreport.org in April, 2019) news portals in Croatia in the period from 2014 to 2018: Direktno, Dnevno, Net Hr, Hrt, Index_Hr, Jutarnji, Novilist, Rtl, SlobodnaDalmacija, Večernji, Tportal, Dnevnik. Web browsing and web crawling were used to select and store the texts with their useful HTML information (publication date of the article, its URL, and title). The linguistic processing of the corpus was performed with the CLASSLA package (https://pypi.org/project/classla/) on the levels of tokenization, sentence splitting, morphosyntactic tagging, lemmatization, dependency parsing and named entity recognition.

This corpus is a linguistically-processed version of the original corpus published at https://repository.pfri.uniri.hr/islandora/object/pfri%3A2156 and is distributed in the CoNLL-U format (https://universaldependencies.org/format.html).

Identifier
PID http://hdl.handle.net/11356/1416
Related Identifier https://www.laconlab.com/projects/engri
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1416
Provenance
Creator Bogunović, Irena; Kučić, Mario; Ljubešić, Nikola; Erjavec, Tomaž
Publisher University of Rijeka, Faculty of Maritime Studies
Publication Year 2021
Rights Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0); https://creativecommons.org/licenses/by-nc-sa/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Croatian
Resource Type corpus
Format application/gzip; text/plain; charset=utf-8; downloadable_files_count: 12
Discipline Linguistics