Latvian Delfi article archive (in Latvian and Russian) 1.0

PID

This dataset is an archive of articles from the Delfi news site from 2015-2019, containing over 180,000 articles (c. 50% in Latvian and 50% in the Russian language). Keywords for articles are included.

There are 5 JSON files: lv_2015.json contains 42 001 articles from the year 2015 lv_2016_.json contains 40 342 articles from the year 2016 lv_2017_.json contains 37 256 articles from the year 2017 lv_2018_.json contains 31 732 articles from the year 2018 lv_2019_.json contains 29 070 articles from the year 2019

In sum: 180 401 articles

Description of the dataset

This JSON file is a list of dictionaries, i.e. each article is represented as a dictionary. Each dictionary contains the following: id (integer) - the ID of the article title (string) - the title of the article lead (string) - the lead of the article tags [1] (list of dictionaries or None): each dictionary represents one tag. The tag dictionary contains the following: domain_id (string) - the ID of the domain id (string) - the ID of the tag lang (string) - the language of the tag tag (string) - the tag itself, e.g. Šokolāde translitted_name (string) - a modified version of the tag, e.g. sokolade rawBody (string) - the raw text of the article (contains HTML) bodyText (string) - clean article text (stripped from HTML) publishDate (string) - published date & time of the article categoryPrimary (dictionary or empty list) - the dictionary contains the following information: categoryId (integer) - the ID of the category categoryName (string)- the name of the category (e.g. Futbols) channelId (integer) - the ID of the channel groupId - None channelLanguage (string) - the language of the channel (nat - Latvian, rus - Russian) categoryLanguage (integer) - ID of the channel language relatedArticles (list of integers or None) - a list of related articles' ID's relatedTags(string or None) -- related tags are comma-separated

Identifier
PID http://hdl.handle.net/11356/1409
Related Identifier https://www.aclweb.org/anthology/2021.hackashop-1.14.pdf
Related Identifier http://embeddia.eu/
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1409
Provenance
Creator Pollak, Senja; Purver, Matthew; Shekhar, Ravi; Freienthal, Linda; Kuulmets, Hele-Andra; Krustok, Ivar
Publisher Ekspress Meedia Group
Publication Year 2021
Funding Reference info:eu-repo/grantAgreement/EC/H2020/825153
Rights Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0); https://creativecommons.org/licenses/by-nc-nd/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Latvian; Russian
Resource Type corpus
Format text/plain; charset=utf-8; application/octet-stream; application/zip; downloadable_files_count: 3
Discipline Linguistics