This dataset is an archive of articles from the Delfi news site from 2015-2019, containing over 180,000 articles (c. 50% in Latvian and 50% in the Russian language). Keywords for articles are included.
There are 5 JSON files:
lv_2015.json contains 42 001 articles from the year 2015
lv_2016_.json contains 40 342 articles from the year 2016
lv_2017_.json contains 37 256 articles from the year 2017
lv_2018_.json contains 31 732 articles from the year 2018
lv_2019_.json contains 29 070 articles from the year 2019
In sum: 180 401 articles
Description of the dataset
This JSON file is a list of dictionaries, i.e. each article is represented as a dictionary. Each dictionary contains the following:
id (integer) - the ID of the article
title (string) - the title of the article
lead (string) - the lead of the article
tags [1] (list of dictionaries or None): each dictionary represents one tag. The tag dictionary contains the following:
domain_id (string) - the ID of the domain
id (string) - the ID of the tag
lang (string) - the language of the tag
tag (string) - the tag itself, e.g. Šokolāde
translitted_name (string) - a modified version of the tag, e.g. sokolade
rawBody (string) - the raw text of the article (contains HTML)
bodyText (string) - clean article text (stripped from HTML)
publishDate (string) - published date & time of the article
categoryPrimary (dictionary or empty list) - the dictionary contains the following information:
categoryId (integer) - the ID of the category
categoryName (string)- the name of the category (e.g. Futbols)
channelId (integer) - the ID of the channel
groupId - None
channelLanguage (string) - the language of the channel (nat - Latvian, rus - Russian)
categoryLanguage (integer) - ID of the channel language
relatedArticles (list of integers or None) - a list of related articles' ID's
relatedTags(string or None) -- related tags are comma-separated