Ekspress news article archive (in Estonian and Russian) 1.0

Dataset

PID

The dataset is an archive of articles from the Ekspress Meedia news site from 2009-2019, containing over 1.4M articles, mostly in Estonian language (1,115,120 articles) with some in Russian (325,952 articles). Keywords are included for articles after 2015.

The main archive is in file ee_articles_2009_2019. Other files contain derived versions and subsets - please see README files inside those zip files.

The main archive contains JSON files of all the Estonian articles from the year 2009 to 2019 May. These datasets are intended for usage in EMBEDDIA, a H2020 project. Articles are in Estonian language with some in Russian.

The main archive is in file ee_articles_2009_2019. Other files contain derived versions and subsets (please see README files inside those zip files), in short:

eearticles2015-2019: This dataset contains Estonian and Russian articles - 5 years, with tags, that were missing in the previous versions.
files eearticles20152019lemmatized and eearticles20092014lemmatized are the files preprocessed by TEXTA (contact linda@texta.ee)
in file eeandsttarticlelemmasembeddingsand_usage you can find w2v embeddings trained by TEXTA (contact linda@texta.ee)

Description of the Main Dataset (eearticles_2009_2019)

There are 12 JSON files:

articles_2009_ver2.json contains 161394 articles from the year 2009

articles_2010_ver2.json contains 151033 articles from the year 2010

articles_2011_ver2.json contains 168273 articles from the year 2011

articles_2012_ver2.json contains 152772 articles from the year 2012

articles_2013_ver2.json contains 141012 articles from the year 2013

articles_2014_ver2.json contains 128388 articles from the year 2014

articles_2015_ver2.json contains 127425 articles from the year 2015

articles_2016_ver2.json contains 130704 articles from the year 2016

articles_2017_ver2.json contains 119318 articles from the year 2017

articles_2018_ver2.json contains 117388 articles from the year 2018

articles_2019_Jan-Apr_ver2.json contains 35076 articles from the year 2019 January to April

articles_2019_May_ver2.json contains 8329 articles from the year 2019 May

In sum: 1 441 112 articles

Each JSON file is a list of dictionaries, i.e. each article is represented as a dictionary. Each dictionary contains the following:

id (integer) - the ID of the article

title (string) - the title of the article

lead (string) - the lead of the article (can contain HTML, e.g. tag)

url (string) - the URL of the article

tags (list of dictionaries or None) [1]: each dictionary represents one tag. The tag dictionary contains the following:

domain_id (string) [2] - the ID of the domain

id (string) - the ID of the tag

lang (string) - the language of the tag

tag (string) - the tag itself, e.g. Kert Kingo (a name)

translitted_name (string) - a modified version of the tag, e.g. kert-kingo

rawBody (string) - the raw text of the article (contains HTML)

bodyText (string) - clean article text (stripped from HTML)

publishDate (string) - published date & time of the article

categoryPrimary (dictionary or empty list) - the dictionary contains the following information:

categoryId (integer) - the ID of the category

categoryName (string)- the name of the category (e.g. World)

channelId (integer) - the ID of the channel

articleId (integer) - the ID of the article

categoryId (integer) - the ID of the category

categoryName (string)- the name of the category (e.g. World)

categoryPrimary (boolean) - unknown

categorySort (integer) - unknown

categoryUrl (string) - the URL of the category

categoryVisible (boolean) - unknown

channelId (integer) - the ID of the channel

channelUrl (string) - the URL of the channel (e.g. 'https://sport.delfi.ee')

directoryName (string) - unknown

parentId (integer) - unknown

channelLanguage (string or None) [3] - the language of the channel

categoryLanguage (int or None) [4] -unknown

commentCount (int) [5] - the number of comments

relatedArticles (list of integers) - a list of related articles' ID's

Identifier
PID	http://hdl.handle.net/11356/1408
Related Identifier	https://www.aclweb.org/anthology/2021.hackashop-1.14.pdf
Related Identifier	http://embeddia.eu/
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1408

Provenance
Creator	Purver, Matthew; Pollak, Senja; Freienthal, Linda; Kuulmets, Hele-Andra; Krustok, Ivar; Shekhar, Ravi
Publisher	Ekspress Meedia Group
Publication Year	2021
Funding Reference	info:eu-repo/grantAgreement/EC/H2020/825153
Rights	Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0); https://creativecommons.org/licenses/by-nc-nd/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Estonian; Russian
Resource Type	corpus
Format	text/plain; charset=utf-8; application/octet-stream; application/zip; downloadable_files_count: 6
Discipline	Linguistics