The news articles reporting on the 2021 Tokyo Olympics data set OG2021 (public)

PID

The OG2021 corpus contains multilingual news articles that are reporting on the events happening during the 2021 Tokyo Olympics. The data set was created to evaluate the clustering algorithm. The articles were initially acquired via the EventRegistry service, clustered using an online news clustering algorithm, and finally manually inspected and annotated by a single evaluator using translation services to understand the meaning of the articles' content.

The corpus consists of a single file called og2021.csv, which contains the data of 10.940 news articles grouped into 1.350 clusters. Each article has the following attributes:

  • id: The ID of the news article.
  • title: The title of the article.
  • lang: The language in which the article is written. Can be one of nine values.
  • source: The news publisher's name.
  • published_at: The date and time when the article was published. The published dates range between 2021-07-01 and 2021-08-14.
  • URL: The URL location of the news article.
  • cluster_id: The ID of the cluster the article is a member of.

The dataset is also published with the body attribute but under a more restrictive licence. It can be found at http://hdl.handle.net/11356/1921.

Identifier
PID http://hdl.handle.net/11356/1922
Related Identifier https://www.humane-ai.eu/
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1922
Provenance
Creator Novak, Erik; Calcina, Erik; Mladenić, Dunja; Grobelnik, Marko
Publisher Jožef Stefan Institute
Publication Year 2024
Funding Reference info:eu-repo/grantAgreement/EC/H2020/952026
Rights Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0); PUB; https://creativecommons.org/licenses/by-nc-nd/4.0/
OpenAccess true
Contact info(at)clarin.si
Representation
Language English; Portuguese; Spanish; Castilian; French; Russian; German; Slovenian; Slovene; Arabic; Chinese
Resource Type corpus
Format text/plain; charset=utf-8; application/zip; downloadable_files_count: 1
Discipline Linguistics