CACAPO dataset

Dataset

DOI

The Combinations of Aligned data-sentenCes from nAturally PrOduced texts (hereafter: CACAPO) dataset is a dataset for data-to-text generation. The dataset contains over 20,000 sentences from automatically scraped news reports for the sports, weather, stock, and incidents domain in English and Dutch, aligned with relevant attribute-value paired data. To our knowledge, this is the first dataset based on “naturally occurring” human-written texts (i.e., texts that were not collected in a task-based setting), that covers various domains, as well as multiple languages.

Method: The texts were collected using automatic scrapers or an interface that allowed quick collection of the article. Aligned data was manually annotated by two annotators. Universe: News reports on traffic/gun violence incidents, soccer/baseball matches, stocks, and weather, published between 2016 and 2019. Country / Nation: The reports come from Dutch- and English-speaking countries. Mostly The Netherlands, United Kingdom, and United States of America.

Additional metadata and information can be found in the file "Data Report.pdf". Data files: 1. Full_Dict_NL.json, Full_Dict_EN.json, Phrase_Dict.json, PhraseTable.json: JSON files that contain information on verbs and determiners. This is useful for a realization module to apply the correct word form in a given situation. 2. WebNLGFormatTrain.xml, WebNLGFormatDev.xml, WebNLGFormatTest.xml: The corpus files in XML format. Their structure is the same as (enriched) WebNLG’s structure v 1.4 (see Github).

Identifier
DOI	https://doi.org/10.34894/LIBYHP
Related Identifier	https://aclanthology.org/2020.inlg-1.10
Metadata Access	https://dataverse.nl/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.34894/LIBYHP

Provenance
Creator	van der Lee, Chris (ORCID: 0000-0003-3454-026X); Emmery, Chris (ORCID: 0000-0002-2179-559X); Wubben, Sander ; Krahmer, Emiel
Publisher	DataverseNL
Contributor	van der Lee, Chris; DataverseNL
Publication Year	2022
Rights	CC-BY-4.0; info:eu-repo/semantics/openAccess; http://creativecommons.org/licenses/by/4.0
OpenAccess	true
Contact	van der Lee, Chris (Tilburg University)

Representation
Resource Type	Annotated corpora; Dataset
Format	application/pdf; application/json; text/xml
Size	174295; 384823; 162245; 37974; 74353; 12207; 86393; 162492; 85812; 192916; 81692; 449085; 881993; 237536; 484989; 221912; 507726; 448953; 417342; 971386; 700598; 678992; 382056; 805017; 712189; 425377; 1557412; 8111975; 1864375; 3609645; 3664792; 3590327; 4530519; 2079327; 4128967
Version	1.0
Discipline	Humanities