The Combinations of Aligned data-sentenCes from nAturally PrOduced texts (hereafter: CACAPO) dataset is a dataset for data-to-text generation. The dataset contains over 20,000 sentences from automatically scraped news reports for the sports, weather, stock, and incidents domain in English and Dutch, aligned with relevant attribute-value paired data. To our knowledge, this is the first dataset based on “naturally occurring” human-written texts (i.e., texts that were not collected in a task-based setting), that covers various domains, as well as multiple languages.
Method: The texts were collected using automatic scrapers or an interface that allowed quick collection of the article. Aligned data was manually annotated by two annotators. Universe: News reports on traffic/gun violence incidents, soccer/baseball matches, stocks, and weather, published between 2016 and 2019.
Country / Nation: The reports come from Dutch- and English-speaking countries. Mostly The Netherlands, United Kingdom, and United States of America.
Additional metadata and information can be found in the file "Data Report.pdf".
Data files: 1. Full_Dict_NL.json, Full_Dict_EN.json, Phrase_Dict.json, PhraseTable.json: JSON files that contain information on verbs and determiners. This is useful for a realization module to apply the correct word form in a given situation.
2. WebNLGFormatTrain.xml, WebNLGFormatDev.xml, WebNLGFormatTest.xml: The corpus files in XML format. Their structure is the same as (enriched) WebNLG’s structure v 1.4 (see Github).