OAGK Keyword Generation Dataset

Dataset

PID

OAGK is a keyword extraction/generation dataset consisting of 2.2 million abstracts, titles and keyword strings from cientific articles. Texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file.

This data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY licence.

This data (OAGK Keyword Generation Dataset) is released under CC-BY licence (https://creativecommons.org/licenses/by/4.0/).

If using it, please cite the following paper: Çano, Erion and Bojar, Ondřej, 2019, Keyphrase Generation: A Text Summarization Struggle, 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics, June 2019, Minneapolis, USA

Identifier
PID	http://hdl.handle.net/11234/1-2943
Related Identifier	https://www.aclweb.org/anthology/N19-1070
Related Identifier	http://hdl.handle.net/11234/1-3062
Metadata Access	http://lindat.mff.cuni.cz/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:lindat.mff.cuni.cz:11234/1-2943

Provenance
Creator	Çano, Erion
Publisher	Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication Year	2019
Funding Reference	info:eu-repo/grantAgreement/EC/H2020/825460
Rights	Creative Commons - Attribution 4.0 International (CC BY 4.0); http://creativecommons.org/licenses/by/4.0/; PUB
OpenAccess	true
Contact	lindat-help(at)ufal.mff.cuni.cz

Representation
Language	English
Resource Type	corpus
Format	text/plain; charset=utf-8; text/plain; application/zip; downloadable_files_count: 2
Discipline	Linguistics