OAGKX Keyword Generation Dataset

PID

OAGKX is a keyword extraction/generation dataset consisting of 22674436 abstracts, titles and keyword strings from scientific articles. The texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file.

The data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY license.

This data (OAGKX Keyword Generation Dataset) is released under CC-BY license (https://creativecommons.org/licenses/by/4.0/).

If using it, please cite the following paper:

Çano Erion, Bojar Ondřej. Keyphrase Generation: A Multi-Aspect Survey. FRUCT 2019, Proceedings of the 25th Conference of the Open Innovations Association FRUCT, Helsinki, Finland, Nov. 2019

To reproduce the experiments in the above paper, you can use the first 100000 lines of part_0_0.txt file.

Identifier
PID http://hdl.handle.net/11234/1-3062
Related Identifier https://ieeexplore.ieee.org/document/8981519
Related Identifier http://hdl.handle.net/11234/1-2943
Metadata Access http://lindat.mff.cuni.cz/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:lindat.mff.cuni.cz:11234/1-3062
Provenance
Creator Çano, Erion
Publisher Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication Year 2019
Funding Reference info:eu-repo/grantAgreement/EC/H2020/825460
Rights Creative Commons - Attribution 4.0 International (CC BY 4.0); http://creativecommons.org/licenses/by/4.0/; PUB
OpenAccess true
Contact lindat-help(at)ufal.mff.cuni.cz
Representation
Language English
Resource Type corpus
Format text/plain; charset=utf-8; application/zip; text/plain; downloadable_files_count: 2
Discipline Linguistics