Slovene corpus for aspect-based sentiment analysis - SentiCoref 1.0 - Dataset

Dataset

Slovene corpus for aspect-based sentiment analysis - SentiCoref 1.0

PID

SentiCoref 1.0 corpus consists of 837 documents selected from SentiNews 1.0 corpus (http://hdl.handle.net/11356/1110). The documents were selected based on the number of automatically detected named entities (using Polyglot, https://polyglot.readthedocs.io/) which contained between 50 and 73 named entities.

The corpus is provides an initial dataset for aspect-based sentiment analysis. The annotations consist of named entities (persons, organizations and locations), coreferences to the named entities, and 5-level sentiment annotation for each entity (coreference chain). Together there are 31,419 manually tagged named entities - 15,285 organizations, 8,606 persons and 7,528 locations. The dataset contains 14,572 coreference chains. Sentiment distribution for entities is as follows - 30 Very negative, 1801 Negative, 10869 Neutral, 1705 Positive and 24 Very positive.

Each document was annotated by two linguist students. In the preparation of the dataset, 8 students participated: Rednak Pia, Roblek Rebeka, Jelovšek Tjaša, Agović Haris, Vaupotič Jana, Grego Annamaria, Vidic Zala, Žvanut Kaja. The final curation was done by Neli Blagus and Slavko Žitnik.

The data is in WebAnno TSV 3 format (similar to CoNLL format) which is compatible with the WebAnno tool (https://webanno.github.io/webanno/).

Identifier
PID	http://hdl.handle.net/11356/1285
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1285

Provenance
Creator	Žitnik, Slavko
Publisher	Faculty of Computer and Information Science, University of Ljubljana
Publication Year	2019
Rights	Creative Commons - Attribution 4.0 International (CC BY 4.0); https://creativecommons.org/licenses/by/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	corpus
Format	application/zip; application/pdf; text/plain; charset=utf-8; downloadable_files_count: 2
Discipline	Linguistics