SentiCoref 1.0 corpus consists of 837 documents selected from SentiNews 1.0 corpus (http://hdl.handle.net/11356/1110). The documents were selected based on the number of automatically detected named entities (using Polyglot, https://polyglot.readthedocs.io/) which contained between 50 and 73 named entities.
The corpus is provides an initial dataset for aspect-based sentiment analysis. The annotations consist of named entities (persons, organizations and locations), coreferences to the named entities, and 5-level sentiment annotation for each entity (coreference chain). Together there are 31,419 manually tagged named entities - 15,285 organizations, 8,606 persons and 7,528 locations. The dataset contains 14,572 coreference chains. Sentiment distribution for entities is as follows - 30 Very negative, 1801 Negative, 10869 Neutral, 1705 Positive and 24 Very positive.
Each document was annotated by two linguist students. In the preparation of the dataset, 8 students participated: Rednak Pia, Roblek Rebeka, Jelovšek Tjaša, Agović Haris, Vaupotič Jana, Grego Annamaria, Vidic Zala, Žvanut Kaja. The final curation was done by Neli Blagus and Slavko Žitnik.
The data is in WebAnno TSV 3 format (similar to CoNLL format) which is compatible with the WebAnno tool (https://webanno.github.io/webanno/).