The corpus of web commentaries with sentiment categorizations was developed as a part of BSc Thesis (Kadunc, 2016) and served for evaluation of the Slovene Sentiment Lexicon KSS
http://hdl.handle.net/11356/1097. It contains web commentaries about different topics (business, politics, sport, and other) from 4 Slovene web portals (RtvSlo, 24ur, Finance, Reporter). The corpus is in XML format and available in two forms:
- original corpus, containing 4,777 commentaries, 898 positive, 3,291 negative and 588 neutral commentaries.
- balanced corpus, a subset of the original corpus, containing 1,740 commentaries, 580 of each type of sentiment (positive, negative and neutral).
References:
Klemen Kadunc (2016). Določanje sentimenta slovenskim spletnim komentarjem s pomočjo strojnega učenja. Diplomsko delo. Univerza v Ljubljani, Fakulteta za računalništvo in informatiko (in Slovene). http://eprints.fri.uni-lj.si/3317/
Klemen Kadunc, Marko Robnik-Šikonja (2016). Analiza mnenj s pomočjo strojnega učenja in slovenskega leksikona sentimenta. Conference on Language Technologies & Digital Humanities, Ljubljana (in Slovene). http://www.sdjt.si/wp/dogodki/konference/jtdh-2016/zbornik/