The Twitter user dataset for discriminating between Bosnian, Croatian, Montenegrin and Serbian Twitter-HBS 1.0

Dataset

PID

The Twitter-HBS dataset consists of Twitter users, their tweets, and the label of their predominantly used language - Bosnian, Croatian, Montenegrin, or Serbian. Among the tweets, there are also tweets in other languages (mainly English) as the label encodes the predominantly used language of a user only. The main intended usage of this dataset is discrimination between closely-related languages on the level of a Twitter user (not a single tweet). The only pre-processing performed on the texts of the tweets is the transliteration from the Cyrillic into the Latin script so that the dataset measures the quality of the user classifications regardless of the script used.

Identifier
PID	http://hdl.handle.net/11356/1482
Related Identifier	https://www.informatica.si/index.php/informatica/article/view/746
Related Identifier	https://www.clarin.si/info/k-centre/
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1482

Provenance
Creator	Ljubešić, Nikola; Rupnik, Peter
Publisher	Jožef Stefan Institute
Publication Year	2022
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Bosnian; Croatian; Serbian
Resource Type	corpus
Format	application/zip; text/plain; charset=utf-8; downloadable_files_count: 1
Discipline	Linguistics