The Twitter user dataset for discriminating between Bosnian, Croatian, Montenegrin and Serbian Twitter-HBS 1.0

PID

The Twitter-HBS dataset consists of Twitter users, their tweets, and the label of their predominantly used language - Bosnian, Croatian, Montenegrin, or Serbian. Among the tweets, there are also tweets in other languages (mainly English) as the label encodes the predominantly used language of a user only. The main intended usage of this dataset is discrimination between closely-related languages on the level of a Twitter user (not a single tweet). The only pre-processing performed on the texts of the tweets is the transliteration from the Cyrillic into the Latin script so that the dataset measures the quality of the user classifications regardless of the script used.

Identifier
PID http://hdl.handle.net/11356/1482
Related Identifier https://www.informatica.si/index.php/informatica/article/view/746
Related Identifier https://www.clarin.si/info/k-centre/
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1482
Provenance
Creator Ljubešić, Nikola; Rupnik, Peter
Publisher Jožef Stefan Institute
Publication Year 2022
Rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Bosnian; Croatian; Serbian
Resource Type corpus
Format application/zip; text/plain; charset=utf-8; downloadable_files_count: 1
Discipline Linguistics