The news dataset for discriminating between Bosnian, Croatian and Serbian SETimes.HBS 1.0

Dataset

PID

The SETimes.HBS dataset consists of parallel documents written in Bosnian, Croatian and Serbian, harvested from the already inactive setimes.com website publishing news in the languages of South-Eastern Europe. While the writing process of the documents is not known, they are quite likely independent translations from English. The main intended usage of this dataset is closely-related language discrimination. This dataset is not a traditional parallel dataset as there are no explicit links between parallel documents. Special care was taken that the training, development and testing bins of the dataset contain the same documents in all three languages as data leakage between the three bins, given the similarity of the three languages, could be problematic for benchmarking.

Identifier
PID	http://hdl.handle.net/11356/1461
Related Identifier	https://aclanthology.org/C12-1160/
Related Identifier	https://www.clarin.si/info/k-centre/
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1461

Provenance
Creator	Ljubešić, Nikola; Rupnik, Peter
Publisher	Jožef Stefan Institute
Publication Year	2022
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Bosnian; Croatian; Serbian
Resource Type	corpus
Format	application/zip; text/plain; charset=utf-8; downloadable_files_count: 1
Discipline	Linguistics