The news dataset for discriminating between Bosnian, Croatian and Serbian SETimes.HBS 1.0

PID

The SETimes.HBS dataset consists of parallel documents written in Bosnian, Croatian and Serbian, harvested from the already inactive setimes.com website publishing news in the languages of South-Eastern Europe. While the writing process of the documents is not known, they are quite likely independent translations from English. The main intended usage of this dataset is closely-related language discrimination. This dataset is not a traditional parallel dataset as there are no explicit links between parallel documents. Special care was taken that the training, development and testing bins of the dataset contain the same documents in all three languages as data leakage between the three bins, given the similarity of the three languages, could be problematic for benchmarking.

Identifier
PID http://hdl.handle.net/11356/1461
Related Identifier https://aclanthology.org/C12-1160/
Related Identifier https://www.clarin.si/info/k-centre/
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1461
Provenance
Creator Ljubešić, Nikola; Rupnik, Peter
Publisher Jožef Stefan Institute
Publication Year 2022
Rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Bosnian; Croatian; Serbian
Resource Type corpus
Format application/zip; text/plain; charset=utf-8; downloadable_files_count: 1
Discipline Linguistics