Comparable corpora of South-Slavic Wikipedias CLASSLA-Wikipedia 1.0

Dataset

PID

This comparable corpus collection consists of Wikipedia dumps of the Bosnian, Croatian, Macedonian, Montenegrin, Serbian, Serbo-Croatian and Slovenian Wikipedia, harvested on October 17th 2020. The text was extracted from the dumps with the process documented at https://github.com/clarinsi/classla-wikipedia, and linguistic annotation was performed with the classla package (https://pypi.org/project/classla/), on all levels available for a specific language, with the Bosnian and Serbo-Croatian Wikipedias processed with the standard Croatian models.

Identifier
PID	http://hdl.handle.net/11356/1427
Related Identifier	https://aclanthology.org/2021.ranlp-1.104.pdf
Related Identifier	https://github.com/clarinsi/classla-wikipedia
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1427

Provenance
Creator	Ljubešić, Nikola; Markoski, Filip; Markoska, Elena; Erjavec, Tomaž
Publisher	Jožef Stefan Institute
Publication Year	2021
Rights	Creative Commons - Attribution 4.0 International (CC BY 4.0); https://creativecommons.org/licenses/by/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Bosnian; Croatian; Macedonian; Serbian; Slovenian; Slovene
Resource Type	corpus
Format	application/gzip; text/plain; charset=utf-8; downloadable_files_count: 7
Discipline	Linguistics