Comparable corpora of South-Slavic Wikipedias CLASSLA-Wikipedia 1.0

PID

This comparable corpus collection consists of Wikipedia dumps of the Bosnian, Croatian, Macedonian, Montenegrin, Serbian, Serbo-Croatian and Slovenian Wikipedia, harvested on October 17th 2020. The text was extracted from the dumps with the process documented at https://github.com/clarinsi/classla-wikipedia, and linguistic annotation was performed with the classla package (https://pypi.org/project/classla/), on all levels available for a specific language, with the Bosnian and Serbo-Croatian Wikipedias processed with the standard Croatian models.

Identifier
PID http://hdl.handle.net/11356/1427
Related Identifier https://aclanthology.org/2021.ranlp-1.104.pdf
Related Identifier https://github.com/clarinsi/classla-wikipedia
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1427
Provenance
Creator Ljubešić, Nikola; Markoski, Filip; Markoska, Elena; Erjavec, Tomaž
Publisher Jožef Stefan Institute
Publication Year 2021
Rights Creative Commons - Attribution 4.0 International (CC BY 4.0); https://creativecommons.org/licenses/by/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Bosnian; Croatian; Macedonian; Serbian; Slovenian; Slovene
Resource Type corpus
Format application/gzip; text/plain; charset=utf-8; downloadable_files_count: 7
Discipline Linguistics