CMC training corpus Janes-Tag 2.0

PID

Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity annotation of non-standard Slovene. As the corpus has been carefully manually annotated, it is also suitable for detailed linguistic explorations which require highly accurate and reliable annotations. As an update to version 1.2, 2.0 corrects some minor errors and includes named entity annotation.

A slightly older version of this corpus is described in: ERJAVEC, Tomaž, ČIBEJ, Jaka, ARHAR HOLDT, Špela, LJUBEŠIĆ, Nikola, FIŠER, Darja. Gold-standard datasets for annotation of Slovene computer-mediated communication. In Proceedings of RASLAN 2016: Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2016, pp. 29-40, https://nlp.fi.muni.cz/raslan/raslan16.pdf

Note that a related corpus, Janes-Norm is also available, cf. http://hdl.handle.net/11356/1084.

Identifier
PID http://hdl.handle.net/11356/1123
Related Identifier http://nl.ijs.si/janes/viri/rocno-oznaceni-korpusi/#Janes-Tag
Related Identifier https://doi.org/10.1007/s10579-018-9425-z
Related Identifier http://hdl.handle.net/11356/1085
Related Identifier http://hdl.handle.net/11356/1238
Related Identifier http://nl.ijs.si/janes/
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1123
Provenance
Creator Erjavec, Tomaž; Fišer, Darja; Čibej, Jaka; Arhar Holdt, Špela; Ljubešić, Nikola; Zupan, Katja
Publisher Jožef Stefan Institute
Publication Year 2017
Rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Slovenian; Slovene
Resource Type corpus
Format application/zip; application/pdf; text/plain; charset=utf-8; downloadable_files_count: 7
Discipline Linguistics