CMC training corpus Janes-Tag 2.1

Dataset

PID

Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity annotation of non-standard Slovene. As the corpus has been carefully manually annotated, it is also suitable for detailed linguistic explorations which require highly accurate and reliable annotations. As an update to version 2.0, this version corrects some minor errors in NER annotation and introduces, in addition to MULTEXT-East morphosyntactic descriptions, also Universal Dependencies morphological features and the corpus in CoNLL-U format. The UD features are also included in the vert file.

The first version of this corpus is described in:

ERJAVEC, Tomaž, ČIBEJ, Jaka, ARHAR HOLDT, Špela, LJUBEŠIĆ, Nikola, FIŠER, Darja. 2016. Gold-standard datasets for annotation of Slovene computer-mediated communication. In Proceedings of RASLAN 2016: Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2016, pp. 29-40, https://nlp.fi.muni.cz/raslan/raslan16.pdf

FIŠER, Darja, LJUBEŠIĆ, Nikola, ERJAVEC, Tomaž. 2018. The Janes project: language resources and tools for Slovene user generated content. Language Resources & Evaluation. https://rdcu.be/7RX4

Note that a related corpus, Janes-Norm is also available, cf. http://hdl.handle.net/11356/1084.

Identifier
PID	http://hdl.handle.net/11356/1238
Related Identifier	http://nl.ijs.si/janes/viri/rocno-oznaceni-korpusi/#Janes-Tag
Related Identifier	https://rdcu.be/7RX4
Related Identifier	http://hdl.handle.net/11356/1123
Related Identifier	http://hdl.handle.net/11356/1732
Related Identifier	http://nl.ijs.si/janes/
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1238

Provenance
Creator	Erjavec, Tomaž; Fišer, Darja; Čibej, Jaka; Arhar Holdt, Špela; Ljubešić, Nikola; Zupan, Katja; Dobrovoljc, Kaja
Publisher	Jožef Stefan Institute
Publication Year	2019
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	corpus
Format	application/zip; application/pdf; text/plain; charset=utf-8; downloadable_files_count: 7
Discipline	Linguistics