Croatian Twitter training corpus ReLDI-NormTagNER-hr 3.0

Dataset

PID

ReLDI-NormTagNER-hr 3.0 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition of non-standard Croatian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness).

This version of the dataset has various annotation errors corrected and the dataset encoded in the CoNLL-U-Plus format, similar to other manually annotated linguistic datasets for Croatian and Serbian.

The continuous improvement of this dataset is led by the CLASSLA knowledge centre for South Slavic languages (https://www.clarin.si/info/k-centre/) and the ReLDI Centre Belgrade (https://reldi.spur.uzh.ch).

Identifier
PID	http://hdl.handle.net/11356/1793
Related Identifier	http://dx.doi.org/10.4312/slo2.0.2016.2.156-188
Related Identifier	http://hdl.handle.net/11356/1241
Related Identifier	https://github.com/reldi-data/reldi-normtagner-hr
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1793

Provenance
Creator	Ljubešić, Nikola; Erjavec, Tomaž; Batanović, Vuk; Miličević, Maja; Samardžić, Tanja
Publisher	Jožef Stefan Institute
Publication Year	2023
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Croatian
Resource Type	corpus
Format	text/plain; charset=utf-8; application/octet-stream; application/gzip; downloadable_files_count: 4
Discipline	Linguistics