Training corpus SETimes.SR 1.0

Dataset

PID

The SETimes.SR training corpus contains 86 726 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic dependencies, and named entities.

The annotations (and other aspects) of the corpus are documented in the teiHeader and back element of the TEI encoded corpus. In short, they follow (1) the MULTEXT-East V5 morphosyntactic specifications, http://nl.ijs.si/ME/V5/msd/, (2) the UDv2 Guidelines, http://universaldependencies.org/guidelines.html, and (3) the Janes annotation guidelines for named entities, http://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf.

Identifier
PID	http://hdl.handle.net/11356/1200
Related Identifier	http://www.aclweb.org/anthology/W17-1407
Related Identifier	https://github.com/vukbatanovic/SETimes.SR
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1200

Provenance
Creator	Batanović, Vuk; Ljubešić, Nikola; Samardžić, Tanja; Erjavec, Tomaž
Publisher	Regional Linguistic Data Initiative Centre ReLDI
Publication Year	2018
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Serbian
Resource Type	corpus
Format	application/zip; text/plain; charset=utf-8; downloadable_files_count: 3
Discipline	Linguistics