Blog post and comment corpus Janes-Blog 1.0

PID

Janes-Blog is an annotated corpus of Slovene blogs from websites rtvslo.si and publishwall.si from the period 2006-10 to 2016-01. The corpus is structured into individual texts containing the post of the blog and comments on the post, together with their metadata. The texts in the corpus are tokenised, sentence segmented, word normalised, morphosyntactically tagged, lemmatised and annotated with named entities. Due to protection of privacy, usernames are not included in the metadata and 'person' as well as 'person derivative' named entities have been removed from the texts.

Identifier
PID http://hdl.handle.net/11356/1138
Related Identifier https://doi.org/10.4312/slo2.0.2016.2.67-99
Related Identifier https://nl.ijs.si/janes/viri/avtomatsko-oznaceni-korpusi/#Janes-Blog
Related Identifier https://doi.org/10.1007/s10579-018-9425-z
Related Identifier https://nl.ijs.si/janes/
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1138
Provenance
Creator Erjavec, Tomaž; Ljubešić, Nikola; Fišer, Darja
Publisher Jožef Stefan Institute
Publication Year 2017
Rights Creative Commons - Attribution 4.0 International (CC BY 4.0); https://creativecommons.org/licenses/by/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Slovenian; Slovene
Resource Type corpus
Format application/zip; text/plain; charset=utf-8; downloadable_files_count: 2
Discipline Linguistics