Dataset of normalised Slovene text KonvNormSl 1.0

Dataset

PID

Data used in the experiments described in:

Nikola Ljubešić, Katja Zupan, Darja Fišer and Tomaž Erjavec: Normalising Slovene data: historical texts vs. user-generated content. Proceedings of KONVENS 2016, September 19–21, 2016, Bochum, Germany. https://www.linguistics.rub.de/konvens16/pub/19_konvensproc.pdf (https://www.linguistics.rub.de/konvens16/)

Data are split into the "token" folder (experiment on normalising individual tokens) and "segment" folder (experiment on normalising whole segments of text, i.e. sentences or tweets). Each experiment folder contains the "train", "dev" and "test" subfolders. Each subfolder contains two files for each sample, the original data (.orig.txt) and the data with hand-normalised words (.norm.txt). The files are aligned by lines.

There are four datasets: - goo300k-bohoric: historical Slovene, hard case (<1850) - goo300k-gaj: historical Slovene, easy case (1850 - 1900) - tweet-L3: Slovene tweets, hard case (non-standard language) - tweet-L1: Slovene tweets, easy case (mostly standard language)

The goo300k data come from http://hdl.handle.net/11356/1025, while the tweet data originate from the JANES project (https://nl.ijs.si/janes/english/).

The text in the files has been split by inserting spaces between characters, with underscore (_) substituting the space character. Tokens not relevant for normalisation (e.g. URLs, hashtags) have been substituted by the inverted question mark '¿' character.

Identifier
PID	http://hdl.handle.net/11356/1068
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1068

Provenance
Creator	Ljubešić, Nikola; Zupan, Katja; Fišer, Darja; Erjavec, Tomaž
Publisher	Jožef Stefan Institute
Publication Year	2016
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); PUB; https://creativecommons.org/licenses/by-sa/4.0/
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	corpus
Format	application/zip; text/plain; charset=utf-8; downloadable_files_count: 1
Discipline	Linguistics