Šolar-Eval is a specialized dataset designed for the evaluation of Slovene spell- and grammar-checking tools and methodologies. It encompasses 109 essays authored by Slovene primary and secondary school students, featuring 9,808 language corrections meticulously annotated based on the identified language problems.
The essays are sourced from the Šolar 3.0 corpus, which integrates authentic corrections from language teachers (http://hdl.handle.net/11356/1589). However, inconsistencies and heterogeneity are common in teacher corrections, particularly in style improvements, making this corpus suboptimal for evaluation tasks. For Šolar-Eval, the corrections were conducted by researchers aiming to ensure consistency, homogeneity, and minimal language intervention.
The corrections are annotated according to the reference guidelines found in the attached document. The codes for language errors are structured hierarchically, facilitating robust or fine-grained evaluation.
The dataset is accessible in JSON format as generated by the CJVT Svala 1.1 annotation tool (https://orodja.cjvt.si/svala/). The source and target text is also available in the CoNLL-U format (https://universaldependencies.org/format.html). Furthermore, linguistic annotations were applied using the CLASSLA pipeline (https://github.com/clarinsi/classla/) across various levels, including tokenization, sentence segmentation, lemmatization, MULTEXT-East v6 MSD-tags, JOS-SYN dependency syntax, Universal Dependencies, and named entities (more about specific annotation layers: https://wiki.cjvt.si/shelves/linguistic-annotation-of-slovene-corpora). For better accessibility and wider usability, we provide versions with JOS-SYN as well as Universal Dependencies, and English as well as Slovene tags.