MAKS (MlAdinski KorpuS, i.e. the Youth Corpus) includes texts from literature, newspapers, and, to a lesser extent, the web. The corpus was designed for the needs of the e-learning environment "Slovenščina na dlani", where it served as a source of grammar and spelling exercises. The texts have therefore been selected to be as style-neutral as possible, proofread, and thematically interesting for the learner population. Some texts originate from the Slovenian Reference corpus Gigafida, while many texts (primarily literary) were newly gathered.
The corpus as a whole is available in the CLARIN.SI concordances, while the openly available ccMAKS dataset includes 10% of the texts, sampled in accordance with the authorship agreements. In the project "Empirical foundations for digitally-supported development of writing skills", the corpus was linguistically annotated with the CLASSLA v1.1.1 pipeline (https://github.com/clarinsi/classla/) at the levels of tokenization, sentence segmentation, lemmatization, MULTEXT-East v6 MSD-tags (https://nl.ijs.si/ME/V6/msd/html/msd-sl.html), JOS dependency syntax (https://nl.ijs.si/jos/bib/jos-skladnja-navodila.pdf), and named entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). The idea is to provide comparably annotated pedagogically-relevant corpora that can be used for different tasks in language didactics and NLP.
The corpus is available in CoNLL-U and vertical formats. The CoNLL-U format contains one document per file (and separately text metadata as a TSV file) and the vertical format contains concatenated documents in one large file. The registry file ccmaks.regi for the vertical format is compatible with the LIST 1.2 corpus extraction tool (http://hdl.handle.net/11356/1276) and the ccmaks.noske.regi file is needed for SketchEngine-type concordancers.