Frequency list of textbook vocabulary by level of education in elementary and secondary schools


The dataset contains a list of 11906 words (lemmas with part of speech information) and their frequency of occurrence in a corpus of Slovenian textobooks, covering elementary school (Grade 1 to 9) and secondary school (Year 1 to 4). The corpus contains 4,302,857 words (5,373,268 tokens), and consists of 127 textbooks from 16 different subjects. The distribution per school level is as follows: - Grade 1: 17949 tokens - Grade 2: 46317 tokens - Grade 3: 84222 tokens - Grade 4: 305454 tokens - Grade 5: 357400 tokens - Grade 6: 351463 tokens - Grade 7: 537359 tokens - Grade 8: 592068 tokens - Grade 9: 765574 tokens - Year 1: 665093 tokens - Year 2: 200267 tokens - Year 3: 149442 tokens - Year 4: 23406 tokens - Year 1-4: 206843 tokens (these are textbooks that are used in all the years of secondary school and were not divided according to different years)

The purpose of the dataset is to facilitate research into vocabularly use at different levels of education, and to enable comparative studies of student language reception and production in Slovene.

Related Identifier
Metadata Access
Creator Kosem, Iztok; Pori, Eva; Arhar Holdt, Špela
Publisher Centre for Language Resources and Technologies, University of Ljubljana
Publication Year 2023
Rights Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0);; PUB
OpenAccess true
Contact info(at)
Language Slovenian; Slovene
Resource Type lexicalConceptualResource
Format text/plain; charset=utf-8; text/plain; downloadable_files_count: 2
Discipline Linguistics