The dataset contains a list of 11906 words (lemmas with part of speech information) and their frequency of occurrence in a corpus of Slovenian textobooks, covering elementary school (Grade 1 to 9) and secondary school (Year 1 to 4). The corpus contains 4,302,857 words (5,373,268 tokens), and consists of 127 textbooks from 16 different subjects. The distribution per school level is as follows:
- Grade 1: 17949 tokens
- Grade 2: 46317 tokens
- Grade 3: 84222 tokens
- Grade 4: 305454 tokens
- Grade 5: 357400 tokens
- Grade 6: 351463 tokens
- Grade 7: 537359 tokens
- Grade 8: 592068 tokens
- Grade 9: 765574 tokens
- Year 1: 665093 tokens
- Year 2: 200267 tokens
- Year 3: 149442 tokens
- Year 4: 23406 tokens
- Year 1-4: 206843 tokens (these are textbooks that are used in all the years of secondary school and were not divided according to different years)
The purpose of the dataset is to facilitate research into vocabularly use at different levels of education, and to enable comparative studies of student language reception and production in Slovene.