Frequency list of textbook vocabulary by level of education in elementary and secondary schools

Dataset

PID

The dataset contains a list of 11906 words (lemmas with part of speech information) and their frequency of occurrence in a corpus of Slovenian textobooks, covering elementary school (Grade 1 to 9) and secondary school (Year 1 to 4). The corpus contains 4,302,857 words (5,373,268 tokens), and consists of 127 textbooks from 16 different subjects. The distribution per school level is as follows: - Grade 1: 17949 tokens - Grade 2: 46317 tokens - Grade 3: 84222 tokens - Grade 4: 305454 tokens - Grade 5: 357400 tokens - Grade 6: 351463 tokens - Grade 7: 537359 tokens - Grade 8: 592068 tokens - Grade 9: 765574 tokens - Year 1: 665093 tokens - Year 2: 200267 tokens - Year 3: 149442 tokens - Year 4: 23406 tokens - Year 1-4: 206843 tokens (these are textbooks that are used in all the years of secondary school and were not divided according to different years)

The purpose of the dataset is to facilitate research into vocabularly use at different levels of education, and to enable comparative studies of student language reception and production in Slovene.

Identifier
PID	http://hdl.handle.net/11356/1719
Related Identifier	https://www.cjvt.si/prop/en/
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1719

Provenance
Creator	Kosem, Iztok; Pori, Eva; Arhar Holdt, Špela
Publisher	Centre for Language Resources and Technologies, University of Ljubljana
Publication Year	2023
Rights	Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0); https://creativecommons.org/licenses/by-nc-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	lexicalConceptualResource
Format	text/plain; charset=utf-8; text/plain; downloadable_files_count: 2
Discipline	Linguistics