Frequency list of words by source from the Trendi corpus 2022-07

PID

The frequency list of words by source was prepared in the following manner: words (i.e. lemmas with their lexical features) were extracted from 15 most frequent sources in the Trendi Monitor Corpus of Slovene (http://hdl.handle.net/11356/1590) covering the period between 1 January 2019 and 31 July 2022. The extracted sources are the following:

  • STA (sta.si)
  • RTV (rtvslo.si)
  • Delo (delo.si)
  • Siol (siol.net)
  • Vestnik (vestnik.si)
  • Večer (vecer.com)
  • Svet24 – Novice (novice.svet24.si)
  • 24ur (24ur.com)
  • Dnevnik (dnevnik.si)
  • Žurnal24 (zurnal24.si)
  • Demokracija (demokracija.si)
  • Nova24TV (nova24tv.si)
  • Slovenske novice (slovenskenovice.si)
  • Gorenjski glas (gorenjskiglas.si)
  • Svet 24 – Ekipa (ekipa.svet24.si)

The frequency lists obtained from Trendi were then compared to the frequency list of words from Gigafida 2.0 (http://hdl.handle.net/11356/1320; covering the period between 1991–2018). The final frequency list contains lemmas, their lexical features, and – for each source (including Gigafida 2.0) – their absolute and relative frequencies from the first (1991–2018) and second periods (from 2019 to 2022-07), as well as the simple maths value indicating if the word is more frequent in 2019-2022-07 (simple maths > 1.00) or in 1991–2018 (simple maths < 1.00).

Because the entire frequency list is quite large, a shorter version with the first 150,000 entries is also provided for easier use in data processing software (such as MS Excel). The lists are sorted by their total absolute frequencies. Note that words with a total frequency of 1 (when adding absolute frequencies from both compared corpora; hapax legomena) were removed.

Identifier
PID http://hdl.handle.net/11356/1702
Related Identifier https://sled.ijs.si/
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1702
Provenance
Creator Čibej, Jaka; Kosem, Iztok
Publisher Jožef Stefan Institute
Publication Year 2022
Rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Slovenian; Slovene
Resource Type lexicalConceptualResource
Format text/plain; charset=utf-8; application/zip; downloadable_files_count: 1
Discipline Linguistics