The valency lexicon was extracted from the Gigafida 2.1 Corpus of Written Standard Slovene (https://www.clarin.si/noske/run.cgi/corp_info?corpname=gfida21) using specialized scripts for extracting data from corpora containing syntactic and semantic role annotations. The lexicon contains valency patterns for 14,595 Slovene verbs based on the JOS syntactic dependency system (http://nl.ijs.si/jos/bib/jos-skladnja-navodila.pdf) and the semantic role labelling system for Slovene with 25 semantic role labels.
The lexicon consists of separate XML files for each verb. Each file contains the verb's lemma, its aspect, and its frequency in the Gigafida 2.1 corpus, followed by a list of all the semantic role labels present in all the verb's valency patterns, along with two measures for each semantic role label (listed in ):
(1) "valency_pattern_ratio", which indicates the percentage of the verb's valency patterns where the semantic role label is present; and
(2) "valency_sentence_ratio", which indicates the percentage of all the corpus sentences containing both the verb and the semantic role label out of a total of all corpus sentences containing the verb.
The valency patterns (listed in ) contain the following:
- the valency pattern's ID-number ();
- the number of corpus sentences in which the verb follows the valency pattern ();
- the semantic roles occurring in the pattern ();
- the syntactic structures occurring in the pattern (; if the syntactic structures contain prepositions, these are also included as additional information);
- a human-readable representation of the valency pattern in Slovene (, e.g. KDO/KAJ abdicira);
- the corpus examples in which the verb occurs with the valency pattern ().
In the examples, the components of the valency patterns are annotated with their semantic role labels and syntactic structures. Each valency pattern contains at least one example from the Gigafida 2.1 corpus and all the relevant examples from the ssj500k 2.2 corpus (http://hdl.handle.net/11356/1210).