The spoken corpus Gos 2.0 is the reference speech corpus of the Slovenian language. This second edition contains about 300 hours of speech, or 2.4 million words, 127 thousand utterances and 1,500 texts.
Gos 2.0 is composed from three different sources:
(1) Spoken corpus Gos 1.1 (http://hdl.handle.net/11356/1438), 112 hours, 1 million words
(2) Spoken corpus Gos VideoLectures 4.2 (http://hdl.handle.net/11356/1444), 22 hours, 179,000 words
(3) A selection from the ASR database ARTUR 1.0 (http://hdl.handle.net/11356/1772), 185 hours, 1.2 mllion words, including:
(3a) Artur-J-Splosni, 62 hours, 422,000 words: transcriptions of media recordings, online recordings of conferences, workshops, education videos, etc.
(3b) Artur-N-Prosti, 61 hours, 324,000 words: transcriptions of monologues and dialogues between two persons, recorded for the purposes of the Artur database. Speakers were asked to freely conversate or freely explain on casual topics.
(3c) Artur-P-SejeDZ, 62 hours, 450,000 words: a selection of transcriptions of speech from the Slovene National Assembly. The maximum length of single speaker speech is 4,000 words.
Note that various encoding changes have been made to the original Gos and Gos VideloLectures corpora so that the encoding of Gos 2.0 is uniform across the three sources.
All transcriptions are manual and made in two modes:
- pronunciation-based or citation-phonemic transcriptions (containing the output phoneme string derived from the orthographic form by letter-to-sound rules)
- standardised or expanded orthographic transcriptions (the standard Slovene spelling is used to indicate the spoken words, but there are additional rules and word-lists for non-standard lexis).
Part-of-speech tagging with MULTEXT-East morphosyntactic descriptions and lemmatisation was performed automatically with CLASSLA (https://github.com/clarinsi/classla).
The corpus is distributed in TEI (XML) format and in vertical file format, the latter used by the CQP familiy of concordancers, such as (no)Sketch Engine.