Spoken corpus Gos 2.0 (transcriptions)


The spoken corpus Gos 2.0 is the reference speech corpus of the Slovenian language. This second edition contains about 300 hours of speech, or 2.4 million words, 127 thousand utterances and 1,500 texts.

Gos 2.0 is composed from three different sources: (1) Spoken corpus Gos 1.1 (http://hdl.handle.net/11356/1438), 112 hours, 1 million words (2) Spoken corpus Gos VideoLectures 4.2 (http://hdl.handle.net/11356/1444), 22 hours, 179,000 words (3) A selection from the ASR database ARTUR 1.0 (http://hdl.handle.net/11356/1772), 185 hours, 1.2 mllion words, including: (3a) Artur-J-Splosni, 62 hours, 422,000 words: transcriptions of media recordings, online recordings of conferences, workshops, education videos, etc. (3b) Artur-N-Prosti, 61 hours, 324,000 words: transcriptions of monologues and dialogues between two persons, recorded for the purposes of the Artur database. Speakers were asked to freely conversate or freely explain on casual topics. (3c) Artur-P-SejeDZ, 62 hours, 450,000 words: a selection of transcriptions of speech from the Slovene National Assembly. The maximum length of single speaker speech is 4,000 words.

Note that various encoding changes have been made to the original Gos and Gos VideloLectures corpora so that the encoding of Gos 2.0 is uniform across the three sources.

All transcriptions are manual and made in two modes: - pronunciation-based or citation-phonemic transcriptions (containing the output phoneme string derived from the orthographic form by letter-to-sound rules) - standardised or expanded orthographic transcriptions (the standard Slovene spelling is used to indicate the spoken words, but there are additional rules and word-lists for non-standard lexis).

Part-of-speech tagging with MULTEXT-East morphosyntactic descriptions and lemmatisation was performed automatically with CLASSLA (https://github.com/clarinsi/classla).

The corpus is distributed in TEI (XML) format and in vertical file format, the latter used by the CQP familiy of concordancers, such as (no)Sketch Engine.

PID http://hdl.handle.net/11356/1771
Related Identifier http://hdl.handle.net/11356/1438
Related Identifier http://eng.slovenscina.eu/korpusi/gos
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1771
Creator Zwitter Vitez, Ana; Zemljarič Miklavčič, Jana; Krek, Simon; Stabej, Marko; Erjavec, Tomaž; Verdonik, Darinka; Potočnik, Tomaž; Sepesy Maučec, Mirjam; Majhenič, Simona; Žgank, Andrej; Bizjak, Andreja; Gril, Lucija; Dobrišek, Simon; Križaj, Janez; Bajec, Marko; Lebar Bajec, Iztok; Jelovšek, Tjaša; Trojar, Mitja; Bernjak, Mitja; Dretnik, Naum; Strle, Gregor; Dobrovoljc, Kaja
Publisher Centre for Language Resources and Technologies, University of Ljubljana; Faculty of Electrical Engineering and Computer Science, University of Maribor; Faculty of Electrical Engineering, University of Ljubljana; Faculty of Computer and Information Science, University of Ljubljana; ZRC SAZU; Jožef Stefan Institute
Publication Year 2023
Rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Language Slovenian; Slovene
Resource Type corpus
Format text/plain; charset=utf-8; application/zip; downloadable_files_count: 2
Discipline Linguistics