Spoken corpus Gos 2.0 (transcriptions)

Dataset

PID

The spoken corpus Gos 2.0 is the reference speech corpus of the Slovenian language. This second edition contains about 300 hours of speech, or 2.4 million words, 127 thousand utterances and 1,500 texts.

Gos 2.0 is composed from three different sources: (1) Spoken corpus Gos 1.1 (http://hdl.handle.net/11356/1438), 112 hours, 1 million words (2) Spoken corpus Gos VideoLectures 4.2 (http://hdl.handle.net/11356/1444), 22 hours, 179,000 words (3) A selection from the ASR database ARTUR 1.0 (http://hdl.handle.net/11356/1772), 185 hours, 1.2 mllion words, including: (3a) Artur-J-Splosni, 62 hours, 422,000 words: transcriptions of media recordings, online recordings of conferences, workshops, education videos, etc. (3b) Artur-N-Prosti, 61 hours, 324,000 words: transcriptions of monologues and dialogues between two persons, recorded for the purposes of the Artur database. Speakers were asked to freely conversate or freely explain on casual topics. (3c) Artur-P-SejeDZ, 62 hours, 450,000 words: a selection of transcriptions of speech from the Slovene National Assembly. The maximum length of single speaker speech is 4,000 words.

Note that various encoding changes have been made to the original Gos and Gos VideloLectures corpora so that the encoding of Gos 2.0 is uniform across the three sources.

All transcriptions are manual and made in two modes: - pronunciation-based or citation-phonemic transcriptions (containing the output phoneme string derived from the orthographic form by letter-to-sound rules) - standardised or expanded orthographic transcriptions (the standard Slovene spelling is used to indicate the spoken words, but there are additional rules and word-lists for non-standard lexis).

Part-of-speech tagging with MULTEXT-East morphosyntactic descriptions and lemmatisation was performed automatically with CLASSLA (https://github.com/clarinsi/classla).

The corpus is distributed in TEI (XML) format and in vertical file format, the latter used by the CQP familiy of concordancers, such as (no)Sketch Engine.

Identifier
PID	http://hdl.handle.net/11356/1771
Related Identifier	http://hdl.handle.net/11356/1438
Related Identifier	http://eng.slovenscina.eu/korpusi/gos
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1771

Provenance
Creator	Zwitter Vitez, Ana; Zemljarič Miklavčič, Jana; Krek, Simon; Stabej, Marko; Erjavec, Tomaž; Verdonik, Darinka; Potočnik, Tomaž; Sepesy Maučec, Mirjam; Majhenič, Simona; Žgank, Andrej; Bizjak, Andreja; Gril, Lucija; Dobrišek, Simon; Križaj, Janez; Bajec, Marko; Lebar Bajec, Iztok; Jelovšek, Tjaša; Trojar, Mitja; Bernjak, Mitja; Dretnik, Naum; Strle, Gregor; Dobrovoljc, Kaja
Publisher	Centre for Language Resources and Technologies, University of Ljubljana; Faculty of Electrical Engineering and Computer Science, University of Maribor; Faculty of Electrical Engineering, University of Ljubljana; Faculty of Computer and Information Science, University of Ljubljana; ZRC SAZU; Jožef Stefan Institute
Publication Year	2023
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	corpus
Format	text/plain; charset=utf-8; application/zip; downloadable_files_count: 2
Discipline	Linguistics