Albanian Spoken Corpus in Kosovo 1.0

PID

This is the third version of a spoken corpus of Albanian in Kosovo. The data of the corpus is based on short life stories of 212 informants out of sample of 1800 speakers balanced across all regions of Kosovo and the categories of gender, age and education. In addition, metadata such as place of birth, place of residence, L1, L2, Age group and occupation were collected. The audio data was recorded in 2019 by students from the University of Prishtina. The speech files can be made available on request from one of the authors and will be made publicly available after the finalisation of the transcription in the next version. The transcription was carried out partly at Humboldt-Universität zu Berlin and partly at the University of Prishtina. The transcription is diplomatic (using the standard alphabet but transcribing relevant phonological realisation). It partly follows typical rendering of Gheg dialectal words and uses the HIAT system. The data was annotated using Timofey Arkhangelsky's Uniparser-albanian-grammar (https://bitbucket.org/timarkh/uniparser-albanian-grammar), keeping only non-ambiguous values. A list of tags used in the parser can be found here: http://albanian.web-corpora.net. The data are in CoNLL-U format. This version of the corpus contains the data of 212 speakers aged between 11 and 80, mainly from the regions of Ferizaj, Gjilan, Kaçanik, Mitrovicë, Podujevë, Rahovec and Shtërpcë. As opposed to the previous version, this corpus corrects several errors in the metadata.

Identifier
PID http://hdl.handle.net/11356/1955
Related Identifier http://hdl.handle.net/11356/1871
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1955
Provenance
Creator Wasserscheidt, Philipp; Rugova, Bardh; Baftiu, Adelajda
Publisher University of Prishtina "Hasan Prishtina"
Publication Year 2024
Rights Creative Commons - Attribution 4.0 International (CC BY 4.0); https://creativecommons.org/licenses/by/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Albanian
Resource Type corpus
Format text/plain; charset=utf-8; application/zip; downloadable_files_count: 1
Discipline Linguistics