Albanian Spoken Corpus in Kosovo 0.2

PID

This is the second version of a spoken corpus of Albanian in Kosovo. The data of the corpus is based on short life stories of 212 informants out of sample of 1800 speakers balanced across all regions of Kosovo and the categories of gender, age and education. In addition, metadata such as place of birth, place of residence, L1, L2, Age group and occupation were collected. The audio data was recorded in 2019 by students from the University of Prishtina. The speech files can be made available on request from one of the authors. The speech files will be made publicly available after the finalisation of the transcription in the next version of the publication. The transcription was carried out partly at Humboldt-Universität zu Berlin and partly at the University of Prishtina. The transcription is diplomatic (using the standard alphabet but transcribing relevant phonological realisation). It partly follows typical rendering of Gheg dialectal words and uses the HIAT system. The data was annotated using Timofey Arkhangelsky's Uniparser-albanian-grammar (https://bitbucket.org/timarkh/uniparser-albanian-grammar), keeping only non-ambiguous values. A list of tags used in the parser can be found here: http://albanian.web-corpora.net. The data are in CoNLL-U format. This version of the corpus contains the data of 212 speakers aged between 11 and 80, mainly from the regions of Ferizaj, Gjilan, Kaçanik, Mitrovicë, Podujevë, Rahovec and Shtërpcë.

Identifier
PID http://hdl.handle.net/11356/1871
Related Identifier http://hdl.handle.net/11356/1955
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1871
Provenance
Creator Wasserscheidt, Philipp; Rugova, Bardh; Baftiu, Adelajda
Publisher University of Prishtina "Hasan Prishtina"
Publication Year 2024
Rights Creative Commons - Attribution 4.0 International (CC BY 4.0); https://creativecommons.org/licenses/by/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Albanian
Resource Type corpus
Format text/plain; charset=utf-8; application/zip; downloadable_files_count: 1
Discipline Linguistics