This is a dataset that contains sentences in various dialects of Udmurt (Permic < Uralic; ISO 639-3 code udm). It mainly contains questionnaire responses collected for the research of the variation in the mutual order of Udmurt clitics in clitic clusters, annotated for clitic order.
This data was collected in 2021-2023 by Timofey Arkhangelskiy. Most responses were collected in the Estonian Udmurt community (Tallinn and Tartu) in 2022. Some were collected in Tatarstan and Bashkortostan (Russia) in 2021 or in Estonia in 2023. Several sentences were taken from transcripts of dialectal texts or produced by consultants without any prompt. Data collection, annotation and publishing were supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) grant — project no. 428175960.
Before proceeding to the dataset, please keep in mind:
Although I provide the questionnaire stimuli, the responses do not always contain their exact translations. Sometimes consultants forgot what exactly they were supposed to translate, added something to their translation, or translated only a part of the stimulus. It was not my goal for the translations to be close to the original. Therefore the stimulus and the response should not be treated as translation pairs.
The speakers were instructed to make translations in their own dialect rather than in the standard language. My transcriptions of their oral responses reflect all dialectal features and deviate from the standard language (sometimes significantly).
As a consequence:
If you want to use this dataset for its original purpose, you can just take the annotation and do not look at the actual examples. If you need it for anything beyond that purpose, you will only be able to do so if you have a reasonable command of Udmurt. There are English translations of the stimuli, but you should not rely on them alone.
DO NOT USE THIS DATASET FOR TRAINING MACHINE TRANSLATION OR UDMURT LANGUAGE MODELS!
The dataset has a TSV format (tab-delimited values). Please refer to readme.txt for further information.
If you have any questions or require help with processing the data, please feel free to contact Timofey Arkhangelskiy: timarkh@gmail.com.
Data collection, annotation and publishing were supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) grant — project no. 428175960.