This is a dataset that contains sentences in various dialects of Udmurt (Permic < Uralic; ISO 639-3 code udm). It mainly contains questionnaire responses collected for various purposes.
This data was collected in 2023-2024 by Timofey Arkhangelskiy in the Estonian Udmurt community (Tallinn and Tartu). Several sentences were taken from transcripts of dialectal texts or produced by consultants without any prompt. Data collection and publishing were supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) grant — project no. 428175960.
Before proceeding to the dataset, please keep in mind:
Although I provide the questionnaire stimuli, the responses do not always contain their exact translations. Sometimes consultants forgot what exactly they were supposed to translate, added something to their translation, or translated only a part of the stimulus. It was not my goal for the translations to be close to the original. Therefore the stimulus and the response should not be treated as translation pairs.
The speakers were instructed to make translations in their own dialect rather than in the standard language. My transcriptions of their oral responses reflect all dialectal features and deviate from the standard language (sometimes significantly).
As a consequence:
You will only be able to use this dataset for your research if you have a good command of Udmurt. In any case, you should not rely on the Russian stimuli alone.
DO NOT USE THIS DATASET FOR TRAINING MACHINE TRANSLATION OR UDMURT LANGUAGE MODELS!
The dataset has a TSV format (tab-delimited values). Please refer to readme.txt for further information.
If you have any questions or require help with processing the data, please feel free to contact Timofey Arkhangelskiy: timarkh@gmail.com.
Data collection and publishing were supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) grant — project no. 428175960.