Beserman multimedia corpus

Dataset

DOI

This deposit contains transcriptions of monologues and conversations in spoken Beserman (formerly classified as a dialect of Udmurt, ISO 639-2 code udm). It contains 276 transcripts with a total of 289 thousand words (see explanations for this count below). The online version of this corpus, which is updated regularly, can be found at https://beserman.web-corpora.net/index_en.html.

Description of the contents

The contents are as follows:

eaf (directory as ZIP archive): transcripts in ELAN with clause-level translations into Russian and English, arranged by the year of recording
json (directory as ZIP archive): the same transcriptions with automatic rule-based morphological annotation, stored in tsakorpus format, arranged by the year of recording
metadata.csv: tab-delimited metadata for the transcriptions
metadata_participants.json: metadata for speakers in JSON
README.md: documentation

The associated recordings (audio or video) are stored in a separate repository with a stricter access policy due to privacy concerns. You can download them separately and extract them in a directory called sound, which should be located under the same path as eaf. This way, ELAN will be able to open the media files.

Beserman language

The language spoken by the Besermans belongs to the Permic branch of Uralic languages. It is spoken by about 2000 people, who live mainly in the northwest of Udmurtia. Unfortunately, the number of speakers is rapidly decreasing, as the transmission of the language to the younger generation stopped completely between 2000 and 2005.

Beserman has traditionally been regarded as a supradialect (dialectal group, narechiye) of the Udmurt language (as well as the only dialect within this dialect). The linguistic difference between Beserman and Udmurt is small, especially if Beserman is compared to the Northern Udmurt dialects. Nevertheless, the Besermans distinguish their language from Udmurt and consider it an important factor of national identity. Beserman is de facto recognized in Udmurtia as a language different from Udmurt. The Day of Beserman language and writing is celebrated in Udmurtia on October 21. There is no official Beserman orthography at the moment. Those who write in Beserman use slightly different spellings, generally based on the Udmurt Cyrillic script. So far, two books have been published in Beserman: Vortčʼa madʼjos (by Vyacheslav Ar-Sergi and Rafail Dyukin) and Pičʼi princ (The Little Prince by Antoine de Saint-Exupéry, translated by Rafail Dyukin).

All morphological grammatical categories are expressed suffixally and agglutinatively, only indefinite and negative pronouns have prefixes. There are no traces of vowel harmony in Beserman, which is presumed to have existed in Proto-Uralic. Nominal grammatical categories include number, case and possessiveness. Verbs distinguish four morphological tenses (direct and evidential past, present and future) and index the person and number of the subject. The direct object is marked by the nominative or the accusative, depending on animateness, referential status, and other factors (differential object marking). The word order in the clause is relatively free, SOV being the default one (subject – direct object – verb).

Corpus characteristics

Language: Beserman (previously classified as a dialect of Udmurt); Russian (code-switching and some utterances by linguists)

Size: The corpus contains full transcripts of recordings, including fragments in Russian. The volume of the corpus is:

only words in Beserman by native speakers, not counting code-switching: 235 thousand words;
all words by native speakers: 256 thousand words;
total size, including utterances of Udmurt speakers, Besermans who are not native speakers, and linguists: 289 thousand words

Texts: Aligned transcripts of audio and video recordings. These were mostly recorded during field trips to the village of Shamardan (Yukamenskoye district, Udmurtia, Russia), which began in 2003. A few recordings made in several villages in the first half of the 2000s were provided by Nadezhda Lyukina.

40% of the texts (in terms of word count) are free dialogues, 35.9% are dialogs recorded during experiments on referential communication, 24% are monologues (mainly interviews in which the linguist acts as a listener, but also narratives about events or oral translations from Russian), 0.1% are songs.

94% of texts were recorded in Shamardan, the rest were recorded in Vorcha, Pyshkizh, Ozhyar, Yunda, Bagurt and Yezhgurt Pichinka.

Annotation:

Translations of sentences into Russian, including comments necessary to understand the context.
Translations of sentences into English. Translations are made with the help of automatic translator DeepL based on the Russian translation. At the moment only a small part of the translations are manually verified.
Automatic morphological annotation (lemmatization, part of speech, all inflectional categories) with uniparser_beserman_lat, 97% of word forms have at least one analysis. (Only words that do not contain digits or Latin characters are counted.) Since the analyzer is rule-based, there is ambiguity, i.e. one word form can have several different parsing options.
Partial disambiguation using Constraint Grammar rules.
Annotation of Russian loanwords.
Annotation of several lexical/semantic classes: animateness/humanness, body parts, means transport, different classes of proper names.
Annotation of the transitivity of verbs and (partially) their subcategorization frames.
Glossing.
Translations of lemmas into Russian and English.

Metadata:

title (in English and Russian)
date (at least the year) of recording
place of recording
genre and subgenre
speaker codes
codes of the linguists who participated in recording and transcribing
sex of the speaker
birth place of the speaker
birth year of the speaker

The Latin-based transcription system used in the transcripts due to a tradition established in our field trips (and enabled in the online search interface by default), is somewhat different from the standard ones. However, there is a one-to-one correspondence between characters or combinations of characters used here with the standard transcription systems.

Utterances in Russian and fragments of utterances, which the corpus authors considered code-switching, are transcribed in Russian in standard Russian orthography.

The correspondence between the transcription system used in this corpus, UPA (Uralic Phonetic Alphabet / Finno-Ugric Transcription, in the variant traditionally used in Udmurt studies), IPA (International Phonetic Alphabet) and Cyrillic-based phonetic transcription (also in the variant traditionally used in Udmurt studies) can be found in README.md.

Format

All speakers (participants) have unique ID codes. Episodic speakers whose identity is not known all get the code other. The ELAN files have three tiers per participant. Participant ID is specified as the value of the PARTICIPANT attribute and as the suffix of the tier name, following @. Tier type is specified as the prefix of the tier name, preceding @. Tier types are tx (transcription), ft_ru (free translation into Russian, including comments) and ft_en (free translation into English, including comments). For example, tx@IM is the transcription tier for the participant with the code IM. The tx tiers are time-aligned, each segment approximately corresponding to one clause or one smaller intonational unit, if there are pauses within the clause. The other two tier types are symbolically associated with the tx tier for the corresponding participant. Additionally, some files have a time-aligned tier called privacy. Segments on this tier should be beeped out in publicly accessible versions of the corpus due to privacy issues.

The file metadata.csv contains metadata for all recordings in a tab-delimited tabular format. The first line contains column headers, each of the other lines corresponds to one text (transcript). The first column contains the file name of the text without the extension.

The file metadata_participants.json contains metadata for all participants, except those that are marked as other. It contains a dictionary where the keys are IDs of the participants and the values are dictionaries with their metadata. The attribute speaker_type can equal native (native speaker of Beserman), native_udmurt (native speaker of Udmurt, but not Beserman), linguist (linguist who is not a native speaker of either Beserman or Udmurt) and russian (native speaker of Russian, but not Beserman or Udmurt, who lives in the village). The other metadata attributes and values are self-explanatory.

Annotation

Lemmatization

The lemma for nouns, relational nouns, pronouns and adjectives is the morphologically unmarked form, i.e. the non-possessive singular nominative form. The lemma for verbs is the infinitive.

Word forms containing productive derivations are lemmatized without these derivations if the corresponding lemma exists. For nouns, these are the proprietives on -o and on -em and the caritive attributivizer on -tem. For example, šʼašʼkajo 'with flower / flowers' is considered a form of the lexeme šʼašʼka 'flower' and is marked as a noun. For verbs, these are the iterative (-əl/-lʼlʼa), the detransitive (-(i)šʼk) and the productive causative (on -(ə)t, but not on -et and not in -t in verbs of the non-a conjugation), as well as the multiplicative (-ja) when it follows a causative.

Tagset

Grammatical values expressed in each word are indicated with tags. A complete list of tags used for annotating words in Beserman can be found in README.md.

Authors

Starting in 2003, the corpus texts were recorded and transcribed in the field by numerous participants of the field trips. The overwhelming majority of the corpus texts (about 80%) were recorded by Maria Usacheva (code Interviewer_MU in the transcripts) and/or Timofey Arkhangelskiy (Interviewer_TA), in some cases, together with other linguists. They, as well as Maria Berseneva, a native speaker of Beserman, prepared the vast majority of transcriptions and translations of the texts into Russian. Olga Biryuk (Interviewer_OB), Ruslan Idrisov (Interviewer_RI), Maria Cheremisinova (Interviewer_MCh), Nikolai Filippov (Interviewer_NF) and Iuliia Zubova (Interviewer_YZ) have also significantly contributed to the recording and transcription of the texts. Timofey Arkhangelskiy provides technical support for the corpus and is responsible for correcting earlier transcriptions. Sound-alignment (ELAN) of texts that were transcribed before 2015 and did not have any alignment was performed by Marina Pankova. Most of the alignment of the remaining texts with sound was done by Timofey Arkhangelskiy.

Funding

The previous publicly accessible version of the corpus (BeserCorp 1.0) was archived in the Language Bank of Finland (FIN-CLARIN): http://urn.fi/urn:nbn:fi:lb-2021052406. It was much smaller, did not have any sound alignment or English translations and had a different annotation (manual annotation in FLEX).

The preparation of this version of the corpus was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) grant — project no. 428175960 (Timofey Arkhangelskiy).

Contact

If you have any questions, would like to propose a collaboration, or have noticed an error in the corpus, please email Timofey Arkhangelskiy at timarkh@gmail.com.

References

ELAN (Version 6.9) [Computer software]. (2024). Nijmegen: Max Planck Institute for Psycholinguistics, The Language Archive. Retrieved from https://archive.mpi.nl/tla/elan

The preparation of this version of the corpus was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) grant — project no. 428175960.

Identifier
DOI	https://doi.org/10.25592/uhhfdm.16991
Related Identifier	IsPartOf https://doi.org/10.25592/uhhfdm.16990
Metadata Access	https://www.fdr.uni-hamburg.de/oai2d?verb=GetRecord&metadataPrefix=oai_datacite&identifier=oai:fdr.uni-hamburg.de:16991

Provenance
Creator	Arkhangelskiy, Timofey ; Usacheva, Maria
Publisher	Universität Hamburg
Contributor	Biryuk, Olga; Idrisov, Ruslan; Cheremisinova, Maria; Filippov, Nikolai; Zubova, Iuliia
Publication Year	2025
Rights	Creative Commons Attribution 4.0 International; Open Access; https://creativecommons.org/licenses/by/4.0/legalcode; info:eu-repo/semantics/openAccess
OpenAccess	true

Representation
Resource Type	Dataset
Version	2.0
Discipline	Other