ChatgaiyyaAlap: A Dataset for Conversion from Chittagonian Dialect to Standard Bangla

Recently, a large number of research has been done on different language conversions from standard Bangla. However, only a limited number of effective works have been done in Bangla dialect conversion. We developed the “ChatgaiyyaAlap” dataset to convert the Chittagongian dialect into standard Bangla. The dataset has two Comma Separated Values (.CSV) files. The first file is for Chittagonian and Bangla sentences. This file contains two columns: one is for Standard Bangla sentences, and the other one is for Chittagonian sentences. For both columns, each row contains sentences in Standard Bangla and their translations in the Chittagonian dialect. The other file contains word mapping of the Chittagonian dialect and standard Bangla which is our state-of-the-art dictionary file. The Chittagonian sentences, in the first CSV file, were collected from diverse sources like Youtube and Facebook posts, comments, videos, short films, and dramas in the Chittagongian dialect. After data collection and preprocessing, we evaluated our collected data through five professional human evaluators who are native speakers of the Chittagong dialect and also know the standard Bangla language. Assembling sentences in the Chittagongian dialect was a slow process, where resource limitation was our major drawback. To speed up our process of data collection, we started to gather Bangla sentences from different social media sites and then translate those sentences into Chittagongian dialect with the assistance of five native speakers. As we verified and translated the data from five different speakers, there is a chance to use more than one synonym for a Bangla word. We tried to use more noticeable terms in our dataset rather than using alternative synonyms for the same phrase in order to avoid any misunderstandings. To keep the system simple and improve the translation process, we have maintained a dictionary file that helps us to select the proper Chittagonian word for a standard Bangla word. So the total dataset consists of two files one is Chittagong and Bangla sentences and the other one is a dictionary file.

THIS DATASET IS ARCHIVED AT DANS/EASY, BUT NOT ACCESSIBLE HERE. TO VIEW A LIST OF FILES AND ACCESS THE FILES IN THIS DATASET CLICK ON THE DOI-LINK ABOVE

Identifier
DOI https://doi.org/10.17632/wtms9xbkkw.1
PID https://nbn-resolving.org/urn:nbn:nl:ui:13-ai-uuho
Metadata Access https://easy.dans.knaw.nl/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=oai:easy.dans.knaw.nl:easy-dataset:350828
Provenance
Creator Remal, D
Publisher Data Archiving and Networked Services (DANS)
Contributor Deawan Rakin Ahamed Remal
Publication Year 2024
Rights info:eu-repo/semantics/openAccess; License: http://creativecommons.org/licenses/by/4.0; http://creativecommons.org/licenses/by/4.0
OpenAccess true
Representation
Resource Type Dataset
Discipline Other