ArDia: Improving Arabic Dialectal Language Classification Using a Novel Dataset

Elsafty, Hossam; Abdou, Bouthaina Soulef; Deußer, Tobias; Pielka, Maren; Bauckhage, Christian; Sifa, Rafet

doi:10.1609/icwsm.v19i1.35944

June 7, 2025

Conference Paper

Abstract

Despite Arabic being one of the most widely spoken languages, there is a scarcity of available dialectal Arabic data. In this paper, we address this challenge by proposing a novel approach to data collection through the main use of video captions from TikTok, and other resources such as dictionaries and articles, resulting in the creation of the ArDia dataset. To the best of our knowledge, the ArDia dataset is the largest labeled dialectal Arabic dataset, containing over 900,000 examples, each labeled with its respective dialect. We further leverage this dataset to pretrain transformer-based models, ArDiaBERT and ArDiaGPT. Due to a lack of research on the Arabic models, we present a comprehensive study of Arabic dialect identification using the ArDia dataset on the dialect identification task.