• English
  • Deutsch
  • Log In
    Password Login
    Research Outputs
    Fundings & Projects
    Researchers
    Institutes
    Statistics
Repository logo
Fraunhofer-Gesellschaft
  1. Home
  2. Fraunhofer-Gesellschaft
  3. Konferenzschrift
  4. ArDia: Improving Arabic Dialectal Language Classification Using a Novel Dataset
 
  • Details
  • Full
Options
June 7, 2025
Conference Paper
Title

ArDia: Improving Arabic Dialectal Language Classification Using a Novel Dataset

Abstract
Despite Arabic being one of the most widely spoken languages, there is a scarcity of available dialectal Arabic data. In this paper, we address this challenge by proposing a novel approach to data collection through the main use of video captions from TikTok, and other resources such as dictionaries and articles, resulting in the creation of the ArDia dataset. To the best of our knowledge, the ArDia dataset is the largest labeled dialectal Arabic dataset, containing over 900,000 examples, each labeled with its respective dialect. We further leverage this dataset to pretrain transformer-based models, ArDiaBERT and ArDiaGPT. Due to a lack of research on the Arabic models, we present a comprehensive study of Arabic dialect identification using the ArDia dataset on the dialect identification task.
Author(s)
Elsafty, Hossam
Universität Bonn  
Abdou, Bouthaina Soulef
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Deußer, Tobias  orcid-logo
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Pielka, Maren  
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Bauckhage, Christian  
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Sifa, Rafet  
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Mainwork
Nineteenth International AAAI Conference on Web and Social Media, ICWSM 2025. Proceedings  
Project(s)
The Lamarr Institute for Machine Learning and Artificial Intelligence  
Funder
Bundesministerium für Bildung und Forschung -BMBF-  
Conference
International Conference on Web and Social Media 2025  
DOI
10.1609/icwsm.v19i1.35944
Additional full text version
Landing Page
Language
English
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Fraunhofer Group
Fraunhofer-Verbund IUK-Technologie  
Keyword(s)
  • Natural Language Processing

  • Arabic NLP

  • Arabic Dialect Classification

  • Cookie settings
  • Imprint
  • Privacy policy
  • Api
  • Contact
© 2024