• English
  • Deutsch
  • Log In
    Password Login
    Research Outputs
    Fundings & Projects
    Researchers
    Institutes
    Statistics
Repository logo
Fraunhofer-Gesellschaft
  1. Home
  2. Fraunhofer-Gesellschaft
  3. Konferenzschrift
  4. Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models
 
  • Details
  • Full
Options
November 2025
Conference Paper
Title

Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models

Abstract
High-quality multilingual training data is essential for effectively pretraining large language models (LLMs). Yet, the availability of suitable open-source multilingual datasets remains limited. Existing state-of-the-art datasets mostly rely on heuristic filtering methods, restricting both their cross-lingual transferability and scalability. Here, we introduce JQL, a systematic approach that efficiently curates diverse and high-quality multilingual data at scale while significantly reducing computational demands. JQL distills LLMs’ annotation capabilities into lightweight annotators based on pretrained multilingual embeddings. These models exhibit robust multilingual and cross-lingual performance, even for languages and scripts unseen during training. Evaluated empirically across 35 languages, the resulting annotation pipeline substantially outperforms current heuristic filtering methods like Fineweb2. JQL notably enhances downstream model training quality and increases data retention rates. Our research provides practical insights and valuable resources for multilingual data curation, raising the standards of multilingual dataset development.
Author(s)
Ali, Mehdi  
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Brack, Manuel
Hessian AI
Lübbering, Max  
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Wendt, Elias
TU Darmstadt  
Khan, Abbas Goher
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Rutmann, Richard
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Jude, Alex
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Kraus, Maurice
DFKI  
Weber, Alexander Arno
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Stollenwerk, Felix
AI Sweden
Kaczér, David
Lamarr Institut
Mai, Florian
Lamarr Institut
Flek, Lucie
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Sifa, Rafet  
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Flores-Herr, Nicolas  
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Köhler, Joachim  
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Schramowski, Patrick
DFKI  
Fromm, Michael  
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Kersting, Kristia
DFKI  
Mainwork
EMNLP 2025, Conference on Empirical Methods in Natural Language Processing. Proceedings  
Conference
Conference on Empirical Methods in Natural Language Processing 2025  
Open Access
DOI
10.18653/v1/2025.emnlp-main.449
Additional link
Full text
Language
English
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Keyword(s)
  • large language models

  • LLMs

  • Cookie settings
  • Imprint
  • Privacy policy
  • Api
  • Contact
© 2024