• English
  • Deutsch
  • Log In
    Password Login
    Research Outputs
    Fundings & Projects
    Researchers
    Institutes
    Statistics
Repository logo
Fraunhofer-Gesellschaft
  1. Home
  2. Fraunhofer-Gesellschaft
  3. Konferenzschrift
  4. Tokenizer Choice For LLM Training: Negligible or Crucial?
 
  • Details
  • Full
Options
June 2024
Conference Paper
Title

Tokenizer Choice For LLM Training: Negligible or Crucial?

Abstract
The recent success of Large Language Models (LLMs) has been predominantly driven by curating the training dataset composition, scaling of model architectures and dataset sizes and advancements in pretraining objectives, leaving tokenizer influence as a blind spot. Shedding light on this underexplored area, we conduct a comprehensive study on the influence of tokenizer choice on LLM downstream performance by training 24 mono- and multilingual LLMs at a 2.6 B parameter scale, ablating different tokenizer algorithms and parameterizations. Our studies highlight that the tokenizer choice can significantly impact the model's downstream performance and training costs. In particular, we find that the common tokenizer evaluation metrics fertility and parity are not always predictive of model downstream performance, rendering these metrics a questionable proxy for the model's downstream performance. Furthermore, we show that multilingual tokenizers trained on the five most frequent European languages require vocabulary size increases of factor three in comparison to English. While English-centric tokenizers have been applied to the training of multi-lingual LLMs in the past, we find that this approach results in a severe downstream performance degradation and additional training costs of up to 68%, due to an inefficient tokenization vocabulary.
Author(s)
Ali, Mehdi  
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Fromm, Michael  
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Thellmann, Klaudia
Technische Universität Dresden  
Rutmann, Richard
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Lübbering, Max  
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Leveling, Johannes
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Ebert, Jan
Forschungszentrum Jülich GmbH
Klug, Katrin  
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Jurkschat, Lena
Technische Universität Dresden  
John, Chelsea
Forschungszentrum Jülich GmbH
Jain, Charvi
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Schulze Buschhoff, Johann Jasper
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Weber, Alexander Arno
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Abdelwahab, Hammam
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Doll, Niclas
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Suarez, Pedro Ortiz
German Research Center for Artificial Intelligence (DFKI)
Ostendorff, Malte
German Research Center for Artificial Intelligence (DFKI)
Weinbach, Samuel
Aleph Alpha GmbH
Sifa, Rafet  
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Kesselheim, Stefan
Forschungszentrum Jülich GmbH
Flores-Herr, Nicolas  
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Mainwork
Findings of the Association for Computational Linguistics: NAACL 2024  
Conference
Association for Computational Linguistics, North American Chapter (NAACL Annual Conference) 2024  
DOI
10.18653/v1/2024.findings-naacl.247
Language
English
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Keyword(s)
  • Computational linguistics

  • Data set size

  • Down-stream

  • Language model

  • Model training

  • Modeling architecture

  • Performance

  • Scalings

  • Tokenizer

  • Training costs

  • Training dataset

  • Large datasets

  • Cookie settings
  • Imprint
  • Privacy policy
  • Api
  • Contact
© 2024