Options
October 21, 2025
Conference Paper
Title
Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs
Abstract
We present two multilingual LLMs, Teuken 7B-base and Teuken 7B-instruct, designed to embrace Europe’s linguistic diversity by supporting all 24 official languages of the European Union. Trained on a dataset comprising around 60% non-English data and utilizing a custom multilingual tokenizer, our models address the limitations of existing Large Language Models (LLMs) that predominantly focus on English or a few high-resource languages. We detail the models’ development principles, i.e., data composition, tokenizer optimization, and training methodologies. The models demonstrate strong performance across multilingual benchmarks, as evidenced by their performance on European versions of ARC, HellaSwag, and TruthfulQA.
Author(s)
Schulze Buschhoff, Johann Jasper
Open Access
File(s)
Rights
CC BY-NC 4.0: Creative Commons Attribution-NonCommercial
Additional link
Language
English