Overview and Joint Report of the Robustness and Consistency Task of the ELOQUENT 2025 Lab for Evaluating Generative Language Model Quality

CC BY 4.0Karlgren, JussiJussiKarlgrenEngels, Marie IsabelMarie IsabelEngelsBarrett, MariaMariaBarrettGunti, Rohit RajRohit RajGuntiHoveyda, MohannaMohannaHoveydaSotic, Bruno NadalicBruno NadalicSoticKamps, JaapJaapKampsKoistinen, MikaMikaKoistinenZosa, ElaineElaineZosa2025-10-302025-12-042025-10-302025https://doi.org/10.24406/publica-5955https://publica.fraunhofer.de/handle/publica/497952https://doi.org/10.24406/publica-595510.24406/publica-59552-s2.0-105019040432Generative language models are intended to be creative and responsive to the style of the conversation they engage in. The experimental Robustness and Consistency task is designed to explore how variation between content-wise equivalent inputs influences the output of a generative language model, and in this year’s edition the task focuses on how linguistic variation makes a difference for value-oriented questions. This paper is a joint report by all participants in the task.enfalseOverview and Joint Report of the Robustness and Consistency Task of the ELOQUENT 2025 Lab for Evaluating Generative Language Model Qualityconference paper