• English
  • Deutsch
  • Log In
    Password Login
    Research Outputs
    Fundings & Projects
    Researchers
    Institutes
    Statistics
Repository logo
Fraunhofer-Gesellschaft
  1. Home
  2. Fraunhofer-Gesellschaft
  3. Anderes
  4. Which Method(s) to Pick when Evaluating Large Language Models with Humans? - A comparison of 6 methods
 
  • Details
  • Full
Options
2025
Paper (Preprint, Research Paper, Review Paper, White Paper, etc.)
Title

Which Method(s) to Pick when Evaluating Large Language Models with Humans? - A comparison of 6 methods

Abstract
Human evaluations are considered the gold standard for assessing the quality of NLP systems, including large language models (LLM), yet there is little research on how different evaluation methods impact results. This study compares six commonly used evaluation methods - four quantitative (Direct Quality Estimation, Best-Worst Scaling, AB Testing, Agreement with Quality Criterion) and two qualitative (spoken and written feedback) - to examine their influence on ranking texts generated by four LLMs. We found that while GPT-4 was consistently ranked as the top-performing model across methods, the ranking of other models varied considerably. In addition, methods differed in their cost-effectiveness, with Direct Quality Estimation emerging as the most efficient. Qualitative methods provided insights beyond quantitative methods, especially spoken feedback in moderated sessions. Participants reported challenges with task comprehension and evaluation interfaces. Our findings highlight that the choice of evaluation method and the implementation of the method can influence results, affecting both the validity and interpretability of human assessments. These findings suggest a need for methodological guidelines in human-centered evaluations of LLMs to improve reliability and reproducibility in NLP research.
Author(s)
Popp, Birgit
Fraunhofer-Institut für Integrierte Schaltungen IIS  
Keck, Sarah
Fraunhofer-Institut für Integrierte Schaltungen IIS  
Mertsiotaki, Androniki
Fraunhofer-Institut für Integrierte Schaltungen IIS  
Kratsch, Emily
Fraunhofer-Institut für Integrierte Schaltungen IIS  
Daum, Alexander
Fraunhofer-Institut für Integrierte Schaltungen IIS  
Conference
Association for Computational Linguistics (ACL Annual Meeting) 2025  
Open Access
File(s)
Download (1.57 MB)
Rights
CC BY-NC-ND 4.0: Creative Commons Attribution-NonCommercial-NoDerivatives
DOI
10.24406/publica-4221
Language
English
Fraunhofer-Institut für Integrierte Schaltungen IIS  
Keyword(s)
  • NLP

  • Evaluation

  • Human

  • Natural Language Processing

  • Large Language Models

  • LLM

  • ChatGPT

  • Method

  • Cookie settings
  • Imprint
  • Privacy policy
  • Api
  • Contact
© 2024