Which Method(s) to Pick when Evaluating Large Language Models with Humans? - A comparison of 6 methods

Popp, Birgit; Keck, Sarah; Mertsiotaki, Androniki; Kratsch, Emily; Daum, Alexander

doi:10.24406/publica-4221

2025

Paper (Preprint, Research Paper, Review Paper, White Paper, etc.)

Abstract

Human evaluations are considered the gold standard for assessing the quality of NLP systems, including large language models (LLM), yet there is little research on how different evaluation methods impact results. This study compares six commonly used evaluation methods - four quantitative (Direct Quality Estimation, Best-Worst Scaling, AB Testing, Agreement with Quality Criterion) and two qualitative (spoken and written feedback) - to examine their influence on ranking texts generated by four LLMs. We found that while GPT-4 was consistently ranked as the top-performing model across methods, the ranking of other models varied considerably. In addition, methods differed in their cost-effectiveness, with Direct Quality Estimation emerging as the most efficient. Qualitative methods provided insights beyond quantitative methods, especially spoken feedback in moderated sessions. Participants reported challenges with task comprehension and evaluation interfaces. Our findings highlight that the choice of evaluation method and the implementation of the method can influence results, affecting both the validity and interpretability of human assessments. These findings suggest a need for methodological guidelines in human-centered evaluations of LLMs to improve reliability and reproducibility in NLP research.