Designing Usable Interfaces for Human Evaluation of LLM-Generated Texts: UX Challenges and Solutions

Mertsiotaki, Androniki; Hofmann, Stephanie; Keck, Sarah; Kratsch, Emily; Daum, Alexander; Popp, Birgit

doi:10.24406/publica-4219

2025

Paper (Preprint, Research Paper, Review Paper, White Paper, etc.)

Abstract

Human evaluations remain important for assessing large language models (LLMs) due to the limitations of automated metrics. However, flawed methodologies and poor user interface (UI) design can compromise the validity and reliability of such evaluations. This pre-registered study investigates usability challenges and proposes solutions for UI design in evaluating LLM-generated texts. By comparing common evaluation methods such as Direct Quality Estimation, AB-Testing, Agreement with Quality Criterion, and Best-Worst Scaling, insights were gained into user experience challenges, including inefficient information transfer and poor visibility of evaluation materials. Iterative redesigns improved discoverability, accessibility, and user interaction through modifications of page layout and content presentation. Testing these enhancements revealed increased clarity and usability, with higher response rates and more consistent ratings. This work highlights the importance of UI design in enabling reliable and meaningful human evaluation, providing actionable recommendations to enhance the integrity and usability of NLP evaluation frameworks.