CC BY-NC-ND 4.0Mertsiotaki, AndronikiAndronikiMertsiotakiHofmann, StephanieStephanieHofmannKeck, SarahSarahKeckKratsch, EmilyEmilyKratschDaum, AlexanderAlexanderDaumPopp, BirgitBirgitPopp2025-02-052025-02-052025https://doi.org/10.24406/publica-4219https://publica.fraunhofer.de/handle/publica/48366210.24406/publica-4219Human evaluations remain important for assessing large language models (LLMs) due to the limitations of automated metrics. However, flawed methodologies and poor user interface (UI) design can compromise the validity and reliability of such evaluations. This pre-registered study investigates usability challenges and proposes solutions for UI design in evaluating LLM-generated texts. By comparing common evaluation methods such as Direct Quality Estimation, AB-Testing, Agreement with Quality Criterion, and Best-Worst Scaling, insights were gained into user experience challenges, including inefficient information transfer and poor visibility of evaluation materials. Iterative redesigns improved discoverability, accessibility, and user interaction through modifications of page layout and content presentation. Testing these enhancements revealed increased clarity and usability, with higher response rates and more consistent ratings. This work highlights the importance of UI design in enabling reliable and meaningful human evaluation, providing actionable recommendations to enhance the integrity and usability of NLP evaluation frameworks.enEvaluationLLMLarge Language ModelsNLPDesignUsabilityUser InterfaceNatural Language ProcessingHumanDesigning Usable Interfaces for Human Evaluation of LLM-Generated Texts: UX Challenges and Solutionspaper