Can Large Language Models (LLMs) Compete with Human Requirements Reviewers? - Replication of an Inspection Experiment on Requirements Documents

Seifert, Daniel; Jöckel, Lisa; Trendowicz, Adam; Ciolkowski, Marcus; Honroth, Thorsten; Jedlitschka, Andreas

doi:10.1007/978-3-031-78386-9_3

2025

Conference Paper

Abstract

The use of large language models (LLMs) for software engineering is growing, especially for code - typically to generate code or to detect or fix quality problems. Because requirements are often written in natural language, it seems promising to exploit the capabilities of LLMs to detect requirement problems. We replicated an inspection experiment in which computer science students searched for defects in requirements documents using different reading techniques. In our replication, we used the LLM GPT-4-Turbo instead of students to determine how the model compares to human reviewers. Additionally, we considered GPT-3.5-Turbo, Nous-Hermes-2-Mixtral-8x7B-DPO, and Phi-3-medium-128k-instruct for one research question. We focus on single prompt approaches and avoid more complex approaches to mimic the original study design where students received all the material at once. We had two phases. First, we explored the general feasibility of using LLMs for requirements inspection on a practice document and examined different prompts. Second, we applied selected approaches to two requirements documents and compared the approaches to each other and to human reviewers. The approaches include variations in reading techniques (ad-hoc, perspective-based, checklist-based), LLMs, the instructions, and material provided. We found that LLMs (a) report only a limited number of deficits despite having enough tokens, which (b) do not vary much across prompts. They (c) rarely match the sample solution.