Q&A Eval: Benchmarking Secure Coding Ability of LLMs on Real-World Tasks

Toran, Markus; Ballin, Bettina; Miltenberger, Marc; Arzt, Steven

doi:10.1145/3786165.3788437

April 12, 2026

Conference Paper

Abstract

Conversational Models revolutionize the way we think, communicate, and code. Large Language Models (LLMs) such as GPT-o4 can generate thousands of lines of code in seconds, ranging from simple boilerplate functions to large and complex applications. In this study, we evaluate the security and quality of the code produced by LLMs, comparing it to a human baseline derived from a vast corpus of StackOverflow questions and answers. We queried 5 LLMs with over 10,000 cybersecurity-related questions from StackOverflow. Using three static code scanners, we automatically identified software vulnerabilities in the AI-generated code for Java and Python as well as the human-provided code snippets in the StackOverflow answers. Based on this data, we analyze what developers can expect from LLM-generated code and how its security level compares to that of code provided by humans. We find that popular LLMs generate code that is less secure than code written by human programmers. LLMs often replicate common vulnerability patterns and, in some cases, introduce additional security issues. Our results contradict a previous study on a similar, albeit smaller dataset.