Using LLMs to Identify Personal Data Processing in Source Code

Kunz, Immanuel; Kao, Ching-Yu Franziska; Kowatsch, Daniel; Hiller, Jens; Schütte, Julian; Prokhorenkov, Dmitry; Bettinger, Konstantin

doi:10.1109/SPW67851.2025.00018

2025

Conference Paper

Abstract

Assessing the privacy impact of software products is essential for adhering to regulatory requirements but it is also highly challenging. This is due to the need for expertise in both software engineering and data protection, and the time-intensive and error-prone nature of the task, particularly when dealing with large and frequently changing applications. In this study, we present a Large Language Model-based approach to automatically classify source code for its privacy impact. Our contributions are (1) a dataset of code snippets, labeled with personal data from a W3C personal data taxonomy (2) an extensible approach and framework to auto-classify source code using the taxonomy and different prompting strategies, and (3) a demonstration of multiple experiments that give insight into an effective use of such a framework. Our results demonstrate that LLM-based detection of personal data processing in source code is feasible with levels of accuracy that can effectively support human reviewers in assessing software at scale.