Towards Scalable Evaluation of Software Understanding: A Methodology Proposal

CC BY 4.0Magin, FlorianFlorianMaginWache, MagdalenaMagdalenaWacheScherf, Fabian WilliamFabian WilliamScherfFischer, CléoCléoFischerZabel, JonasJonasZabel2025-11-102025-11-102025-10-13https://publica.fraunhofer.de/handle/publica/498893https://doi.org/10.24406/publica-613610.1145/3733822.376467210.24406/publica-6136In reverse engineering our goal is to build systems that help people to understand software. However, the field has not converged on a way to measure software understanding. In this paper, we make the case that understanding should be measured via performance on understanding-questions. We propose a method for constructing understanding-questions and evaluating answers at scale. We conduct a case study in which we apply our method and compare Ghidra’s default auto analysis with an analysis that supports binary constructs that are specific to Objective-C.enDecompilationEvaluationUnderstandingLarge Language ModelsTowards Scalable Evaluation of Software Understanding: A Methodology Proposalconference paper