Mechanistic understanding and validation of large AI models with SemanticLens

Dreyer, Maximilian; Berend, Jim; Labarta, Tobias; Vielhaben, Johanna; Wiegand, Thomas; Lapuschkin, Sebastian Roland; Samek, Wojciech

doi:10.1038/s42256-025-01084-w

2025

Journal Article

Abstract

Unlike human-engineered systems, such as aeroplanes, for which the role and dependencies of each component are well understood, the inner workings of artificial intelligence models remain largely opaque, which hinders verifiability and undermines trust. Current approaches to neural network interpretability, including input attribution methods, probe-based analysis and activation visualization techniques, typically provide limited insights about the role of individual components or require extensive manual interpretation that cannot scale with model complexity. This paper introduces SemanticLens, a universal explanation method for neural networks that maps hidden knowledge encoded by components (for example, individual neurons) into the semantically structured, multimodal space of a foundation model such as CLIP. In this space, unique operations become possible, including (1) textual searches to identify neurons encoding specific concepts, (2) systematic analysis and comparison of model representations, (3) automated labelling of neurons and explanation of their functional roles, and (4) audits to validate decision-making against requirements. Fully scalable and operating without human input, SemanticLens is shown to be effective for debugging and validation, summarizing model knowledge, aligning reasoning with expectations (for example, adherence to the ABCDE rule in melanoma classification) and detecting components tied to spurious correlations and their associated training data. By enabling component-level understanding and validation, the proposed approach helps mitigate the opacity that limits confidence in artificial intelligence systems compared to traditional engineered systems, enabling more reliable deployment in critical applications.