Visual Latent Captioning - Towards Verbalizing Vision Transformer Encoders

Haghighat, Sogol; Metzler, Tim Daniel; Thoduka, Santosh; Houben, Sebastian

doi:10.1007/978-3-031-88711-6_25

April 2025

Conference Paper

Abstract

The efficient adaptability of large multimodal models to downstream tasks depends on understanding how these models have learned their knowledge and how the information can be accessed and manipulated to achieve the desired performance. Beyond performance, transparency is also important for understanding and evaluating a model’s behavior. In this paper, we propose a novel method for analyzing each layer of the vision encoder within vision-language models in the form of natural language. We leverage the model’s multimodal text decoder to generate captions for visual features at each layer of its transformer-based vision encoder. In essence, we use the model to interpret its own components by translating information from one modality to another. Subsequently, we use a large language model to interpret the generated captions to provide insight into the type of information represented at each layer of the vision encoder. We specifically track detectable visual classes, such as actions, objects, and colors, to determine in which layer sufficient visual information has been accumulated to form more complex descriptions. We find that the textual representations develop while progressing through the layers, starting from simple visual characteristics to complex scene descriptions featuring multiple objects. The detection of actions starts by first generating a prototypical action in layer 18, which is then refined in later layers. Our code is available online. (https://github.com/SogolHaghighat/latent_verbalizer).