• English
  • Deutsch
  • Log In
    Password Login
    Research Outputs
    Fundings & Projects
    Researchers
    Institutes
    Statistics
Repository logo
Fraunhofer-Gesellschaft
  1. Home
  2. Fraunhofer-Gesellschaft
  3. Konferenzschrift
  4. Visual Latent Captioning - Towards Verbalizing Vision Transformer Encoders
 
  • Details
  • Full
Options
April 2025
Conference Paper
Title

Visual Latent Captioning - Towards Verbalizing Vision Transformer Encoders

Abstract
The efficient adaptability of large multimodal models to downstream tasks depends on understanding how these models have learned their knowledge and how the information can be accessed and manipulated to achieve the desired performance. Beyond performance, transparency is also important for understanding and evaluating a model’s behavior. In this paper, we propose a novel method for analyzing each layer of the vision encoder within vision-language models in the form of natural language. We leverage the model’s multimodal text decoder to generate captions for visual features at each layer of its transformer-based vision encoder. In essence, we use the model to interpret its own components by translating information from one modality to another. Subsequently, we use a large language model to interpret the generated captions to provide insight into the type of information represented at each layer of the vision encoder. We specifically track detectable visual classes, such as actions, objects, and colors, to determine in which layer sufficient visual information has been accumulated to form more complex descriptions. We find that the textual representations develop while progressing through the layers, starting from simple visual characteristics to complex scene descriptions featuring multiple objects. The detection of actions starts by first generating a prototypical action in layer 18, which is then refined in later layers. Our code is available online. (https://github.com/SogolHaghighat/latent_verbalizer).
Author(s)
Haghighat, Sogol
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Metzler, Tim Daniel
Fachhochschule Bonn-Rhein-Sieg
Thoduka, Santosh
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Houben, Sebastian
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Mainwork
Advances in Information Retrieval. 47th European Conference on Information Retrieval, ECIR 2025. Proceedings. Part II  
Conference
European Conference on Information Retrieval 2025  
DOI
10.1007/978-3-031-88711-6_25
Language
English
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Keyword(s)
  • Interpretability

  • Large Language Models

  • Multimodal Models

  • Transformer Vision Encoder

  • Vision-Language Models

  • Cookie settings
  • Imprint
  • Privacy policy
  • Api
  • Contact
© 2024