Knowledge Distillation for Sensory Substitution in Multimodal Models

Ahmed Navid, Shayekh Mohiuddin

doi:10.24406/publica-4778

May 2025

Master Thesis

Abstract

In recent years, multimodal models have achieved strong performance in tasks like VQA, largely relying on RGB images. However, RGB is not always available or the most informative modality for all tasks. This work introduces a multimodal knowledge distillation framework for VQA, where knowledge is transferred from a large teacher model, pre-trained on RGB images, to a smaller, student model, trained only on depth images. We propose a two-phase training pipeline to bridge the modality gap. In the first phase, contrastive distillation aligns the depth features of the student’s vision encoder with the teacher’s RGB features. In the second phase, we apply the Logit Calibration (LoCa) loss to refine the student’s output by correcting overconfident teacher predictions while preserving inter-class relationships. Combined with standard cross-entropy supervision, this hybrid objective improves robustness and accuracy in category-specific tasks. Experiments on a recreated version of the VQA-SUNRGBD dataset show that our student model, based only on depth data, achieves 47.5% accuracy on the validation dataset and 45.5% on the test dataset. To evaluate semantic understanding, we use a neural similarity metric, under which the model scores 69.5% and 67.9% on validation and test datasets, respectively. Notably, our depth-only model outperforms RGB-trained models such as Pixtral-12B, OneVision-7B (pre-trained and fine-tuned), and OneVision-0.5B (both RGB-pre-trained and depth-pre-trained), particularly on directional, object identification, and counting questions. This work shows that the key to effective multimodal knowledge distillation lies in the structured integration of feature alignment and calibrated supervision. This is achieved by modality-aware loss functions that are carefully selected and tuned for each phase of learning. Our framework provides a scalable method for transferring pretrained multimodal knowledge to depth-only models, enabling the processing of depth information by leveraging RGB pre-trained VLMs.

Thesis Note

Koblenz, TU, Master Thesis, 2025

Author(s)

Ahmed Navid, Shayekh Mohiuddin

Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS

Advisor(s)

Kschischo, Maik

Universität Koblenz-Landau

Spravil, Julian

Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS

Options

Knowledge Distillation for Sensory Substitution in Multimodal Models