• English
  • Deutsch
  • Log In
    Password Login
    Research Outputs
    Fundings & Projects
    Researchers
    Institutes
    Statistics
Repository logo
Fraunhofer-Gesellschaft
  1. Home
  2. Fraunhofer-Gesellschaft
  3. Konferenzschrift
  4. Learning Robust Aligned Representations Across Multiple Visual Modalities in Human Action Recognition
 
  • Details
  • Full
Options
2025
Conference Paper
Title

Learning Robust Aligned Representations Across Multiple Visual Modalities in Human Action Recognition

Abstract
We propose Cross-Modal Video Representation Alignment (CMVRA), a novel framework for human action recognition that leverages multiple visual modalities - RGB, infrared (IR), depth, and skeleton data - to learn robust, generalizable representations with reduced reliance on annotated data. By employing contrastive learning, CMVRA effectively aligns these modalities, enhancing the model's ability to integrate complementary information and capture richer representations across domains. This multi-modal alignment is crucial for improving recognition performance in diverse and challenging contexts. We propose a unified multi-modal embedding framework that aligns RGB, depth, infrared, and skeleton data, enhancing robustness and feature diversity, while also advancing alignment techniques by demonstrating that fully integrated multi-modal alignment outperforms traditional pairwise strategies. Extensive experiments conducted on the NTU and Drive&Act datasets confirm the effectiveness of our approach. CMVRA achieves a 3.01% improvement in 3D skeleton-based activity recognition on Drive&Act, outperforming state-of-the-art methods. Experiments on NTU show that CMVRA closes the gap between self-supervised and supervised learning methods. These results highlight the potential of self-supervised multi-modal activity recognition and emphasize the benefits of leveraging contrastive learning across diverse modalities. Our findings suggest promising directions for future research, particularly in multi-modal contrastive learning and the integration of vision-language models for action recognition. Our code and the generated video captions will be made available on GitHub.
Author(s)
Lerch, David
Fraunhofer-Institut für Optronik, Systemtechnik und Bildauswertung IOSB  
Rothenburger, Bastian
Goethe University Frankfurt
Zhong, Zeyun
Fraunhofer-Institut für Optronik, Systemtechnik und Bildauswertung IOSB  
Martin, Manuel  
Fraunhofer-Institut für Optronik, Systemtechnik und Bildauswertung IOSB  
Diederichs, Frederik  
Fraunhofer-Institut für Optronik, Systemtechnik und Bildauswertung IOSB  
Stiefelhagen, Rainer  
Karlsruhe Institute of Technology (KIT)
Mainwork
IEEE/CVF International Conference on Computer Vision Workshops, ICCV-W 2025. Proceedings  
Conference
International Conference on Computer Vision Workshops 2025  
DOI
10.1109/ICCVW69036.2025.00281
Language
English
Fraunhofer-Institut für Optronik, Systemtechnik und Bildauswertung IOSB  
Keyword(s)
  • Visualization

  • Three-dimensional displays

  • Conferences

  • Supervised learning

  • Contrastive learning

  • Skeleton

  • Robustness

  • Human activity recognition

  • Videos

  • Software development management

  • self-supervised learning

  • multi-modal learning

  • representation learning

  • activity recognition

  • Cookie settings
  • Imprint
  • Privacy policy
  • Api
  • Contact
© 2024