Options
2025
Conference Paper
Title
Learning Robust Aligned Representations Across Multiple Visual Modalities in Human Action Recognition
Abstract
We propose Cross-Modal Video Representation Alignment (CMVRA), a novel framework for human action recognition that leverages multiple visual modalities - RGB, infrared (IR), depth, and skeleton data - to learn robust, generalizable representations with reduced reliance on annotated data. By employing contrastive learning, CMVRA effectively aligns these modalities, enhancing the model's ability to integrate complementary information and capture richer representations across domains. This multi-modal alignment is crucial for improving recognition performance in diverse and challenging contexts. We propose a unified multi-modal embedding framework that aligns RGB, depth, infrared, and skeleton data, enhancing robustness and feature diversity, while also advancing alignment techniques by demonstrating that fully integrated multi-modal alignment outperforms traditional pairwise strategies. Extensive experiments conducted on the NTU and Drive&Act datasets confirm the effectiveness of our approach. CMVRA achieves a 3.01% improvement in 3D skeleton-based activity recognition on Drive&Act, outperforming state-of-the-art methods. Experiments on NTU show that CMVRA closes the gap between self-supervised and supervised learning methods. These results highlight the potential of self-supervised multi-modal activity recognition and emphasize the benefits of leveraging contrastive learning across diverse modalities. Our findings suggest promising directions for future research, particularly in multi-modal contrastive learning and the integration of vision-language models for action recognition. Our code and the generated video captions will be made available on GitHub.
Author(s)