Analysis of Deep Fusion Strategies for Multi-modal Gesture Recognition

Roitberg, Alina; Pollert, Tim; Haurilet, Monica; Martin, Manuel; Stiefelhagen, Rainer

doi:10.1109/CVPRW.2019.00029

2019

Conference Paper

Abstract

Video-based gesture recognition has a wide spectrum of applications, ranging from sign language understanding to driver monitoring in autonomous cars. As different sensors suffer from their individual limitations, combining multiple sources has strong potential to improve the results. A number of deep architectures have been proposed to recognize gestures from e.g. both color and depth data. However, these models conventionally comprise separate networks for each modality, which are then combined in the final layer (e.g. via simple score averaging). In this work, we take a closer look at different fusion strategies for gesture recognition especially focusing on the information exchange in the intermediate layers. We compare three fusion strategies on the widely used C3D architecture: 1) late fusion, combining the streams in the final layer; 2) information exchange in an intermediate layer using an additional convolution layer; and 3) linking information at multiple layers simultaneously using the cross-stitch units, originally designed for multi-task learning. Our proposed C3D-Stitchmodel achieves the best recognition rate, demonstrating the effectiveness of sharing information at earlier stages.

Author(s)

Roitberg, Alina

Pollert, Tim

Haurilet, Monica

Martin, Manuel

Stiefelhagen, Rainer

Mainwork

IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2019. Proceedings

Project(s)

PaKoS

Funder

Bundesministerium für Bildung und Forschung BMBF (Deutschland)

Conference

Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 2019

Options

Analysis of Deep Fusion Strategies for Multi-modal Gesture Recognition