Koch, JannikJannikKochWolf, StefanStefanWolfBeyerer, JürgenJürgenBeyerer2023-02-162023-02-162023https://publica.fraunhofer.de/handle/publica/43606910.1109/wacvw58289.2023.00015Fine-grained image classification is limited by only considering a single view while in many cases, like surveillance, a whole video exists which provides multiple perspectives. However, the potential of videos is mostly considered in the context of action recognition while fine-grained object recognition is rarely considered as an application for video classification. This leads to recent video classification architectures being inappropriate for the task of fine-grained object recognition. We propose a novel, Transformer-based late-fusion mechanism for fine-grained video classification. Our approach achieves superior results to both early-fusion mechanisms, like the Video Swin Transformer, and a simple consensus-based late-fusion baseline with a modern Swin Transformer backbone. Additionally, we achieve improved efficiency, as our results show a high increase in accuracy with only a slight increase in computational complexity. Code is available at: https://github.com/wolfstefan/tlf.enA Transformer-based Late-Fusion Mechanism for Fine-Grained Object Recognition in Videosconference paper