Exploring CrossViT for Robust Face Recognition in Low-Resolution Scenarios

Vu, Thien Nam

2026

Bachelor Thesis

Abstract

Recent progress in face recognition (FR) has been driven largely by deep convolutional neural networks and, more recently, Vision Transformers (ViTs). While ViTs provide strong global feature modeling and have achieved state-of-the-art performance on high-resolution (HR) datasets, their ability to operate on low-resolution (LR) scenarios is still limited. LR face images appear frequently in real-world scenarios, such as surveillance, mobile capture,... Standard ViTs, which operate on fixed-size patches, normally struggle in these settings. To address these challenges, this work investigates the CrossViT architecture, an efficient multi-scale ViT originally proposed for generic image classification, and evaluates its suitability for robust LR FR. CrossViT processes images using parallel branches with different patch sizes and fuses them through a mechanism called cross-attention, enabling the model to analyze both global structure and local detail simultaneously. In this thesis, we systematically compare CrossViT variants against standard ViT baselines using largescale FR training data and multiple evaluation benchmarks, including challenging LR datasets such as TinyFace. We further analyze performance trade-offs, discuss limitations, and outline potential extensions for future research.

Thesis Note

Darmstadt, TU, Bachelor Thesis, 2026

Author(s)

Vu, Thien Nam

Fraunhofer-Institut für Graphische Datenverarbeitung IGD

Advisor(s)

Damer, Naser

Fraunhofer-Institut für Graphische Datenverarbeitung IGD

Chettaoui, Tahar

Fraunhofer-Institut für Graphische Datenverarbeitung IGD

Options

Exploring CrossViT for Robust Face Recognition in Low-Resolution Scenarios