Rauch, LukasLukasRauchHeinrich, René Patrick GeraldRené Patrick GeraldHeinrichMoummad, IlyassIlyassMoummadJoly, AlexisAlexisJolySick, BernhardBernhardSickScholz, ChristophChristophScholz2025-12-042025-12-082026-01-272026-01-272025-12-042025-08https://publica.fraunhofer.de/handle/publica/500184Masked Autoencoders (MAEs) learn rich representations in audio classification throughan efficient self-supervised reconstruction task. Yet, general-purpose models struggle infine-grained audio domains such as bird sound classification, which demands distinguishingsubtle inter-species differences under high intra-species variability. We show that bridgingthis domain gap requires full-pipeline adaptation beyond domain-specific pretraining data.Using BirdSet, a large-scale bioacoustic benchmark, we systematically adapt pretraining,fine-tuning, and frozen feature utilization. Our Bird-MAE sets new state-of-the-art resultson BirdSet’s multi-label classification benchmark. Additionally, we introduce the parameterefficientprototypical probing, which boosts the utility of frozen MAE features by achievingup to 37 mAP points over linear probes and narrowing the gap to fine-tuning in low-resourcesettings. Bird-MAE also exhibits strong few-shot generalization with prototypical probeson our newly established few-shot benchmark on BirdSet, underscoring the importance oftailored self-supervised learning pipelines for fine-grained audio domains.enSelf-supervised learningmasked autoencodersbird-sound classificationbioacousticsprototypical probingfew-shot learningmulti-label classificationCan Masked Autoencoders Also Listen to Birds?journal article