A Comparative Evaluation of Vision Language Models for Waste Classification in Few-Shot Settings

Funk, Jonas; Bäcker, Paul; Roming, Lukas; Josekutty, Jerardh; Maier, Georg; Längle, Thomas

doi:10.11159/mvml25.113

2025

Conference Paper

Abstract

Efficient waste classification is essential for sustainable waste management systems. Accurate sorting can significantly enhance recycling efforts and reduce pollution. However, traditional computer vision methods often require large, annotated datasets and extensive retraining, limiting their adaptability to varying waste types and challenging real-world conditions. In this study, we evaluate the potential of Multimodal Large Language Models (MLLMs) and Vision-Language Models (VLMs) for adaptive waste classification, focusing on zero-shot and few-shot learning scenarios. Using datasets such as TrashNet and our custom MultiWaste dataset, we test a method using a CLIP VLM for feature extraction and a simple Nearest Neighbour (VLM-NN) approach for classification. This showcases robust few-shot capabilities and excellent scalability, achieving an accuracy of 97.74% on TrashNet. While MLLMs exhibit strong zeroshot capabilities, their utility diminishes with increasing labelled samples due to high computational costs. In contrast, VLM-NN offers efficient performance but struggles with extremely limited training data. Our results show the potential of Large Pretrained Models for the task of waste classification while providing guidance on which model architectures to consider for different amounts of training data.