Evaluation of vision-language models for waste classification: Zero- and few-shot with a training-free tip-adapter

Funk, Jonas; Bäcker, Paul; Roming, Lukas; Josekutty, Jerardh; Maier, Georg; Längle, Thomas

doi:10.1016/j.clwas.2026.100475

2026

Journal Article

Abstract

Accurate waste classification is essential for effective recycling. For household waste, changing recycling policies and the introduction of new materials in consumer products continually reshape sorting categories across regions and over time. These changes challenge existing sensor-based sorting systems trained for a specific classification task, and restoring accuracy with fully supervised retraining is costly because it requires new labeled data and full model updates. We investigate whether foundation models can deliver accurate image-based waste classification with no or few labeled examples. We analyze existing methods such as multimodal large language models (MLLMs), vision–language models (VLMs), Vision Transformers (ViTs), and a baseline CNN across four datasets, including a new food vs. non-food packaging dataset, with varying numbers of labeled examples. Further, we propose adaptations to the existing approaches, such as chain-of-thought prompting for MLLMs, and ensemble prompting and a Tip-Adapter for VLMs. We show that MLLMs perform well on zero-shot classification and larger models like GPT-4o further improve with few examples, but at infeasible computational cost for industry-scale inference. For a comparably faster way of zero-shot classification, we show that VLMs yield an accuracy of 90.4% on TrashNet, by contrast, a CNN typically needs a few hundred labeled images to achieve similar performance. Using the training-free Tip-Adapter with only 10 labeled example images per class lifts macro-F1 by 8.1 points over the zero-shot VLM baseline. Overall, we propose a guideline for language-driven, training-free methods for waste classification.