Zero-Shot Open-Vocabulary OOD Object Detection and Grounding using Vision Language Models

Sinhamahapatra, Poulami; Bose, Shirsha; Roscher, Karsten; Günnemann, Stephan

doi:10.24406/publica-4460

2025

Conference Paper

Abstract

Automated driving involves complex perception tasks that require a precise understanding of diverse traffic scenarios and confident navigation. Traditional data-driven algorithms trained on closed-set data often fail to generalize upon out-of-distribution (OOD) and edge cases. Recently, Large Vision Language Models (LVLMs) have shown potential inintegrating the reasoning capabilities of language models to understand and reason about complex driving scenes, aiding generalization to OOD scenarios. However, grounding such OOD objects still remains a challenging task. In this work, we propose an automated framework zPROD for zero-shot promptable open vocabulary OOD object detection,segmentation, and grounding in autonomous driving. We leverage LVLMs with visual grounding capabilities, eliminating the need for lengthy textc ommunication and providing precise indications of OOD objects in the scene or on the track of the egocentric vehicle. We evaluate our approach on OOD datasets from existing road anomaly segmentation benchmarks such as SMIYC and Fishyscapes. Our zero-shot approach shows superior performance on RoadAnomaly and RoadObstacle and comparable results on the Fishyscapes subset as compared to supervised models and acts a baseline for future zero-shot methods based on open vocabulary OOD detection.