Options
August 7, 2025
Conference Paper
Title
EdgeYOLO+Depth: Extending object detection for real-time depth estimation on smartphones
Abstract
Monocular cameras are widely used in consumer devices, robotics and industrial systems. While object detection is well-established for identifying relevant objects, many applications require additional instance depth information. But depth estimation methods typically rely on complex models unsuitable for edge devices and real-time applications. We investigate lightweight approaches to add instance depth estimation to object detection systems without increasing computational requirements, optimized for real-time deployment on edge devices like smartphones. We improve the existing DisNet approach, which uses only bounding box dimensions as input and is compatible with any object detection model. Additionally, we extend an edge-optimized YOLO model with a depth estimation component by modifying only the model’s head (EdgeYOLO+Depth). Our approaches are evaluated on an internal dataset and the KITTI dataset. On KITTI, our DisNet-based method achieves a relative error of 6.55 %, while our EdgeYOLO+Depth approach reaches 3.59 %, outperforming comparable methods. We demonstrate that with proper tuning of the depth component’s weight in the loss function, the additional depth estimation has no negative impact on object detection results. Our experiments show robust model performance against minor input transformations (horizontal shifts: +0.24 pp relative error), and acceptable degradation under more substantial geometric changes (rotation: +3.86 pp relative error). Regarding inference times, our DisNet variant adds only 0.15 ms for depth estimation, while EdgeYOLO+Depth requires no additional inference time, delivering complete object detection with depth estimation at 52 FPS on a Samsung Galaxy S23. Our code is publicly available at: https://github.com/christoph-i/EdgeDepth.
Author(s)