Automation for camera-only 6D Object Detection

Rojtberg, Pavel

2021

Doctoral Thesis

Abstract

Today a widespread deployment of Augmented Reality (AR) systems is only possible by means of computer vision frameworks like ARKit and ARCore, which abstract from specific devices, yet restrict the set of devices to the respective vendor. This thesis therefore investigates how to allow deploying AR systems to any device with an attached camera. One crucial part of an AR system is the detection of arbitrary objects in the camera frame and naturally accompanying the estimation of their 6D-pose. This increases the degree of scene understanding that AR applications require for placing augmentations in the real world. Currently, this is limited by a coarse segmentation of the scene into planes as provided by the aforementioned frameworks. Being able to reliably detect individual objects, allows attaching specific augmentations as required by e.g. AR maintenance applications. For this, we employ convolutional neural networks (CNNs) to estimate the 6D-pose of all visible objects from a single RGB image. Here, the addressed challenge is the automated training of the respective CNN models, given only the CAD geometry of the target object. First, we look at reconstructing the missing surface data in real-time before we turn to the more general problem of bridging the domain gap between the non-photorealistic representation and the real world appearance. To this end, we build upon generative adversarial network (GAN) models to formulate the domain gap as an unsupervised learning problem. Our evaluation shows an improvement in model performance, while providing a simplified handling compared to alternative solutions. Furthermore, the calibration data of the used camera must be known for precise pose estimation. This data, again, is only available for the restricted set of devices, that the proprietary frameworks support. To lift this restriction, we propose a web-based camera calibration service that not only aggregates calibration data, but also guides users in the calibration of new cameras. Here, we first present a novel calibration-pose selection framework that reduces the number of required calibration images by 30% compared to existing solutions, while ensuring a repeatable and reliable calibration outcome. Then, we present an evaluation of different user guidance strategies, which allows choosing a setting suitable for most users. This enables even novice users to perform a precise camera calibration in about 2 minutes. Finally, we propose an efficient client-server architecture to deploy the aforementioned guidance on the web, making it available to the widest possible range of devices. This service is not restricted to AR systems, but allows the general deployment of computer vision algorithms on the web that rely on camera calibration data, which was previously not possible. These elements combined, allow a semi-automatic deployment of AR systems with any camera to detect any object.