Localization and Mapping of Monocular Cameras in Large Urban Environments using Virtual City Models

Pandikow, Lars

2019

Master Thesis

Abstract

In recent years there has been a lot of progress on the task of simultaneous localization and mapping (SLAM) of image sequences from monocular cameras. Latest methods utilize advancements in machine learning to improve the quality of both the camera tracking as well as the reconstruction of the environment. One of these methods called CNN-SLAM uses a neural network to estimate depth maps for keyframes and fuses them with stereo observations from neighboring image frames. To make use of the images within a geo-referenced context they first have to be localized globally. The use of additional sensors to determine the position and orientation of the camera is not only more expensive, but also sensitive to errors. This thesis proposes a real-time system that combines the methods of CNN-SLAM with image based localization within a simple city model. The SLAM algorithm tracks the camera movement while the use of a depth estimation network enables the recovery of the scale of the scene. A genetic algorithm is implemented to quickly refine estimated camera poses by aligning synthetic views of the city model with semantic segmentations of the images. This does not only localize the camera trajectories but also helps to compensate tracking errors caused by the SLAM algorithm. The evaluation showed the systems ability to compute scaled trajectories correctly, to compensate tracking drift and densly reconstruct the scene in the vicinity of the camera. It also revealed the unreliability of image localization without a constraint search space, tracking drift during rotational movement and inaccurate semantic segmentations.

;

In den letzten Jahren hat es große Fortschritte im Bereich Localization and Mapping (SLAM) von monokularen Kamerabildern gegeben. Neuste Methoden nutzen die Fortschritte in maschinellem Lernen, um die Qualität des Kameratrackings sowie der Rekonstruktion der Umgebung zu verbessern. Eine dieser Methoden, genannt CNN-SLAM, verwendet ein neuronales Netz, um Tiefenkarten für Keyframes vorherzusagen und diese mit Stereo-Beobachtung von benachbarten Bildern zu verschmelzen. Um die Bilder in einem geo-referenzierten Kontext benutzen zu können, müssen sie zunächst global lokalisiert werden. Die Verwendung von zusätzlichen Sensoren zur Bestimmung von Position und Orientierung der Camera ist nicht nur teurer, sondern auch anfällig für Fehler. Diese Arbeit beschreibt ein Echtzeit-fähiges System, welches die Methoden von CNN-SLAM mit Bild-basierter Lokalisierung innerhalb eines einfachem Stadtmodells verknüpft. Der SLAM-Algorithmus verfolgt die Bewegung der Kamera, während das Verwenden eines neuronalen Netzes zur Tiefenvorhersage es ermöglicht die Szene korrekt zu skalieren. Ein genetischer Algorithmus wird eingesetzt, um geschätzte Kameraposen schnell zur verfeinern, indem synthetische Ansichten des Stadtmodells an semantischen Segmentierungen der Bilder ausgerichtet werden. Dieses Vorgehen lokalisiert nicht nur die Trajektorie der Kamera, sondern hilft auch Tracking Fehler des SLAM-Algorithmus auszugleichen. Die Evaluation zeigt, dass das System in der Lage ist korrekt skalierte Trajektorien zu berechnen, Drift in den Berechnungen auszugleichen und die Szene in der Nähe der Kamera dicht zu rekonstruieren. Sie deckte außerdem die Unsicherheiten der Bildlokalisierung ohne beschränkten Suchraum, Drift beim Tracking von Drehbewegungen und ungenaue semantische Segmentierungen auf.

Thesis Note

Darmstadt, TU, Master Thesis, 2019

Author(s)

Pandikow, Lars