Hier finden Sie wissenschaftliche Publikationen aus den Fraunhofer-Instituten.

Self-Supervised Learning for Monocular Depth Estimation from Aerial Imagery

: Hermann, Max; Ruf, Boitumelo; Weinmann, Martin; Hinz, Stefan

Fulltext urn:nbn:de:0011-n-5998488 (8.0 MByte PDF)
MD5 Fingerprint: e8e16db39691c0e8718d07f512fa00bf
(CC) by
Created on: 3.9.2020

Paparoditis, N. ; International Society for Photogrammetry and Remote Sensing -ISPRS-:
XXIV ISPRS Congress 2020. Commission II : 31 Aug - 2 Sep, on-line, Nice, France
Istanbul: ISPRS, 2020 (ISPRS Annals V-2-2020)
International Society for Photogrammetry and Remote Sensing (ISPRS Congress) <24, 2020, Online>
Conference Paper, Electronic Publication
Fraunhofer IOSB ()
monocular depth estimation; self-supervised learning; deep learning; Convolutional Neural Networks; self-improving; online processing; Oblique Aerial Imagery

Supervised learning based methods for monocular depth estimation usually require large amounts of extensively annotated training data. In the case of aerial imagery, this ground truth is particularly difficult to acquire. Therefore, in this paper, we present a method for self-supervised learning for monocular depth estimation from aerial imagery that does not require annotated training data. For this, we only use an image sequence from a single moving camera and learn to simultaneously estimate depth and pose information. By sharing the weights between pose and depth estimation, we achieve a relatively small model, which favors real-time application. We evaluate our approach on three diverse datasets and compare the results to conventional methods that estimate depth maps based on multi-view geometry. We achieve an accuracy δ1:25 of up to 93.5 %. In addition, we have paid particular attention to the generalization of a trained model to unknown data and the self-improving capabilities of our approach. We conclude that, even though the results of monocular depth estimation are inferior to those achieved by conventional methods, they are well suited to provide a good initialization for methods that rely on image matching or to provide estimates in regions where image matching fails, e.g. occluded or texture-less regions.