Neural Radiance Fields for Real-time Object Pose Estimation on unseen Object Appearances

Werner, Christoph

2021

Master Thesis

Abstract

I present an extension to the iNeRF (inverting Neural Radianc Fields) framework, which allows mesh-free 6DoF pose estimation even on previously unseen appearance variations of known objects from RGB-only inputs in near real-time. The iterative analysis-by-synthesis approach of iNeRF uses Neural Radiance Fields (NeRFs) to render photorealistic views from freely chosen directions. An object pose is then estimated by matching the camera pose of a synthesized view with the input image. Thus,"inverting" the NeRF formula. While NeRF allows the volumetric rendering of complex scenes without 3D models, it is only capable of representing a single static scene. Variation in scenes, due to illumination changes or different object textures, requires an explicitly trained model for these conditions. Since lighting settings can quickly change in any natural environment, there is no feasible way to prepare NeRF models for every expected object appearance beforehand. Especially, when no additional synthetic data can be generated due to the non-availability of 3D meshes. This, in turn, limits iNeRF use cases in realistic scenarios. Additionally, the iterative iNeRF algorithm is very computationally expensive, which further reduces possible cases of application. The goal of this thesis is to enhance the iNeRF method, so it is able to predict poses on various object appearances under changing illumination conditions, even when the specific looks of these object were not seen before. Furthermore, performance enhancements are supposed to reduce the computation time and allow a wider variety of use cases. By introducing a variational autoencoder (VAE) to the NeRF formulation, my method allows NeRF to represent object appearances in a latent space. Instead of a single fixed scene, my NeRF model is trained to reproduce objects under a wide distribution of differing textures and illuminations, based on a single input image. A single trained NeRF-VAE model, capable of recreating various different known and unseen textures, thus replaces multiple individually trained NeRFs, which can only render already seen objects. With this NeRF-VAE upgrade to the rendering process, iNeRF-VAE can now adapt the once static neural radiance field to reconstruct the appearance from any given input image of the object. In my experiments, I show the expressiveness of this latent vector to represent unseen object appearances in a custom dataset. I also establish my model's capability to still capture fine geometric details. Additionally, I examine that the introduced changes allow pose estimation on various object appearances. My approach also reveals proficiency in predicting poses of partially occluded objects by inpainting missing parts. Finally, to achieve closer to realtime pose estimation, the iterative process is sped up by predicting a closer initial starting position. I also reduce the number of rendered pixels by sampling more informative rays and introduce an early stopping regime, to prioritize convergence speed over minor accuracy progress.

Thesis Note

Darmstadt, TU, Master Thesis, 2021

Author(s)

Werner, Christoph