Publica
Hier finden Sie wissenschaftliche Publikationen aus den FraunhoferInstituten. Estimating Position & Velocity in 3D Space from Monocular Video Sequences Using a Deep Neural Network
 Ikeuchi, K. ; Institute of Electrical and Electronics Engineers IEEE: IEEE International Conference on Computer Vision Workshops, ICCVW 2017 : 2229 October 2017, Venice, Italy. Proceedings Piscataway, NJ: IEEE, 2017 ISBN: 9781538610343 ISBN: 9781538610350 S.14601469 
 International Conference on Computer Vision (ICCV) <2017, Venice> 

 Englisch 
 Konferenzbeitrag 
 Fraunhofer HHI () 
Abstract
This work describes a regression model based on Convolutional Neural Networks (CNN) and LongShort Term Memory (LSTM) networks for tracking objects from monocular video sequences. The target application being pursued is VisionBased Sensor Substitution (VBSS). In particular, the tooltip position and velocity in 3D space of a pair of surgical robotic instruments (SRI) are estimated for three surgical tasks, namely suturing, needlepassing and knottying. The CNN extracts features from individual video frames and the LSTM network processes these features over time and continuously outputs a 12dimensional vector with the estimated position and velocity values. A series of analyses and experiments are carried out in the regression model to reveal the benefits and drawbacks of different design choices. First, the impact of the loss function is investigated by adequately weighing the Root Mean Squared Error (RMSE) and Gradient Difference Loss (GDL), using the VGG16 neural network for feature extraction. Second, this analysis is extended to a Residual Neural Network designed for feature extraction, which has fewer parameters than the VGG16 model, resulting in a reduction of ~96.44 % in the neural network size. Third, the impact of the number of time steps used to model the temporal information processed by the LSTM network is investigated. Finally, the capability of the regression model to generalize to the data related to "unseen" surgical tasks (unavailable in the training set) is evaluated. The aforesaid analyses are experimentally validated on the public dataset JIGSAWS. These analyses provide some guidelines for the design of a regression model in the context of VBSS, specifically when the objective is to estimate a set of 1D time series signals from video sequences.