Benchmarking Uncertainty Estimation Methods for Deep Learning with Safety-Related Metrics
Deep neural networks generally perform very well on giving accurate predictions, but they often lack in recognizing when these predictions may be wrong. This absence of awareness regarding the reliability of given outputs is a big obstacle in deploying such models in safety-critical applications. There are certain approaches that try to address this problem by designing the models to give more reliable values for their uncertainty. However, even though the performance of these models are compared to each other in various ways, there is no thorough evaluation comparing them in a safety-critical context using metrics that are designed to describe trade-offs between performance and safe system behavior. In this paper we attempt to fill this gap by evaluating and comparing several state-of-the-art methods for estimating uncertainty for image classifcation with respect to safety-related requirements and metrics that are suitable to describe the models performance in safety-critical domains. We show the relationship of remaining error for predictions with high confidence and its impact on the performance for three common datasets. In particular, Deep Ensembles and Learned Confidence show high potential to significantly reduce the remaining error with only moderate performance penalties.