Measuring Ensemble Diversity and its Effects on Model Robustness

Heidemann, Lena; Schwaiger, Adrian; Roscher, Karsten

doi:10.24406/publica-fhg-411897

2021

Conference Paper

Abstract

Deep ensembles have been shown to perform well on a variety of tasks in terms of accuracy, uncertainty estimation, and further robustness metrics. The diversity among ensemble members is often named as the main reason for this. Due to its complex and indefinite nature, diversity can be expressed by a multitude of metrics. In this paper, we aim to explore the relation of a selection of these diversity metrics among each other, as well as their link to different measures of robustness. Specifically, we address two questions: To what extent can ensembles with the same training conditions differ in their performance and robustness? And are diversity metrics suitable for selecting members to form a more robust ensemble? To this end, we independently train 20 models for each task and compare all possible ensembles of 5 members on several robustness metrics, including the performance on corrupted images, out-of-distribution detection, and quality of uncertainty estimation. Our findings reveal that ensembles trained with the same conditions can differ significantly in their robustness, especially regarding out-of-distribution detection capabilities. Across all setups, using different datasets and model architectures, we see that, in terms of robustness metrics, choosing ensemble members based on the considered diversity metrics seldom exceeds the baseline of a selection based on the accuracy. We conclude that there is significant potential to improve the formation of robust deep ensembles and that novel and more sophisticated diversity metrics could be beneficial in that regard.