Hier finden Sie wissenschaftliche Publikationen aus den Fraunhofer-Instituten.

Measuring Ensemble Diversity and its Effects on Model Robustness

: Heidemann, Lena; Schwaiger, Adrian; Roscher, Karsten

Fulltext urn:nbn:de:0011-n-6390342 (702 KByte PDF)
MD5 Fingerprint: 853e3c3a9ddf98dea07b4cc024cf4f0f
(CC) by
Created on: 27.8.2021

Espinoza, H.:
Workshop on Artificial Intelligence Safety, AISafety 2021. Proceedings. Online resource : Co-located with the Thirteenth International Joint Conference on Artificial Intelligence (IJCAI 2021), Virtual, August 2021
Online im WWW, 2021 (CEUR Workshop Proceedings 2916)
Paper 8, 9 pp.
Workshop on Artificial Intelligence Safety (AISafety) <2021, Online>
International Joint Conference on Artificial Intelligence (IJCAI) <30, 2021, Online>
Bayerisches Staatsministerium für Wirtschaft, Landesentwicklung und Energie StMWi

IKS Aufbauprojekt
Conference Paper, Electronic Publication
Fraunhofer IKS ()
machine learning; ML; Deep neural networks; DNN; ensemble learning; deep ensemble; Ensemble Diversity; uncertainty quantification; uncertainty estimation; Safe Artificial Intelligence; Safe AI; Safe Machine Learning; Safe ML; Safe Intelligence

Deep ensembles have been shown to perform well on a variety of tasks in terms of accuracy, uncertainty estimation, and further robustness metrics. The diversity among ensemble members is often named as the main reason for this. Due to its complex and indefinite nature, diversity can be expressed by a multitude of metrics. In this paper, we aim to explore the relation of a selection of these diversity metrics among each other, as well as their link to different measures of robustness. Specifically, we address two questions: To what extent can ensembles with the same training conditions differ in their performance and robustness? And are diversity metrics suitable for selecting members to form a more robust ensemble? To this end, we independently train 20 models for each task and compare all possible ensembles of 5 members on several robustness metrics, including the performance on corrupted images, out-of-distribution detection, and quality of uncertainty estimation. Our findings reveal that ensembles trained with the same conditions can differ significantly in their robustness, especially regarding out-of-distribution detection capabilities. Across all setups, using different datasets and model architectures, we see that, in terms of robustness metrics, choosing ensemble members based on the considered diversity metrics seldom exceeds the baseline of a selection based on the accuracy. We conclude that there is significant potential to improve the formation of robust deep ensembles and that novel and more sophisticated diversity metrics could be beneficial in that regard.