On a Systematic Test of ML-Based Systems: Experiments on Test Statistics

Grube, Nicolas; Massah, Mozhdeh; Tebbe, Michael; Wancura, Paul; Wiesbrock, Hans-Werner; Großmann, Jürgen; Kharma, Sami

doi:10.1109/AITest62860.2024.00010

September 25, 2024

Conference Paper

Abstract

Machine learning (ML)-based systems are becoming increasingly ubiquitous even in safety critical environments. The strength of ML systems, to solve complex problems with a stochastic model, leads to challenges in the testing domain. This motivates us to introduce a rigorous testing method for ML-models and their application environment akin to classical software testing, which is independent of the training process and considers the probabilistic nature of ML. The approach is based on the concept of the Probabilistically Extended ONtology (PEON). In brief, PEON is a an ontology modeling the designated Operational Design Domain (ODD), which is extended by assigning probability distributions to classes and their individual attributes, as well as probabilistic dependencies between these attributes. The relevant statistical key figures like accuracy depend not only on the ML-based model but also strongly on the statistics of the test data set, which we refer to by quality assurance (QA) data set, to emphasize its independence from the test data set in the training process. This implies that we have to consider the statistical properties of the QA data in order to evaluate an ML-based system. In this paper we present first experimental results comparing established test selection methods e.g. N-wise, with a new approach the PEON. Our findings strongly suggest, that the underlying statistical properties of the QA data significantly influence the test results of ML-based systems. In this respect, careful attention must be paid to the statistical independence and balance of the QA data. The PEON provides a good basis for the composition of QA data sets, which are not only independent of the development process but also statistically representative and balanced with respect to the modeled ODD.