Recommending data preprocessing pipelines for machine learning applications in production

Frye, Maik

doi:10.18154/RWTH-2023-02364

March 8, 2023

Doctoral Thesis

Abstract

Das Zeitalter der Industrie 4.0 ermöglicht die datengetriebene Optimierung von Produktionssystemen. Um einen Mehrwert aus Produktionsdaten zu generieren, werden Modelle des maschinellen Lernens (ML) eingesetzt. Eine entscheidende Voraussetzung für leistungsfähige ML-Modelle ist die Verfügbarkeit von Daten in hoher Qualität. Da die in der Produktion erzeugten Rohdaten verschiedenste Qualitätsmängel aufweisen, ist eine zielgerichtete Datenvorverarbeitung (DPP) erforderlich. Eine der wichtigsten Designentscheidungen in ML-Projekten ist die Wahl geeigneter DPP-Methoden. Der Suchraum vergrößert sich weiter, wenn mehrere DPP-Methoden in DPP-Pipelines konfiguriert werden. Aufgrund der großen Anzahl möglicher DPP-Pipelines wählen Data Scientists Pipelines in der Regel manuell und durch ein Trial and Error Verfahren aus. Daher nimmt DPP heutzutage etwa 80 % der Zeit in ML-Projekten in Anspruch. Um Data Scientists zu unterstützen, wurden Entscheidungsunterstützungssysteme entwickelt, die bei der Auswahl geeigneter DPP-Pipelines helfen, aber keine produktionsspezifischen Anforderungen abdecken. Daraus ergab sich die Hauptforschungs-frage der vorliegenden Dissertation: Kann ein Entscheidungsunterstützungssystem entwickelt werden, das bei der Empfehlung von DPP-Pipelines für ML-Anwendungen in der Produktion unterstützt? Um die Hauptforschungsfrage zu beantworten, wurde ein Meta-Learning-basiertes Entscheidungsunterstützungssystem, Meta-DPP genannt, entwickelt. Meta-DPP stützt sich auf drei Kernkomponenten: den Meta Target Selector, die Meta Features Database und das Meta Modell. Der Meta Target Selector wählt zwischen zwei vorselektierten Mengen von performanten Pipelines, sog. Pipeline Pools, für Klassifizierungs- und Regressionsaufgaben aus. Darüber hinaus speichert die Meta Features Database lernaufgabenspezifische Informationen über den Datensatz, z. B. die Anzahl der Instanzen, sowie Performanzen von ML-Algorithmen und DPP-Pipelines. Das Meta Modell empfiehlt dann eine Pipeline aus dem Pipeline-Pool auf der Grundlage der Meta Features aus der Database. Bei der Anwendung von Meta-DPP kann der Data Scientist, Produktionsexperte oder IT-Experte über eine Benutzeroberfläche seinen Datensatz, Lernaufgabe, ML-Algorithmus und Informationen zur Erklärbarkeit eingeben. Auf Basis dieser vier Eingaben liefert Meta-DPP eine Rangfolge von Empfehlungen für die DPP-Pipelines aus dem Pool. Die Verifizierung und Validierung zeigte die korrekte Entwicklung und Implementierung von Meta-DPP. Die Validierung an 324 produktiven Anwendungsfällen zeigt außerdem, dass Meta-DPP im Durchschnitt besser abschneidet als essentielle Pipelines, wobei essentielle Pipelines das Funktionieren von ML-Algorithmen durch minimale DPP sicherstellen. Daher wurde die Hauptforschungsfrage positiv beantwortet.

;

The era of Industry 4.0 opens up the possibility of optimizing production systems in a data-driven way. To turn data into value, machine learning (ML) models are trained on production data aiming at identifying patterns to optimize processes. A crucial prereq-uisite for achieving performant ML models is the availability of high quality data. Since raw data generated in production exhibits multiple quality issues, data preprocessing (DPP) is required to increase the quality of the data. One of the key design decisions in any ML project is the choice of suitable DPP methods. The search space further increases when DPP methods are configured into DPP pipelines. Due to the high num-ber of possible DPP pipelines, data scientists commonly select suitable pipelines man-ually and via trial and error. For these reasons, DPP nowadays accounts for approximately 80 % of the time in ML projects.To guide data scientists, decision support systems (DSS) have been developed that assist in the selection of suitable DPP pipelines but do not cover productionspecific requirements. Therefore, the main research question was: Can a DSS be developed that supports in recommending DPP pipelines for ML applications in production? To be able to answer the main research question, a meta learning-based decision sup-port system, called Meta-DPP, was developed. Meta-DPP relies on three core compo-nents: the meta target selector, meta features database, and meta model. The meta target selector chooses between two preselected sets of overall well performing pipe-lines, called pipeline pools, for both classification and regression tasks. Further, the meta features database stores learning taskspecific information about the data set, e. g., the number of instances, as well as past ML algorithm and DPP pipeline performances. The meta model then recommends a pipeline from the pipeline pool based on the meta features from the database. When applying Meta-DPP, a user interface enables the data scientist, production expert, or IT expert to input their data set, learning task, ML algorithm and information about explainability. Given these four inputs, Meta-DPP provides a ranked recommendation of the DPP pipelines from the pool. Probabilities provided by the meta model further indicate how certain Meta-DPP is about the recommendation. Verifying and validating revealed the correct development and implementation of Meta-DPP. The validation on 324 production use cases further prove that Meta-DPP outperform essential pipelines on average, whereby essential pipelines ensure the function-ing of ML algorithms by performing minimum DPP. As a conclusion, the main research question was positively answered.

Thesis Note

Zugl.: Aachen, TH, Diss., 2022

Author(s)

Frye, Maik

Fraunhofer-Institut für Produktionstechnologie IPT

Advisor(s)

Schmitt, Robert

Fraunhofer-Institut für Produktionstechnologie IPT

Behr, Marek