Options
September 2022
Conference Paper
Title
Preliminary analysis on data quality for ML applications
Abstract
Purpose: This publication investigates preliminary data quality analyses to estimate the efforts and expected results of the use of data sets for ML solutions already in the data understanding phase of an implementation. Knowledge about the necessary data cleaning efforts and result qualities allows potentials to be estimated early in the process.
Methodology: Through a literature research, characteristics of a time series as well as methods of data cleaning are analysed. Based on the results, a test environment is implemented in Python, enabling the evaluation of individual methods using sample data sets from the process industry and comparing them with different error analyses.
Findings: The publication describes a detailed overview of data cleaning procedures and addresses a first Indication of a connection between the final achievable forecast quality and the degree of error of the original data set. Insights into the influence of the choice of preprocessing method on the achievable quality of the AI-based forecast can be concluded.
Originality: Within the publication, the link between data characteristics in time series and preprocessing methods is established to draw conclusions in advance about the quality improvement to be expected from selected data cleaning methods and to provide decision support for the selection of the method.
Methodology: Through a literature research, characteristics of a time series as well as methods of data cleaning are analysed. Based on the results, a test environment is implemented in Python, enabling the evaluation of individual methods using sample data sets from the process industry and comparing them with different error analyses.
Findings: The publication describes a detailed overview of data cleaning procedures and addresses a first Indication of a connection between the final achievable forecast quality and the degree of error of the original data set. Insights into the influence of the choice of preprocessing method on the achievable quality of the AI-based forecast can be concluded.
Originality: Within the publication, the link between data characteristics in time series and preprocessing methods is established to draw conclusions in advance about the quality improvement to be expected from selected data cleaning methods and to provide decision support for the selection of the method.
Author(s)