A provenance meta learning framework for missing data handling methods selection

Liu, Qian; Hauswirth, Manfred

2020

Conference Paper

Abstract

Missing data is a big problem in many real-world data sets and applications, which can lead to wrong or misleading results of analyses and lower quality and confidence in the results. A large number of missing data handling methods have been proposed in the research community but there exists no universally single best method which can handle all the missing data problems. To select the right method for a specific missing data handling problem, it usually depends on multiple inter-twined factors. To alleviate this methods selection problem, in this paper, we propose a Provenance Meta Learning Framework to simplify this process. We conducted an extensive literature review over 118 missing data handling method survey papers from 2000 to 2019. With this review, we analyse 9 influential factors and 12 selection criteria for missing data handling methods and further perform a detailed analysis of 6 popular missing data handling methods (4 machine learning methods, i.e., KNN Imputation (KNNI), Weighted KNN Imputation (WKNNI), K Means Imputation (KMI), and Fuzzy KMI (FKMI), and 2 ad-hoc methods, i.e., Median/Mode Imputation (MMI) and Group/Class MMI (CMMI)). We focus on missing data handling methods selection for 3 different classification techniques, i.e., C4.5, KNN, and RIPPER. In our evaluations, we adopt 25 real world data sets from KEEL and UCI data sets repositories. Our Provenance Meta Learning Framework suggests that using KNNI to handle missing values when missing data mechanism is Missing Complete At Random (MCAR), missing data pattern is uni-attribute missing data pattern, or monotone missing data pattern, missing data rate is within [1%,5%], number of class labels is 2, sample size is no more than 10'000, since it can keep classification performance better and have higher imputation accuracy and imputation exhaustiveness than all the other 5 missing data handling methods when subsequent classification methods are KNN or RIPPER.