Generic error identification in data sets

El Bekri, Nadia; Peinsipp, Byma

doi:10.24406/publica-fhg-394199

2016

Conference Paper

Abstract

The manual acquisition of data is in many areas, as for example the United States Environmental Protection Agency (EPA) [1] does, quite common. This type of data acquisition can lead to many errors within the data set. Such errors can affect extracted rules and patterns from Data Mining algorithms. A wrong data entry for example could be a too high fuel consumption for a vehicle caused by a missing comma. If a customer considers buying this vehicle and looks up the fuel consumption via the EPA database an incorrect data entry could influence his purchase decision. A manual inspection of the data set is very time consuming and not practical for large data sets. The inspection of the data set therefore needs automatic procedures to remain accurate. This paper illustrates the approach to identify errors with the methodology of association rules. By combining various algorithms of the field of clustering and association analysis, the association rules are generated. These association rules can help prevent erroneous data entries in advance.

Author(s)

El Bekri, Nadia

Peinsipp, Byma

Mainwork

25th International Conference on Software Engineering and Data Engineering, SEDE 2016

Conference

International Conference on Software Engineering and Data Engineering (SEDE) 2016

International Conference on Computer Applications in Industry and Engineering (CAINE) 2016

Options

Generic error identification in data sets