Options
2020
Conference Paper
Titel
Random forests followed by computed ABC analysis as a feature selection method for machine learning in biomedical data
Abstract
Background: Data from biomedical measurements usually include many parameters (variables/features). To reduce efforts of data acquisition or to enhance comprehension, a feature selection method is proposed that combines the ranking of the relative importance of each parameter in random forests classifiers with an item categorization provided by computed ABC analysis. Data: The input data space, comprising an example subset of plasma concentrations of d = 23 different lipid markers of various classes, acquired in Parkinson patients and healthy subjects (n = 100 each). Methods: Random forest classifiers were constructed with various different scenarios of the number of trees and the number of features in each tree. The relative importance of each feature calculated by the classifier was submitted to computed ABC analysis, a categorization technique for skewed distributions to identify the most important feature subset ""A,"" i.e., a reduced-set containing the important few items. Results: Using different parameters for the algorithms, the classification performance of all reduced-set random forest classifiers was almost as good as that of a random forest classifier using the full set of d = 23 lipid markers; all reaching 95% or better classification accuracy. When including additional ""nonsense"" features consisting of concentration data permutated across the subject groups, these features were never found in the ABC set ""A."" The obtained features sets provided better classifiers than those obtained using classical regression methods. Conclusions: Random forests plus computed ABC analysis provided a feature selection without the necessity to predefine the number of features. A substantial reduction of the number of features, following the ""80/20 rule,"" was obtained. The classifiers using the A-class performed better than with a regression-based feature selection and were (nearly) as good as using the complete feature set. The obtained small feature sets are also well suited for domain experts' interpretation.