Journal of Statistical Software | 2019

Fuzzy Forests: Extending Random Forest Feature Selection for Correlated, High-Dimensional Data

 
 
 
 

Abstract


In this paper we introduce fuzzy forests, a novel machine learning algorithm for ranking the importance of features in high-dimensional classification and regression problems. Fuzzy forests is specifically designed to provide relatively unbiased rankings of variable importance in the presence of highly correlated features, especially when the number of features, p, is much larger than the sample size, n (p n). We introduce our implementation of fuzzy forests in the R package, fuzzyforest. Fuzzy forests works by taking advantage of the network structure between features. First, the features are partitioned into separate modules such that the correlation within modules is high and the correlation between modules is low. The package fuzzyforest allows for easy use of the package WGCNA (weighted gene coexpression network analysis, alternatively known as weighted correlation network analysis) to form modules of features such that the modules are roughly uncorrelated. Then recursive feature elimination random forests (RFE-RFs) are used on each module, separately. From the surviving features, a final group is selected and ranked using one last round of RFE-RFs. This procedure results in a ranked variable importance list whose size is pre-specified by the user. The selected features can then be used to construct a predictive model.

Volume 91
Pages 1-25
DOI 10.18637/jss.v091.i09
Language English
Journal Journal of Statistical Software

Full Text