One of the most important steps in building a statistical model is deciding which data to include. With very large datasets and models that have a high computational cost, impressive efficiency can be realized by identifying the most (and least) useful features of a dataset prior to running a model. Feature selection is the process of identifying the features in a dataset that actually have an influence on the dependent variable.
High dimensionality of the explanatory variables can cause both high computation times and a risk of overfitting the data. Moreover, it’s difficult to interpret models with a high number of features. Ideally we would be able to select the significant features before performing statistical modeling. This reduces training time and makes it easier to interpret the results.
Some techniques to address the “curse of dimensionality” take the approach of creating new variables in a lower-dimensional space, such as Principal Component Analysis (Pearson 1901) or Singular Value Decomposition (Eckart and Young 1936). While these may be easier to run and more predictive than an un-transformed set of predictors, they can be very hard to interpret.
We’d rather—if possible—select from the original predictors, but only those that have an impact. There are a few sophisticated feature selection algorithms such as Boruta (Kursa and Rudnicki 2010), genetic algorithms (Kuhn and Johnson 2013, Aziz et al. 2013) or simulated annealing techniques (Khachaturyan, Semenovsovskaya, and Vainshtein 1981) which are well known but still have a very high computational cost — sometimes measured in days as the dataset multiplies in scale by the hour.
As genuinely curious, investigative minds, we wanted to explore how one of these methods, the Boruta algorithm, performed. Overall, we found that for small datasets, it is a very intuitive and beneficial method to model high dimensional data. Below follows a summary of our approach.
Why such a strange name?
Boruta comes from the mythological Slavic figure that embodies the spirit of the forest. In that spirit, the Boruta R package is based on ranger, which is a fast implementation of the random forests classification method.
How does it work?
We assume you have some knowledge of how Random Forests work—if not, this may be tough.
Let’s assume you have a target vector T (what you care about predicting) and a bunch of predictors P.
The Boruta algorithm starts by duplicating every variable in P—but instead of making a row-for-row copy, it permutes the order of the values in each column. So, in the copied columns (let’s call them P’), there should be no relationship then between the values and the target vector.
Boruta then trains a Random Forest to predict T based on P and P’.
The algorithm then compares the variable importance scores for each variable in P with it’s “shadow” in P’. If the distribution of variable importances is significantly greater in P than it is in P’, then the Boruta algorithm considers that variable significant.
The dataset of interest here were records of doctors’ appointments for insurance-related matters, and the target variable of interest was whether or not the patient showed up for their appointment. Part of our task was to find the most significant interactions, and with fifty jurisdictions and thirty doctor specialties, we already have a space of 1,500 potential interactions to search through—not including many other variables.
The set of features can be visualized by creating a set of boxplots for the variable importances for each potential feature.
The three red boxplots represent the distribution of minimum, mean and maximum scores of the randomly duplicated “shadow” variables. This is basically the range of variable importances that can be achieved through chance.
The blue bars are features that performed worse than the best “shadow” variables and should not be included in the model. Purple bars are features that have the same explanatory power as the best “shadow” variable, and its use in the model is up to the discretion of the analyst. The green bars are variables with importances higher than the maximum “shadow” variable — and are therefore good predictors to include in a future classification model.
show_mm <- model.matrix( ~ 0 + `Doctor Specialty` + `Business Line` + Jurisdiction, data = show_df, contrasts.arg = lapply( show_df[, c('Doctor Specialty', 'Business Line', 'Jurisdiction')], contrasts, contrasts = FALSE ) ) show_mm_st <- cbind(status = show_df$`Appt Status`, show_mm) show_mdf <- as.data.frame(show_mm_st) library(Boruta) b_model <- Boruta(status ~ ., data = show_mdf) cat(getSelectedAttributes(b_model), sep = "\n") # Doctor SpecialtyChiropractic Medicine # Doctor SpecialtyNeurology # Doctor SpecialtyNurse # Doctor SpecialtyOrthopaedic Surgery # Doctor SpecialtyOther # Doctor SpecialtyRadiology # Business LineDisability # Business LineFirst Party Auto # Business LineLiability # Business LineOther # Business LineThird Party Auto # Business LineWorkers Comp # JurisdictionCA # JurisdictionFL # JurisdictionMA # JurisdictionNJ # JurisdictionNY # JurisdictionOR # JurisdictionOther # JurisdictionTX # JurisdictionWA ## Importance plot plot(b_model, las =2, cex.axis=0.75)