Categories

# Why Variable Selection is important?

Removing a redundant variable helps to improve accuracy. Similarly, inclusion of a relevant variable has a positive effect on model accuracy.

Too many variables might result to overfitting which means model is not able to generalize pattern

Too many variables leads to slow computation which in turns requires more memory and hardware.

Why Boruta Package?

There are a lot of packages for feature selection in R. The question arises ” What makes boruta package so special”.  See the following reasons to use boruta package for feature selection.

It works well for both classification and regression problem.

It takes into account multi-variable relationships.

It is an improvement on random forest variable importance measure which is a very popular method for variable selection.

It follows an all-relevant variable selection method in which it considers all features which are relevant to the outcome variable. Whereas, most of the other variable selection algorithms follow a minimal optimal method where they rely on a small subset of features which yields a minimal error on a chosen classifier.

It can handle interactions between variables

It can deal with fluctuating nature of random a random forest importance measure

Basic Idea of Boruta Algorithm

Perform shuffling of predictors’ values and join them with the original predictors and then build random forest on the merged dataset. Then make comparison of original variables with the randomised variables to measure variable importance. Only variables having higher importance than that of the randomised variables are considered important.

How Boruta Algorithm Works
Follow the steps below to understand the algorithm –

Create duplicate copies of all independent variables. When the number of independent variables in the original data is less than 5, create at least 5 copies using existing variables.

Shuffle the values of added duplicate copies to remove their correlations with the target variable. It is called shadow features or permuted copies.

Combine the original ones with shuffled copies

Run a random forest classifier on the combined dataset and performs a variable importance measure (the default is Mean Decrease Accuracy) to evaluate the importance of each variable where higher means more important.

Then Z score is computed. It means mean of accuracy loss divided by standard deviation of accuracy loss.

Find the maximum Z score among shadow attributes (MZSA)

Tag the variables as ‘unimportant’  when they have importance significantly lower than MZSA. Then we permanently remove them from the process.

Tag the variables as ‘important’  when they have importance significantly higher than MZSA.

Repeat the above steps for predefined number of iterations (random forest runs), or until all attributes are either tagged ‘unimportant’ or ‘important’, whichever comes first.

Difference between Boruta and Random Forest Importance Measure

When i first learnt this algorithm, this question ‘RF importance measure vs. Boruta’ made me puzzled for hours. After reading a lot about it, I figured out the exact difference between these two variable selection algorithms.

In random forest, the Z score is computed by dividing the average accuracy loss by its standard deviation. It is used as the importance measure for all the variables. But we cannot use Z Score which is calculated in random forest, as a measure for finding variable importance as this Z score is not directly related to the statistical significance of the variable importance. To workaround this problem, boruta package runs random forest on both original and random attributes and compute the importance of all variables. Since the whole process is dependent on permuted copies, we repeat random permutation procedure to get statistically robust results.

Is Boruta a solution for all?

Answer is NO. You need to test other algorithms. It is not possible to judge the best algorithm without knowing data and assumptions. Since it is an improvement on random forest variable importance measure, it should work well on most of the times.

What is shuffled feature or permuted copies?

It simply means changing order of values of a variable. See the practical example below –

set.seed(123)
mydata = data.frame(var1 = 1 : 6, var2=runif(6))
shuffle = data.frame(apply(mydata,2,sample))

```
Original         Shuffled
var1   var2    var1      var2
1    1 0.2875775    4 0.9404673
2    2 0.7883051    5 0.4089769
3    3 0.4089769    3 0.2875775
4    4 0.8830174    2 0.0455565
5    5 0.9404673    6 0.8830174
6    6 0.0455565    1 0.7883051
```

R : Feature Selection with Boruta Package

1. Get Data into R

The read.csv() function is used to read data from CSV and import it into R environment.