In this post we will dicuss two feature selection algorithms:
Forward Feature Selection and Boruta
In part 1, we will discuss these topics in this post:
Why feature selection?
What is forward feature selection?
What is the downside of forward feature selection?
Feature selection is a significant procedure in machine learning model training. The main purpose is to:
Improve prediction accuracy
Reduce Model Comlexity
Increase model interpretability
Remove irrelevant features or noise in the model
In essence, forward feature selection is a stepwise subset selection method “in which we start with having no feature in the model…[and] we keep adding the feature which best improves our model till an addition of a new variable does not improve the performance of the model.”
The below picture is an illustration of the forward selection process:
The 'most significant variable' is usually defined as the next feature that will bring the largest incremental increase of the chosen evaluation/training metrics, while an example of the stopping rule can be a minimum of metric score the model tries to reach or a maximum number of features the model has to cap at.
Compared to the ‘best subset’ selection method, which searches for the best subset by testing all possible feature combinations in the entire feature space, stepwise forward selection is a relatively, computationally-efficient method.
To understand this in concrete terms: if your model has 100 features to choose from, for an exhaustive search to find its best subset, we will need to fit 2^100 (2^p) models, which equals to an astronomical number of more than 1.26e+30 models.
In contrast, if stepwise feature selection is used to choose 10 features from 100 features, we approximately* only needed to fit 1045 models, as shown by the sigma summation below, where k can be understood as 'when the kth feature has been selected'.
However, forward selection has a few main downsides:
1. Given its greedy nature, every step in the additive process is based on the current state of the already selected features. For example, in the scenario where we have a set of 3 predictors (x1, X2, X3), if x1, is chosen at Step 1, the algorithm can only add either x2,or x3,during Step 2. There is no way for the algorithm to choose the combination of (X2, X3) even if this outperforms (x1, X2) and (x1, X3).
2. As features are selected based on the “greatest additional improvement” criteria, features are competing among themselves to be selected. Even if X_2 and X_3 both have great predictive power on the response, only one of them can be chosen at each step. And once one is chosen, the other one may never make it into the selection basket due to the greedy nature mentioned in point 1.
3. The stopping threshold or criteria in the selection process blinds us from seeing the performance change or potential gain beyond the stopping spot. For example, suppose 10 features were selected by the model as none of the potential 11th feature could improve the training metric score by a certain desired minimum amount. However, hypothetically-speaking, it is possible that the 12th or 13th feature would be able to bring us substantial improvement that would be worth increasing the size of the subset further. Stepwise forward feature selection does not give us the chance to see and evaluate that possibility.
REFERENCES:
sauravkaushik8. (2016, December 1). Introduction to Feature Selection methods with an example (or how to select the right variables?). Retrieved from AnalyticsVidhya: https://www.analyticsvidhya.com/blog/2016/12/introduction-to-feature-selection-methods-with-an-example-or-how-to-select-the-right-variables/
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2017). An Introduction to Statistical Learning with Applications in R. Springer.