The most comprehensive guide to automated feature selection methods in Python

To demonstrate the automated feature selection methods in Python we would use the diabetes dataset. Import the diabetes .csv file into a data-frame with Pandas as below:

import pandas as pd
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
df_diabetes = pd.read_csv('diabetes.csv', names = names)
df_diabetes.head()

You should see the data frame as below:

10 of the most useful feature selection methods in Machine Learning with Python are described below, along with the code to automate all of these.

1) Remove features with low -variance

The first feature elimination method which we could use is to remove features with low variance. The thinking behind this is that features which have low variance, i.e. features which mostly remain at the same level across different observations, should not ideally be responsible for differing responses in the observations.

Select X as the set of all features, and Y as the repsonse variable. Here, the response variable Y is class.

X = df_diabetes[['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age']]
Y = df_diabetes['class']

We now create a VarianceThreshold object of Sklearn, with a variance threshold of 0.3 (i.e. remove features with variance less than 30%). Next we fit the VarianceThreshold object with the response variable X and the feature matrix Y.

from sklearn.feature_selection import VarianceThreshold

var = VarianceThreshold(threshold=0.3)
var = var.fit(X,Y)

With the .get_support method of sklearn classification objects we see that the feature with index 6 in the feature matrix has been eliminated.

cols = var.get_support(indices=True)
cols

#Output
array([0, 1, 2, 3, 4, 5, 7], dtype=int64)

If we now look at the feature names, we see that the feature "pedi" has been eliminated from the feature matrix.

features = X.columns[cols]
features

#Output
Index(['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'age'], dtype='object')

2) Remove features which are not correlated with the response variable

For this we could use the Pearson correlation to find the features which have some degree of correlation with the response variable. It is however to be noted that although a feature may not be correlated with the response variable, it may still interact with some other variables such that the interaction of the variables helps explain the response. Strictly speaking, eliminating un-correlated variables only makes sense, if the variables are independent of all other variables. Another caveat is that one variable may show low correlation with the response, because this variable has been confounded by another variable. So, removing variables based on uni-variate correlations should only be performed, when it has been clearly proven that adding the variable doesn't have a significant improvement in model performance.

In our example, we can calculate the correlations and visualize as below. Note that we are using the matplotib and seaborn libraries for plotting the correlation matrix as a heat-map. So install these libraries beforehand:

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12,12))
cor = df_diabetes.corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
plt.show()

The heat map would look like below:

To automate this method we can create a filter to select only those features which have a correlation coefficient above a threshold (just like the variance threshold). In our case we use a threshold of 10%. The code for the filter would look like below:

# Consider correlations only with the target variable
cor_target = abs(cor['class'])

#Select correlations with a correlation above a threshold
features = cor_target[cor_target>0.1]
features

The selected features produced with the above code are shown with their correlation coefficients.

preg     0.221898
plas     0.466581
test     0.130548
mass     0.292695
pedi     0.173844
age      0.238356
class    1.000000
Name: class, dtype: float64

You can then use the features.index method to extract the selected feature column names, which you can later use on the data frame to create the feature matrix of the selected features.

features.index

#Output
Index(['preg', 'plas', 'test', 'mass', 'pedi', 'age', 'class'], dtype='object')

3) K-Best Fit

As per sklearn this method removes all but the k highest scoring features. The score is based on uni-variate statistical tests. Here, in the example below we use the ChiSquare scoring function.

As before, we first create an object of the SelectKBest class with k = 5, i.e. we want to select 5 best scoring features. The score function is chi2. Next we fit the KBest object with the response variable X and the full feature matrix Y.

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

KBest = SelectKBest(score_func = chi2, k = 5)
KBest = KBest.fit(X,Y)

We can get the scores of all the features with the .scores_ method on the KBest object. Similarly we can get the p values. We can combine these in a dataframe called df_scores.

df_scores = pd.DataFrame({'features': X.columns, 'Chi2Score': KBest.scores_, 'pValue': KBest.pvalues_ })
df_scores

The scores look like below:

#Output

	features	Chi2Score	pValue
0	preg	111.519691	4.552610e-26
1	plas	1411.887041	5.487286e-309
2	pres	17.605373	2.718193e-05
3	skin	53.108040	3.156977e-13
4	test	2175.565273	0.000000e+00
5	mass	127.669343	1.325908e-29
6	pedi	5.392682	2.022137e-02
7	age		181.303689	2.516388e-41

Since we are interested in only 5 of the best scoring features, we can get the feature name list as below:

cols = KBest.get_support(indices=True)
cols

#Output
array([0, 1, 4, 5, 7], dtype=int64)

features = X.columns[cols]
features

#Output
Index(['preg', 'plas', 'test', 'mass', 'age'], dtype='object')

4) SelectPercentile

This is a modification to the K-Best feature selection technique where we select the top x percentile of the best scoring features. So in our example, if we say that x is 80%, we want to select the top 80 percentile of the features based on their scores.

The code is very similar to that of SelectKBest, with the only change that we provide a percentile in the SelectPercentile object created.

from sklearn.feature_selection import SelectPercentile
from sklearn.feature_selection import chi2

SPercentile = SelectPercentile(score_func = chi2, percentile=80)
SPercentile = SPercentile.fit(X,Y)

We would see that with the top 80 percentile of the best scoring features, we end up with an additional feature 'skin ' compared to the K-Best method.

cols = SPercentile.get_support(indices=True)
cols
features = X.columns[cols]
features

#Output
Index(['preg', 'plas', 'skin', 'test', 'mass', 'age'], dtype='object')

5) Sequential Feature Selectors (Step-Wise Forward Selector)

To use the sequential step wise forward feature selector we would have to use a special library called 'mlxtend'. Install this library from PyPI. As per the official package, the goal of feature selection is two-fold - to improve the computational efficiency, and, reduce the generalization error of the model by removing irrelevant features or noise. Step wise or sequential feature selectors achieve this by adding or removing one feature at a time, based on the classifier/estimator performance until a feature subset of the desired size k is reached.

Here we first use the forward selector. The code is very similar to the ones we have used above. We first create an object of the SequentialFeatureSelector class from mlextend.feature_selection. Because these feature selectors work by evaluating model performance for every added or removed feature, we need to use an estimator. Here we use the RandomForestClassfier from sklearn. Next we fit the selector object with the response variable and the feature matrix. We want to chose the best combination of 6 variables, so we set k_features = 6 in the StepForward selector object. We also want to cross-validate model performance for every combination of variables, so we provide a cv value of 5 for 5 fold cross validation. If you want to skip the cross validation, simply put a value of 0 or ignore the parameter (default value of cv is 5). For the SequentialFeatureSelector object to work as a forward selector, we set the forward value as True .

from mlxtend.feature_selection import SequentialFeatureSelector as sfs
from sklearn.ensemble import RandomForestClassifier

# Estimator
estimator = RandomForestClassifier(n_estimators=10, n_jobs=-1)
# Step Forward Feature Selector
StepForward = sfs(estimator,k_features=6,forward=True,floating=False,verbose=2,scoring='accuracy',cv=5)   

    
StepForward.fit(X,Y)

When the above code is run, while fitting the data, it shows the best combination of features and model performance for every number of features till the k_features value is reached.

Via the .sunsets_ attribute, you can take a look at the selected feature indices at each step.

StepForward.subsets_

So you could look at the best features for each step of the iteration as below:

#Output

{1: {'feature_idx': (1,),
  'cv_scores': array([0.70779221, 0.68181818, 0.67532468, 0.7254902 , 0.76470588]),
  'avg_score': 0.7110262286732875,
  'feature_names': ('plas',)},
 2: {'feature_idx': (1, 5),
  'cv_scores': array([0.72727273, 0.67532468, 0.77272727, 0.75163399, 0.70588235]),
  'avg_score': 0.7265682030387912,
  'feature_names': ('plas', 'mass')},
 3: {'feature_idx': (0, 1, 5),
  'cv_scores': array([0.75324675, 0.67532468, 0.76623377, 0.75816993, 0.73202614]),
  'avg_score': 0.7370002546473134,
  'feature_names': ('preg', 'plas', 'mass')},
 4: {'feature_idx': (0, 1, 5, 6),
  'cv_scores': array([0.75324675, 0.7012987 , 0.74675325, 0.79084967, 0.73856209]),
  'avg_score': 0.7461420932009167,
  'feature_names': ('preg', 'plas', 'mass', 'pedi')},
 5: {'feature_idx': (0, 1, 2, 5, 6),
  'cv_scores': array([0.72727273, 0.76623377, 0.76623377, 0.80392157, 0.74509804]),
  'avg_score': 0.7617519735166794,
  'feature_names': ('preg', 'plas', 'pres', 'mass', 'pedi')},
 6: {'feature_idx': (0, 1, 2, 5, 6, 7),
  'cv_scores': array([0.73376623, 0.72077922, 0.72727273, 0.81699346, 0.73202614]),
  'avg_score': 0.7461675579322637,
  'feature_names': ('preg', 'plas', 'pres', 'mass', 'pedi', 'age')}}

You could look at the indices of the feature columns through the k_feature_idx_ method.

# Selected feature columns
cols = list(StepForward.k_feature_idx_)
cols

#Output
[0, 1, 2, 5, 6, 7]

You could use these indices to find the feature names as before.

features = X.columns[cols]
features

#Output
Index(['preg', 'plas', 'pres', 'mass', 'pedi', 'age'], dtype='object')

You could also use the inbuilt k_feature_names_ method to directly look at the feature names.

StepForward.k_feature_names_

#Output
('preg', 'plas', 'pres', 'mass', 'pedi', 'age')

Lastly you could get the best fit score of the model with the selected features, with the .k_score_ method.

StepForward.k_score_

The score in our case is below:

0.7474238180120534

6) Sequential Feature Selectors (Step-Wise Backward Selector) + Automatic best number of features

The backward method is exactly the same as above, except we start with all the features and then remove features until we find the best combination of the k_features. The advantage with this method is that if the k_number of ...

You are seeing a restricted version of the article. The article is reserved for registered users only whose profiles have been approved. To read the full article, register on the website and then send your profile for approval. To send your profile for approval, after registering, fill in the required details in your profile section and press submit for approval. Only genuine and non-malicious profiles would be approved. Corporate profiles with intention to use such free information for profit making would not be approved.

17 comments have been posted.

March 2, 2022, 1:09 p.m. - nib rojas ¶

thank you

| • Reply

March 1, 2022, 9:51 p.m. - nib rojas ¶

please approve my profile!

June 15, 2021, 12:01 p.m. - Scott Grossman ¶

Please approve my profile

June 15, 2021, 3:45 p.m. - Data Graphi moderator ¶

Approved!

1 |

March 30, 2021, 9:34 a.m. - wanhui yu ¶

March 30, 2021, 10:08 a.m. - Data Graphi moderator ¶

approved

Feb. 28, 2021, 9:41 a.m. - Anwesh Bhattacharyya ¶

1 | • Reply

Feb. 28, 2021, 11:26 a.m. - Data Graphi moderator ¶

Approved

Oct. 16, 2020, 1:31 p.m. - Demi Townson ¶

Hello! I would greatly appreciated an approved on my profile so I can read the rest of this great article :)

Oct. 17, 2020, 12:27 p.m. - Data Graphi moderator ¶

Sept. 18, 2020, 8:24 a.m. - Khanh Pham ¶

Sept. 18, 2020, 8:43 a.m. - Data Graphi moderator ¶

Could you please fill out your social media details in your profile section. We do this to identify real users/avoid spam.

Sept. 2, 2020, 11:40 p.m. - Joachim Rives ¶

Waiting on profile approval for the most important part: how to find the proper thresholds.

2 | • Reply

Sept. 3, 2020, 8:08 a.m. - Data Graphi moderator ¶

Your profile has been approved. Please login to read the full-article.

Nov. 7, 2019, 4:04 a.m. - Mayank Goyal ¶

Highly useful information

The most comprehensive guide to automated feature selection methods in Python

Automated Feature Selection in Python for most common machine learning problems

March 2, 2022, 1:09 p.m. - nib rojas ¶

March 2, 2022, 1:09 p.m. - nib rojas ¶

March 1, 2022, 9:51 p.m. - nib rojas ¶

June 15, 2021, 12:01 p.m. - Scott Grossman ¶

June 15, 2021, 3:45 p.m. - Data Graphi moderator ¶

March 30, 2021, 9:34 a.m. - wanhui yu ¶

March 30, 2021, 10:08 a.m. - Data Graphi moderator ¶

Feb. 28, 2021, 9:41 a.m. - Anwesh Bhattacharyya ¶

Feb. 28, 2021, 11:26 a.m. - Data Graphi moderator ¶

Oct. 16, 2020, 1:31 p.m. - Demi Townson ¶

Oct. 17, 2020, 12:27 p.m. - Data Graphi moderator ¶

Sept. 18, 2020, 8:24 a.m. - Khanh Pham ¶

Sept. 18, 2020, 8:43 a.m. - Data Graphi moderator ¶

Sept. 2, 2020, 11:40 p.m. - Joachim Rives ¶

Sept. 3, 2020, 8:08 a.m. - Data Graphi moderator ¶

Nov. 7, 2019, 4:04 a.m. - Mayank Goyal ¶

Dec. 12, 2019, 3:14 p.m. - Mausam Gaurav moderator ¶

Post your comment