Trimming Away Redundant Variables

4 min readMar 27, 2021

Photo by Claudio Schwarz | @purzlbaum on Unsplash

If you’ve been assigned a dataset for work, school, a competition, or come across data for personal projects, that first moment when Pandas reads the data for the first time is as exciting as it is overwhelming. Perhaps one of the things that’s creating this cacophony of emotions is the first glance of your variables. While all the columns probably exist for good reason, they can be otherwise redundant. For machine learning models, this can present challenges in feature selection. Here are a few considerations that have helped me weed through (more tree stuff to come) redundant variables in order to select ones for a classification model.

Research your Data

This dataset is from a DrivenData competition (active until 11/1/21) on predicting water well functionality in Tanzania. Datasets from competitions and/or Kaggle may have a section dedicated to the column descriptions. If it is included in the competition page or a thoughtful data science comrade has created a reliable list, please create a separate text or markdown file where you can easily access it. You can then distinguish the data by separate categories. For this dataset, I compartmentalized the columns into geographic, infrastructural, political, and bureaucratic categories. This is not very systematic and can vary from person-to-person, but hey, whichever works best for you is all that matters!

From here, I’d recommend going down the list by category and selecting which columns are best fit for the scope of the project, based on their description. This brings me to my next tip: having a thorough understanding of the business problem.

Understanding the Business Problem

In some cases, this can be very specific according to particular research aims and based on what has been decided within the company. In this case, it is possible much of the variable selection process may have been done for you. If the project is open-ended, making a clear decision on how a potential solution can be elucidated from the business problem from what kind of data (i.e., from your list of categories) can help streamline the process. Doing research on the subject of interest (I had very little knowledge on Tanzania) and allowing your own personal interests to put wind in your sails can help streamline the process and, most importantly, make it enjoyable. Allow your interests to guide your decisions.

MissingNo

If and when you have missing data, you can certainly use df.describe() and df.isna().sum() for initial exploration. However, a neat trick that also tickled my love for Pokémon was using MissingNo. It helps you determine the patterns of missing data, so you can have a clearer understanding of how null values are presented. Below are the matrix and heatmap methods. The matrix evidently shows you where the missing data is located with respect to the other columns. The heatmap shows the strength of column “nullity,” like a correlation but strictly for null values.

import missingno as msno
msno.matrix(df.sample(59400)) #size of dataframe

A value of <1 means that if one column has a null value, the corresponding column will almost always have a null value at a particular observation. A value of -1 means the opposite: if a variable is null at one observation, a corresponding variable will practically never have a null value there. Based on the information you gather, you can make a decision on how you want to handle your null values. All of that said, you do not necessarily have to drop columns if you believe they are salvageable and can add meaningful information to the project; imputation helps!

msno.heatmap(df, figsize=(10,6))

Modeling Techniques

For machine learning models, it is imperative to check your columns for multicollinearity. Because this step is practically built into your process, removing collinear variables will reduce the amount of predictors you are handling while improving your model. Big win, imo.

plt.figure(figsize=(20,12))
sns.heatmap(df2.corr(), vmin=-1, vmax=1, cmap='RdBu', annot=True)

For many machine learning models, you can visualize the importance each model assigns to your features. You can do this for a tree-based or ensemble method model.

def plot_feature_importances(model):
    n_features = X.shape[1]
    plt.figure(figsize=(8,8))
    plt.barh(range(n_features), model.feature_importances_, align='center') 
    plt.yticks(np.arange(n_features), X.columns.values) 
    plt.xlabel("Feature importance")
    plt.ylabel("Feature")

Source code: Catherine Wolk

One More Thing

If you are anything like me, you may have spent lots of time staring into the deep abyss of your raw data. A little birdie once told me “all models are bad, some are useful. Keeping this in mind has really helped me take the pressure off of ensuring I pick the “perfect” features. I hope these tips and keeping focused on your interests can do the same for you.