Imputation with Machine Learning Algorithms

Sean Turner
5 min readSep 6, 2017

Imputation of null values with a machine learning algorithm is a rather interesting technique, and offers greater milage compared to imputing null values with the mean or mode for continuous or categorical data, respectively. Imputation with machine learning involves the following:

  1. Fit and test a model on your target variable.
  2. Subset your data by the feature you want to impute into rows containing null values and rows not containing null values. This feature is now your new target.
  3. Fit a second model on the subset of rows without null values, classifying or predicting the target feature (depending on if feature is categorical or continuous).
  4. Apply the second model to the subset of the data where the target feature consists of only null values, and attempt to predict the nulls. Note that you are not re-training the model, only testing it.
  5. Impute the null values from your original data with the predicted nulls, and retrain and test a model on your original target variable (with imputed values).

I will provide an example of this process using a dataset on cellphone churning, or the percentage of subscribers in a given time frame that cease to use the company’s services for one reason or another.

Unfortunately, imputation didn’t really add anything to my model, and I will be cutting things short around step five. However, I still really wanted to write this post as I think the subject is very interesting, and something which I am interested in exploring further.

The dataset has 3333 rows, 20 columns, and 400 null values in the same rows across voicemail plan and voicemail message.

In order to begin fitting a model, I first need to encode categorical variables as integers or as dummies, depending on how many outcomes there are. Once the categorical variables properly encoded, I can go ahead and train a model predicting the churn column.

Initially, all of the null values are dropped. The goal is to classify if a customer churns or not on rows without null values. Then, once the null values are imputed, I will go back and retrain the model on the entirety of the data.

The first step is to standardise the data using z-scoring. Otherwise, because the goal is to predict churn, this is a classification problem.

As such, I will be utilising the classification algorithms, K nearest neighbours, random forests, and gradient boost. The optimal k neighbours is selected by a for loop to the left.

Interpreting the results against the baseline, the cross validated Gradient Boosting Classifier performed the best.

Note, this performance could perhaps be improved outside of the model by changing the classification threshold to try and lower the amount of false positives.

Now that I have a model I can go ahead and impute the two voicemail columns. The first step is to subset the data into rows with null values and rows without null values.

Fortunately, the null values are in the same rows across the voicemail columns. This means the data need only be subset by one of the variables to be imputed.

Regardless, the next step is to train a model on data that only contains null values.

Considering that only one of the models is just barely above the baseline, these models unfortunately do not perform as well as the previous models. Either way, the next step is to train a model on the rows consisting only of null values with the KNN model. Additionally, when standardising the data, the null-only rows needs to utilise the standardisation fit from the no-null rows. The thought process is to treat the no-null rows as training data, and the null-only row as testing data (in an unsupervised manner of speaking).

The model ended up predicting all zeros, or no voicemail plan in the context of the feature. However, this is not necessarily a bad thing. The main goal of imputation is to find replacements for null values that are better than the mean or mode. Additionally, imputation is especially helpful when dealing with smaller datasets, where the mean or mode would be more inaccurate.

Imputation of the voicemail message column would follow the exact same steps as above with one minor change. Because the voicemail message column is a continuous variable, imputation relies on prediction algorithms instead of classification algorithms. Unfortunately, the resulting R² value is very small, and shows that voicemail message imputation is mostly unreliable. Either way, while imputation didn’t entirely end up paying off this time, I don’t doubt that I will end up finding a good use for it in the near future.

--

--