How to impute missing values in Python ?

Aude Sportisse

The problem of missing data is ubiquitous in the practice of data analysis. Main approaches for handling missing data include imputation methods. In this Notebook, we first describe the main imputation methods available on R packages on synthetic data. Then, we compare them on both synthetic data for different missing-data mechanisms and percentage of missing values. Finally, we propose a function giving the comparison of the methods in one particular setting (missing-data mechanism, percentage of missing values) for a list of (complete) real datasets.

Description of imputation methods on synthetic data

In this section we provide, for some of the main classes and functions in Python (the list is of course not thorough) to impute missing values, links to tutorials if any, as well as a description of their main functionalities and reusable code. The goal is not to describe all the methods precisely, as many resources are already available, but rather to provide an overview of several imputation options. The methods we focus on are gathered in the table below.

Class (or function) Data Types Underlying Method Imputation Comments
SingleImputer with strategy='mean' (default), sklearn.impute quantitative imputation by the mean single Easiest method
softImpute function (mimics R into Python) quantitative low-rank matrix completion single Strong theoretical guarantees, regularization parameter to tune
IterativeImputer with BayesianRidge (default), sklearn.impute mixed imputation by chained equations single Very flexible to data types, no parameter to tune
IterativeImputer with ExtraTreesRegressor, sklearn.impute mixed random forests single Requires large sample sizes, no parameter to tune
Sinkhorn imputation quantitative optimal transport single

Let us consider a Gaussian data matrix of size $n$ times $p$.

We introduce some missing (here MCAR) values in the data matrix using the function produce_NA given in the Python Notebook How to generate missing values in Python ?.

Imputation by the mean

The SimpleImputer class provides basic strategies for imputing missing values as the mean imputation. This is a naive imputation, which serves as benchmark in the sequel.

softimpute

The function softimpute (original article of Hastie and al.) can be used to impute quantitative data. The function coded here in Python mimics the function softimpute of the R package softImpute. It fits a low-rank matrix approximation to a matrix with missing values via nuclear-norm regularization. The main arguments are the following.

To calibrate the parameter lambda, one may perform cross-validation, coded in the function cv_softimpute which takes in argument the data set with missing values and the length of the grid on which cross-validation is performed.

Iterative chained equations

Iterative chained equations methods consist in (iterative) imputation using conditional expectation. The IterativeImputer class provides such methods and is inspired by the mice package in R but differs from it by returning a single imputation instead of multiple imputations.

The main arguments are

The method fit_transform allows to fit the imputer on the incomplete matrix and return the complete matrix.

Another estimor can be used, the ExtraTreesRegressor estimator, which trains iterative random forest instead of doing iterative regression and mimics the missForest in R. ExtraTreesRegressor fits a number of randomized extra-trees and averages the results. It comes from the module sklearn.ensemble. Its main arguments are the number of trees in the forest and the random state which allows to control the sources of randomness.

Sinkhorn imputation

Sinkhorn imputation can be used to impute quantitative data. It relies on the idea that two batches extracted randomly from the same dataset should share the same distribution and consists in minimizing OT distances between batches. More details can be found in the original article and the code is provided here.

The main argument are

To set the regularization, one uses the function pick_epsilon which takes a multiple of the median distance. The method fit_transform allows to fit the imputer on the incomplete matrix and return the complete matrix.

MIWAE

MIWAE imputes missing values with a deep latent variable model based on importance weighted variational inference. The original article is here and its code is available here.

The main arguments are

Numerical experiments to compare the different methods

Synthetic data

We compare the methods presented above for different percentage of missing values and for different missing-data mechanisms:

We compare the methods in terms of mean squared error (MSE), i.e.: $$MSE(X^{imp}) = \frac{1}{n_{NA}}\sum_{i}\sum_{j} 1_{X^{NA}_{ij}=NA}(X^{imp}_{ij} - X_{ij})^2$$ where $n_{NA} = \sum_{i}\sum_{j} 1_{X^{NA}_{ij}=NA}$ is the number of missing entries in $X^{NA}$.

Note that in order to evaluate this error, we need to know the true values of the missing entries.

The function how_to_impute compares the methods above with the naive imputation by the mean in terms of MSE on a complete dataset. More particularly, the function allows to introduce missing values on the complete dataset using different percentages of missing values and missing-data mechanisms and gives the MSE of the methods for the different missing-value settings. The final MSE for one specific missing-value setting is computed by aggregating the MSE's obtained for several simulations, where the stochasticity comes from the process of drawing several times the missing-data pattern.

The arguments are the following.

It returns a table containing the mean of the results for the simulations performed.

Real datasets

We will now compare the methods on real complete data set taken from the UCI repository in which we will introduce missing values. In the present workflow, we propose a selection of several data sets (here, the data sets contain only quantitative variables):

But you can test the methods on any complete dataset you want.

You can choose to scale data prior to running the experiments, which implies that the variable have the same weight in the analysis. Scaling data may be performed on complete data sets but is more difficult for incomplete data sets. (For MCAR values, the estimations of the standard deviation can be unbiased. However, for MNAR values, the estimators will suffer from biases.)

We can then apply the how_to_impute_real function. It compares in terms of MSE several imputation methods for different complete datasets where missing values are introduced with a given percentage of missing values and a given missing-data mechanism.

The arguments are the following.

It returns a table containing the mean of the MSEs for the simulations performed.