How to predict with missing values in Python ?

Aude Sportisse

Missing values occur in many applications of supervised learning. Methods to deal with missing values in supervised learning are very different from methods to deal with missing values in a inferential framework. For instance mean imputation which is the worst thing that can be done when the aim is to estimate parameter can be consistent when the aim is to predict as well as possible, as shown in this recent paper.

In this notebook, we will cover how to accurately predict the response y given X when X contains missing values (missing values in both training and testing data).

In this case, there are essentially two approaches:

1. Two-step strategy: imputing missing data and applying classical methods on the completed data sets to predict;
2. One-step strategy: Predicting with methods adapted to the missing data without necessarily imputing them.

We first describe the methods on synthetic data and apply them on real datasets.

# Description of the different strategies¶

We generate the covariates $X$ with 3 variables from a Gaussian distribution with positive structure of correlation, the Gaussian noise $\epsilon$ from the standart Gaussian distribution and the fixed regression parameter $\beta$ with the uniform distribution. The outcome variable is obtained with the following linear model: $$Y=X\beta+\epsilon.$$

We introduce some missing (here MCAR) values in the data matrix using the function produce_NA given in the Python Notebook How to generate missing values in Python ?.

## Two-steps strategy¶

We will consider two imputation methods:

• Mean imputation: replace missing values by the mean of the feature (the column);
• Iterative imputation: each feature is regressed/estimated from the others, which implies that imputation can take advantage of others features (it is the implementation proposed in sklearn.impute.IterativeImputer)

More details on these methods can be found in How to impute missing values in Python.

Note that Josse et al. study the classic tools of missing values in the context of supervised learning. More particularly, they give the following take-home messages:

• If we possess a good learner algorithm and enough samples in our dataset (e.g. more than $10^5$), the mean imputation is a good default choice (since it is consistent).
• It is important to use the same imputation for the train and the test set, since, in this case, the learning algorithm can learn the imputed value for detecting that the entry was initially missing.
• It might by fruitfull to add the missing values indicator (by concatenating the missing values indicator to X), in particular when missingness can be related to the prediction target (MNAR or MAR values).

To concatenate the missing indicator to X, we can use the argument add_indicator=True of SimpleImputer. Note that this concatenation is done after imputation and is only used for prediction.

## One-step strategy¶

We compare these imputations methods to a learning algorithm which can perform predictions by directly accounting for missing values:

• Missing In Attribute Imputation (MIA) Twala et al.): this method is dedicated to handle missing values in tree based methods, such as randomforest, xgboost, etc... It is implimented in different packages (partykit and grf in R and by default as imputation method in the function HistGradientBoostingRegressor from the module sklearn.ensemble of scikitlearn). If you want to implement it it by yourself, you can duplicate features twice and replace its missing values once by $\infty$ and once by $- \infty$ (or extreme out-of-range values). Here, we choose to remplace them once by the maximum of the variable plus 1000 and once by the minimum of the variable minus 1000. More information on this method can be found in Josse et al..

This method does not pretend to impute missing data. Here, the step of duplicating features is internal in the tree based learning algorithm.

## Pipeline¶

Let's evaluate the different strategies. Let's consider our different imputers w.r.t. different machine learning algorithms. The pipeline will be

1. Imputation.
2. Regression on the imputed dataset.

Here we decompose each step of the pipeline for clarity.

First, we can split the data intro train and test datasets.

We can then choose a learning algorithm, for exemple the random forests. We can use the class sklearn.linear_model of scikit-learn. Note that we can not directly apply the learner, since it can not deal with missing values.

We fit the model for imputing missing values in the train dataset and then transform both train and test with the same imputer.

Finally, we can fit the learner.

# Method selection on synthetic data¶

The function score_pred compares the strategies above for synthetic data in terms of prediction performances by applying chosen learning algorithms.

More precisely, the function takes as imput a complete data matrix (X) and an output variable (y). Then, missing values are introduced in the complete data matrix with both specific percentage of missing values (p) and missing-data mechanism (mecha). Each method is performed with a learning algorithm (learner). The methods are detailed below:

• Mean imputation with or without adding the mask + Learning algorithm (two-step method),
• Iterative imputation with or without adding the mask + Learning algorithm (two-step method),
• Learning algorithm if the learning algorithm account for MIA (one-step method, if opt_learner="Learner_MIA") or MIA + Learning algorithm otherwise.

The introduction of the missing values is done several times (nbsim), it implies the stochasticity in the results (and boxplots).

The arguments are the following.

• X: complete data matrix (covariates).
• y: output variable.
• learner: learner to be performed for comparing strategies (e.g. random forests, gradient boosting, linear regression).
• p : percentage of missing values to introduce.
• nbsim : number of simulations performed.
• opt_learner: indicates if the learning algorithm account for MIA (opt_learn="Learner_MIA").
• mecha : missing-data mechanism to use for introducing the missing values.

It returns scores for each strategy.

We apply this function by introducing MCAR or MNAR values in X. MCAR means that the probability that an observation is missing is independent of the data. MNAR means that the missingness depends on the missing values and potentially also on the observed values. To introduce missing values, we use How to generate missing values.

# Method selection on real data¶

The function plot_score_realdatasets can be used for real datasets containing missing values. The arguments are the following.

• X: data matrix containing missing values (covariates).
• y: output variable (containing no missing values)y: output variable (containing no missing values).
• learner: dictionnary containing the learners to be performed for comparing strategies.

It returns Boxplot scores for each method (Mean imputation, Iterative imputation, MIA), the stochasticity comes from the way to split the dataset into a train set and a test set which is repeated several times.

Here, we study a real dataset which does not contain real missing values, thus we add some missing values (MCAR or MNAR) before applying the function plot_score_realdatasets. In this case, we can compute the scores for the complete matrix, which are represented in the boxplots.