A resource website on missing values - Methods and references for managing missing data


When it comes to analyses with missing values, some questions are raised regularely during classes or seminars. We try to list the most popular questions with some elements of response. If you have another question related to the handling of missing values, feel free to contact us via the Contact form.

Click on a question to see the answer.

For prediction tasks, the same imputation model has to be used for training and test set. However this is not always possible when imputing with some blackbox imputation function that does not allow for specification of a given imputation model. (JJ)

See this recent article for a discussion of this topic and this video of a keynote at the useR! 2019 conference on the same subject.

The question of the percentage of missing data is one of the most frequent questions from users. We are often asked:if I have 30% NA, is that too much? and 40%, etc.?

It is not only the percentage of missing data that counts, but also the structure of the data. A simple example to understand this point is a data set with 100 variables that are all identical, so the correlation between these variables is 1. Even with 80% missing data, many imputation techniques will be able to perfectly predict the missing values. Therefore, the variability associated with the prediction will be zero. It is also possible to have a data set, where the information is very unstructured and therefore even a very small percentage of missing data can completely destroy the links between the variables.

Of course, we do not know a priori the structure of the data. This is why it is imperative, with missing data, to consider the notions of variability and confidence in the results. Multiple imputation, for example, reflects the prediction variance of missing data. A first way to assess the impact of missing data is to use visualization tools to visualize the different imputed values. Then, of course, the size of the confidence intervals will be a good indicator. (JJ)

The idea of imputing from a closer neighbour is a sensible strategy. The problem here is not only missing data but the problem of k-NN for large dimensional datasets with heterogeneous variables (quantitative, categorical, etc). It is necessary to have an appropriate distance to take into account the mixed nature of the data and possibly reduce the size before computing the distances, so for many data sets it is not immediate to apply a k-NN algorithm for imputation. (JJ)

It always depends on the objective: If we only want to impute and therefore best predict missing values, we can always do cross-validation (add missing cells to the data, predict with different techniques and select the method that gives the smallest prediction error). Afterwards you can also be guided through theoretical arguments. I impute a lot of my data with dimension reduction techniques (low-rank approximation), because it is quite plausible to think that a lot of data can be well approximated by matrices of low rank. (JJ)

Here is an interesting reference on this topic: Udell, M. (2019). Big Data is Low Rank. SIAG/OPT Views and News

Yes, definitely! The consequences of not taking into account the missing data can become dramatic very quickly. Even without mentioning underestimation of variance, there can be a significant bias! For example, at the moment I am working on estimating the effect of a treatment and if we do not take into account the missing data, we can say that the treatment kills when it saves. (JJ)

There are starting to be first R packages like, missCompare, which allow to compare several imputation methods. There are still a lot of things to fix because all methods have many default settings, etc. But, on the R-miss-tastic platform we will try to put together some workflows that help the user to easily make this type of comparison. (JJ)

If we have good reason to believe that the missing data are completely at random (MCAR), then yes, with a lot of data, we can work on the complete data because we will have samples that come from the joint distribution of the data. Otherwise, even if we have a lot of data, they represent a sample that is not representative of the population. The classic example is missing income data: if rich or poor people do not disclose their income, it is clear that there is a selection bias in the complete case (MNAR data). But even if it is the young or the elderly who do not give their income and that income and age are very linked, we have the same problem of selection bias (MAR data). (JJ)

If having missing data is informative for prediction, we see that having an indicator in your dataset that codes for missing/not missing will help because it is seen as an explanatory variable. The MIA method (Twala et al. 2008) for regression trees/random forests allows this to be done. (JJ)

See also, Josse et al. (2019)

Yes, there are solutions that consist in modelling the mechanism of missing data, often this requires having a fairly strong prior on the parametric form of the distribution of missing data. But the practical solutions are still quite limited. There is a series of new approaches based on graphical and causal models that can be used to address missing MNAR data without modeling the mechanism and that offer new solutions but the solutions are still limited to simple models such as the linear model. See for instance Mohan and Pearl (2019). (JJ)

I would tend to say no, but that is to be checked. What is certain is that Rubin's aggregation rules are not suitable for many quantities and that there is still a lot of research to be done on the subject. (JJ)

You simply need to create a single variable with different categories, encoding the different series of possible answers.

For example,

(1) Do you have a bank account? Yes/No

(2) If yes to (1): How many bank accounts do you have, <5 or >5?

(3) If >5: what is the total value? If <5, what is the value of account 1 to 5?

will be coded in one variable with the following categories: Yes >5_1, Yes >5_2, Yes >5_3, Yes >5_4, Yes >5_5, Yes <5 and No. (JJ)

(Question relative to (Josse et al. 2019))

Yes, that's the point. We do a recoding, just because the implementations of most methods stop when they see the NA symbol for missing. They don't take it as a code. (JJ)