How to handle missing values in practice?
How to handle missing values if we want to do estimation, inference or prediction ? There doesn't exist a unique method to answer to this question. We propose several workflows in both R and Python to compare the most common methods, including in particular multiple imputation (missMDA in R, IterativeImputer in Python), low rank methods (softImpute) and some recent methods implemented in Python using optimal transport (Sinkhorn imputation) and autoencoders (MIWAE).
R - How to …
- … generate missing values? (PDF, Rmd, Resource R)
- … estimate some parameters with missing values? (PDF, Rmd)
- … impute missing values? (PDF, Rmd, Resource R)
Python - How to ...
- ... generate missing values? (Interactive notebook)
- ... impute missing values? (Interactive notebook)
- ... predict with missing values? (Interactive notebook)
- How to predict with missing values in R? (R notebook .Rmd) by Katarzyna Woźnica (Warsaw University of Technology)
- Machine learning with missing values (Python tutorial .ipynb) by Gaël Varoquaux (Inria)
- Comparison of classical and deep learning imputation methods (using Python and R) (Python notebook .ipynb) by François Husson (Agrocampus Ouest)
- Using "Imputation Scores" for assessing missing value imputations (using R package Iscores) by Meta-Lina Spohn and Jeffrey Näf (ETH Zurich)
- How to do causal inference with incomplete covariates/attributes in R? (IPW and doubly robust estimation, R notebook .Rmd) by Imke Mayer (EHESS). Additional information about this pipeline can be found here.
External notebooks
The following notebooks have been developed by external contributors and reviewed by the R-miss-tastic committee. We recommend these notebooks as they provide useful and complementary guidance on how to predict with missing values, on how to impute incomplete data, and on how to handle incomplete data in treatment effect estimation.
If you have any suggestions please raise an issue on our Github repository.