Bibliography

Fri Apr 04, 2025

On this platform we attempt to give you an overview of main references on missing values. We do not claim to gather all available references on the subject but rather to offer a peak into different fields of active research on handling missing values, allowing for an introductory reading as well as a starting point for further bibliographical research.

See here for a full (and uncommented) list of references.

Inspired by CRAN Task View on Missing Data and a review of Imbert & Vialaneix on handling missing values (2018, written in French) we organized our selection of relevant references on missing values by different topics.

Short introduction to missing values

In order to provide a more formal introduction for the problem of missing values and the existing methods to handle them (e.g. diagnose/describe the missingness or perform statistical analysis on the incomplete data), we introduce some farely standard definitions and notations used in the remainder of this article.

Let \(X=(X_1,\dots, X_p)\) be a vector of \(p\) random variables which can be continuous or categorical.
We note \(x_{ij}\) the observation of variable \(X_j\) for an individual \(i\in\{1,\dots,n\}\) and \(\mathbf{x}_i=(x_{i1},\dots,x_{ip})\) the vector of observations of all \(p\) variables \(X\) for the individual \(i\).
The observations of the \(n\) individuals are stacked by rows in a matrix \(\mathbf{X}\in\mathbb{R}^{n\times p}\).
The indicator matrix of missing values \(\mathbf{R}\) is defined such that its values \((r_{ij})_{\substack{i=1,\dots,n\\j=1,\dots,p}}\) are given by: \(r_{ij} = \left\{\begin{array}{ll}1 & \text{ if } x_{ij} \text{ is observed}\\0 & \text{ otherwise}\end{array}\right. = \mathbb{1}_{x_{ij}\, is\, observed}\). The associated random variable is denoted by \(R\).
The observed and missing parts of \(X\) are denoted respectively by \(X_{obs}\) and \(X_{mis}\).

General references and reviews

These general references and reviews are helpful to get started with the large field of missing values as they provide an introduction to the main concepts and methods or give an overview of the diversity of topics in statistical analysis related to missing values. They discuss different mechanisms that generated the missing values, necessary conditions for working consistently on the observed values alone and ways to impute, i.e. complete, the missing values to end up with complete datasets allowing the use of standard statistical analysis methods.

Little, R. J. and D. B. Rubin. Statistical analysis with missing data. John Wiley & Sons, 2019.
Allison, P. D. Missing Data. Quantitative Applications in the Social Sciences. Thousand Oaks, CA, USA: Sage Publications, 2001. ISBN: 9780761916727.
DOI
Carpenter, J. and M. Kenward. Multiple Imputation and its Application. Chichester, West Sussex, UK: Wiley, 2013. ISBN: 9780470740521.
DOI
Enders, C. K. Applied Missing Data Analysis. Guilford Press, 2010, p. 401. ISBN: 9781606236390.
Kim, J. K. and J. Shao. Statistical Methods for Handling Incomplete Data. Boca Raton, FL, USA: Chapman and Hall/CRC, 2013. ISBN: 9781482205077.
Molenberghs, G., G. Fitzmaurice, M. G. Kenward, et al. Handbook of Missing Data Methodology. Chapman & Hall/CRC Handbooks of Modern Statistical Methods. New York, NY, USA: Chapman and Hall/CRC, 2014. ISBN: 9781439854624.
Molenberghs, G. and M. G. Kenward. Missing Data in Clinical Studies. Chichester, West Sussex, UK: Wiley, 2007. ISBN: 9780470849811.
DOI
O’Kelly, M. and B. Ratitch. Clinical Trials with Missing Data: A Guide for Practitioners. John Wiley & Sons, Ltd, 2014.
DOI
Schafer, J. L. Analysis of Incomplete Multivariate Data. CRC Monographs on Statistics & Applied Probability. Boca Raton, FL, USA: Chapman and Hall/CRC, 1997. ISBN: 0412040611.
Buuren, S. van. Flexible Imputation of Missing Data. Boca Raton, FL: Chapman and Hall/CRC, 2018.
URL

Graham, J. W. Missing data analysis: making it work in the real world. In: Annual Review of Psychology 60 (2009), pp. 549-576.
DOI
Kaiser, J. Dealing with missing values in data. In: Journal of Systems Integration 5.1 (2014), pp. 42-51.
DOI
Pigott, T. D. A review of methods for missing data. In: Educational Research and Evaluation 7.4 (2001), pp. 353–383.
DOI
Schafer, J. L. and J. W. Graham. Missing data: our view of the state of the art. In: Psychological Methods 7.2 (2002), pp. 147-177.
DOI

Orchard, T. and M. A. Woodbury. A missing information principle: theory and applications. In: Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Theory of Statistic. Ed. by L. M. Le Cam, N. J. and E. L. Scott. Vol. 1. University of California Press, 1972, pp. 697–715.
URL

If you are rather new to the subject and wish to start with less formal and more application-based introductions or if you look for general high-level advices on handling missing data we suggest the following publications:

National Research Council, U. The Prevention and Treatment of Missing Data in Clinical Trials. Washington (DC), USA: National Academies Press, 2010. ISBN: 9780309158145.
DOI

Baraldi, A. N. and C. K. Enders. An introduction to modern missing data analysis. In: Journal of School Psychology 48.1 (2010), pp. 5-37.
DOI
Dax, A. Imputing Missing Entries of a Data Matrix: A review. In: Journal of Advanced Computing 3.3 (2014), pp. 98-222.
DOI
Dong, Y. and C. J. Peng. Principled missing data methods for researchers. In: SpringerPlus 2 (2013), p. 222.
DOI
Horton, N. J. and K. P. Kleinman. Much Ado About Nothing - A Comparison of Missing Data Methods and Software to Fit Incomplete Data Regression Models. In: The American Statistician 61.1 (2017), pp. 79-90.
DOI
Meng, X. L. UYou want me to analyze data I don’t have? Are you insane? In: Shanghai Archives of Psychiatry 24.5 (2012), pp. 287-301.
DOI URL
Peugh, J. L. and C. K. Enders. Missing data in educational research: a review of reporting practices and suggestions for improvement. In: Review of Educational Research 74.4 (2004), pp. 525–556.
DOI URL

Furthermore you can have a look at the following statistical journals which regularly contain recent results related to handling missing data:

Significance (bimonthly magazine)
Statistica sinica (quarterly journal; Volume 28, Number 4, October 2018 on Data Missing Not at Random)
Statistical Science (quarterly journal; Volume 33, Number 2, May 2018 on Missing Data)
The American Statistician (quarterly journal)

Weighting methods

The first intuitive and probably most applied solution in data analyses to deal with missing values is to delete the partial observations and to work excusively on the individuals with complete information. This has several drawbacks, among others it introduces an estimation bias in most cases (more precisely in cases where the missingness is not independent of the data). In order to reduce this bias one can reweight the complete observations to compensate for the deletion of incomplete individuals in the dataset. The weights are defined by inverse probabilities, for instance the inverse of the probability for each individual of being fully observed. This method is known as inverse probability weighting and is described in detail in the publications below. We split the references in two parts: handling missing values in survey data and performing causal inference in the presence of missing values, both requiring the use of weighting methods.

For survey data analysis

Such weighting methods are widely used on survey data in order to correct for unbalanced sampling fractions by balancing the empirical distributions of the observed covariates to recover the structure of the target population.

Buck, S. F. A method of estimation of missing values in multivariate data suitable for use with an electronic computer. In: Journal of the Royal Statistical Society, Series B 22 (1960), pp. 302-306.
DOI
Carpenter, J. R., M. G. Kenward, and S. Vansteelandt. A comparison of multiple imputation and doubly robust estimation for analyses with missing data. In: Journal of the Royal Statistical Society: Series A (Statistics in Society) 169.3 (2006), pp. 571–584.
DOI
Fitzmaurice, G. M., G. Molenberghs, and S. R. Lipsitz. Regression Models for Longitudinal Binary Responses with Informative Drop-Outs. In: Journal of the Royal Statistical Society. Series B (Methodological) 57.4 (1995), pp. 691–704.
URL
Gelman, A., G. King, and C. Liu. Not asked and not answered: Multiple imputation for multiple surveys. In: Journal of the American Statistical Association 93.443 (1998), pp. 846–857.
DOI
Kalton, G. and D. Kasprzyk. The treatment of missing survey data. In: Survey Methodology 12.1 (1986), pp. 1-16.
URL
Preisser, J. S., K. K. Lohman, and P. J. Rathouz. Performance of weighted estimating equations for longitudinal binary data with drop-outs missing at random. In: Statistics in Medicine 21.20 (2002), pp. 3035–3054.
DOI
Robins, J. M., A. Rotnitzky, and L. P. Zhao. Estimation of Regression Coefficients When Some Regressors are not Always Observed. In: Journal of the American Statistical Association 89.427 (1994), pp. 846-866.
DOI
Rubin, D. B. Formalizing subjective notions about the effect of nonrespondents in sample surveys. In: Journal of the American Statistical Association 72.359 (1977), pp. 538-543.
DOI
Vansteelandt, S., J. Carpenter, and M. G. Kenward. Analysis of incomplete data using inverse probability weighting and doubly robust estimators. In: Methodology – European Journal of Research Methods for the Behavioral and Social Sciences 6.1 (2010), pp. 37–48.
DOI

Methods in common with causal inference

Inverse probability weighting is also considered in causal inference: A bias is induced by the presence of confounders, i.e. variables which interact with both covariates and outcome. Hence, if the goal is to estimate causal relationships between covariates and outcome it is necessary to account for the potential effect of confounders – a selection bias – on the result of causal inference.

Bang, H. and J. M. Robins. Doubly robust estimation in missing data and causal inference models. In: Biometrics 61.4 (2005), pp. 962-973.
DOI
Bartlett, J. W., O. Harel, and J. R. Carpenter. Asymptotically unbiased estimation of exposure odds ratios in complete records logistic regression. In: American journal of epidemiology 182.8 (2015), pp. 730–736.
DOI
Blake, H. A., C. Leyrat, K. Mansfield, et al. Propensity scores using missingness pattern information: a practical guide. In: arXiv preprint (2019). arXiv: 1901.03981 [stat.ME].
URL
Ding, P. and F. Li. Causal Inference: A Missing Data Perspective. In: Statistical Science 33.2 (2018), pp. 214–237.
DOI
Hogan, J. W. and T. Lancaster. Instrumental variables and inverse probability weighting for causal inference from longitudinal observational studies. In: Statistical Methods in Medical Research 13.1 (2004), pp. 17-48.
DOI
Seaman, S. R. and S. Vansteelandt. Introduction to Double Robust Methods for Incomplete Data. In: Statistical Science 33.2 (2018), p. 184.
DOI
Seaman, S. R. and I. R. White. Review of inverse probability weighting for dealing with missing data. In: Statistical Methods in Medical Research 22.3 (2011), pp. 278-295.
DOI
Wal, W. M. van der and R. B. Geskus. ipw: an R package for inverse probability weighting. In: Journal of Statistical Software 43.13 (2011).
DOI
Yang, S., L. Wang, and P. Ding. Identification and estimation of causal effects with confounders subject to instrumental missingness. In: Statistics Methodology Repository (2017).
URL
Zhu, Z., T. Wang, and R. J. Samworth. High-dimensional principal component analysis with heterogeneous missingness. In: arXiv preprint (2019).
URL

Kallus, N., X. Mao, and M. Udell. Causal Inference with Noisy and Missing Covariates via Matrix Factorization. In: Advances in Neural Information Processing Systems. Ed. by -. 2018. eprint: 1806.00811.
URL

Inference with missing values

The most popular approach to deal with missing values for statistical inference tasks is likelihood-based approaches that can deal with incomplete data. More precisely, if the missingness mechanism is ignorable (in a certain sense that is explained in the Missing values mechanisms section) then one can attempt to infer the model parameters by maximizing the likelihood on the observed values. When the mechanism cannot be ignored, then a specific model for it needs to be assumed. The main algorithm available for performing maximum likelihood estimation (ML) with missing values, is the Expectation Maximization (EM) algorithm. This algorithm requires the knowledge of the joint distribution of \(X = (X_{obs}, X_{mis})\) and its implementation is not straightforward since it involves integrals which cannot always be computed easily. Once the model parameters are estimated, one can impute the missing values using this estimated information on the data model.

And there exist also other methods that allow for statistical inference with missing values and that are not using likelihood maximization.

McLachlan, G. J. and T. Krishnan. The EM Algorithm and Extensions. Wiley series in probability and statistics. Hoboken, NJ, USA: Wiley, 2008. ISBN: 9780471201700.

Collins, L. M., J. L. Schafer, and K. Chi-Ming. A comparison of inclusive and restrictive strategies in modern missing data procedures. In: Psychological Methods 6.4 (2007), pp. 330-351.
DOI
Enders, C. K. A primer on maximum likelihood algorithms available for use with missing data. In: Structural Equation Modeling 8.1 (2001), pp. 128-141.
DOI
Finkbeiner, C. Estimation for the multiple factor model when data are missing. In: Psychometrika 44.4 (1979), pp. 409-420.
DOI
Golden, R. M., S. S. Henley, H. White, et al. Consequences of model misspecification for maximum likelihood estimation with missing data. In: Econometrics 7.3 (2019), p. 37.
DOI
Ibrahim, J. G., M. Chen, and S. R. Lipsitz. Missing responses in generalised linear mixed models when the missing data mechanism is nonignorable. In: Biometrika 88.2 (2001), pp. 551-564.
DOI
Ibrahim, J. G., S. R. Lipsitz, and M. Chen. Missing Covariates in Generalized Linear Models When the Missing Data Mechanism is Non-Ignorable. In: Journal of the Royal Statistical Society. Series B (Statistical Methodology) 61.1 (1999), pp. 173-190.
DOI URL
Jiang, W., J. Josse, and M. Lavielle. Logistic Regression with Missing Covariates–Parameter Estimation, Model Selection and Prediction. In: arXiv preprint (2018). arXiv: 1805.04602 [stat.ME].
Jones, M. P. Indicator and Stratification Methods for Missing Explanatory Variables in Multiple Linear Regression. In: Journal of the American Statistical Association 91.433 (1996), pp. 222-230.
DOI
Little, R. J. A. Regression with missing X’s: a review. In: Journal of the American Statistical Association 87.420 (1992), pp. 1227-1237.
DOI
Louis, T. A. Finding the Observed Information Matrix when Using the EM Algorithm. In: Journal of the Royal Statistical Society. Series B (Methodological) 44.2 (1982), pp. 226–233.
URL
Lüdtke, O., A. Robitzsch, and S. G. West. Regression models involving nonlinear effects with missing data: A sequential modeling approach using Bayesian estimation. In: Psychological methods (2019).
DOI
Meng, S. L. and D. B. Rubin. Maximum likelihood estimation via the ECM algorithm: a general framework. In: Biometrika 80.2 (1993), pp. 267-278.
DOI
Meng, X. L. and D. B. Rubin. Using EM to obtain asymptotic variance-covariance matrices: the SEM algorithm. In: Journal of the American Statistical Association 86.416 (1991), pp. 899-909.
DOI
Rosseel, Y. lavaan: an R package for structural equation modeling. In: Journal of Statistical Software 48.2 (2012).
DOI
Rubin, D. B. Inference and missing data. In: Biometrika 63.3 (1976), pp. 581-592.
DOI
Stubbendick, A. L. and J. G. Ibrahim. Maximum Likelihood Methods for Nonignorable Missing Responses and Covariates in Random Effects Models. In: Biometrics 59.4 (2003), pp. 1140–1150.
DOI
Stubbendick, A. L. and J. G. Ibrahim. Likelihood-based inference with nonignorable missing responses and covariates in models for discrete longitudinal data. In: Statistica Sinica 16.4 (2006), pp. 1143–1167.
URL
Tabouy, T., P. Barbillon, and J. Chiquet. Variational inference for stochastic block models from sampled data. In: Journal of the American Statistical Association 115.529 (2020), pp. 455–466.
DOI
Tchetgen Tchetgen, E. J., L. Wang, and B. Sun. Discrete choice models for nonmonotone nonignorable missing data: identification and inference. In: Statistica Sinica 28.4 (2018), pp. 2069–2088.
DOI
Xue, F. and A. Qu. Integrating multi-source block-wise missing data in model selection. In: Journal of the American Statistical Association (2020), pp. 1–36.
DOI
Zhao, Y. Statistical inference for missing data mechanisms. In: Statistics in Medicine 39.28 (2020), pp. 4325–4333.
DOI
Zhao, J. and Y. Ma. A versatile estimation procedure without estimating the nonignorable missingness mechanism. In: Journal of the American Statistical Association (2021), pp. 1–15.
DOI
Zhou, Y., R. J. A. Little, and J. D. Kalbfleisch. Block-conditional missing at random models for missing data. In: Statistical Science 25.4 (2010), pp. 517–532.
DOI

Londschien, M., S. Kovács, and P. Bühlmann. Change point detection for graphical models in presence of missing values. 2019. arXiv: 1907.05409 [stat.ML].

Regression

There is a vast literature on how to perform (linear) regression, possibly in high dimensional setting, in presence of missing values in the covariates. This can be seen as a particular case of supervised learning, which is presented below even if the focus is often more on estimating parameters or selecting relevant variables.

Jiang, W., M. Bogdan, J. Josse, et al. Adaptive Bayesian SLOPE–High-dimensional Model Selection with Missing Values. In: arXiv preprint (2019).
URL
Golden, R. M., S. S. Henley, H. White, et al. Consequences of model misspecification for maximum likelihood estimation with missing data. In: Econometrics 7.3 (2019), p. 37.
DOI
Jiang, W., J. Josse, and M. Lavielle. Logistic Regression with Missing Covariates–Parameter Estimation, Model Selection and Prediction. In: arXiv preprint (2018). arXiv: 1805.04602 [stat.ME].
Jones, M. P. Indicator and Stratification Methods for Missing Explanatory Variables in Multiple Linear Regression. In: Journal of the American Statistical Association 91.433 (1996), pp. 222-230.
DOI
Lüdtke, O., A. Robitzsch, and S. G. West. Regression models involving nonlinear effects with missing data: A sequential modeling approach using Bayesian estimation. In: Psychological methods (2019).
DOI

Loh, P. and M. J. Wainwright. High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity. In: Advances in Neural Information Processing Systems. Ed. by J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira and K. Q. Weinberger. Vol. 24. Curran Associates, Inc., 2011, pp. 2726–2734.
URL

Single imputation

In the previously mentioned EM algorithm there is in fact an implicit step called imputation: imputing a missing value means replacing it with a plausible one. The definition of plausibility is not stated explicitly but can be deduced from the used method to fill in the gaps, for instance one could choose to replace all missing values of a certain variable \(X_j\) by the average observed value \(\frac{1}{n_{obs,j}}\sum_{i} x_{ij}\mathbb{1}_{\{x_{ij} \, is\, observed\}}\), where \(n_{obs,j} = \sum_{i} \mathbb{1}_{\{x_{ij} \, is\, observed\}}\). The interest of imputation is manifold: (1) it allows to use all information in the sample (instead of deleting incomplete observations which leads to a decreasing power in the statistical analysis), (2) if there is sufficient data, i.e. sufficient observations, then the imputation can be very accurate and this assures good quality of future statistical analyses and (3) the imputed dataset is a complete dataset and one can apply standard statistical inference methods. The latter however has to be treated with caution since it implies that in the statistical analysis one does not make any distinction between observed values and imputed values anymore. We will come back to this issue in the next section on multiple imputation.

Audigier, V., F. Husson, and J. Josse. A principal component method to impute missing values for mixed data. In: Advances in Data Analysis and Classification 10.1 (2016), pp. 5-26.
DOI
Bertsimas, D., C. Pawlowski, and Y. D. Zhuo. From predictive methods to missing data imputation: an optimization approach. In: The Journal of Machine Learning Research 18.1 (2017), pp. 7133–7171.
Cranmer, S. J. and J. Gill. We have to be discrete about this: a non-parametric imputation technique for missing categorical data. In: British Journal of Political Science 43 (2012), pp. 425-449.
DOI
Crookston, N. L. and A. O. Finley. yaImpute: an R package for kNN imputation. In: Journal of Statistical Software 23 (2008), p. 10.
DOI
Dax, A. Imputing Missing Entries of a Data Matrix: A review. In: Journal of Advanced Computing 3.3 (2014), pp. 98-222.
DOI
Ding, Y. and J. S. Simonoff. An investigation of missing data methods for classification trees applied to binary response data. In: Journal of Machine Learning Research 11.1 (2010), pp. 131-170.
URL
Fellegi, I. P. and D. Holt. A systematic approach to automatic edit and imputation. In: Journal of the American Statistical Association 71.353 (1976), pp. 17-35.
DOI
Ferrari, P. A., P. Annoni, A. Barbiero, et al. An imputation method for categorical variables with application to nonlinear principal component analysis. In: Computational Statistics & Data Analysis 55.7 (2011), pp. 2410-2420.
DOI
Finkbeiner, C. Estimation for the multiple factor model when data are missing. In: Psychometrika 44.4 (1979), pp. 409-420.
DOI
Huisman, M. Imputation of missing item responses: some simple techniques. In: Quality & Quantity 34.4 (2000), pp. 331-351.
DOI
Husson, F. and J. Josse. Handling missing values in multiple factor analysis. In: Food Quality and Preference 30 (2013), pp. 77-85.
DOI
Ilin, A. and T. Raiko. Practical approaches to Principal Component Analysis in the presence of missing values. In: Journal of Machine Learning Research 11 (2010), pp. 1957-2000.
URL
Joenssen, D. W. and U. Bankhofer. Donor limited hot deck imputation: effect on parameter estimation. In: Journal of Theoretical and Applied Computer Science 6.3 (2012), pp. 58-70.
URL
Josse, J., M. Chavent, B. Liquet, et al. Handling missing values with regularized iterative multiple correspondance analysis. In: Journal of Classification 29.1 (2012), pp. 91-116.
DOI
Josse, J., F. Husson, and J. Pagès. Gestion des données manquantes en Analyse en Composantes Principales. In: Journal de la Société Française de Statistique 150.2 (2009), pp. 28-51.
URL
Kalton, G. and D. Kasprzyk. The treatment of missing survey data. In: Survey Methodology 12.1 (1986), pp. 1-16.
URL
Kohn, R. and C. F. Ansley. Estimation, prediction, and interpolation for ARIMA models with missing data. In: Journal of the American Statistical Association 81.395 (1986), pp. 751-761.
DOI
Kowarik, A. and M. Templ. Imputation with the R Package VIM. In: Journal of Statistical Software 74.7 (2016), pp. 1-16.
DOI
Moritz, S. and T. Bartz-Beielstein. imputeTS: time series missing value imputation in R. In: The R Journal 9.1 (2017), pp. 207-218.
URL
Tang, F. and H. Ishwaran. Random forest missing data algorithms. In: Statistical Analysis and Data Mining: The ASA Data Science Journal 10.6 (2017), pp. 363–377.
DOI
Stacklies, W., H. Redestig, M. Scholz, et al. pcaMethods – a bioconductor package providing PCA methods for incomplete data. In: Bioconductor 23.9 (2007), pp. 1164-1167.
DOI
Troyanskaya, O., M. Cantor, G. Sherlock, et al. Missing value estimation methods for DNA microarrays. In: Bioinformatics 17.6 (2001), pp. 520-525.
DOI
Unnebrink, K. and J. Windeler. Intention-to-treat: methods for dealing with missing values in clinical trials of progressively deteriorating diseases. In: Statistics in Medecine 20.24 (2001), pp. 3931-3946.
DOI
Verbanck, M., J. Josse, and F. Husson. Regularised PCA to denoise and visualise data. In: Statistics and Computing 25.2 (2015), pp. 471-486.
DOI
Zhang, H., P. Xie, and E. Xing. Missing Value Imputation Based on Deep Generative Models. In: Computing Research Repository abs/1808.01684 (2018).
URL
Zhang, S. Nearest neighbor selection for iterative kNN imputation. In: Journal of Systems and Software 85.11 (2012), pp. 2541-2552.
DOI
Zhu, Z., T. Wang, and R. J. Samworth. High-dimensional principal component analysis with heterogeneous missingness. In: arXiv preprint (2019).
URL

Tran, L., X. Liu, J. Zhou, et al. Missing Modalities Imputation via Cascaded Residual Autoencoder. In: 2017 IEEE Conference on Computer Vision and PAttern Recognition (CVPR). (Jul. 21, 2017-Jul. 26, 2017). Ed. by -. IEEE, 2017, pp. 4971-4980.
DOI
Zhao, Y. and M. Udell. Missing value imputation for mixed data via gaussian copula. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020, pp. 636–646.
DOI

Moritz, S., A. Sardá, T. Bartz-Beielstein, et al. Comparison of different methods for univariate time series imputation in R. Prepint arXiv 1510.03924. 2015.
URL

Hot-deck and KNN approaches

Let \(x_i\) be an observation with missing values, e.g. each entry of \(x_i\) could be the temperature at a certain day for one given place and unfortunately for some days the temperature was not measured. An intuitive idea to replace this missing information could be: take other observations \(\{x_j\}_j\) which are similar to \(x_i\) at the observed values and use this information to fill in the gaps. This idea of taking observed values from neighbours or donors based on some similarity measure is implemented in the so-called hot-deck and k-nearest-neighbors (kNN) approaches.

Andridge, R. and R. J. A. Little. A review of hot deck imputation for survey non-response. In: International Statistical Review 78.1 (2010), pp. 40-64.
DOI
Huisman, M. Imputation of missing item responses: some simple techniques. In: Quality & Quantity 34.4 (2000), pp. 331-351.
DOI
Imbert, A., A. Valsesia, C. Le Gall, et al. Multiple hot-deck imputation for network inference from RNA sequencing data. In: Bioinformatics 34.10 (2018), pp. 1726-1732.
DOI
Joenssen, D. W. and U. Bankhofer. Donor limited hot deck imputation: effect on parameter estimation. In: Journal of Theoretical and Applied Computer Science 6.3 (2012), pp. 58-70.
URL
Rao, J. N. K. and J. Shao. Jackknife variance estimation with survey data under hot deck imputation. In: Biometrika 79.4 (1992), pp. 811-822.
DOI
Reilly, M. and M. Pepe. The relationship between hot-deck multiple imputation and weighted likelihood. In: Statistics in Medecine 16.1-3 (1997), pp. 5-19.
DOI
Voillet, V., P. Besse, L. Liaubet, et al. Handling missing rows in multi-omics data integration: multiple imputation in multiple factor analysis framework. In: BMC Bioinformatics 17.402 (2016). Forthcoming.
DOI

Matrix factorization

A special case of imputation is matrix completion that exploits structural assumptions about the row and column spaces to impute the missing values.

Nguyen, L. T., J. Kim, and B. Shim. Low-Rank Matrix Completion: A Contemporary Survey. In: IEEE Access 7 (2019), pp. 94215–94237.
DOI
Robin, G., O. Klopp, J. Josse, et al. Main Effects and Interactions in Mixed and Incomplete Data Frames. In: Journal of the American Statistical Association 115.531 (2020), pp. 1292-1303. eprint: https://doi.org/10.1080/01621459.2019.1623041.
DOI URL
Sportisse, A., C. Boyer, and J. Josse. Imputation and low-rank estimation with Missing Not At Random data. In: Statistics and Computing 30.6 (2018), pp. 1629-1643.
DOI

Kallus, N., X. Mao, and M. Udell. Causal Inference with Noisy and Missing Covariates via Matrix Factorization. In: Advances in Neural Information Processing Systems. Ed. by -. 2018. eprint: 1806.00811.
URL
Ma, W. and G. H. Chen. Missing Not at Random in Matrix Completion: The Effectiveness of Estimating Missingness Probabilities Under a Low Nuclear Norm Assumption. In: Advances in Neural Information Processing Systems 32. Ed. by H. Wallach, H. Larochelle, A. Beygelzimer, F. d. Alché-Buc, E. Fox and R. Garnett. Curran Associates, Inc., 2019, pp. 14900–14909.
URL
Zhao, Y. and M. Udell. Missing value imputation for mixed data via gaussian copula. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020, pp. 636–646.
DOI

Robin, G. Low-rank methods for heterogeneous and multi-source data. 2019.
DOI

Multiple imputation

A major drawback of single imputation, i.e. where every missing value is replaced by a single most plausible value, consists in the underestimation of the overall variance of the data and inferred parameters. Indeed, by replacing every missing value by a given plausible one and by applying generic statistical methods on the completed dataset, one makes no difference between initially observed and unobserved data anymore. Therefore the variability due to the uncertainty of the missing values is not reflected in future statistical analyses which treat the dataset as if it had been fully observed from the beginning. A nice and conceptually simple workaround for this problem is multiple imputation: instead of generating a single complete dataset by a given imputation method one imputes every missing value by several possible values. Statistical analysis is then applied on each of the imputed datasets and the resulting estimations are aggregated and used to estimate the sample variance and the variance due to the uncertainty in the missing values.

Carpenter, J. and M. Kenward. Multiple Imputation and its Application. Chichester, West Sussex, UK: Wiley, 2013. ISBN: 9780470740521.
DOI
Rubin, D. B. Multlipe Imputation for Nonresponse in Surveys. Hoboken, NJ, USA: Wiley, 1987. ISBN: 9780471655740.

Abayomi, K., A. Gelman, and M. Levy. Diagnostics for multivariate imputations. In: Journal of the Royal Statistical Society, Series C (Applied Statistics) 57.3 (2008), pp. 273-291.
DOI
Audigier, V., F. Husson, and J. Josse. Multiple imputation for continuous variables using a Bayesian principal component analysis. In: Journal of Statistical Computation and Simulation 86.11 (2015), pp. 2140-2156.
DOI
Audigier, V., F. Husson, and J. Josse. MIMCA: multiple imputation for categorical variables with multiple correspondence analysis. In: Statistics and Computing 27.2 (2016), pp. 1-18. eprint: 1505.08116.
DOI
Carpenter, J. R., M. G. Kenward, and S. Vansteelandt. A comparison of multiple imputation and doubly robust estimation for analyses with missing data. In: Journal of the Royal Statistical Society: Series A (Statistics in Society) 169.3 (2006), pp. 571–584.
DOI
Collins, L. M., J. L. Schafer, and K. Chi-Ming. A comparison of inclusive and restrictive strategies in modern missing data procedures. In: Psychological Methods 6.4 (2007), pp. 330-351.
DOI
Erler, N. S., D. Rizopoulos, and E. M. Lesaffre. JointAI: joint analysis and imputation of incomplete data in R. In: arXiv preprint (2019).
URL
Fay, R. E. Alternative paradigms for the analysis of imputed survey data. In: Journal of the American Statistical Association 91.434 (1996), pp. 490-498.
DOI
Gelman, A., G. King, and C. Liu. Not asked and not answered: Multiple imputation for multiple surveys. In: Journal of the American Statistical Association 93.443 (1998), pp. 846–857.
DOI
Gelman, A., I. van Mechelen, G. Verbeke, et al. Multiple Imputation for Model Checking: Completed-Data Plots with Missing and Latent Data. In: Biometrics 61.1 (2005), pp. 74–85.
DOI
Graham, J. W., A. E. Olchowski, and T. E. Gilreath. How many imputations are really needed? Some practical clarifications of multiple imputation theory. In: Prevention Science 8.3 (2007), pp. 206-213.
DOI
Honaker, J., G. King, and M. Blackwell. Amelia II: a program for missing data. In: Journal of Statistical Software 45.7 (2011). eprint: arXiv:1501.0228.
DOI
Imbert, A., A. Valsesia, C. Le Gall, et al. Multiple hot-deck imputation for network inference from RNA sequencing data. In: Bioinformatics 34.10 (2018), pp. 1726-1732.
DOI
Josse, J., J. Pagès, and F. Husson. Multiple imputation in principal component analysis. In: Advances in Data Analysis and Classification 5.3 (2011), pp. 231-246.
DOI
Josse, J. and F. Husson. Handling missing values in exploratory multivariate data analysis methods. In: Journal de la Société Française de Statistique 153.2 (2012), pp. 79-99.
URL
Josse, J. and F. Husson. missMDA: a package for handling missing values in multivariate data analysis. In: Journal of Statistical Software 70.1 (2016), pp. 1-31.
DOI
Kropko, J., B. Goodrich, A. Gelman, et al. Multiple Imputation for Continuous and Categorical Data: Comparing Joint Multivariate Normal and Conditional Approaches. In: Political Analysis 22.4 (2014), pp. 497–519.
DOI
Larose, C., D. K. Dey, and O. Harel. The impact of missing values on different measures of uncertainty. In: Statistica Sinica 29.2 (2019), pp. 551–566.
DOI
Murray, J. S. and J. P. Reiter. Multiple Imputation of Missing Categorical and Continuous Values via Bayesian Mixture Models With Local Dependence. In: Journal of the American Statistical Association 111.516 (2016), pp. 1466-1479.
DOI
Quartagno, M. and J. R. Carpenter. Multiple imputation for discrete data: Evaluation of the joint latent normal model. In: Biometrical Journal 61.4 (2019), pp. 1003–1019.
DOI
Robins, J. M. and N. Wang. Inference for imputation estimators. In: Biometrika 87.1 (2000), pp. 113-124.
URL
Rubin, D. B. Multiple imputation after 18+ years. In: Journal of the American Statistical Association 91.434 (2012), pp. 473-489.
DOI
Schafer, J. L. and M. K. Olsen. Multiple Imputation for multivariate missing-data problems: a data analyst’s perspective. In: Multivariate Behavioral Research 33.4 (1998), pp. 545-571.
DOI
Schafer, J. L. Multiple imputation: a primer. In: Statistical Methods in Medical Research 8.1 (1999), pp. 3-15.
DOI
Stuart, E. A., M. Azur, C. Frangakis, et al. Multiple imputation with large data sets: a case study of the children’s mental health initiative. In: American Journal of Epidemiology 169.9 (2009), pp. 1133-1139.
DOI
Su, Y. S., A. Gelman, J. Hill, et al. Multiple imputation with diagnostics (mi) in R: opening windows into the black box. In: Journal of Statistical Software 45 (2011), p. 2.
DOI
Buuren, S. van, J. P. L. Brand, C. G. M. Groothuis-Oudshoorn, et al. Fully conditional specification in multivariate imputation. In: Journal of Statistical Computation and Simulation 76.12 (2006), pp. 1049-1064.
DOI
Buuren, S. van and K. Groothuis-Oudshoorn. MICE: multivariate imputation by chained equations in R. In: Journal of Statistical Software 45 (2011), p. 3. eprint: NIHMS150003.
DOI
Buuren, S. van. Multiple imputation of discrete and continuous data by fully conditional specification. In: Statistical Methods in Medical Research 16 (2007), pp. 219-242.
DOI
Voillet, V., P. Besse, L. Liaubet, et al. Handling missing rows in multi-omics data integration: multiple imputation in multiple factor analysis framework. In: BMC Bioinformatics 17.402 (2016). Forthcoming.
DOI
Wang, N. and J. M. Robins. Large-sample theory for parametric multiple imputation procedures. In: Biometrika 85.4 (1998), pp. 935–948.
DOI
Xie, X. and X. L. Meng. Dissecting multiple imputation from a multi-phase inference perspective: what happens when God’s, imputer’s and analyst’s models are uncongenial? In: Statistica Sinica 27.4 (2017), pp. 1485–1594.
DOI

Gondara, L. and K. Wang. MIDA: Multiple Imputation using Denoising Autoencoders. In: Proceedings of the 22nd Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2018). (Jun. 03, 2018-Jun. 06, 2018). Ed. by D. Phung, V. Tseng, G. Webb, B. Ho, M. Ganji and L. Rashidi. Lecture Notes in Computer Science. Springer International Publishing, 2018, pp. 260-272. ISBN: 3319930404.
DOI URL
Muzellec, B., J. Josse, C. Boyer, et al. Missing Data Imputation using Optimal Transport. In: International Conference on Machine Learning. PMLR. 2020, pp. 7130–7140.

Machine Learning

The field of machine learning being dependent on the availability of (good) training data, it is – in most real-world applications – necessarily facing the issue of missing data. Hence there has been an increasing attention to how to handle missing data, in the features and the output, in order to learn accurately from the data.

Supervised learning

Methods to deal with supervised learning (predict as well as possible an outcome) with missing values in the covariates are really different from methods for inference with missing values (estimating parameters).

Morvan, M. L. and G. Varoquaux. Imputation for prediction: beware of diminishing returns. In: arXiv preprint arXiv:2407.19804 (2024).
Josse, J., N. Prost, E. Scornet, et al. On the consistency of supervised learning with missing values. In: arXiv preprint (2019). arXiv: 1902.06931 [stat.ML].
URL
Ma, A. and D. Needell. Stochastic Gradient Descent for Linear Systems with Missing Data. In: Numerical Mathematics: Theory, Methods and Applications 12.1 (2017), pp. 1-20.
DOI

Ayme, A., C. Boyer, A. Dieuleveut, et al. Near-optimal rate of consistency for linear models with missing values. In: International Conference on Machine Learning. PMLR. 2022, pp. 1211–1243.
Ayme, A., C. Boyer, A. Dieuleveut, et al. Random features models: a way to study the success of naive imputation. In: Proceedings of the 41st International Conference on Machine Learning. ICML’24. Vienna, Austria: JMLR.org, 2024.
Ipsen, N. B., P. Mattei, and J. Frellsen. not-MIWAE: Deep generative modelling with missing not at random data. In: arXiv preprint (2020). Ipsen, N., P. Mattei, and J. Frellsen. How to deal with missing data in supervised deep learning? In: ICML Workshop on the Art of Learning with Missing Values (Artemiss). 2020.
URL
Le Morvan, M., N. Prost, J. Josse, et al. Linear predictor on linearly-generated data with missing values: non consistency and solutions. In: Proceedings of Machine Learning Research. Ed. by -. Vol. 108. Proceedings of Machine Learning Research. 2020, p. 3165–3174. eprint: 2002.00658v2.
URL
Le Morvan, M., J. Josse, T. Moreau, et al. NeuMiss networks: differentiable programming for supervised learning with missing values. In: Advances in Neural Information Processing Systems, 33. (Dec. 2020). Ed. by -. IEEE, 2020. eprint: 2007.01627v4.
URL
Sportisse, A., C. Boyer, A. Dieuleveut, et al. Debiasing Averaged Stochastic Gradient Descent to handle missing values. In: Advances in Neural Information Processing Systems, 33. (Dec. 2020). Ed. by -. IEEE, 2020. eprint: 2002.09338v2.
URL

Le Morvan, M., J. Josse, E. Scornet, et al. What’s a good imputation to predict with missing values? 2021.
URL

Unsupervised learning

Methods have been suggested to perform clustering with missing values (k-means, mixture models) as well as dimensionality reduction with missing values (PCA).

Brinis, S., C. Traina, and A. J. Traina. Hollow-tree: a metric access method for data with missing values. In: Journal of Intelligent Information Systems (2019), pp. 1–28.
DOI
Hunt, L. and M. Jorgensen. Mixture model clustering for mixed data with missing information. In: Computational Statistics & Data Analysis 41.3-4 (2003), pp. 429–440.
DOI
Chi, J. T., E. C. Chi, and R. G. Baraniuk. k-pod: A method for k-means clustering of missing data. In: The American Statistician 70.1 (2016), pp. 91–99.
DOI
Josse, J., M. Chavent, B. Liquet, et al. Handling missing values with regularized iterative multiple correspondance analysis. In: Journal of Classification 29.1 (2012), pp. 91-116.
DOI
Miao, W. and E. J. Tchetgen Tchetgen. Identification and inference with nonignorable missing covariate data. In: Statistica Sinica 28.4 (2018), pp. 2049–2067.
DOI

Trees and forests

Decision trees are models based on recursive executions of elementary rules. This architecture grants them a variety of simple options to deal with missing values, without requiring prior imputation. A popular class of decision tree models is called random trees (or more generally random forests) and allows data analyses such as causal inference in the presence of missing values without the need of having to impute these missing values.

Beaulac, C. and J. S. Rosenthal. BEST: A decision tree algorithm that handles missing values. In: arXiv preprint (2018). eprint: 1804.10168.
URL
Bertsimas, D., C. Pawlowski, and Y. D. Zhuo. From predictive methods to missing data imputation: an optimization approach. In: The Journal of Machine Learning Research 18.1 (2017), pp. 7133–7171.
Ding, Y. and J. S. Simonoff. An investigation of missing data methods for classification trees applied to binary response data. In: Journal of Machine Learning Research 11.1 (2010), pp. 131-170.
URL
Hothorn, T., K. Hornik, and A. Zeileis. Unbiased Recursive Partitioning: A Conditional Inference Framework. In: Journal of Computational and Graphical Statistics 15.3 (2012), pp. 651-674.
DOI
Josse, J., N. Prost, E. Scornet, et al. On the consistency of supervised learning with missing values. In: arXiv preprint (2019). arXiv: 1902.06931 [stat.ML].
URL
Kapelner, A. and J. Bleich. Prediction with missing data via Bayesian additive regression trees. In: Canadian Journal of Statistics 43.2 (2015), pp. 224-239.
DOI URL
Khosravi, P., A. Vergari, Y. Choi, et al. Handling missing data in decision trees: A probabilistic approach. In: arXiv preprint arXiv:2006.16341 (2020).
Rahman, G. and Z. Islam. Missing value imputation using decision trees and decision forests by splitting and merging records: Two novel techniques. In: Knowledge-Based Systems 53 (2013), pp. 51–65.
DOI URL
Stekhoven, D. J. and P. Bühlmann. Missforest-non-parametric missing value imputation for mixed-type data. In: Bioinformatics 28.1 (2012), pp. 112-118. eprint: 1105.0828.
DOI
Strobl, C., A. L. Boulesteix, and T. Augustin. Unbiased split selection for classification trees based on the Gini Index. In: Computational Statistics & Data Analysis 52.1 (2007), pp. 483-501.
DOI
Tierney, N. J., F. A. Harden, M. J. Harden, et al. Using decision trees to understand structure in missing data. In: BMJ Open 5.6 (2015), p. e007450.
DOI
Twala, B. E. T. H., M. C. Jones, and D. J. Hand. Good methods for coping with missing data in decision trees. In: Pattern Recognition Letters 29.7 (2008), pp. 950-956.
DOI

Chen, T. and C. Guestrin. XGBoost: A Scalable Tree Boosting System. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (Aug. 13, 2016-Aug. 17, 2016). Ed. by -. New York, NY, USA: ACM, 2016, pp. 785-794. ISBN: 0450342322.
DOI

Rieger, A., T. Hothorn, and C. Strobl. Random forests with missing values in the covariates. Tech. rep. 79. University of Munich, Department of Statistics, 2010.
URL

Logistic Regression

Logistic regression is a core method in supervised learning, widely used not only for classification but also for estimating class probabilities. This probabilistic output is especially important in applications like medicine, where decisions depend on risk estimates rather than hard class labels. However, the presence of missing values in the covariates poses significant challenges. Unlike linear regression, where missing data techniques are more mature, handling missingness in logistic regression is still a relatively new and active area of research.

Verchand, K. A. and A. Montanari. High-dimensional logistic regression with missing data: Imputation, regularization, and universality. In: arXiv preprint arXiv:2410.01093 (2024).
Lobo, A. D. R., A. Ayme, C. Boyer, et al. A primer on linear classification with missing data. In: arXiv preprint arXiv:2405.09196 (2024).

Deep Learning

The advance and success of (deep) neural networks in many research and application areas such as computer vision and natural language processing has also re-discovered the problem of handling missing values. Indeed the question of training neural networks on incomplete data has been considered even before the latest rise of deep learning and is considered to be essential due to the impact of missingness on the feasibility and quality of various learning problems.

Deng, G., C. Han, and D. S. Matteson. Extended missing data imputation via GANs for ranking applications. In: Data Mining and Knowledge Discovery 36.4 (2022), pp. 1498–1520.
Fang, F. and S. Bao. FragmGAN: generative adversarial nets for fragmentary data imputation and prediction. In: Statistical Theory and Related Fields 8.1 (2024), pp. 15–28.
Bianchi, F. M., L. Livi, K. Ø. Mikalsen, et al. Learning representations of multivariate time series with missing data. In: Pattern Recognition 96 (2019), p. 106973.
DOI
Ipsen, N. B., P. Mattei, and J. Frellsen. not-MIWAE: Deep generative modelling with missing not at random data. In: arXiv preprint (2020).
URL
Sharpe, P. K. and R. J. Solly. Dealing with missing values in neural network-based diagnostic systems. In: Neural Computing & Applications 3.2 (1995), pp. 73-77.
DOI
Śmieja, M., Ł. Struski, J. Tabor, et al. Processing of missing data by neural networks. In: Computing Research Repository abs/1805.07405 (2018). eprint: 1805.07405.
URL
Sovilj, D., E. Eirola, Y. Miche, et al. Extreme learning machine for missing data using multiple imputations. In: Neurocomputing 174.A (2016), pp. 220-231.
DOI
Zhang, H., P. Xie, and E. Xing. Missing Value Imputation Based on Deep Generative Models. In: Computing Research Repository abs/1808.01684 (2018).
URL

Bengio, Y. and F. Gingras. Recurrent neural networks for missing or asynchronous data. In: Proceedings of the 8th International Conference on Neural Information Processing Systems. (Nov. 27, 1995-Dec. 02, 1995). Ed. by -. Cambridge, MA, USA: MIT Press, 1995, pp. 395-401.
URL
Biessmann, F., D. Salinas, S. Schelter, et al. Deep" Learning for Missing Value Imputation in Tables with Non-Numerical Data. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management. Ed. by -. CIKM ’18. Torino, Italy: ACM, 2018, pp. 2017–2025. ISBN: 978-1-4503-6014-2.
DOI URL
Gondara, L. and K. Wang. MIDA: Multiple Imputation using Denoising Autoencoders. In: Proceedings of the 22nd Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2018). (Jun. 03, 2018-Jun. 06, 2018). Ed. by D. Phung, V. Tseng, G. Webb, B. Ho, M. Ganji and L. Rashidi. Lecture Notes in Computer Science. Springer International Publishing, 2018, pp. 260-272. ISBN: 3319930404.
DOI URL
Goodfellow, I., M. Mirza, A. Courville, et al. Multi-Prediction Deep Boltzmann Machines. In: Proceedings of the 26th International Conference on Neural Information Processing Systems. (Dec. 05, 2013-Dec. 10, 2013). Ed. by C. Burges, L. Bottou, M. Welling, Z. Ghahramani and K. Weinberger. Advances in Neural Information Processing Systems 26. Curran Associates, Inc., 2013, pp. 548–556.
URL
Ipsen, N. B., P. Mattei, and J. Frellsen. not-MIWAE: Deep generative modelling with missing not at random data. In: arXiv preprint (2020). Ipsen, N., P. Mattei, and J. Frellsen. How to deal with missing data in supervised deep learning? In: ICML Workshop on the Art of Learning with Missing Values (Artemiss). 2020.
URL
Mattei, P. and J. Frellsen. MIWAE: Deep generative modelling and imputation of incomplete data sets. In: Proceedings of the 36th International Conference on Machine Learning. Vol. 97. Proceedings of Machine Learning Research. Kamalika Chaudhuri and Ruslan Salakhutdinov, 2019, pp. 4413–4423.
URL
Le Morvan, M., J. Josse, T. Moreau, et al. NeuMiss networks: differentiable programming for supervised learning with missing values. In: Advances in Neural Information Processing Systems, 33. (Dec. 2020). Ed. by -. IEEE, 2020. eprint: 2007.01627v4.
URL
Nowicki, R. K., R. Scherer, and L. Rutkowski. Novel rough neural network for classification with missing data. In: 21st International Conference on Methods and Models in Automation and Robotics (MMAR). (Sep. 29, 2016-Sep. 01, 2016). Ed. by -. IEEE, 2016, pp. 820–825.
DOI
Tran, L., X. Liu, J. Zhou, et al. Missing Modalities Imputation via Cascaded Residual Autoencoder. In: 2017 IEEE Conference on Computer Vision and PAttern Recognition (CVPR). (Jul. 21, 2017-Jul. 26, 2017). Ed. by -. IEEE, 2017, pp. 4971-4980.
DOI
Yoon, J., J. Jordon, and M. van der Schaar. GAIN: Missing Data Imputation using Generative Adversarial Nets. In: Proceedings of the 35th International Conference on Machine Learning. (Jul. 10, 2018-Jul. 15, 2018). Ed. by J. Dy and A. Krause. Vol. 80. Proceedings of Machine Learning Research. Stockholmsmässan, Stockholm Sweden: PMLR, 2018, pp. 5689–5698.
URL
Yoon, S. and S. Sull. GAMIN: Generative Adversarial Multiple Imputation Network for Highly Missing Data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, pp. 8456–8464.
URL

Londschien, M., S. Kovács, and P. Bühlmann. Change point detection for graphical models in presence of missing values. 2019. arXiv: 1907.05409 [stat.ML].

Graphical Models for Missing Data

Graphical models, particularly Directed Acyclic Graphs (DAGs), are used to represent causal relationships and conditional independencies in data. In the context of missing data, missingness graphs (m-graphs) extend these models to incorporate missingness mechanisms explicitly. This allows for the representation of how missingness arises in different variables and how these mechanisms may interact. M-graphs provide a structured framework for understanding and analyzing the dependencies between observed and missing data, offering a formal approach to diagnosing and addressing missing data issues in a variety of applications.

Mohan, K. and J. Pearl. Graphical models for recovering probabilistic and causal queries from missing data. In: Probabilistic and Causal Inference: the Works of Judea Pearl. 2022, pp. 413–432.

Mohan, K., J. Pearl, and J. Tian. Graphical models for inference with missing data. In: Advances in neural information processing systems 26 (2013).
Mohan, K. and J. Pearl. Graphical models for processing missing data. In: Journal of the American Statistical Association 116.534 (2021), pp. 1023–1037.

Missing values mechanisms and identifiability

As mentioned in the above sections, it is necessary to make assumptions on the mechanism generating the missing values or response mechanism in order to work with missing values. Broadly speaking, these assumptions indicate how much the missingness is related to the data itself. The assumptions made on the mechanism impact further steps in the data analysis (since some types of missingness can induce a bias on the analysis results) and are therefore crucial for valid analyses of data in the presence of missing values.

More formally, both \(X\) and \(R\) are modeled as random variables and the response mechanism is defined as the conditional distribution of \(R\) given \(X\), \(\mathbb{P}_R(R|X)\). This distribution can depend on some parameter \(\phi\) so that we have \(\mathbb{P}_R(R|X;\phi)\). Little and Rubin (2002) defined three main categories of missing values depending on the form of the conditional distribution \(\mathbb{P}_R\):

Missing completely at random (MCAR): The missingness does not depend on the variables \(X=(X^\mathrm{obs},X^\mathrm{mis})\), denoting the observed variables and the missing ones as \(X^\mathrm{obs}\) and \(X^\mathrm{mis}\) respectively i.e.
\[\mathbb{P}_R(R|X^\mathrm{obs},X^\mathrm{mis};\phi) = \mathbb{P}_R(R;\phi), \forall \phi\]
Missing at random (MAR): The missingness depends only on the observed variables \(X_{obs}\), i.e.
\[\mathbb{P}_R(R|X^\mathrm{obs},X^\mathrm{mis};\phi) = \mathbb{P}_R(R|X^\mathrm{obs};\phi), \forall \phi,X^\mathrm{mis}\]
Missing not at random (MNAR): The missingness is said MNAR in all other cases, i.e. the missingness depends on the missing values and potentially also on the observed values.

To understand this definition, take the example of alcohol consumption: alcoholics are less inclined to reveal their alcohol consumption, therefore the probability of missing information on the alcohol consumption depends on the amount of consumption itself. Another simple example is the information on income or wealth which is missing more often for individuals of very high or very low income.

Note that MCAR is a special case of MAR and that these three categories are of increasing complexity with a large gap between the second and third. Indeed, most more or less generic methods which have been proposed in the last few decades are suited for data that is MAR. The case MNAR requires different techniques and further assumptions.

Note that Little and Rubin (2002) consider these three categories as really missing values as opposed to not really missing values where, in the case of categorical data, the missingness rather constitutes an additional category (for instance in a questionnaire with multiple choice answers, a participant can leave out a question because the category he wants to choose is not among the given choices).

Another – maybe complementary – approach to consider and study different missing values mechanisms and problems consists in using graphical models, for instance missingness graphs or m-graphs (Mohan et al., 2013). These allow to represent multivariate dependencies and to study identifiability or recoverability for different (estimation or prediction) problems.

Finally, another line of research considers the occurrence of missing values beforehand and addresses the question of how to anticipate or control the occurrence of missing values in a study design.

Wainer, H., ed. Drawing Inferences from Self-Selected Samples. New York, NY, USA: Springer, 1986.

Berrett, T. B. and R. J. Samworth. Optimal nonparametric testing of missing completely at random and its connections to compatibility. In: The Annals of Statistics 51.5 (2023), pp. 2170–2193.
Molenberghs, G., C. Beunckens, C. Sotto, et al. Every missingness not at random model has a missingness at random counterpart with equal fit. In: Journal of the Royal Statistical Society Series B: Statistical Methodology 70.2 (2008), pp. 371–388.
Näf, J., E. Scornet, and J. Josse. What Is a Good Imputation Under MAR Missingness? In: arXiv preprint arXiv:2403.19196 (2024).
Spohn, M., J. Näf, L. Michel, et al. PKLM: A flexible MCAR test using Classification. In: Psychometrika (2025), pp. 1–24.
Albert, P. S. and D. A. Follmann. Modeling repeated count data subject to informative dropout. In: Biometrics 56.3 (2000), pp. 667-677.
DOI
Chen, Y. and M. Sadinle. Nonparametric Pattern-Mixture Models for Inference with Missing Data. In: arXiv preprint (2019). arXiv: 1904.11085 [stat.ME].
URL
Diggle, P. and M. G. Kenward. Informative drop-out in longitudinal data analysis. In: Journal of the Royal Statistical Society, Series C (Applied Statistics) 43.1 (1994), pp. 49-93.
DOI
Fang, F., J. Zhao, and J. Shao. Imputation-based adjusted score equations in generalized linear models with nonignorable missing covariate values. In: Statistica Sinica 28.4 (2018), pp. 1677–1701.
DOI
Follmann, D. and M. Wu. An approximate generalized linear model with random effects for informative missing data. In: Biometrics 51.1 (1995), pp. 151-168.
DOI
Gad, A. M. and N. M. M. Darwish. A shared parameter model for longitudinal data with missing values. In: American Journal of Applied Mathematics and Statistics 1.2 (2013), pp. 30-35.
URL
Heckman, J. J. The common structure of statistical models of truncation, sample selection and limited dependent variables and a simple estimator for such models. In: Annals of Economic and Social Measurement 5.4 (1976), pp. 475-492.
URL
Ibrahim, J. G., M. Chen, and S. R. Lipsitz. Missing responses in generalised linear mixed models when the missing data mechanism is nonignorable. In: Biometrika 88.2 (2001), pp. 551-564.
DOI
Ibrahim, J. G., S. R. Lipsitz, and M. Chen. Missing Covariates in Generalized Linear Models When the Missing Data Mechanism is Non-Ignorable. In: Journal of the Royal Statistical Society. Series B (Statistical Methodology) 61.1 (1999), pp. 173-190.
DOI URL
Ipsen, N. B., P. Mattei, and J. Frellsen. not-MIWAE: Deep generative modelling with missing not at random data. In: arXiv preprint (2020).
URL
Jamshidian, M., S. Jalal, and C. Jansen. MissMech: an R package for testing homoscedasticity, multivariate normality, and missing completely at random (MCAR). In: Journal of Statistical Software 56.6 (2014), pp. 1-31.
DOI
Jamshidian, M. and S. Jalal. Tests of homoscedasticity, normality, and missing completely at random for incomplete multivariate data. In: Psychometrika 75.4 (2010), pp. 649-674. eprint: NIHMS150003.
DOI
Larose, C., D. K. Dey, and O. Harel. The impact of missing values on different measures of uncertainty. In: Statistica Sinica 29.2 (2019), pp. 551–566.
DOI
Lee, K. M., R. Mitra, and S. Biedermann. Optimal design when outcome values are not missing at random. In: Statistica Sinica 28.4 (2018), pp. 1821–1838.
DOI
Lee, K. J., K. Tilling, R. P. Cornish, et al. Framework for the Treatment And Reporting of Missing data in Observational Studies: The Treatment And Reporting of Missing data in Observational Studies framework. In: Journal of clinical epidemiology 134 (2021), pp. 79–88.
Little, R. J. A. A test of missing completely at random for multivariate data with missing values. In: Journal of the American Statistical Association 83.404 (1988), pp. 1198-1202.
DOI
Little, R. J. A. Pattern-mixture models for multivariate incomplete data. In: Journal of the American Statistical Association 88.421 (1993), pp. 125-134.
DOI
Little, R. J. A. Modeling the drop-out mechanism in repeated-measures studies. In: Journal of the American Statistical Association 90.431 (1995), pp. 1112-1121.
DOI
Miao, W. and E. J. Tchetgen Tchetgen. Identification and inference with nonignorable missing covariate data. In: Statistica Sinica 28.4 (2018), pp. 2049–2067.
DOI
Molenberghs, G., B. Michiels, M. G. Kenward, et al. Monotone missing data and pattern-mixture models. In: Statistica Neerlandica 52.2 (1998), pp. 153-161.
DOI
Nabi, R., R. Bhattacharya, and I. Shpitser. Full Law Identification In Graphical Models Of Missing Data: Completeness Results. In: arXiv preprint arXiv:2004.04872 (2020).
URL
Reiter, J. P. and M. Sadinle. Itemwise conditionally independent nonresponse modelling for incomplete multivariate data. In: Biometrika 104.1 (Jan. 2017), pp. 207-220. eprint: http://oup.prod.sis.lan/biomet/article-pdf/104/1/207/13066719/asw063.pdf.
DOI URL
Rioux, C., A. Lewin, O. A. Odejimi, et al. Reflection on modern methods: planned missing data designs for epidemiological research. In: International Journal of Epidemiology (2020).
DOI
Robins, J. M., A. Rotnitzky, and L. P. Zhao. Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. In: Journal of the American Statistical Association 90.429 (1995), pp. 106-121.
DOI
Rotnitzky, A., J. M. Robins, and D. O. Scharfstein. Semiparametric regression for repeated outcomes with nonignorable nonresponse. In: Journal of the American Statistical Association 93.444 (1998), pp. 1321-1339.
DOI
Sadinle, M. and J. P. Reiter. Sequential Identification of Nonignorable Missing Data Mechanisms. In: Statistica Sinica 28.4 (2018), pp. 1741–1759.
DOI
Sadinle, M. and J. P. Reiter. Sequentially additive nonignorable missing data modeling using auxiliary marginal information. In: arXiv preprint (2019). arXiv: 1902.06043 [stat.ME].
URL
Santos, M. S., R. C. Pereira, A. F. Costa, et al. Generating Synthetic Missing Data: A Review by Missing Mechanism. In: IEEE Access 7 (2019), pp. 11651–11667. — Generating Synthetic Missing Data: A Review by Missing Mechanism. In: IEEE Access 7 (2019), pp. 11651–11667.
DOI
Santos, M. S., R. C. Pereira, A. F. Costa, et al. Generating Synthetic Missing Data: A Review by Missing Mechanism. In: IEEE Access 7 (2019), pp. 11651–11667.
DOI
Seaman, S., J. Galati, D. Jackson, et al. What Is Meant by "Missing at Random"? In: Statistical Science 28.2 (2013), pp. 257–268. — What Is Meant by" Missing at Random"? In: Statistical Science (2013), pp. 257–268.
Seaman, S., J. Galati, D. Jackson, et al. What Is Meant by "Missing at Random"? In: Statistical Science 28.2 (2013), pp. 257–268.
DOI URL
Shao, J. and J. Zhang. A transformation approach in linear mixed-effects models with informative missing responses. In: Biometrika 102.1 (2015), pp. 107-119.
DOI
Simon, G. A. and J. S. Simonoff. Diagnostic plots for missing data in least squares regression. In: Journal of the American Statistical Association 81.394 (1986), pp. 501-509.
DOI
Stubbendick, A. L. and J. G. Ibrahim. Maximum Likelihood Methods for Nonignorable Missing Responses and Covariates in Random Effects Models. In: Biometrics 59.4 (2003), pp. 1140–1150.
DOI
Stubbendick, A. L. and J. G. Ibrahim. Likelihood-based inference with nonignorable missing responses and covariates in models for discrete longitudinal data. In: Statistica Sinica 16.4 (2006), pp. 1143–1167.
URL
Tchetgen Tchetgen, E. J., L. Wang, and B. Sun. Discrete choice models for nonmonotone nonignorable missing data: identification and inference. In: Statistica Sinica 28.4 (2018), pp. 2069–2088.
DOI
Templ, M., A. Alfons, and P. Filzmoser. Exploring Incomplete data using visualization techniques. In: Advances in Data Analysis and Classification 6.1 (2012), pp. 29-47.
DOI
Thijs, H., G. Molenberghs, B. Michiels, et al. Strategies to fit pattern-mixture models. In: Biostatistics 3.2 (2002), pp. 245-265.
DOI
Tierney, N. J., F. A. Harden, M. J. Harden, et al. Using decision trees to understand structure in missing data. In: BMJ Open 5.6 (2015), p. e007450.
DOI
Vansteelandt, S., A. Rotnitzky, and J. Robins. Estimation of regression models for the mean of repeated outcomes under nonignorable nonmonotone nonresponse. In: Biometrika 94.4 (2007), pp. 841–860.
DOI
Verbeke, G., G. Molenberghs, H. Thijs, et al. Sensitivity analysis for nonrandom dropout: a local influence approach. In: Biometrics 57.1 (2001), pp. 7-14.
DOI
White, I. R., J. Carpenter, and N. J. Horton. A mean score method for sensitivity analysis to departures from the missing at random assumption in randomised trials. In: Statistica Sinica 28.4 (2018), pp. 1985–2003.
DOI
Wu, M. C. and R. J. Carroll. Estimation and comparison of changes in the presence of informative right censoring by modeling the censoring process. In: Biometrics 44.1 (1988), pp. 175-188.
DOI
Zhao, J. and Y. Ma. A versatile estimation procedure without estimating the nonignorable missingness mechanism. In: Journal of the American Statistical Association (2021), pp. 1–15.
DOI
Zhou, Y., R. J. A. Little, and J. D. Kalbfleisch. Block-conditional missing at random models for missing data. In: Statistical Science 25.4 (2010), pp. 517–532.
DOI

Gill, R. D., M. J. Van Der Laan, and J. M. Robins. Coarsening at random: Characterizations, conjectures, counter-examples. In: Proceedings of the First Seattle Symposium in Biostatistics. Springer. 1997, pp. 255–294.
DOI
Mohan, K., F. Thoemmes, and J. Pearl. Estimation with Incomplete Data: The Linear Case. In: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18. International Joint Conferences on Artificial Intelligence Organization, Jul. 2018, pp. 5082–5088.
DOI URL
Sportisse, A., C. Boyer, and J. Josse. Estimation with informative missing data in the low-rank model with random effects. In: Advances in Neural Information Processing Systems, 33. (Dec. 2020). Ed. by -. IEEE, 2020. eprint: 1906.02493v3.
URL

Mohan, K. and J. Pearl. Graphical Models for Processing Missing Data. Tech. rep. R-473-L. Forthcoming, Journal of American Statistical Association (JASA). CA: Department of Computer Science, University of California, Los Angeles, 2019.
URL
Tierney, N. and D. Cook. Expanding tidy data principles to facilitate missing data exploration, visualization and assessment of imputations. Monash Econometrics and Business Statistics Working Papers 14/18. Monash University, Department of Econometrics and Business Statistics, 2018.
URL