| R-miss-tastic

Tue Apr 22, 2025

Below you will find a selection of high-quality lectures, tutorials and labs on different aspects of missing values. Note that some of these lectures are available with publicly available video recordings.

General lectures

Statistical Learning with Missing Values

(Julie Josse, Jeffrey Näf, StatML & Bocconi Spring School 2025)

This presentation provides an overview of methods for handling missing data in the context of statistical and distributional learning. It discusses, among other things, how imputation methods should be evaluated and highlights the value of using distribution-based metrics that capture how well the underlying data structure is preserved when compared to traditional predictive measures like RMSE. The slides also introduce the intuition behind multiple imputation and explore strategies for supervised learning with incomplete data. Developed as part of a course module, the presentation offers a foundation for principled and effective approaches to missing values in data analysis. Event page.

Slides

Going beyond the fear of emptyness to gain consistency.

(Erwan Scornet, 2025)

This talk explores two common strategies for handling missing data in predictive modeling: “Impute-then-regress” and “Pattern-by-pattern.” The “Impute-then-regress” approach imputes missing values before learning from the imputed dataset, offering computational efficiency but potentially inconsistent results, though it can be consistent with non-parametric algorithms. In contrast, the “Pattern-by-pattern” strategy uses a separate predictor for each missing data pattern, ensuring consistency by design, but it is often computationally intractable for large datasets. The talk also examines the behavior of linear models under these strategies, noting their consistency rates and challenges, such as the zero imputation inconsistency and the different convergence rates in high-dimensional settings.

Slides

Statistical Methods for Analysis With Missing Data

(Marie Davidian, course at NC State University, spring 2017)

This course provides an overview of modern statistical frameworks and methods for analysis in the presence of missing data. Both methodological developments and applications are emphasized. The course provides a foundation in the fundamentals of this area that will prepare to read the current literature and to have broad appreciation the implications of missing data for valid inference. Course page.

Dealing With Missing Values in R

(Julie Josse, course at ETH Zürich, winter 2020)

The ability to easily collect and gather a large amount of data from different sources can be seen as an opportunity to better understand many processes. It has already led to breakthroughs in several application areas. However, due to the wide heterogeneity of measurements and objectives, these large databases often exhibit an extraordinary high number of missing values. Hence, in addition to scientific questions, such data also present some important methodological and technical challenges for data analysts. This tutorial gives an overview of the missing values literature as well as the recent improvements that caught the attention of the community due to their ability to handle large matrices with large amount of missing entries. The methods presented in this tutorial are illustrated on medical, environmental and survey data.

Overview of methods for handling missing values

(Julie Josse, course winter school on statistics and applied probability in Les Diablerets, winter 2022)

Missing values are ubiquitous in the practice of data analysis. In this series of lectures, we will start by presenting classical methods for handling missing data (simple imputation, multiple imputation, likelihood-based methods) developed in an inferential framework, where the objective is to best estimate parameters and their variance in the presence of missing data.We will emphasize very powerful methods of simple and multiple imputation based on low-rank approximations that can be applied to heterogeneous data (quantitative, categorical). We will then present recent results in a supervised learning framework. A striking result is that naive imputation strategies (such as mean imputation) can be optimal, as the supervised learning method does the hard work. The fact that such a simple approach can be relevant may have important consequences in practice. We will also discuss how missing value modeling can be easily incorporated into tree models, such as gradient boosted trees, resulting in a learner that has been shown to perform very well, including in challenging non-random missingness settings.Notebooks will be presented. Finally, we will briefly present how such results are useful in the context of causal inference with missing values in the covariates.

Slides

Analysis of missing values

(Jae-Kwang Kim, course at Iowa State University, fall 2015)

This course focuses on the theory and methods for missing data analysis. Topics include maximum likelihood estimation under missing data, EM algorithm, Monte Carlo computation techniques, imputation, Bayesian approach, propensity scores, semi-parametric approach, and non-ignorable missing data.

Statistical Methods for Analysis with Missing Data

(Mauricio Sadinle, course at University of Washington, winter 2019)

This course formally introduces methodologies for handling missing data in statistical analyses. It covers naive methods, missing-data assumptions, likelihood-based approaches, Bayesian and multiple imputation approaches, inverse-probability weighting, pattern-mixture models, sensitivity analysis and approaches under nonignorable missingness. Computational tools such as the Expectation-Maximization algorithm and the Gibbs’ sampler will be introduced. This course is intended for students who are interested in methodological research.
Course syllabus

Exercices/Homework

Statistical modeling and missing data (video)

(Rod, keynote talk at virtual workshop on Missing Data Challenges in Computation Statistics and Applications, fall 2020)

Multiple imputation

Missing Values in Clinical Research - Multiple Imputation

(Nicole Erler, NIHES course Missing Values in Clinical Research (EP16), May 2018)

This course is the second part of a NIHES course on Missing Values in Clinical Research and it focuses on multiple imputation (MI), specifically the fully conditional specification (FCS, MICE), which is often considered the gold standard to handle missing data. A detailed discussion on what MI(CE) does, which assumptions need to be met in order for it to perform well, and alternative imputation approaches for settings where MICE is not optimal are given. The theoretic considerations will be accompanied by demonstrations and short practical sessions in R, and a workflow for doing MI using the R package mice will be proposed, illustrating how to perform (multiple) imputation for cross-sectional and longitudinal data in R.

Multiple Imputation: Methods and Applications

(Jerry Reiter, short course at the Odum Institute at UNC Chapel Hill, March 2018)

This short course on multiple imputation gives an overview of missing data problems, various solutions to tackle them as well as their limitations. It introduces to MI inferences and provides details on implementation and application of MI.

Missing values and principal component methods

Imputation using principal components

(Julie Josse, course at École Polytechnique, fall 2018)

This tutorial is part of a master course on statistics with R. It discusses different missing values problems and illustrates them on medical, industrial and ecologial data. It provides a detailed introduction to single and multiple imputation via principal component methods, both in theory and in practice. The practical part illustrates how to perform (multiple) imputation using the R package missMDA.

Special focus on principal component methods

Handling missing values in PCA and MCA

(François Husson, video tutorial accompanying the R-package missMDA, 2016)

These two videos can be viewed independently or as a complement to the above tutorial on Imputation using principal components as they provide detailed explanation on how to use the functions of the missMDA package to visualize and analyze missing values and how to perform (multiple) imputation.

Specific data or application types

Virtual Workshop on Missing Data Challenges in Computation, Statistics and Applications (video recordings)

(organized by Laura Balzano (IAS/University of Michigan), Bianca Dumitrascu (IAS/ SAMSI), and Boaz Nadler (IAS/Weizmann Institute of Science), fall 2020)

This keynote talk gives an overview of different approaches for inference and prediction tasks. A striking result for the latter is that the widely-used method of imputing with the mean prior to learning can be consistent.

Statistical modeling and missing data (video) by Rod Little
Supervised learning with missing values (video) by Julie Josse
Missing data in single cell studies: augmentation, integration, and discovery (video) by Barbara Englehardt
Experimental Evaluation of Computer-Assisted Human Decision Making: A Missing Data Approach (video) by Kosuke Imai
Model-based clustering of high-dimensional data: Pitfalls & solutions (video) by David Dunson
Causal inference with binary outcomes subject to both missingness and misclassification (video) by Grace Yi
Statistical challenges with single cell RNA-Seq technologies (video) by Rafael Irizarry
Gene expression recovery in single cell transcriptomic data (video) by Nancy Zhang
Synthesizing medical images using generative adversarial networks; applications, promises, and pitfalls (video) by Sanmi Koyejo
High-dimensional omics data analysis with missing values (video) by Anru Zhang
Metric and manifold repair for missing data (video) by Anna Gilbert
Low-rank matrix recovery from quantized or count observations (video) by Mark Davenport
Low Algebraic Dimension Matrix Completion (video) by Laura Balzano

Supervised learning with missing values

(Julie Josse, video of Keynote at useR! conference in Toulouse, 2019)

Handling missing values in surveys

(Guillaume Chauvet, course at École Nationale de la Statistique et de l’Analyse de l’Information, spring 2015, slides in French)

This course recalls basic concepts of surveys and data collection before discussing how to handle unit non-response and item non-response in surveys.

Traitement des données manquantes dans les Enquêtes

Longitudinal data with missing values

(Dimitris Rizopoulos, talk at Joint Conference on Biometrics & Biopharmaceutical Statistics, August 2017)

In follow-up studies different types of outcomes are typically collected for each subject. These include longitudinally measured responses (e.g., biomarkers), and the time until an event of interest occurs (e.g., death, dropout). Often these outcomes are separately analyzed, but in many occasions it is of scientific interest to study their association. This type of research question has given rise in the class of joint models for longitudinal and time-to-event data. These models constitute an attractive paradigm for the analysis of follow-up data that is mainly applicable in two settings: First, when focus is on a survival outcome and we wish to account for the effect of endogenous time-dependents covariates measured with error, and second, when focus is on the longitudinal outcome and we wish to correct for non-random dropout. This course is aimed at applied researchers and graduate students, and will provide a comprehensive introduction into this modeling framework. It provides explanation when these models should be used in practice, which are the key assumptions behind them, and how they can be utilized to extract relevant information from the data. Emphasis is given on applications, and after the end of the course participants will be able to define appropriate joint models to answer their questions of interest.

Joint Modelling of Longitudinal and Time to Event Data

Time Series Imputation

(Steffen Moritz, talk at useR! 2017, July 2017)

This tutorial gives a short overview about methods for missing data in time series in R in general and subsequently introduces the imputeTS package. The imputeTS package is specifically made for handling missing data in time series and offers several functions for visualization and replacement (imputation) of missing data. Based on usage examples it is shown how imputeTS can be used for time series imputation.

How to deal with Missing Data in Time Series and the imputeTS package

Treatment Effect Estimation with Missing Attributes

(Julie Josse, talk at CIRM virtual conference on Mathematical Methods of Modern Statistics, 2020)

While the problem of missing values in the covariates has been considered very early in the causal inference literature, it remains difficult for practitioners to know which method to use, under which assumptions the different approaches are valid and whether the tools developed are also adapted to more complex data, e.g., for high-dimensional or mixed data. This talk provides a rigorous classification of existing methods according to the main underlying assumptions, which are based either on variants of the classical unconfoundedness assumption or relying on assumptions about the mechanism that generates the missing values. It also highlights two recent contributions on this topic: first an extension of classical doubly robust estimators that allows handling of missing attributes and second an approach to causal inference based on variational autoencoders in the case of latent confounding.

Analysis and imputation of missing count data

(Third year students from École Polytechnique, final project of Statistics with R course, December 2018)

The estimation of count data, such as bird abundance, is an important task in many disciplines and can be used for instance by ecologists for species conservation. Collecting count data is often subject to inaccuracies and missing data due to the nature of the counted object and due to multiplicity of actors/sensors collecting the data over more or less long periods of time. Methods such as Correspondence Analysis or Generalized Linear Models can be used to estimate these missing values and allow a more accurate analyses of the count data. The objective of this project is to investigate the abundance for the Eurasian Coot, which is mainly observed in the mediterranean part of North-Africa, and its relation to external geographical and meteorological factors. First, different methods are compared in terms of accuracy, using R packages glm, Rtrim, Lori and missMDA. Afterwards, external factors and their impact on bird abundance are examined and finally the temporal trend is investigated to determine whether the Eurasian coot is declining or not.
This project was carried out in collaboration with the Research Institute for the conservation of Mediterranean wetlands, the association Les Amis des Oiseaux (Friends of the birds) and the Office National de la Chasse et de la Faune Sauvage (National Agency for Hunting and Wildlife).