Below you will find a selection of high-quality lectures, tutorials and labs on different aspects of missing values. Note that some of these lectures are available with publicly available video recordings.
(Marie Davidian, course at NC State University, spring 2017)
This course provides an overview of modern statistical frameworks and methods for analysis in the presence of missing data. Both methodological developments and applications are emphasized. The course provides a foundation in the fundamentals of this area that will prepare to read the current literature and to have broad appreciation the implications of missing data for valid inference. Course page.
- Introduction and Motivation
- Naive Methods
- Likelihood-based Methods Under Missing At Random (MAR)
- Multiple Imputation Methods Under MAR
- Inverse Probability Weighted Methods Under MAR
- Pattern Mixture Models
- Sensitivity Analysis to Deviations from MAR
- Homework 1
- Homework 2
- Homework 3
- Homework 4
- Data for homeworks
The ability to easily collect and gather a large amount of data from different sources can be seen as an opportunity to better understand many processes. It has already led to breakthroughs in several application areas. However, due to the wide heterogeneity of measurements and objectives, these large databases often exhibit an extraordinary high number of missing values. Hence, in addition to scientific questions, such data also present some important methodological and technical challenges for data analysts. This tutorial gives an overview of the missing values literature as well as the recent improvements that caught the attention of the community due to their ability to handle large matrices with large amount of missing entries. The methods presented in this tutorial are illustrated on medical, environmental and survey data.
This course focuses on the theory and methods for missing data analysis. Topics include maximum likelihood estimation under missing data, EM algorithm, Monte Carlo computation techniques, imputation, Bayesian approach, propensity scores, semi-parametric approach, and non-ignorable missing data.
(Mauricio Sadinle, course at University of Washington, winter 2019)
This course formally introduces methodologies for handling missing data in statistical analyses. It covers naive methods, missing-data assumptions, likelihood-based approaches, Bayesian and multiple imputation approaches, inverse-probability weighting, pattern-mixture models, sensitivity analysis and approaches under nonignorable missingness. Computational tools such as the Expectation-Maximization algorithm and the Gibbs’ sampler will be introduced. This course is intended for students who are interested in methodological research.
- Lecture 1: syllabus, motivating examples
- Lecture 2: general setup, notation, missing-data mechanisms
- Lecture 3: naive methods: complete-case analysis and imputation
- Lecture 4: R session 1
- Lecture 5: likelihood-based methods
- Lecture 6: the EM algorithm
- Lecture 7: R session 2 (setup), R script
- Lecture 8: introduction to Bayesian inference
- Lecture 9: Gibbs sampling, ignorability under Bayesian inference, data augmentation
- Lecture 10: multiple imputation
- Lecture 11: R session 3 (setup), R script
- Lecture 12: inverse probability weighting
- Lecture 13: introduction to (weighted generalized) estimating equations
- Lecture 14: R session 4 (setup), R script
- Lecture 15: identifiability, nonignorability, pattern-mixture models
- Lecture 16: pattern-mixture models (continued), sensitivity analysis
(Nicole Erler, NIHES course Missing Values in Clinical Research (EP16), May 2018)
This course is the second part of a NIHES course on Missing Values in Clinical Research and it focuses on multiple imputation (MI), specifically the fully conditional specification (FCS, MICE), which is often considered the gold standard to handle missing data. A detailed discussion on what MI(CE) does, which assumptions need to be met in order for it to perform well, and alternative imputation approaches for settings where MICE is not optimal are given. The theoretic considerations will be accompanied by demonstrations and short practical sessions in R, and a workflow for doing MI using the R package mice will be proposed, illustrating how to perform (multiple) imputation for cross-sectional and longitudinal data in R.
(Jerry Reiter, short course at the Odum Institute at UNC Chapel Hill, March 2018)
This tutorial is part of a master course on statistics with R. It discusses different missing values problems and illustrates them on medical, industrial and ecologial data. It provides a detailed introduction to single and multiple imputation via principal component methods, both in theory and in practice. The practical part illustrates how to perform (multiple) imputation using the R package
(François Husson, video tutorial accompanying the R-package
These two videos can be viewed independently or as a complement to the above tutorial on Imputation using principal components as they provide detailed explanation on how to use the functions of the
missMDA package to visualize and analyze missing values and how to perform (multiple) imputation.
(organized by Laura Balzano (IAS/University of Michigan), Bianca Dumitrascu (IAS/ SAMSI), and Boaz Nadler (IAS/Weizmann Institute of Science), fall 2020)
This keynote talk gives an overview of different approaches for inference and prediction tasks. A striking result for the latter is that the widely-used method of imputing with the mean prior to learning can be consistent.
- Statistical modeling and missing data (video) by Rod Little
- Supervised learning with missing values (video) by Julie Josse
- Missing data in single cell studies: augmentation, integration, and discovery (video) by Barbara Englehardt
- Experimental Evaluation of Computer-Assisted Human Decision Making: A Missing Data Approach (video) by Kosuke Imai
- Model-based clustering of high-dimensional data: Pitfalls & solutions (video) by David Dunson
- Causal inference with binary outcomes subject to both missingness and misclassification (video) by Grace Yi
- Statistical challenges with single cell RNA-Seq technologies (video) by Rafael Irizarry
- Gene expression recovery in single cell transcriptomic data (video) by Nancy Zhang
- Synthesizing medical images using generative adversarial networks; applications, promises, and pitfalls (video) by Sanmi Koyejo
- High-dimensional omics data analysis with missing values (video) by Anru Zhang
- Metric and manifold repair for missing data (video) by Anna Gilbert
- Low-rank matrix recovery from quantized or count observations (video) by Mark Davenport
- Low Algebraic Dimension Matrix Completion (video) by Laura Balzano
(Julie Josse, video of Keynote at useR! conference in Toulouse, 2019)
(Guillaume Chauvet, course at École Nationale de la Statistique et de l’Analyse de l’Information, spring 2015, slides in French)
This course recalls basic concepts of surveys and data collection before discussing how to handle unit non-response and item non-response in surveys.
(Dimitris Rizopoulos, talk at Joint Conference on Biometrics & Biopharmaceutical Statistics, August 2017)
In follow-up studies different types of outcomes are typically collected for each subject. These include longitudinally measured responses (e.g., biomarkers), and the time until an event of interest occurs (e.g., death, dropout). Often these outcomes are separately analyzed, but in many occasions it is of scientific interest to study their association. This type of research question has given rise in the class of joint models for longitudinal and time-to-event data. These models constitute an attractive paradigm for the analysis of follow-up data that is mainly applicable in two settings: First, when focus is on a survival outcome and we wish to account for the effect of endogenous time-dependents covariates measured with error, and second, when focus is on the longitudinal outcome and we wish to correct for non-random dropout. This course is aimed at applied researchers and graduate students, and will provide a comprehensive introduction into this modeling framework. It provides explanation when these models should be used in practice, which are the key assumptions behind them, and how they can be utilized to extract relevant information from the data. Emphasis is given on applications, and after the end of the course participants will be able to define appropriate joint models to answer their questions of interest.
This tutorial gives a short overview about methods for missing data in time series in R in general and subsequently introduces the
imputeTS package. The
imputeTS package is specifically made for handling missing data in time series and offers several functions for visualization and replacement (imputation) of missing data. Based on usage examples it is shown how
imputeTS can be used for time series imputation.
(Julie Josse, talk at CIRM virtual conference on Mathematical Methods of Modern Statistics, 2020)
While the problem of missing values in the covariates has been considered very early in the causal inference literature, it remains difficult for practitioners to know which method to use, under which assumptions the different approaches are valid and whether the tools developed are also adapted to more complex data, e.g., for high-dimensional or mixed data. This talk provides a rigorous classification of existing methods according to the main underlying assumptions, which are based either on variants of the classical unconfoundedness assumption or relying on assumptions about the mechanism that generates the missing values. It also highlights two recent contributions on this topic: first an extension of classical doubly robust estimators that allows handling of missing attributes and second an approach to causal inference based on variational autoencoders in the case of latent confounding.
(Third year students from École Polytechnique, final project of Statistics with R course, December 2018)
The estimation of count data, such as bird abundance, is an important task in many disciplines and can be used for instance by ecologists for species conservation. Collecting count data is often subject to inaccuracies and missing data due to the nature of the counted object and due to multiplicity of actors/sensors collecting the data over more or less long periods of time. Methods such as Correspondence Analysis or Generalized Linear Models can be used to estimate these missing values and allow a more accurate analyses of the count data. The objective of this project is to investigate the abundance for the Eurasian Coot, which is mainly observed in the mediterranean part of North-Africa, and its relation to external geographical and meteorological factors. First, different methods are compared in terms of accuracy, using R packages
missMDA. Afterwards, external factors and their impact on bird abundance are examined and finally the temporal trend is investigated to determine whether the Eurasian coot is declining or not.
This project was carried out in collaboration with the Research Institute for the conservation of Mediterranean wetlands, the association Les Amis des Oiseaux (Friends of the birds) and the Office National de la Chasse et de la Faune Sauvage (National Agency for Hunting and Wildlife).
naniarvignette: Missing data visualizations
(Nicholas Tierney, 2018)
- useR! tutorial on handling missing values
(Julie Josse & Nicholas Tierney, 2018)
micevignette: Ad hoc methods and
(Stef van Buuren and Gerko Vink, 2018)
(Stef van Buuren and Gerko Vink, 2018)
- Multiple imputation with the
(Nicole Erler, NIHES course on multiple imputation, 2018)
- Multiple imputation in complex settings (using
(Nicole Erler, NIHES course on multiple imputation, 2018)
- Example using
(Jerry Reiter, 2018)
If you wish to contribute some of your own material to this platform, please feel free to contact us.