Implementations
We list some of the most popular R packages and Python libraries that allow to handle missing values.
Click here to go directly to the list of Python libraries.
R Packages
Here are some introductions to popular missing data packages with small examples on how to use them. It gives more extensive information than the CRAN Task View on Missing Data, which is recommended to get a first overall overview about the CRAN missing data landscape.
You can also contribute on your own to this page and provide a short introduction to a missing data package. Take a look at this short description on how to do this (Template). We are very happy about all contributions.
mice
Category: Multiple Imputation
Multiple imputation using Fully Conditional Specification (FCS) implemented by the MICE algorithm as described in Van Buuren and Groothuis-Oudshoorn (2011) doi:10.18637/jss.v045.i03. Each variable has its own imputation model. Built-in imputation models are provided for continuous data (predictive mean matching, normal), binary data (logistic regression), unordered categorical data (polytomous logistic regression) and ordered categorical data (proportional odds). MICE can also impute continuous two-level data (normal model, pan, second-level variables). Passive imputation can be used to maintain consistency between variables. Various diagnostic plots are available to inspect the quality of the imputations.
more..miceDRF
Category: Multiple Imputation
This package contains miceDRF imputation method and tools for measuring imputation performance.
more..missForest
Category: Single Imputation
The function ‘missForest’ in this package is used to impute missing values particularly in the case of mixed-type data. It uses a random forest trained on the observed values of a data matrix to predict the missing values. It can be used to impute continuous and/or categorical data including complex interactions and non-linear relations. It yields an out-of-bag (OOB) imputation error estimate without the need of a test set or elaborate cross-validation. It can be run in parallel to save computation time.
more..missMDA
Category: Single and Multiple Imputation, Multivariate Data Analysis
Imputation of incomplete continuous or categorical datasets; Missing values are imputed with a principal component analysis (PCA), a multiple correspondence analysis (MCA) model or a multiple factor analysis (MFA) model; Perform multiple imputation with and in PCA or MCA.
more..Hmisc
Category: Single and Multiple Imputation, Data Processing
Contains many functions useful for data analysis, high-level graphics, utility operations, functions for computing sample size and power, simulation, importing and annotating datasets, imputing missing values, advanced table making, variable clustering, character string manipulation, conversion of R objects to LaTeX and html code, recoding variables, caching, simplified parallel computing, encrypting and decrypting data using a safe workflow, general moving window statistical estimation, and assistance in interpreting principal component analysis.
more..naniar
Category: Visualisations for Missing Data
Missing values are ubiquitous in data and need to be explored and handled in the initial stages of analysis. ‘naniar’ provides data structures and functions that facilitate the plotting of missing values and examination of imputations. This allows missing data dependencies to be explored with minimal deviation from the common work patterns of ‘ggplot2’ and tidy data. The work is fully discussed at Tierney & Cook (2023) doi:10.18637/jss.v105.i07.
more..simputation
Category: Single Imputation, Meta-Package
Easy to use interfaces to a number of imputation methods that fit in the not-a-pipe operator of the ‘magrittr’ package.
more..VIM
Category: Single Imputation, Visualisations for Missing Data
New tools for the visualization of missing and/or imputed values are introduced, which can be used for exploring the data and the structure of the missing and/or imputed values. Depending on this structure of the missing values, the corresponding methods may help to identify the mechanism generating the missing values and allows to explore the data including missing values. In addition, the quality of imputation can be visually explored using various univariate, bivariate, multiple and multivariate plot methods. A graphical user interface available in the separate package VIMGUI allows an easy handling of the implemented plot methods.
more..imputeLCMD
Category: Left-Censored Missing Data
A collection of functions for left-censored missing data imputation. Left-censoring is a special case of missing not at random (MNAR) mechanism that generates non-responses in proteomics experiments. The package also contains functions to artificially generate peptide/protein expression data (log-transformed) as random draws from a multivariate Gaussian distribution as well as a function to generate missing data (both randomly and non-randomly). For comparison reasons, the package also contains several wrapper functions for the imputation of non-responses that are missing at random. * New functionality has been added: a hybrid method that allows the imputation of missing values in a more complex scenario where the missing data are both MAR and MNAR.
more..missCompare
Category: Single and Multiple Imputation
Offers a convenient pipeline to test and compare various missing data imputation algorithms on simulated and real data. These include simpler methods, such as mean and median imputation and random replacement, but also include more sophisticated algorithms already implemented in popular R packages, such as ‘mi’, described by Su et al. (2011) doi:10.18637/jss.v045.i02; ‘mice’, described by van Buuren and Groothuis-Oudshoorn (2011) doi:10.18637/jss.v045.i03; ‘missForest’, described by Stekhoven and Buhlmann (2012) doi:10.1093/bioinformatics/btr597; ‘missMDA’, described by Josse and Husson (2016) doi:10.18637/jss.v070.i01; and ‘pcaMethods’, described by Stacklies et al. (2007) doi:10.1093/bioinformatics/btm069. The central assumption behind ‘missCompare’ is that structurally different datasets (e.g. larger datasets with a large number of correlated variables vs. smaller datasets with non correlated variables) will benefit differently from different missing data imputation algorithms. ‘missCompare’ takes measurements of your dataset and sets up a sandbox to try a curated list of standard and sophisticated missing data imputation algorithms and compares them assuming custom missingness patterns. ‘missCompare’ will also impute your real-life dataset for you after the selection of the best performing algorithm in the simulations. The package also provides various post-imputation diagnostics and visualizations to help you assess imputation performance.
more..Amelia
Category: Multiple Imputation
A tool that “multiply imputes” missing data in a single cross-section (such as a survey), from a time series (like variables collected for each year in a country), or from a time-series-cross-sectional data set (such as collected by years for each of several countries). Amelia II implements our bootstrapping-based algorithm that gives essentially the same answers as the standard IP or EMis approaches, is usually considerably faster than existing approaches and can handle many more variables. Unlike Amelia I and other statistically rigorous imputation software, it virtually never crashes (but please let us know if you find to the contrary!). The program also generalizes existing approaches by allowing for trends in time series across observations within a cross-sectional unit, as well as priors that allow experts to incorporate beliefs they have about the values of missing cells in their data. Amelia II also includes useful diagnostics of the fit of multiple imputation models. The program works from the R command line or via a graphical user interface that does not require users to know R.
more..mixgb
Category: Multiple Imputation
Multiple imputation using ‘XGBoost’, subsampling, and predictive mean matching as described in Deng and Lumley (2023) doi:10.1080/10618600.2023.2252501. The package supports various types of variables, offers flexible settings, and enables saving an imputation model to impute new data. Data processing and memory usage have been optimised to speed up the imputation process.
more..CALIBERrfimpute
Category: Multiple Imputtaion
Functions to impute using random forest under full conditional specifications (multivariate imputation by chained equations). The methods are described in Shah and others (2014) doi:10.1093/aje/kwt312.
more..imputeTS
Category: Time-Series Imputation, Visualisations for Missing Data
Imputation (replacement) of missing values in univariate time series. Offers several imputation functions and missing data plots. Available imputation algorithms include: ‘Mean’, ‘LOCF’, ‘Interpolation’, ‘Moving Average’, ‘Seasonal Decomposition’, ‘Kalman Smoothing on Structural Time Series models’, ‘Kalman Smoothing on ARIMA models’. Published in Moritz and Bartz-Beielstein (2017) doi:10.32614/RJ-2017-009.
more..pcaMethods
Category: Single Imputation
Provides Bayesian PCA, Probabilistic PCA, Nipals PCA, Inverse Non-Linear PCA and the conventional SVD PCA. A cluster based method for missing value estimation is included for comparison. BPCA, PPCA and NipalsPCA may be used to perform PCA on incomplete data as well as for accurate missing value estimation. A set of methods for printing and plotting the results is also provided. All PCA methods make use of the same data structure (pcaRes) to provide a common interface to the PCA results. Initiated at the Max-Planck Institute for Molecular Plant Physiology, Golm, Germany.
more..imputomics
Category: Single and multiple imputation, Metabolomics, Left-Censored Missing Data
A robust wrapper package containing a range of methods for simulating and imputing missing values in different types of omics data such as genomics, transcriptomics, proteomics, and metabolomics. Provides tools for comparing and evaluating the performance of imputation methods and a web server.
more..MetabImpute
Category: Single imputation, Metabolomics, Imputation with Biological Replicates, Left-Censored Missing Data
A package to evaluate missing data, simulate data matrices and missingness, evaluate multiple imputation methods and return statistics on these and finally methods to impute utilizing multiple standard imputation approaches. Novel imputation methodologies which utilize an imputation approach with data that uses biological or technical replication are also included. ICC evaluation methods are included specifically included to suit researchers working with data with biological or technical replicates. Source code was written by the authors with code copied and modified from the following GitHub packages: https://github.com/Tirgit/missCompare, https://github.com/WandeRum/GSimp (Wei, R., Wang, J., Jia, E., Chen, T., Ni, Y., & Jia, W. (2017). GSimp: A Gibbs sampler based left-censored missing value imputation approach for metabolomics studies. PLOS Computational Biology) https://github.com/juuussi/impute-metabo Kokla, M., Virtanen, J., Kolehmainen, M. et al. Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: a comparative study. BMC Bioinformatics 20, 492 (2019). https://doi.org/10.1186/s12859-019-3110-0
more..
Python modules
Here are some links to modules or methods in Python to handle missing values.
sklearn.impute: module from sklearn for missing value imputation (simple imputation, conditional iterative imputer, k-Nearest Neighbors imputer).
pandas: available methods in pandas to handle dataframes with missing values (fill the missing values by a constant, remove missing values).
statsmodels.imputation: module from statsmodels to handle missing values (multiple imputation, Bayesian imputation using a Gaussian model).