R-miss-tastic

A resource website on missing values - Methods and references for managing missing data

Here you will find a constantly growing list of interesting data sets which are frequently used in the R community working on missing values. These data sets can be useful to get familiar with different concepts in handling missing values and to assess the quality and performance of new methods.

If you have suggestions on other data sets which might be of interest to others, please feel free to contact us via the Contact form.


Complete data

If you wish to evaluate a certain missing data method on real (or simulated) data it can be useful to first generate missing values in a complete dataset. This allows to control the response mechanism and evaluate the method for different response mechanisms. Some useful tools for this:


Incomplete data

The data sets listed below are either widely used in general in the missing data community or used for illustration of different methods handling missing values in the tutorials from the Tutorials and R packages sections. This presentation scheme is inspired by the UCI Machine Learning Repository.

Click on a table entry to obtain further information about the data set.

Name Data Types Attribute Types # Instances # Attributes % Missing entries Complete data available Year
This data set contains daily air quality measurements in New York (May to September 1973) and presents missing values in some variables. It can be loaded in R by calling data(airquality).

More information on the dataset.

Tutorials illustrating methods on this data:
  • Nick Tierney's naniar vignette for missing data visualization.

From the mvoutlier package description: "The Kola Data were collected in the Kola Project (1993-1998, Geological Surveys of Finland (GTK) and Norway (NGU) and Central Kola Expedition (CKE), Russia). More than 600 samples in five different layers were analysed, this dataset contains the C-horizon."

More information on the dataset.

In the VIM all outliers have been recoded as NA. It can be loaded by calling data(chorizonDL).
"Health Nutrition and Population Statistics database provides key health, nutrition and population statistics gathered from a variety of international and national sources. Themes include global surgery, health financing, HIV/AIDS, immunization, infectious diseases, medical resources and usage, noncommunicable diseases, nutrition, population dynamics, reproductive health, universal health coverage, and water and sanitation." (Data website of the World Bank Group, January 23th 2019)

The data have been gathered from 259 countries over the last 58 years.
More information on the dataset on the Wold Bank Group website.

R Datathon on this dataset organized by the useR! 2019 conference.
R-package NHANES containing data from the US National Health and Nutrition Examination Study. The data comprises body shape and related measurements from the US National Health and Nutrition Examination Survey (NHANES, 1999-2004 and 2009-2012, more details on the survey).

Tutorials illustrating methods on this data:
  • Stef van Buuren's vignette for ad hoc methods and mice.
  • Jerry Reiter's course on multiple imputation.

West Pacific Tropical Atmosphere Ocean Data. The data is collected by the Tropical Atmosphere Ocean project and contains real-time data from moored ocean buoys. It can be found in R in the naniar package and is loaded by calling data(oceanbuoys).

More information on the collected data on the website of the Pacific Marine Environmental Laboratory.
Los Angeles Ozone Pollution Data, 1976. This data set contains daily measurements of ozone concentration and meteorological quantities. It can be found in R in the mlbench package and is loaded by calling data(Ozone).

More information on the dataset.

Tutorials illustrating methods on this data:
  • Julie Josse's course on missing values imputation using PC methods.
  • Julie Josse's and Nick Tierney's tutorial on handling missing values. Download the data set from this tutorial: ozoneNA.csv
  • Nick Tierney's naniar vignette for missing data visualization.

This data set contains hourly counts of pedestrians from 4 sensors around Melbourne in 2016. It can be found in R in the naniar package and is loaded by calling data(pedestrian).

More information on the collected data on the public data website of the City of Melbourne.
The data is a subset of the 2009 survey from the Behavioral Risk Factor Surveillance System designed to measure behavioral risk factors for the adult population living in households. It can be found in R in the naniar package and is loaded by calling data(riskfactors).

More information on the survey on the website of the Centers for Disease Control and Prevention.
The data contains a synthetic subset of the Austrian structural business statistics (SBS) data, more specifically it contains data on 9 variables of NACE 52.42 (retail sale of clothing). From original Austrian SBS data set of confidential raw data a non-confidential, close-to-reality, synthetic data set was generated. It can be found in R in the VIM package and is loaded by calling data(SBS5242).

More information on the initial SBS data on the website of Statistik Austria.
The data contains sleep data. It can be found in R in the VIM package and is loaded by calling data(sleep).

More information about the collected data in Allison, T. and Chichetti, D. (1976) Sleep in mammals: ecological and constitutional correlates. Science 194 (4266), 732-734.
The data contains monthly totals of international airline passengers between 1949 and 1960. It can be found in R in the imputeTS package and is loaded by calling data(tsAirgap).

More information on the data in the work from Box & Jenkins.
The data contains a time series of a heating systems supply temperature, measured from 18.11.2013 - 05:12:00 to 13.01.2015 - 15:08:00 in 1 minute steps. It can be found in R in the imputeTS package and is loaded by calling data(tsHeating). The data comes from the GECCO Industrial Challenge 2015.

More information about the challenge on the website of SPOTSeven Lab.
The data contains a time series of a NH4 concentration in a wastewater system, measured from 30.11.2010 - 16:10 to 01.01.2011 - 06:40 in 10 minute steps. It can be found in R in the imputeTS package and is loaded by calling data(tsHeating). The data comes from the GECCO Industrial Challenge 2014.

More information about the challenge on the website of SPOTSeven Lab.


If you are looking for other publicly available data sets with missing values that have been used by researchers to assess the quality of different methods, have a look at:


Share