| R-miss-tastic

Here you will find a constantly growing list of interesting data sets which are frequently used in the R community working on missing values. These data sets can be useful to get familiar with different concepts in handling missing values and to assess the quality and performance of new methods.

If you have suggestions on other data sets which might be of interest to others, please feel free to contact us via the Contact form.

Complete data

If you wish to evaluate a certain missing data method on real (or simulated) data it can be useful to first generate missing values in a complete dataset. This allows to control the response mechanism and evaluate the method for different response mechanisms. Some useful tools for this:

The ampute function of the mice R-package. Rianne Schouten and her colleagues wrote a self-contained tutorial on how to ampute data.
The R workflow on How to generate missing values? extending some functionalities of the ampute function. For the related R source code click here.
The missCompare R-package.

Click on a table entry to obtain further information about the data set.

Name	Data Types	Attribute Types	# Instances	# Attributes	% Missing entries	Complete data available	Year
Airquality	Multivariate, Time Series	Real	154	6	7	No	1973
This data set contains daily air quality measurements in New York (May to September 1973) and presents missing values in some variables. It can be loaded in R by calling <code> data(airquality)</code>. <br> <br><a href="https://stat.ethz.ch/R-manual/R-devel/RHOME/library/datasets/html/airquality.html" target="_blank">More information on the dataset</a>. <br> <br> Tutorials illustrating methods on this data: <ul> <li> Nick Tierney's <code>naniar</code> <a href="https://cran.r-project.org/web/packages/naniar/vignettes/naniar-visualisation.html" target="_blank">vignette</a> for missing data visualization.</li> </ul> <br> </div> </td>
chorizonDL	Multivariate	Integer, Real	606	110	15	Yes	1998
From the <code>mvoutlier</code> package description: "The Kola Data were collected in the Kola Project (1993-1998, Geological Surveys of Finland (GTK) and Norway (NGU) and Central Kola Expedition (CKE), Russia). More than 600 samples in five different layers were analysed, this dataset contains the C-horizon." <br> <br><a href="https://cran.r-project.org/web/packages/mvoutlier/mvoutlier.pdf" target="_blank">More information on the dataset</a>. <br> <br> In the <a href="https://cran.r-project.org/web/packages/VIM/VIM.pdf" target="_blank">VIM</a> all outliers have been recoded as NA. It can be loaded by calling <code> data(chorizonDL)</code>. <br> </div> </td>
Health Nutrition And Population Statistics	Multivariate, Time Series	Integer, Real	15,022	397	54	No	2017
"Health Nutrition and Population Statistics database provides key health, nutrition and population statistics gathered from a variety of international and national sources. Themes include global surgery, health financing, HIV/AIDS, immunization, infectious diseases, medical resources and usage, noncommunicable diseases, nutrition, population dynamics, reproductive health, universal health coverage, and water and sanitation." (Data website of the World Bank Group, January 23th 2019) <br> <br>The data have been gathered from 259 countries over the last 58 years. <br><a href="https://datacatalog.worldbank.org/dataset/health-nutrition-and-population-statistics" target="_blank">More information on the dataset</a> on the Wold Bank Group website. <br> <br><a href="http://user2019.r-project.org/datathon/">R Datathon</a> on this dataset organized by the useR! 2019 conference. <br> </div> </td>
NHANES	Multivariate	Categorical, Integer, Real	10,000	75	37	No	2012
R-package <a href="https://cran.r-project.org/web/packages/NHANES/" target="_blank">NHANES</a> containing data from the US National Health and Nutrition Examination Study. The data comprises body shape and related measurements from the US National Health and Nutrition Examination Survey (NHANES, 1999-2004 and 2009-2012, <a href="http://www.cdc.gov/nchs/nhanes.htm" target="_blank">more details on the survey</a>). <br> <br> Tutorials illustrating methods on this data: <ul> <li> Stef van Buuren's <a href="https://www.gerkovink.com/miceVignettes/Ad_hoc_and_mice/Ad_hoc_methods.html" target="_blank">vignette</a> for ad hoc methods and <code>mice</code>.</li> <li> Jerry Reiter's <a href="/tutorials/Reiter_course_MultipleImputationOverview_2018/Reiter_script_MultipleImputationMICE_2018.html" target="_blank">course</a> on multiple imputation.</li> </ul> <br> </div> </td>
oceanbuoys	Multivariate, Time Series	Real	736	8	3	No	1997
West Pacific Tropical Atmosphere Ocean Data. The data is collected by the Tropical Atmosphere Ocean project and contains real-time data from moored ocean buoys. It can be found in R in the <a href="https://cran.r-project.org/web/packages/naniar/index.html" target="_blank"><code>naniar</code></a> package and is loaded by calling <code> data(oceanbuoys)</code>. <br> <br><a href="https://www.pmel.noaa.gov/tao/drupal/disdel/" target="_blank">More information on the collected data</a> on the website of the Pacific Marine Environmental Laboratory. <br> </div> </td>
Ozone	Multivariate	Categorical, Integer, Real	366	13	6	No	1976
Los Angeles Ozone Pollution Data, 1976. This data set contains daily measurements of ozone concentration and meteorological quantities. It can be found in R in the <a href="https://cran.r-project.org/web/packages/mlbench/index.html" target="_blank"><code>mlbench</code></a> package and is loaded by calling <code> data(Ozone)</code>. <br> <br><a href="https://www.rdocumentation.org/packages/mlbench/versions/2.1-1/topics/Ozone" target="_blank">More information on the dataset</a>. <br> <br> Tutorials illustrating methods on this data: <ul> <li> Julie Josse's <a href="/tutorials/Josse_slides_imputation_PCA_2018.pdf" target="_blank">course</a> on missing values imputation using PC methods.</li> <li> Julie Josse's and Nick Tierney's tutorial on handling missing values. Download the data set from this tutorial: <a href="/tutorials/ozoneNA.csv">ozoneNA.csv</a></li> <li> Nick Tierney's <code>naniar</code> <a href="https://cran.r-project.org/web/packages/naniar/vignettes/naniar-visualisation.html" target="_blank">vignette</a> for missing data visualization.</li> </ul> <br> </div> </td>
pedestrian	Multivariate, Time series	Categorical, Integer	37,700	9	2	No	2016
This data set contains hourly counts of pedestrians from 4 sensors around Melbourne in 2016. It can be found in R in the <a href="https://cran.r-project.org/web/packages/naniar/index.html" target="_blank"><code>naniar</code></a> package and is loaded by calling <code> data(pedestrian)</code>. <br> <br><a href="https://data.melbourne.vic.gov.au/Transport-Movement/Pedestrian-volume-updated-monthly-/b2ak-trbp" target="_blank">More information on the collected data</a> on the public data website of the City of Melbourne. <br> </div> </td>
riskfactors	Multivariate	Categorical, Integer, Real	245	34	14	No	2009
The data is a subset of the 2009 survey from the Behavioral Risk Factor Surveillance System designed to measure behavioral risk factors for the adult population living in households. It can be found in R in the <a href="https://cran.r-project.org/web/packages/naniar/index.html" target="_blank"><code>naniar</code></a> package and is loaded by calling <code> data(riskfactors)</code>. <br> <br><a href="https://www.cdc.gov/brfss/data_documentation/index.htm" target="_blank">More information on the survey</a> on the website of the Centers for Disease Control and Prevention. <br> </div> </td>
SBS52424	Multivariate	Real	262	9	2	No	2016
The data contains a synthetic subset of the Austrian structural business statistics (SBS) data, more specifically it contains data on 9 variables of NACE 52.42 (retail sale of clothing). From original Austrian SBS data set of confidential raw data a non-confidential, close-to-reality, synthetic data set was generated. It can be found in R in the <a href="https://cran.r-project.org/web/packages/VIM/index.html" target="_blank"><code>VIM</code></a> package and is loaded by calling <code> data(SBS5242)</code>. <br> <br><a href="http://statistik.at/web_en/statistics/Economy/enterprises/structural_business_statistics/index.html" target="_blank">More information on the initial SBS data</a> on the website of Statistik Austria. <br> </div> </td>
sleep	Multivariate	Integer, Real	62	10	6	No	1976
The data contains sleep data. It can be found in R in the <a href="https://cran.r-project.org/web/packages/VIM/index.html" target="_blank"><code>VIM</code></a> package and is loaded by calling <code> data(sleep)</code>. <br> <br><a href="https://www.semanticscholar.org/paper/Sleep-in-mammals%3A-ecological-and-constitutional-Allison-Cicchetti/8d4f202354bf0fd1bd445792340e16acc042ec6d" target="_blank">More information about the collected data</a> in Allison, T. and Chichetti, D. (1976) Sleep in mammals: ecological and constitutional correlates. <i>Science</i> <b>194 (4266)</b>, 732-734. <br> </div> </td>
tsAirgap	Time series	Integer	144	1	9	Yes	1960
The data contains monthly totals of international airline passengers between 1949 and 1960. It can be found in R in the <a href="https://cran.r-project.org/web/packages/imputeTS/index.html" target="_blank"><code>imputeTS</code></a> package and is loaded by calling <code> data(tsAirgap)</code>. <br> <br><a href="https://www.wiley.com/en-us/Time+Series+Analysis%3A+Forecasting+and+Control%2C+5th+Edition-p-9781118674918" target="_blank">More information on the data</a> in the work from Box & Jenkins. <br> </div> </td>
tsHeating	Time series	Real	606,837	1	9	Yes	2015
The data contains a time series of a heating systems supply temperature, measured from 18.11.2013 - 05:12:00 to 13.01.2015 - 15:08:00 in 1 minute steps. It can be found in R in the <a href="https://cran.r-project.org/web/packages/imputeTS/index.html" target="_blank"><code>imputeTS</code></a> package and is loaded by calling <code> data(tsHeating)</code>. The data comes from the GECCO Industrial Challenge 2015. <br> <br><a href="http://www.spotseven.de/gecco/gecco-challenge/gecco-challenge-2015/" target="_blank">More information about the challenge</a> on the website of SPOTSeven Lab. <br> </div> </td>
tsNH4	Time series	Real	4,552	1	9	Yes	2014
The data contains a time series of a NH4 concentration in a wastewater system, measured from 30.11.2010 - 16:10 to 01.01.2011 - 06:40 in 10 minute steps. It can be found in R in the <a href="https://cran.r-project.org/web/packages/imputeTS/index.html" target="_blank"><code>imputeTS</code></a> package and is loaded by calling <code> data(tsHeating)</code>. The data comes from the GECCO Industrial Challenge 2014. <br> <br><a href="http://www.spotseven.de/gecco/gecco-challenge/gecco-challenge-2014/" target="_blank">More information about the challenge</a> on the website of SPOTSeven Lab. <br> </div> </td>

If you are looking for other publicly available data sets with missing values that have been used by researchers to assess the quality of different methods, have a look at:

Missing Values in Data Mining website of the Soft Computing and Intelligent Information Systems research group at University of Granada.
Missing values data sets in the Knowledge Extraction based on Evolutionary Learning (KEEL) data set repository.