The following real data example is adapted to a large extend from the guidance on Rmisstastic_descriptive_statistics_with_missing_values
Air pollution is currently one of the most serious public health worries worldwide. Many epidemiological studies have proved the influence that some chemical compounds, such as sulphur dioxide (\(SO_2\)), nitrogen dioxide (\(NO_2\)), ozone (\(O_3\)) can have on our health. Associations set up to monitor air quality are active all over the world to measure the concentration of these pollutants.
The data set we use here is a small subset of a cleaned version of air pollution measurements in the US. For more details, I refer to the Appendix C of the following paper. In this example, I actually induced missing values here, so that we have full control over the missing mechanism and access to the true data.
We first load a real (prepared) data set:
library(mice)
# Naniar provides principled, tidy ways to summarise, visualise, and manipulate
# missing data with minimal deviations from the workflows in ggplot2 and tidy
# data.
library(naniar)
library(VIM)
library(FactoMineR)
X <- read.csv("data.csv", header = T, row.names = 1)
Xstar<- read.csv("fulldata.csv", header = T, row.names = 1)
head(X)
## max_PM2.5 max_O3 max_PM10 Longitude Latitude Elevation Land.Use_AGRICULTURAL
## 1 4.75 0.027 11 -106.58520 35.13430 1591 0
## 2 7.15 0.055 25 -120.68028 38.20185 NA 1
## 3 1.45 0.044 8 -105.30353 42.76697 1508 1
## 4 7.00 0.058 20 -70.74802 43.07537 8 0
## 5 12.00 0.017 NA -112.09577 33.50383 NA 0
## 6 5.40 0.049 17 -85.89804 38.06091 135 0
## Land.Use_COMMERCIAL Land.Use_INDUSTRIAL Location.Setting_RURAL
## 1 0 0 0
## 2 0 0 1
## 3 0 0 1
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## Location.Setting_SUBURBAN
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 1
head(Xstar)
## max_PM2.5 max_O3 max_PM10 Longitude Latitude Elevation Land.Use_AGRICULTURAL
## 1 4.75 0.027 11 -106.58520 35.13430 1591 0
## 2 7.15 0.055 25 -120.68028 38.20185 310 1
## 3 1.45 0.044 8 -105.30353 42.76697 1508 1
## 4 7.00 0.058 20 -70.74802 43.07537 8 0
## 5 12.00 0.017 15 -112.09577 33.50383 343 0
## 6 5.40 0.049 17 -85.89804 38.06091 135 0
## Land.Use_COMMERCIAL Land.Use_INDUSTRIAL Location.Setting_RURAL
## 1 0 0 0
## 2 0 0 1
## 3 0 0 1
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## Location.Setting_SUBURBAN
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 1
summary(X)
## max_PM2.5 max_O3 max_PM10 Longitude
## Min. :-3.300 Min. :0.0010 Min. : 0.00 Min. :-124.18
## 1st Qu.: 4.100 1st Qu.:0.0330 1st Qu.: 9.00 1st Qu.:-117.33
## Median : 6.200 Median :0.0410 Median : 15.00 Median :-106.51
## Mean : 7.332 Mean :0.0413 Mean : 18.31 Mean :-102.33
## 3rd Qu.: 9.200 3rd Qu.:0.0490 3rd Qu.: 23.00 3rd Qu.: -88.22
## Max. :56.333 Max. :0.0910 Max. :143.00 Max. : -68.26
## NA's :159 NA's :162 NA's :338 NA's :159
## Latitude Elevation Land.Use_AGRICULTURAL Land.Use_COMMERCIAL
## Min. :26.05 Min. : -14 Min. :0.000 Min. :0.0000
## 1st Qu.:34.49 1st Qu.: 68 1st Qu.:0.000 1st Qu.:0.0000
## Median :38.20 Median : 269 Median :0.000 Median :0.0000
## Mean :38.35 Mean : 430 Mean :0.125 Mean :0.3115
## 3rd Qu.:41.30 3rd Qu.: 571 3rd Qu.:0.000 3rd Qu.:1.0000
## Max. :48.64 Max. :2195 Max. :1.000 Max. :1.0000
## NA's :353
## Land.Use_INDUSTRIAL Location.Setting_RURAL Location.Setting_SUBURBAN
## Min. :0.00 Min. :0.000 Min. :0.000
## 1st Qu.:0.00 1st Qu.:0.000 1st Qu.:0.000
## Median :0.00 Median :0.000 Median :0.000
## Mean :0.03 Mean :0.193 Mean :0.428
## 3rd Qu.:0.00 3rd Qu.:0.000 3rd Qu.:1.000
## Max. :1.00 Max. :1.000 Max. :1.000
##
We start by inspecting various plots for the missing values:
res <- summary(aggr(X, sortVar = TRUE))$combinations
##
## Variables sorted by number of missings:
## Variable Count
## Elevation 0.1765
## max_PM10 0.1690
## max_O3 0.0810
## max_PM2.5 0.0795
## Longitude 0.0795
## Latitude 0.0000
## Land.Use_AGRICULTURAL 0.0000
## Land.Use_COMMERCIAL 0.0000
## Land.Use_INDUSTRIAL 0.0000
## Location.Setting_RURAL 0.0000
## Location.Setting_SUBURBAN 0.0000
res[rev(order(res[, 2])), ]
## Combinations Count Percent
## 1 0:0:0:0:0:0:0:0:0:0:0 1008 50.40
## 5 0:0:1:0:0:1:0:0:0:0:0 179 8.95
## 2 0:0:0:0:0:1:0:0:0:0:0 174 8.70
## 6 0:1:0:0:0:0:0:0:0:0:0 162 8.10
## 7 1:0:0:0:0:0:0:0:0:0:0 159 7.95
## 4 0:0:1:0:0:0:0:0:0:0:0 159 7.95
## 3 0:0:0:1:0:0:0:0:0:0:0 159 7.95
Creating the res variable renders a nice plot, showing the percentage of missing values for each variable. Moreover the next command nicely shows the patterns (\(M\)), as well as their frequency of occurring in the data set. In particular, we can further visualize the pattern using the matrixplot function:
matrixplot(X, sortby = 3)
The VIM function marginplot
creates a
scatterplot with additional information on the missing values. If you
plot the variables \((x,y)\), the
points with no missing values are represented as in a standard
scatterplot. The points for which \(x\)
(resp. \(y\)) is missing are
represented in red along the \(y\)
(resp. \(x\)) axis. In addition,
boxplots of the x and y variables are represented along the axes with
and without missing values (in red all variables \(x\) where \(y\) is missing, in blue all variables x
where y is observed).
marginplot(X[,2:3])
This plot can be used to check whether MCAR might hold. Under MCAR, the distribution of a variable when another variable is missing should always be the same. Under MAR this can be violated as we have seen (distribution shifts!). This plotting is a convenient way to check this a bit.
There are many more plotting possibilities with VIM, as demonstrated e.g., in 2012ADAC.pdf (tuwien.ac.at).
We now finally use the mice package for imputation.
library(mice)
We consider several methods and then start by choosing the best one according to the new I-Score. I-Score is contained in miceDRF package. In can be installed with
devtools::install_github("KrystynaGrzesiak/miceDRF")
As the best version of the score not only scores one imputation but an imputation method itself for this dataset, we need to define a function for each:
library(miceDRF)
X <- as.matrix(X)
methods <- c("pmm", # mice-pmm
"cart", # mice-cart
"sample", # mice-sample
"norm.nob", # Gaussian Imputation
"DRF") # mice-DRF
# Creating a list of impuattion functions
imputation_list <- create_mice_imputations(methods)
# Calculatiung the scores
scores <- Iscores_compare(X = X, N = 30,
imputation_list = imputation_list,
methods = methods)
scores
The score considers mice-cart to to be the best method. As a side note however, mice-rf is deemed second best and might have better properties for uncertainty estimation and multiple imputation, thus both should be considered. Here, we go with mice-cart:
imp.mice <- mice(X, m = 10, method = "cart", printFlag = F)
Since we have the true data in this case, we analyze the imputation method a bit closer:
## This here is not possible without the fully observed data ###
Ximp <- mice::complete(imp.mice)
index1 <- 1
index2 <- 2
par(mfrow = c(1, 2))
plot(Xstar[is.na(X[, index1]) | is.na(X[, index2]), c(index1, index2)])
plot(Ximp[is.na(X[, index1]) | is.na(X[, index2]), c(index1, index2)])
# Replicating first and second moments
colMeans(Xstar) - colMeans(Ximp)
## max_PM2.5 max_O3 max_PM10
## -0.0003866667 0.0000702500 -0.1242500000
## Longitude Latitude Elevation
## -0.0126517775 0.0000000000 1.6121475993
## Land.Use_AGRICULTURAL Land.Use_COMMERCIAL Land.Use_INDUSTRIAL
## 0.0000000000 0.0000000000 0.0000000000
## Location.Setting_RURAL Location.Setting_SUBURBAN
## 0.0000000000 0.0000000000
norm(cov(Xstar) - cov(Ximp)) / norm(cov(Xstar))
## [1] 0.01344732
# Apply a regression to the multiple imputation
lm.mice.out <- with(imp.mice, lm(max_O3 ~ max_PM2.5 + Longitude + Latitude + Elevation + Land.Use_AGRICULTURAL + Land.Use_COMMERCIAL + Land.Use_INDUSTRIAL + Location.Setting_RURAL + Location.Setting_SUBURBAN))
# Use Rubins Rules to aggregate the estimates
res <- pool(lm.mice.out)
summary(res)
## term estimate std.error statistic df
## 1 (Intercept) 3.950883e-02 3.743029e-03 10.5553097 1107.9578
## 2 max_PM2.5 1.152739e-04 6.046336e-05 1.9065090 128.2361
## 3 Longitude -1.038535e-05 2.091214e-05 -0.4966179 681.3921
## 4 Latitude -1.209735e-05 7.315343e-05 -0.1653695 1382.0726
## 5 Elevation -1.116929e-06 7.169468e-07 -1.5578962 605.4050
## 6 Land.Use_AGRICULTURAL -1.576081e-04 1.468549e-03 -0.1073223 516.8873
## 7 Land.Use_COMMERCIAL 7.634395e-04 6.961190e-04 1.0967084 1393.2735
## 8 Land.Use_INDUSTRIAL -1.951021e-03 1.968846e-03 -0.9909466 190.4904
## 9 Location.Setting_RURAL 1.921670e-03 1.201294e-03 1.5996663 842.7499
## 10 Location.Setting_SUBURBAN 8.411120e-04 6.819850e-04 1.2333292 1258.4391
## p.value
## 1 6.959455e-25
## 2 5.882162e-02
## 3 6.196187e-01
## 4 8.686774e-01
## 5 1.197804e-01
## 6 9.145749e-01
## 7 2.729584e-01
## 8 3.229687e-01
## 9 1.100474e-01
## 10 2.176833e-01
Importantly, this works here because we have all the ingredients for
the pool
function, which are (according to
?pool
):
the estimates of the model;
the standard error of each estimate;
the residual degrees of freedom of the model.
Just to double check, we also perform the regression on \(X^*\):
## This here is not possible without the fully observed data ###
res.not.attainable <- lm(max_O3 ~ max_PM2.5 + Longitude + Latitude + Elevation + Land.Use_AGRICULTURAL + Land.Use_COMMERCIAL + Land.Use_INDUSTRIAL + Location.Setting_RURAL + Location.Setting_SUBURBAN, data = as.data.frame(Xstar))
summary(res.not.attainable)
##
## Call:
## lm(formula = max_O3 ~ max_PM2.5 + Longitude + Latitude + Elevation +
## Land.Use_AGRICULTURAL + Land.Use_COMMERCIAL + Land.Use_INDUSTRIAL +
## Location.Setting_RURAL + Location.Setting_SUBURBAN, data = as.data.frame(Xstar))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.043412 -0.008181 -0.000635 0.007901 0.049148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.092e-02 3.627e-03 11.282 <2e-16 ***
## max_PM2.5 1.305e-04 5.333e-05 2.447 0.0145 *
## Longitude 1.314e-08 1.988e-05 0.001 0.9995
## Latitude -2.495e-05 7.142e-05 -0.349 0.7269
## Elevation -9.333e-07 6.769e-07 -1.379 0.1681
## Land.Use_AGRICULTURAL 3.083e-04 1.382e-03 0.223 0.8235
## Land.Use_COMMERCIAL 5.986e-04 6.794e-04 0.881 0.3784
## Land.Use_INDUSTRIAL -1.777e-03 1.751e-03 -1.015 0.3103
## Location.Setting_RURAL 1.883e-03 1.152e-03 1.634 0.1023
## Location.Setting_SUBURBAN 8.635e-04 6.632e-04 1.302 0.1931
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01288 on 1990 degrees of freedom
## Multiple R-squared: 0.007379, Adjusted R-squared: 0.00289
## F-statistic: 1.644 on 9 and 1990 DF, p-value: 0.09762
cbind(round((res$pooled$estimate - res.not.attainable$coefficients) / res.not.attainable$coefficients, 3))
## [,1]
## (Intercept) -0.035
## max_PM2.5 -0.117
## Longitude -791.070
## Latitude -0.515
## Elevation 0.197
## Land.Use_AGRICULTURAL -1.511
## Land.Use_COMMERCIAL 0.275
## Land.Use_INDUSTRIAL 0.098
## Location.Setting_RURAL 0.021
## Location.Setting_SUBURBAN -0.026
Of course there are many more challenges, especially also for data that may be partly dependent (for instance repeat measurement or panel data). Most importantly, mice-cart is awesome, but it does not model the uncertainty of the missing imputation itself. As such it is technically not a proper imputation method, as one part of the uncertainty is missing. This could be an issue for confidence intervals and p-values especially in smaller samples. We also refer to the provided links for more information. In particular also the task view on missing data.