Example by Jerry Reiter for short course at Odum Institute, March 2018. This short example shows R commands for handling missing data using the mice package.

Preliminaries (load packages and data)

If you have not already, install the mice package with the following command (remove #):

#install.packages("mice")

Once you have mice installed, load it into memory using the command:

library(mice)
## Loading required package: lattice
## 
## Attaching package: 'mice'
## The following objects are masked from 'package:base':
## 
##     cbind, rbind

We illustrate the package with a simple data set from the mice package:

data(nhanes2)

dim(nhanes2)
## [1] 25  4
summary(nhanes2)
##     age          bmi          hyp          chl       
##  20-39:12   Min.   :20.40   no  :13   Min.   :113.0  
##  40-59: 7   1st Qu.:22.65   yes : 4   1st Qu.:185.0  
##  60-99: 6   Median :26.75   NA's: 8   Median :187.0  
##             Mean   :26.56             Mean   :191.4  
##             3rd Qu.:28.93             3rd Qu.:212.0  
##             Max.   :35.30             Max.   :284.0  
##             NA's   :9                 NA's   :10

Note that the age and hyp variables are already defined as factor variables in this dataset. If you have a dataset with categorical variables without this pre-definition, you should make the variables factors before doing mice. For example, to make a variable called var1 into a factor in some dataset called thedata, type thedata$var1 = as.factor(thedata$var1).

Missing data

The NA values represent missing data.

Let’s look at the missing data pattern:

md.pattern(nhanes2)

##    age hyp bmi chl   
## 13   1   1   1   1  0
## 3    1   1   1   0  1
## 1    1   1   0   1  1
## 1    1   0   0   1  2
## 7    1   0   0   0  3
##      0   8   9  10 27

Exploratory data analyses

Let’s look at some exploratory data analyses based on the complete cases.

First make the complete cases

ccnhanes2data = cc(nhanes2)
ccnhanes2data
##      age  bmi hyp chl
## 2  40-59 22.7  no 187
## 5  20-39 20.4  no 113
## 7  20-39 22.5  no 118
## 8  20-39 30.1  no 187
## 9  40-59 22.0  no 238
## 13 60-99 21.7  no 206
## 14 40-59 28.7 yes 204
## 17 60-99 27.2 yes 284
## 18 40-59 26.3 yes 199
## 19 20-39 35.3  no 218
## 22 20-39 33.2  no 229
## 23 20-39 27.5  no 131
## 25 40-59 27.4  no 186

Note that there are only 3 cases with hypertension = yes. Relationships will be hard to model accurately due to small sample size…

We will use these data for illustration of the MI procedures. Really we need more data to say anything meaningful about relationships among these variables.

plot(ccnhanes2data$bmi, x=ccnhanes2data$chl, xlab = "Cholesterol", ylab = "BMI", main = "Complete cases: bmi versus chl")

boxplot(ccnhanes2data$bmi ~ ccnhanes2data$age, xlab = "Age", ylab = "BMI", main = "Complete cases: bmi versus age")

boxplot(ccnhanes2data$bmi ~ ccnhanes2data$hyp, xlab = "Hypertensive", ylab = "BMI", main = "Complete cases: bmi versus hyp")

Multiple imputation

How to use mice

Let’s create 5 multiple imputations for the missing values. We will use normal linear regressions for continuous variables, logistic regressions for binary variables, multinomial regressions for unordered categorical variables, and proportional odds models for ordered categorical variables. Note: We have not discussed the last two models in this class, but they are popular models for these kinds of data.

nhanes2MI5 = mice(nhanes2, m=5, defaultMethod = c("norm", "logreg", "polyreg", "polr"))
## 
##  iter imp variable
##   1   1  bmi  hyp  chl
##   1   2  bmi  hyp  chl
##   1   3  bmi  hyp  chl
##   1   4  bmi  hyp  chl
##   1   5  bmi  hyp  chl
##   2   1  bmi  hyp  chl
##   2   2  bmi  hyp  chl
##   2   3  bmi  hyp  chl
##   2   4  bmi  hyp  chl
##   2   5  bmi  hyp  chl
##   3   1  bmi  hyp  chl
##   3   2  bmi  hyp  chl
##   3   3  bmi  hyp  chl
##   3   4  bmi  hyp  chl
##   3   5  bmi  hyp  chl
##   4   1  bmi  hyp  chl
##   4   2  bmi  hyp  chl
##   4   3  bmi  hyp  chl
##   4   4  bmi  hyp  chl
##   4   5  bmi  hyp  chl
##   5   1  bmi  hyp  chl
##   5   2  bmi  hyp  chl
##   5   3  bmi  hyp  chl
##   5   4  bmi  hyp  chl
##   5   5  bmi  hyp  chl

Look at the first couple of completed datasets

d1 = complete(nhanes2MI5, 1)
d1
##      age      bmi hyp       chl
## 1  20-39 27.10510  no 178.72730
## 2  40-59 22.70000  no 187.00000
## 3  20-39 27.85599  no 187.00000
## 4  60-99 31.65444 yes 321.80262
## 5  20-39 20.40000  no 113.00000
## 6  60-99 18.55172  no 184.00000
## 7  20-39 22.50000  no 118.00000
## 8  20-39 30.10000  no 187.00000
## 9  40-59 22.00000  no 238.00000
## 10 40-59 23.79742  no 150.99083
## 11 20-39 18.36895  no  81.90879
## 12 40-59 19.17256  no 159.52815
## 13 60-99 21.70000  no 206.00000
## 14 40-59 28.70000 yes 204.00000
## 15 20-39 29.60000  no 187.43072
## 16 20-39 29.57671  no 166.30587
## 17 60-99 27.20000 yes 284.00000
## 18 40-59 26.30000 yes 199.00000
## 19 20-39 35.30000  no 218.00000
## 20 60-99 25.50000 yes 271.85660
## 21 20-39 26.56711  no 154.82231
## 22 20-39 33.20000  no 229.00000
## 23 20-39 27.50000  no 131.00000
## 24 60-99 24.90000  no 231.30378
## 25 40-59 27.40000  no 186.00000
d2 = complete(nhanes2MI5, 2)
d2
##      age      bmi hyp       chl
## 1  20-39 32.93049  no 183.58290
## 2  40-59 22.70000  no 187.00000
## 3  20-39 31.09278  no 187.00000
## 4  60-99 18.87489  no 195.42848
## 5  20-39 20.40000  no 113.00000
## 6  60-99 15.20205 yes 184.00000
## 7  20-39 22.50000  no 118.00000
## 8  20-39 30.10000  no 187.00000
## 9  40-59 22.00000  no 238.00000
## 10 40-59 28.12083  no 227.16156
## 11 20-39 26.44805  no 185.25915
## 12 40-59 29.50729  no 219.47739
## 13 60-99 21.70000  no 206.00000
## 14 40-59 28.70000 yes 204.00000
## 15 20-39 29.60000  no 185.81944
## 16 20-39 20.44892  no  84.21949
## 17 60-99 27.20000 yes 284.00000
## 18 40-59 26.30000 yes 199.00000
## 19 20-39 35.30000  no 218.00000
## 20 60-99 25.50000 yes 250.17108
## 21 20-39 23.73500  no 118.28273
## 22 20-39 33.20000  no 229.00000
## 23 20-39 27.50000  no 131.00000
## 24 60-99 24.90000  no 237.60771
## 25 40-59 27.40000  no 186.00000

Imputation diagnostics

Plot imputed and observed continuous variables

stripplot(nhanes2MI5, col=c("grey",mdc(2)),pch=c(1,20))