Example by Jerry Reiter for short course at Odum Institute, March 2018. This short example shows R commands for handling missing data using the mice
package.
If you have not already, install the mice package with the following command (remove #):
#install.packages("mice")
Once you have mice
installed, load it into memory using the command:
library(mice)
## Loading required package: lattice
##
## Attaching package: 'mice'
## The following objects are masked from 'package:base':
##
## cbind, rbind
We illustrate the package with a simple data set from the mice package:
data(nhanes2)
dim(nhanes2)
## [1] 25 4
summary(nhanes2)
## age bmi hyp chl
## 20-39:12 Min. :20.40 no :13 Min. :113.0
## 40-59: 7 1st Qu.:22.65 yes : 4 1st Qu.:185.0
## 60-99: 6 Median :26.75 NA's: 8 Median :187.0
## Mean :26.56 Mean :191.4
## 3rd Qu.:28.93 3rd Qu.:212.0
## Max. :35.30 Max. :284.0
## NA's :9 NA's :10
Note that the age
and hyp
variables are already defined as factor variables in this dataset. If you have a dataset with categorical variables without this pre-definition, you should make the variables factors before doing mice. For example, to make a variable called var1 into a factor in some dataset called thedata, type thedata$var1 = as.factor(thedata$var1)
.
The NA
values represent missing data.
Let’s look at the missing data pattern:
md.pattern(nhanes2)
## age hyp bmi chl
## 13 1 1 1 1 0
## 3 1 1 1 0 1
## 1 1 1 0 1 1
## 1 1 0 0 1 2
## 7 1 0 0 0 3
## 0 8 9 10 27
Let’s look at some exploratory data analyses based on the complete cases.
First make the complete cases
ccnhanes2data = cc(nhanes2)
ccnhanes2data
## age bmi hyp chl
## 2 40-59 22.7 no 187
## 5 20-39 20.4 no 113
## 7 20-39 22.5 no 118
## 8 20-39 30.1 no 187
## 9 40-59 22.0 no 238
## 13 60-99 21.7 no 206
## 14 40-59 28.7 yes 204
## 17 60-99 27.2 yes 284
## 18 40-59 26.3 yes 199
## 19 20-39 35.3 no 218
## 22 20-39 33.2 no 229
## 23 20-39 27.5 no 131
## 25 40-59 27.4 no 186
Note that there are only 3 cases with hypertension = yes
. Relationships will be hard to model accurately due to small sample size…
We will use these data for illustration of the MI procedures. Really we need more data to say anything meaningful about relationships among these variables.
plot(ccnhanes2data$bmi, x=ccnhanes2data$chl, xlab = "Cholesterol", ylab = "BMI", main = "Complete cases: bmi versus chl")
boxplot(ccnhanes2data$bmi ~ ccnhanes2data$age, xlab = "Age", ylab = "BMI", main = "Complete cases: bmi versus age")
boxplot(ccnhanes2data$bmi ~ ccnhanes2data$hyp, xlab = "Hypertensive", ylab = "BMI", main = "Complete cases: bmi versus hyp")
mice
Let’s create 5 multiple imputations for the missing values. We will use normal linear regressions for continuous variables, logistic regressions for binary variables, multinomial regressions for unordered categorical variables, and proportional odds models for ordered categorical variables. Note: We have not discussed the last two models in this class, but they are popular models for these kinds of data.
nhanes2MI5 = mice(nhanes2, m=5, defaultMethod = c("norm", "logreg", "polyreg", "polr"))
##
## iter imp variable
## 1 1 bmi hyp chl
## 1 2 bmi hyp chl
## 1 3 bmi hyp chl
## 1 4 bmi hyp chl
## 1 5 bmi hyp chl
## 2 1 bmi hyp chl
## 2 2 bmi hyp chl
## 2 3 bmi hyp chl
## 2 4 bmi hyp chl
## 2 5 bmi hyp chl
## 3 1 bmi hyp chl
## 3 2 bmi hyp chl
## 3 3 bmi hyp chl
## 3 4 bmi hyp chl
## 3 5 bmi hyp chl
## 4 1 bmi hyp chl
## 4 2 bmi hyp chl
## 4 3 bmi hyp chl
## 4 4 bmi hyp chl
## 4 5 bmi hyp chl
## 5 1 bmi hyp chl
## 5 2 bmi hyp chl
## 5 3 bmi hyp chl
## 5 4 bmi hyp chl
## 5 5 bmi hyp chl
Look at the first couple of completed datasets
d1 = complete(nhanes2MI5, 1)
d1
## age bmi hyp chl
## 1 20-39 27.10510 no 178.72730
## 2 40-59 22.70000 no 187.00000
## 3 20-39 27.85599 no 187.00000
## 4 60-99 31.65444 yes 321.80262
## 5 20-39 20.40000 no 113.00000
## 6 60-99 18.55172 no 184.00000
## 7 20-39 22.50000 no 118.00000
## 8 20-39 30.10000 no 187.00000
## 9 40-59 22.00000 no 238.00000
## 10 40-59 23.79742 no 150.99083
## 11 20-39 18.36895 no 81.90879
## 12 40-59 19.17256 no 159.52815
## 13 60-99 21.70000 no 206.00000
## 14 40-59 28.70000 yes 204.00000
## 15 20-39 29.60000 no 187.43072
## 16 20-39 29.57671 no 166.30587
## 17 60-99 27.20000 yes 284.00000
## 18 40-59 26.30000 yes 199.00000
## 19 20-39 35.30000 no 218.00000
## 20 60-99 25.50000 yes 271.85660
## 21 20-39 26.56711 no 154.82231
## 22 20-39 33.20000 no 229.00000
## 23 20-39 27.50000 no 131.00000
## 24 60-99 24.90000 no 231.30378
## 25 40-59 27.40000 no 186.00000
d2 = complete(nhanes2MI5, 2)
d2
## age bmi hyp chl
## 1 20-39 32.93049 no 183.58290
## 2 40-59 22.70000 no 187.00000
## 3 20-39 31.09278 no 187.00000
## 4 60-99 18.87489 no 195.42848
## 5 20-39 20.40000 no 113.00000
## 6 60-99 15.20205 yes 184.00000
## 7 20-39 22.50000 no 118.00000
## 8 20-39 30.10000 no 187.00000
## 9 40-59 22.00000 no 238.00000
## 10 40-59 28.12083 no 227.16156
## 11 20-39 26.44805 no 185.25915
## 12 40-59 29.50729 no 219.47739
## 13 60-99 21.70000 no 206.00000
## 14 40-59 28.70000 yes 204.00000
## 15 20-39 29.60000 no 185.81944
## 16 20-39 20.44892 no 84.21949
## 17 60-99 27.20000 yes 284.00000
## 18 40-59 26.30000 yes 199.00000
## 19 20-39 35.30000 no 218.00000
## 20 60-99 25.50000 yes 250.17108
## 21 20-39 23.73500 no 118.28273
## 22 20-39 33.20000 no 229.00000
## 23 20-39 27.50000 no 131.00000
## 24 60-99 24.90000 no 237.60771
## 25 40-59 27.40000 no 186.00000
Plot imputed and observed continuous variables
stripplot(nhanes2MI5, col=c("grey",mdc(2)),pch=c(1,20))