The first steps are to load the package mice
and the data we want to impute. For this example, we will use the NHANES dataset.
To load this dataset, you can use the command file.choose()
which opens the explorer and allows you to navigate to the location of the file NHANES_1.RData
. If you know the path to the file, you can also use load("<path>/NHANES_1.RData")
.
Note:
To display the help file for any function, type a ?
followed by the name of the function in one of the code fields and click on “Run Code”.
Alternatively, you can look up the help pages online at https://www.rdocumentation.org/ or find the whole manual for a package at https://cran.r-project.org/web/packages/available_packages_by_name.html
The mice package needs to be loaded with the command library()
. Let’s take a first look at the data. Useful functions are dim()
, head()
, str()
and summary()
.
library(mice)
dim(NHANES)
## [1] 1000 23
head(NHANES)
str(NHANES)
## 'data.frame': 1000 obs. of 23 variables:
## $ height : num 1.65 1.8 1.65 1.68 1.57 ...
## $ BMI : num 22.5 24.4 18 22.6 30.2 ...
## $ hypten : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 NA 1 2 1 ...
## $ DM : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 2 1 ...
## $ cohort : chr "2011" "2011" "2011" "2011" ...
## $ DBP : num 74.7 73.3 75.3 68.7 82.7 ...
## $ bili : num NA NA 1 0.5 NA 0.6 1.1 0.6 0.4 0.9 ...
## $ SBP : num 131 123 105 123 125 ...
## $ hypchol : Factor w/ 2 levels "no","yes": NA NA 1 1 NA 1 1 1 1 1 ...
## $ creat : num NA NA 0.96 0.75 NA 0.7 0.72 0.86 0.6 0.94 ...
## $ WC : num 79.2 89.4 70.2 89.9 105.2 ...
## $ age : num 61 20 24 30 61 26 61 48 57 40 ...
## $ HyperMed: Ord.factor w/ 3 levels "no"<"previous"<..: NA NA NA NA NA NA NA NA 3 NA ...
## $ weight : num 61.2 79.4 49 63.5 74.8 ...
## $ chol : num NA NA 4.99 4.84 NA 4.78 4.09 4.84 4.68 4.86 ...
## $ educ : Factor w/ 5 levels "Less than 9th grade",..: 3 3 3 4 5 4 2 3 5 1 ...
## $ alc : Ord.factor w/ 5 levels "0"<"<=1"<"1-3"<..: 1 4 NA 5 3 5 3 5 2 4 ...
## $ albu : num NA NA 4 4.7 NA 4.1 4.8 4.3 3.5 4.3 ...
## $ race : Factor w/ 5 levels "Mexican American",..: 1 5 3 5 2 5 5 3 3 3 ...
## $ HDL : num NA NA 1.34 2.02 NA 1.22 1.32 1.09 0.8 1.5 ...
## $ uricacid: num NA NA 4.4 5 NA 7.3 5.8 5.2 5 4.9 ...
## $ smoke : Ord.factor w/ 3 levels "never"<"former"<..: 2 2 1 1 1 3 2 3 1 3 ...
## $ gender : Factor w/ 2 levels "male","female": 1 1 2 2 2 2 1 1 2 1 ...
summary(NHANES)
## height BMI hypten DM cohort
## Min. :1.397 Min. :16.27 no :637 no :843 Length:1000
## 1st Qu.:1.600 1st Qu.:23.33 yes :335 yes:157 Class :character
## Median :1.676 Median :26.61 NA's: 28 Mode :character
## Mean :1.679 Mean :27.71
## 3rd Qu.:1.753 3rd Qu.:30.67
## Max. :1.930 Max. :62.35
## NA's :20 NA's :30
## DBP bili SBP hypchol
## Min. : 12.67 Min. :0.2000 Min. : 81.33 no :815
## 1st Qu.: 64.00 1st Qu.:0.6000 1st Qu.:108.67 yes :100
## Median : 70.67 Median :0.7000 Median :118.00 NA's: 85
## Mean : 70.50 Mean :0.7492 Mean :120.25
## 3rd Qu.: 77.33 3rd Qu.:0.9000 3rd Qu.:128.67
## Max. :110.00 Max. :4.1000 Max. :207.33
## NA's :41 NA's :87 NA's :41
## creat WC age HyperMed
## Min. :0.3400 Min. : 64.50 Min. :20.00 no : 29
## 1st Qu.:0.6900 1st Qu.: 85.00 1st Qu.:32.00 previous: 28
## Median :0.8200 Median : 94.95 Median :45.00 yes :213
## Mean :0.8559 Mean : 96.59 Mean :46.54 NA's :730
## 3rd Qu.:0.9600 3rd Qu.:105.78 3rd Qu.:60.00
## Max. :7.4600 Max. :152.40 Max. :79.00
## NA's :86 NA's :50
## weight chol educ alc
## Min. : 38.56 Min. :2.07 Less than 9th grade : 78 0 :139
## 1st Qu.: 63.50 1st Qu.:4.34 9-11th grade :157 <=1 :276
## Median : 76.66 Median :4.99 High school graduate:216 1-3 :113
## Mean : 78.44 Mean :5.02 some college :289 3-7 : 82
## 3rd Qu.: 89.36 3rd Qu.:5.61 College or above :259 >7 :114
## Max. :167.83 Max. :9.93 NA's : 1 NA's:276
## NA's :17 NA's :85
## albu race HDL uricacid
## Min. :2.600 Mexican American : 96 Min. :0.360 Min. : 1.800
## 1st Qu.:4.100 Other Hispanic :110 1st Qu.:1.110 1st Qu.: 4.400
## Median :4.300 Non-Hispanic White:356 Median :1.340 Median : 5.400
## Mean :4.277 Non-Hispanic Black:243 Mean :1.388 Mean : 5.406
## 3rd Qu.:4.500 other :195 3rd Qu.:1.580 3rd Qu.: 6.300
## Max. :5.200 Max. :3.570 Max. :11.000
## NA's :86 NA's :85 NA's :86
## smoke gender
## never :574 male :476
## former :213 female:524
## current:212
## NA's : 1
##
##
##
It is important to check that all variables are coded correctly, i.e., have the correct variable class
specified, because mice()
automatically selects imputation methods based on each variable’s class
. When importing data from other software it can happen that factors become continuous variables or that ordered factors lose their ordering.
str()
showed that in the NHANES data smoke
and alc
are correctly specified as ordinal variables, but educ
is an unordered factor.
Using levels(NHANES$educ)
we can print the names of the categories of educ
. Convert the unordered factor to an ordered factor, for example using as.ordered()
. Afterwards, check if the conversion was successful.
Explore the missing data pattern of the NHANES data.
mdp <- md.pattern(NHANES)
mdp
## DM cohort age race gender educ smoke weight height hypten BMI DBP SBP
## 143 1 1 1 1 1 1 1 1 1 1 1 1 1
## 9 1 1 1 1 1 1 1 1 1 1 1 1 1
## 483 1 1 1 1 1 1 1 1 1 1 1 1 1
## 60 1 1 1 1 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 0 1 1 1 1 1 1
## 3 1 1 1 1 1 1 1 1 0 1 0 1 1
## 4 1 1 1 1 1 1 1 1 1 1 1 0 0
## 2 1 1 1 1 1 1 1 1 1 0 1 1 1
## 6 1 1 1 1 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 0 1 1 0 1 1
## 5 1 1 1 1 1 1 1 1 1 1 1 1 1
## 146 1 1 1 1 1 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1 0 1 0 1 1
## 3 1 1 1 1 1 1 1 1 1 1 1 0 0
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 3 1 1 1 1 1 1 1 1 1 1 1 1 1
## 3 1 1 1 1 1 1 1 0 1 1 0 1 1
## 8 1 1 1 1 1 1 1 1 1 0 1 0 0
## 5 1 1 1 1 1 1 1 0 0 1 0 1 1
## 3 1 1 1 1 1 1 1 1 1 1 1 0 0
## 4 1 1 1 1 1 1 1 1 0 1 0 1 1
## 2 1 1 1 1 1 1 1 0 1 1 0 1 1
## 1 1 1 1 1 1 1 1 1 1 0 1 0 0
## 8 1 1 1 1 1 1 1 1 1 0 1 0 0
## 1 1 1 1 1 1 1 1 0 0 1 0 1 1
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 0 1 0 0 0
## 3 1 1 1 1 1 1 1 1 1 0 1 0 0
## 2 1 1 1 1 1 1 1 1 0 0 0 0 0
## 17 1 1 1 1 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 0 0 0 0 0 0
## 2 1 1 1 1 1 1 1 1 1 1 1 1 1
## 31 1 1 1 1 1 1 1 1 1 1 1 1 1
## 6 1 1 1 1 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 3 1 1 1 1 1 1 1 1 1 1 1 1 1
## 7 1 1 1 1 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 1 1 0 0
## 1 1 1 1 1 1 1 1 1 0 1 0 1 1
## 1 1 1 1 1 1 1 1 1 1 1 1 0 0
## 6 1 1 1 1 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 0 1 1 0 1 1
## 2 1 1 1 1 1 1 1 1 1 1 1 0 0
## 1 1 1 1 1 1 1 1 0 1 1 0 1 1
## 1 1 1 1 1 1 1 1 0 1 1 0 1 1
## 1 1 1 1 1 1 1 1 1 1 0 1 0 0
## 1 1 1 1 1 1 1 1 1 1 0 1 0 0
## 1 1 1 1 1 1 1 1 0 1 1 0 1 1
## 1 1 1 1 1 1 0 1 1 1 0 1 0 0
## 0 0 0 0 0 1 1 17 20 28 30 41 41
## WC hypchol chol HDL creat albu uricacid bili alc HyperMed
## 143 1 1 1 1 1 1 1 1 1 1 0
## 9 0 1 1 1 1 1 1 1 1 1 1
## 483 1 1 1 1 1 1 1 1 1 0 1
## 60 1 1 1 1 1 1 1 1 0 1 1
## 1 1 1 1 1 1 1 1 1 1 1 1
## 3 1 1 1 1 1 1 1 1 1 1 2
## 4 1 1 1 1 1 1 1 1 1 1 2
## 2 1 1 1 1 1 1 1 1 1 0 2
## 6 0 1 1 1 1 1 1 1 1 0 2
## 1 1 1 1 1 1 1 1 1 1 1 2
## 5 0 1 1 1 1 1 1 1 0 1 2
## 146 1 1 1 1 1 1 1 1 0 0 2
## 2 1 1 1 1 1 1 1 1 1 0 3
## 3 1 1 1 1 1 1 1 1 0 1 3
## 1 1 1 1 1 1 1 1 0 0 0 3
## 3 0 1 1 1 1 1 1 1 0 0 3
## 3 1 1 1 1 1 1 1 1 0 1 3
## 8 1 1 1 1 1 1 1 1 1 0 4
## 5 1 1 1 1 1 1 1 1 1 0 4
## 3 0 1 1 1 1 1 1 1 0 1 4
## 4 1 1 1 1 1 1 1 1 0 0 4
## 2 1 1 1 1 1 1 1 1 0 0 4
## 1 0 1 1 1 1 1 1 1 1 0 5
## 8 1 1 1 1 1 1 1 1 0 0 5
## 1 1 1 1 1 1 1 1 1 0 0 5
## 1 1 1 1 1 0 0 0 0 1 0 5
## 1 0 1 1 1 1 1 1 1 0 1 6
## 3 0 1 1 1 1 1 1 1 0 0 6
## 2 1 1 1 1 1 1 1 1 0 0 7
## 17 1 0 0 0 0 0 0 0 1 1 7
## 1 1 1 1 1 1 1 1 1 0 0 8
## 2 0 0 0 0 0 0 0 0 1 1 8
## 31 1 0 0 0 0 0 0 0 1 0 8
## 6 1 0 0 0 0 0 0 0 0 1 8
## 1 0 0 0 0 0 0 0 0 1 0 9
## 3 0 0 0 0 0 0 0 0 0 1 9
## 7 1 0 0 0 0 0 0 0 0 0 9
## 1 0 0 0 0 0 0 0 0 1 1 10
## 1 1 0 0 0 0 0 0 0 1 0 10
## 1 1 0 0 0 0 0 0 0 0 1 10
## 6 0 0 0 0 0 0 0 0 0 0 10
## 1 1 0 0 0 0 0 0 0 0 1 10
## 2 0 0 0 0 0 0 0 0 0 1 11
## 1 0 0 0 0 0 0 0 0 0 1 11
## 1 1 0 0 0 0 0 0 0 0 0 11
## 1 0 0 0 0 0 0 0 0 1 0 12
## 1 1 0 0 0 0 0 0 0 0 0 12
## 1 0 0 0 0 0 0 0 0 0 0 12
## 1 0 0 0 0 0 0 0 0 1 0 13
## 50 85 85 85 86 86 86 87 276 730 1835
md.pattern()
gives us a matrix where each row represents one missing data pattern (1 = observed, 0 = missing). The rownames show how many rows of the data have the given missing data pattern. The last column gives the number of missing values in the pattern, the last row the number of missing values per variable.
Visualize and inspect the missing data pattern using one or more of the examples shown in the course slides.
Hint: Some functions you can try are JointAI::md_pattern()
, VIM::aggr()
, visdat::vis_dat()
, visdat::vis_miss()
.
For JointAI::md_pattern()
you can customize the y-axis by using the argument yaxis_pars
, which is a list of arguments. To hide the counts use yaxis_pars = list(yaxt = "n")
and printN = FALSE
. To prevent the printed output use print = FALSE
.
For more options check out the help pages of the functions.
Now get an overview of how much missingness there is in each variable and the proportion of (in)complete cases.
The function complete.cases()
will give you a vector that is TRUE if the the case is complete and FALSE if there are missing values.
is.na()
returns TRUE
if the value is missing, FALSE
if the value is observed
colSums()
calculates the sum of values in each column of a data.frame or matrix
colMeans()
calculates the mean of values in each column of a data.frame or matrix
# number and proportion of complete cases
Ncc <- cbind(
"#" = table(complete.cases(NHANES)),
"%" = round(100 * table(complete.cases(NHANES))/nrow(NHANES), 2)
)
rownames(Ncc) <- c("incompl.", "complete")
Ncc
## # %
## incompl. 857 85.7
## complete 143 14.3
# number and proportion of missing values per variable
cbind("# NA" = sort(colSums(is.na(NHANES))),
"% NA" = round(sort(colMeans(is.na(NHANES))) * 100, 2))
## # NA % NA
## DM 0 0.0
## cohort 0 0.0
## age 0 0.0
## race 0 0.0
## gender 0 0.0
## educ 1 0.1
## smoke 1 0.1
## weight 17 1.7
## height 20 2.0
## hypten 28 2.8
## BMI 30 3.0
## DBP 41 4.1
## SBP 41 4.1
## WC 50 5.0
## hypchol 85 8.5
## chol 85 8.5
## HDL 85 8.5
## creat 86 8.6
## albu 86 8.6
## uricacid 86 8.6
## bili 87 8.7
## alc 276 27.6
## HyperMed 730 73.0
Our data contains height
, weight
and BMI
. Check the missing data pattern specifically for those three variables.
# Three solutions to choose from:
md.pattern(NHANES[, c("height", "weight", "BMI")])
## weight height BMI
## 970 1 1 1 0
## 13 1 0 0 2
## 10 0 1 0 2
## 7 0 0 0 3
## 17 20 30 67
with(NHANES, table(is.na(height), is.na(weight), is.na(BMI)))
## , , = FALSE
##
##
## FALSE TRUE
## FALSE 970 0
## TRUE 0 0
##
## , , = TRUE
##
##
## FALSE TRUE
## FALSE 0 10
## TRUE 13 7
library(plyr)
ddply(NHANES, c(height = "ifelse(is.na(height), 'missing', 'observed')",
weight = "ifelse(is.na(weight), 'missing', 'observed')",
BMI = "ifelse(is.na(BMI), 'missing', 'observed')"),
summarize,
N = length(height)
)
As we have already seen in the lecture, there are quite a number of cases where only either height
or weight
is missing. BMI
is only observed when both components are observed. To use all available information, we want to impute height
and weight
separately and calculate BMI
from the imputed values.
Because we might expect that hypertensive medication is only prescribed for patients with hypertension, we will also look into the variables HyperMed
and hypten
. Make a table of the two variables to confirm our expectation. Make sure that missing values are also displayed in that table!
with(NHANES, table(HyperMed, hypten, exclude = NULL))
## hypten
## HyperMed no yes <NA>
## no 0 29 0
## previous 0 28 0
## yes 0 213 0
## <NA> 637 65 28
Furthermore, we can expect a systematic relationship between hypchol
and chol
. Find out how these two variables are related.
with(NHANES, plot(chol ~ hypchol))
with(NHANES, summary(chol[hypchol == "no"]))
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.070 4.270 4.860 4.795 5.400 6.180 85
with(NHANES, summary(chol[hypchol == "yes"]))
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 6.210 6.410 6.620 6.854 7.110 9.930 85
It seems that hypchol
is defined as having chol > 6.2
, which makes hypchol
dependent on chol
.
Before imputing the missing data it is important to take a look at how the incomplete variables are distributed so that we can choose appropriate imputation models.
Visualize the distributions of the incomplete continuous and categorical variables. Pay attention to
Hint: You may want to use the functions hist()
, table()
and barplot()
.
Of course, you could also use the ggplot2 package.
# One option that allows to quickly get an overview of all variables,
# categorical and continuous
nc <- max(5, ceiling(sqrt(ncol(NHANES))))
nr <- ceiling(ncol(NHANES) / nc)
par(mfrow = c(nr, nc), mgp = c(2, 0.6, 0), mar = c(2, 3, 3, 0.5))
for (i in 1:ncol(NHANES)) {
if (is.numeric(NHANES[, i])) {
hist(NHANES[, i], nclass = 50, xlab = "",
main = paste0(names(NHANES[i]), " (",
round(mean(is.na(NHANES[, i])) * 100, 2), "% NA)")
)
} else {
barplot(table(NHANES[, i]), ylab = "Frequency",
main = paste0(names(NHANES[i]), " (",
round(mean(is.na(NHANES[, i])) * 100, 2), "% NA)"))
}
}
Now that we know how our data and the missing values are distributed, we are ready to impute!
First, do a set-up run of mice()
, without any iterations (maxit = 0
). This is to get all the objects that we may need to adjust in order to customize the imputation to our data.
If you need help to figure out what you need to specify take a look at the help page ?mice
.
imp0 <- mice(NHANES, maxit = 0)
Note: This command will not produce any output.
There are many imputation methods available in mice. You can find the list in the help page of the mice()
function. We will focus here on the following ones:
name | description | variable type |
---|---|---|
pmm | Predictive mean matching | any |
norm | Bayesian linear regression | numeric |
logreg | Logistic regression | factor, 2 levels |
polyreg | Polytomous logistic regression | factor, >= 2 levels |
polr | Proportional odds model | ordered, >= 2 levels |
The default imputation methods that mice()
selects can be specified in the argument defaultMethod
.
If unspecified, mice
will use
pmm
for numerical columns,logreg
for factor columns with two categories,polyreg
for columns with unordered andpolr
for columns with ordered factors with more than two categories.When a normal imputation model seems to be appropriate for most of the continuous covariates, you may want to specify norm
as the default method in the setup run. Let’s do that:
The order of the types of variable is: continuous, binary, factor, ordered factor.
imp0 <- mice(NHANES, maxit = 0,
defaultMethod = c("norm", 'logreg', 'polyreg', 'polr'))
Looking at the histograms we made for the continuous variables, we can see that the variable creat
has a skewed distribution, so using a normal imputation model may not work well.
Let’s change the imputation method for creat
so that it is imputed with predictive mean matching:
meth <- imp0$meth
meth["creat"] <- "pmm"
We can now check out if we need to make any changes to the predictorMatrix
argument.
pred <- imp0$predictorMatrix
pred
## height BMI hypten DM cohort DBP bili SBP hypchol creat WC age
## height 0 1 1 1 0 1 1 1 1 1 1 1
## BMI 1 0 1 1 0 1 1 1 1 1 1 1
## hypten 1 1 0 1 0 1 1 1 1 1 1 1
## DM 0 0 0 0 0 0 0 0 0 0 0 0
## cohort 0 0 0 0 0 0 0 0 0 0 0 0
## DBP 1 1 1 1 0 0 1 1 1 1 1 1
## bili 1 1 1 1 0 1 0 1 1 1 1 1
## SBP 1 1 1 1 0 1 1 0 1 1 1 1
## hypchol 1 1 1 1 0 1 1 1 0 1 1 1
## creat 1 1 1 1 0 1 1 1 1 0 1 1
## WC 1 1 1 1 0 1 1 1 1 1 0 1
## age 0 0 0 0 0 0 0 0 0 0 0 0
## HyperMed 1 1 1 1 0 1 1 1 1 1 1 1
## weight 1 1 1 1 0 1 1 1 1 1 1 1
## chol 1 1 1 1 0 1 1 1 1 1 1 1
## educ 1 1 1 1 0 1 1 1 1 1 1 1
## alc 1 1 1 1 0 1 1 1 1 1 1 1
## albu 1 1 1 1 0 1 1 1 1 1 1 1
## race 0 0 0 0 0 0 0 0 0 0 0 0
## HDL 1 1 1 1 0 1 1 1 1 1 1 1
## uricacid 1 1 1 1 0 1 1 1 1 1 1 1
## smoke 1 1 1 1 0 1 1 1 1 1 1 1
## gender 0 0 0 0 0 0 0 0 0 0 0 0
## HyperMed weight chol educ alc albu race HDL uricacid smoke gender
## height 1 1 1 1 1 1 1 1 1 1 1
## BMI 1 1 1 1 1 1 1 1 1 1 1
## hypten 1 1 1 1 1 1 1 1 1 1 1
## DM 0 0 0 0 0 0 0 0 0 0 0
## cohort 0 0 0 0 0 0 0 0 0 0 0
## DBP 1 1 1 1 1 1 1 1 1 1 1
## bili 1 1 1 1 1 1 1 1 1 1 1
## SBP 1 1 1 1 1 1 1 1 1 1 1
## hypchol 1 1 1 1 1 1 1 1 1 1 1
## creat 1 1 1 1 1 1 1 1 1 1 1
## WC 1 1 1 1 1 1 1 1 1 1 1
## age 0 0 0 0 0 0 0 0 0 0 0
## HyperMed 0 1 1 1 1 1 1 1 1 1 1
## weight 1 0 1 1 1 1 1 1 1 1 1
## chol 1 1 0 1 1 1 1 1 1 1 1
## educ 1 1 1 0 1 1 1 1 1 1 1
## alc 1 1 1 1 0 1 1 1 1 1 1
## albu 1 1 1 1 1 0 1 1 1 1 1
## race 0 0 0 0 0 0 0 0 0 0 0
## HDL 1 1 1 1 1 1 1 0 1 1 1
## uricacid 1 1 1 1 1 1 1 1 0 1 1
## smoke 1 1 1 1 1 1 1 1 1 0 1
## gender 0 0 0 0 0 0 0 0 0 0 0
Remember that we wanted to impute height
and weight
separately? But let’s use BMI
to impute the other variables. And since HyperMed
does not give us a lot more information than hypten
, but has a lot more missing values, we do not want to use it as a predictor variable.
Go ahead and apply the necessary changes to pred
and meth
:
For passive imputation, you need to specify the formula used to calculate BMI in meth using "~I(...)"
.
pred[c("height", "weight"), "BMI"] <- 0
pred[, c("height", "weight")] <- 0
pred["height", "weight"] <- 1
pred["weight", "height"] <- 1
pred[, "HyperMed"] <- 0
pred["chol", "hypchol"] <- 0
meth["BMI"] <- "~I(weight/height^2)"
meth["HyperMed"] <- ""
To make sure that the imputed values of BMI
match the imputed values of height
and weight
,BMI
needs to be imputed after height
and weight
. Get the visitSequence
from imp0
(using imp0$visitSequence
) and change it if necessary.
visSeq <- imp0$visitSequence
# the lazy version:
visSeq <- c(visSeq[-2], visSeq[2])
# the save version:
# To make sure you still do the correct thing if the order of the columns in
# NHANES would change
which_BMI <- match("BMI", names(visSeq))
visSeq <- c(visSeq[-which_BMI], visSeq[which_BMI])
With the changes that we have made to the predictorMatrix
and method
, we can now perform the imputation. Use m = 5
and maxit = 10
.
imp <- mice(NHANES, method = meth, predictorMatrix = pred, visitSequence = visSeq,
maxit = 10, m = 5,
seed = 2018)
##
## iter imp variable
## 1 1 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 1 2 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 1 3 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 1 4 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 1 5 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 2 1 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 2 2 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 2 3 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 2 4 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 2 5 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 3 1 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 3 2 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 3 3 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 3 4 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 3 5 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 4 1 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 4 2 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 4 3 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 4 4 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 4 5 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 5 1 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 5 2 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 5 3 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 5 4 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 5 5 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 6 1 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 6 2 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 6 3 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 6 4 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 6 5 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 7 1 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 7 2 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 7 3 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 7 4 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 7 5 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 8 1 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 8 2 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 8 3 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 8 4 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 8 5 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 9 1 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 9 2 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 9 3 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 9 4 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 9 5 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 10 1 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 10 2 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 10 3 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 10 4 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
## 10 5 height hypten DBP bili SBP hypchol creat WC HyperMed weight chol educ alc albu HDL uricacid smoke BMI
mice()
prints the name of the variable being imputed for each iteration and imputation. If you run mice()
on your own computer the output will show up continuously. There, you may notice that imputation is slowest for categorical variables, especially when they have many categories.
You can hide the lengthy output by specifying printFlag = FALSE
.
mice()
does not return a data.frame
. Find out the class of the object returned by mice()
function using the function class()
, and take a look at the help file for this class:
class(imp)
## [1] "mids"
mids
objectWe see that imp
is an object of class mids
.
The help file tells us that a mids
object is a list with several elements:
call :
|
The call that created the mids object.
|
data :
|
A copy of the incomplete data set. |
where :
|
The missingness indicator matrix. |
m :
|
The number of imputations. |
nmis :
|
The number of missing observations per variable. |
imp :
|
The imputed values: A list of ncol(data) components,each list component is a matrix with nmis[j] rows and m columns.
|
method :
|
The vector imputation methods. |
predictorMatrix :
|
The predictor matrix. |
visitSequence :
|
The sequence in which columns are visited during imputation. |
post :
|
A vector of strings of length ncol(data) with commands for post-processing.
|
seed :
|
The seed value of the solution. |
iteration :
|
The number of iterations. |
lastSeedValue :
|
The most recent seed value. |
chainMean:
|
The mean of imputed values per variable and iteration: a list of m components. Each component is a matrix with maxit columns and length(visitSequence) rows.
|
chainVar :
|
The variances of imputed values per variable and iteration(same structure as chainMean ).
|
loggedEvents :
|
A matrix with the record of automatic removal actions (details below); (NULL if no action was made).
|
pad :
|
A list containing the internally used version of the data ,method , predictorMatrix , visitSequence , post and dummy coding.
|
Details of the loggedEvents
:
mice()
does some pre-processing of the data:
Furthermore, during each iteration
polr
imputation that does not converge is replaced by polyreg
.loggedEvents
has the following columns:
it
|
iteration number |
im
|
imputation number |
co
|
column number in the data |
dep
|
name of the name of the variable being imputed |
meth
|
imputation method used |
out
|
character vector with names of altered/removed predictors |
It is good practice to make sure that mice()
has not done any processing of the data that was not planned or that you are not aware of. This means checking the loggedEvents
, but also ensuring that the correct method
, predictorMatrix
and visitSequence
were used.
Do these checks for imp
.
imp$method
## height BMI hypten
## "norm" "~I(weight/height^2)" "logreg"
## DM cohort DBP
## "" "" "norm"
## bili SBP hypchol
## "norm" "norm" "logreg"
## creat WC age
## "pmm" "norm" ""
## HyperMed weight chol
## "" "norm" "norm"
## educ alc albu
## "polr" "polr" "norm"
## race HDL uricacid
## "" "norm" "norm"
## smoke gender
## "polr" ""
imp$predictorMatrix
## height BMI hypten DM cohort DBP bili SBP hypchol creat WC age
## height 0 0 1 1 0 1 1 1 1 1 1 1
## BMI 0 0 1 1 0 1 1 1 1 1 1 1
## hypten 0 1 0 1 0 1 1 1 1 1 1 1
## DM 0 0 0 0 0 0 0 0 0 0 0 0
## cohort 0 0 0 0 0 0 0 0 0 0 0 0
## DBP 0 1 1 1 0 0 1 1 1 1 1 1
## bili 0 1 1 1 0 1 0 1 1 1 1 1
## SBP 0 1 1 1 0 1 1 0 1 1 1 1
## hypchol 0 1 1 1 0 1 1 1 0 1 1 1
## creat 0 1 1 1 0 1 1 1 1 0 1 1
## WC 0 1 1 1 0 1 1 1 1 1 0 1
## age 0 0 0 0 0 0 0 0 0 0 0 0
## HyperMed 0 1 1 1 0 1 1 1 1 1 1 1
## weight 1 0 1 1 0 1 1 1 1 1 1 1
## chol 0 1 1 1 0 1 1 1 0 1 1 1
## educ 0 1 1 1 0 1 1 1 1 1 1 1
## alc 0 1 1 1 0 1 1 1 1 1 1 1
## albu 0 1 1 1 0 1 1 1 1 1 1 1
## race 0 0 0 0 0 0 0 0 0 0 0 0
## HDL 0 1 1 1 0 1 1 1 1 1 1 1
## uricacid 0 1 1 1 0 1 1 1 1 1 1 1
## smoke 0 1 1 1 0 1 1 1 1 1 1 1
## gender 0 0 0 0 0 0 0 0 0 0 0 0
## HyperMed weight chol educ alc albu race HDL uricacid smoke gender
## height 0 1 1 1 1 1 1 1 1 1 1
## BMI 0 0 1 1 1 1 1 1 1 1 1
## hypten 0 0 1 1 1 1 1 1 1 1 1
## DM 0 0 0 0 0 0 0 0 0 0 0
## cohort 0 0 0 0 0 0 0 0 0 0 0
## DBP 0 0 1 1 1 1 1 1 1 1 1
## bili 0 0 1 1 1 1 1 1 1 1 1
## SBP 0 0 1 1 1 1 1 1 1 1 1
## hypchol 0 0 1 1 1 1 1 1 1 1 1
## creat 0 0 1 1 1 1 1 1 1 1 1
## WC 0 0 1 1 1 1 1 1 1 1 1
## age 0 0 0 0 0 0 0 0 0 0 0
## HyperMed 0 0 1 1 1 1 1 1 1 1 1
## weight 0 0 1 1 1 1 1 1 1 1 1
## chol 0 0 0 1 1 1 1 1 1 1 1
## educ 0 0 1 0 1 1 1 1 1 1 1
## alc 0 0 1 1 0 1 1 1 1 1 1
## albu 0 0 1 1 1 0 1 1 1 1 1
## race 0 0 0 0 0 0 0 0 0 0 0
## HDL 0 0 1 1 1 1 1 0 1 1 1
## uricacid 0 0 1 1 1 1 1 1 0 1 1
## smoke 0 0 1 1 1 1 1 1 1 0 1
## gender 0 0 0 0 0 0 0 0 0 0 0
imp$visitSequence
## height hypten DBP bili SBP hypchol creat WC
## 1 3 6 7 8 9 10 11
## HyperMed weight chol educ alc albu HDL uricacid
## 13 14 15 16 17 18 20 21
## smoke BMI
## 22 2
imp$loggedEvents
Checking the loggedEvents
we see that for alc
, educ
and smoke
the method multinom
was used. That is a bit confusing: we specified polr
in our imputation method and imp$method
is indeed polr
for the three ordinal variables.
The mice
help page told us that in case polr
does not converge polyreg
will be used.
The help for mice.impute.polr()
tells us:
The call to
polr
might fail, usually because the data are very sparse. In that case,multinom
is tried as a fallback, and a record is written to theloggedEvents
component of themids
object.
and the help for mice.impute.polyreg()
says:
The algorithm of
mice.impute.polyreg
uses the functionmultinom()
from the nnet package.
So that explains it, everything seems to be in order, but unfortunately the ordinal models for alc
, educ
and smoke
did not converge and multinomial models had to be used instead.
Let’s see what else would have come up if we had not prepared the predictorMatrix
, method
and visitSequence
before imputation. The object impnaive
contains the result of
impnaive <- mice(NHANES, m = 5, maxit = 10)
Take a look at the loggedEvents
of impnaive
.
impnaive$loggedEvents
The loggedEvents
of the “naive” imputation show that the constant variable cohort
was excluded before the imputation (as it should be).
Furthermore, in the imputation model for HyperMed
, the variable hypten.1
was excluded. hypten.1
is the first dummy variable belonging to hypten
. We can check this by looking at impnaive$pad$categories
:
impnaive$pad$categories
Here, we see in row 3 that hypten
has 1 dummy variable and that hypten.1
(row 24) is a dummy variable, belonging to column 3 in the dataset. Hence, hypten
was not used in the imputation of HyperMed
.
We also did not change the visitSequence
in impnaive
. Find out what the effects of that are.
naiveDF1 <- complete(impnaive, 1)
naivecalcBMI <- with(naiveDF1, weight/height^2)
impDF1 <- complete(imp, 1)
impcalcBMI <- with(impDF1, weight/height^2)
cbind(naiveBMI = naiveDF1$BMI, naivecalcBMI,
impBMI = impDF1$BMI, impcalcBMI)[which(is.na(NHANES$BMI)), ]
## naiveBMI naivecalcBMI impBMI impcalcBMI
## [1,] 34.70152 36.52053 50.86348 50.86348
## [2,] 36.58007 35.61107 25.38624 25.38624
## [3,] 21.35093 21.63289 20.26659 20.26659
## [4,] 36.03137 37.19942 37.86607 37.86607
## [5,] 29.78683 29.28613 19.50081 19.50081
## [6,] 25.10961 23.02495 21.48701 21.48701
## [7,] 24.87445 25.74716 28.77091 28.77091
## [8,] 24.43877 24.32765 29.14430 29.14430
## [9,] 27.63528 26.36509 54.91769 54.91769
## [10,] 31.07062 33.10497 35.08130 35.08130
## [11,] 29.18011 29.28613 25.87766 25.87766
## [12,] 31.60193 32.50797 26.86994 26.86994
## [13,] 27.26186 26.45248 25.73757 25.73757
## [14,] 30.17188 30.89659 33.43548 33.43548
## [15,] 24.79961 22.63133 23.82928 23.82928
## [16,] 22.70996 22.80717 23.33221 23.33221
## [17,] 27.82475 27.34157 28.46065 28.46065
## [18,] 25.69608 25.94145 37.60368 37.60368
## [19,] 27.69863 28.71536 22.52371 22.52371
## [20,] 30.82238 31.79628 34.14806 34.14806
## [21,] 24.14285 23.96292 25.41106 25.41106
## [22,] 27.66707 26.83037 26.65740 26.65740
## [23,] 24.02152 25.06720 29.66248 29.66248
## [24,] 30.54098 30.18226 32.73194 32.73194
## [25,] 31.74297 31.01124 38.04242 38.04242
## [26,] 26.62547 24.99801 25.77261 25.77261
## [27,] 28.97478 29.05244 30.92338 30.92338
## [28,] 32.98316 33.96469 36.76925 36.76925
## [29,] 29.18011 28.84171 33.11822 33.11822
## [30,] 30.11381 30.27103 37.89824 37.89824
When we compare the imputed and calculated values of BMI
from impnaive
we can see that the imputed height
and weight
give a different BMI
than is imputed. This is because BMI
is imputed before weight
, which means that the most recent imputed value of weight
is from the previous iteration.
Changing the visitSequence
in imp
prevented this inconsistency.
The mean and variance of the imputed values per iteration and variable is stored in the elements chainMean
and chainVar
of the mids
object.
Let’s plot them to see if our imputation imp
has converged:
# implemented plotting function (use layout to change the number of rows and columns)
plot(imp, layout = c(6, 6))
# plot for a single variable
matplot(imp$chainMean["weight", , ], type = "l", ylab = "imputed value",
xlab = "iteration", main = "weight")
# or the ggplot version
library(ggplot2)
library(reshape2)
ggplot(melt(imp$chainMean), aes(x = Var2, y = value, color = Var3)) +
geom_line() +
facet_wrap("Var1", scales = 'free') +
theme(legend.position = 'none') +
xlab("iteration") +
ylab("imputed value")
The chains in imp
seem to have converged, but in practice, more iterations should be done to confirm this.
In comparison, impnaive
had some convergence problems:
plot(impnaive, layout = c(6, 6))
height
, and BMI
show a clear trend and the chains do not mix well, i.e., there is more variation between the chains than within each chain (the same is the case for weight
). Moreover, hyperchol
shows some undesired behaviour, where four chains move upwards and the 5th chain separates from them after half of the iterations. These are clear signs that there is correlation or identification problems between these variables and some other variables (which is why we made adjustments to the predictorMatrix
for imp
).
Now that we know that imp
has converged, we can compare the distribution of the imputed values against the distribution of the observed values.
Plot the distributions of the imputed variables (continuous and categorical), and make sure the imputed values are realistic (e.g., no height of 2.50m or weight of 10kg for adults). If there are differences in the distributions of observed and imputed values see if those differences can be explained by other variables.
You can use densityplot()
and probplot()
to get plots for all continuous and categorical variables.
You can obtain probplot()
from https://gist.github.com/NErler/0d00375da460dd33839b98faeee2fdab.
To check all imputed values you can either get a summary of the imp
element of the mids
object or create a complete dataset containing all imputations using the function complete()
and get the summary of that.
densityplot(imp)
# create a long dataset with all imputed data sets stacked onto each other
longDF <- complete(imp, "long")
summary(longDF)
## .imp .id height BMI hypten
## 1:1000 1 : 5 Min. :1.397 Min. :11.28 no :3286
## 2:1000 10 : 5 1st Qu.:1.600 1st Qu.:23.40 yes:1714
## 3:1000 100 : 5 Median :1.676 Median :26.63
## 4:1000 1006 : 5 Mean :1.678 Mean :27.80
## 5:1000 1008 : 5 3rd Qu.:1.753 3rd Qu.:30.73
## 1012 : 5 Max. :1.930 Max. :62.35
## (Other):4970
## DM cohort DBP bili
## no :4215 Length:5000 Min. : 12.67 Min. :-0.2129
## yes: 785 Class :character 1st Qu.: 64.00 1st Qu.: 0.6000
## Mode :character Median : 70.67 Median : 0.7000
## Mean : 70.51 Mean : 0.7492
## 3rd Qu.: 77.33 3rd Qu.: 0.9000
## Max. :110.00 Max. : 4.1000
##
## SBP hypchol creat WC
## Min. : 78.81 no :4452 Min. :0.3400 Min. : 54.53
## 1st Qu.:108.67 yes: 548 1st Qu.:0.6900 1st Qu.: 84.95
## Median :118.00 Median :0.8200 Median : 94.90
## Mean :120.38 Mean :0.8574 Mean : 96.65
## 3rd Qu.:129.33 3rd Qu.:0.9600 3rd Qu.:105.80
## Max. :207.33 Max. :7.4600 Max. :178.05
##
## age HyperMed weight chol
## Min. :20.00 no : 145 Min. : 26.21 Min. :2.070
## 1st Qu.:32.00 previous: 140 1st Qu.: 63.50 1st Qu.:4.340
## Median :45.00 yes :1065 Median : 77.11 Median :4.990
## Mean :46.54 NA's :3650 Mean : 78.51 Mean :5.023
## 3rd Qu.:60.00 3rd Qu.: 89.36 3rd Qu.:5.640
## Max. :79.00 Max. :167.83 Max. :9.930
##
## educ alc albu
## Less than 9th grade : 392 0 :1023 Min. :2.600
## 9-11th grade : 786 <=1:2009 1st Qu.:4.100
## High school graduate:1082 1-3: 746 Median :4.300
## some college :1445 3-7: 524 Mean :4.276
## College or above :1295 >7 : 698 3rd Qu.:4.500
## Max. :5.217
##
## race HDL uricacid
## Mexican American : 480 Min. :0.2401 Min. : 0.5986
## Other Hispanic : 550 1st Qu.:1.1100 1st Qu.: 4.4000
## Non-Hispanic White:1780 Median :1.3400 Median : 5.4000
## Non-Hispanic Black:1215 Mean :1.3879 Mean : 5.4044
## other : 975 3rd Qu.:1.6000 3rd Qu.: 6.3000
## Max. :3.5700 Max. :11.0000
##
## smoke gender
## never :2871 male :2380
## former :1066 female:2620
## current:1063
##
##
##
##
# get the summary of the "raw" imputed values
lapply(imp$imp, function(x)
summary(unlist(x))
)
## $height
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.409 1.540 1.590 1.600 1.671 1.814
##
## $BMI
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.28 25.78 29.41 30.77 34.56 54.92
##
## $hypten
## no yes
## 101 39
##
## $DM
## Length Class Mode
## 0 NULL NULL
##
## $cohort
## Length Class Mode
## 0 NULL NULL
##
## $DBP
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 37.74 62.53 71.54 70.67 79.66 103.95
##
## $bili
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.2129 0.5488 0.7627 0.7499 0.9493 1.8327
##
## $SBP
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 78.81 111.43 123.10 123.41 136.39 174.65
##
## $hypchol
## no yes
## 377 48
##
## $creat
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4300 0.7100 0.8400 0.8741 0.9700 1.9800
##
## $WC
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 54.53 84.40 94.05 97.87 107.30 178.05
##
## $age
## Length Class Mode
## 0 NULL NULL
##
## $HyperMed
## Mode NA's
## logical 3650
##
## $weight
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 26.21 68.05 80.96 82.80 97.27 127.55
##
## $chol
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.400 4.392 5.023 5.054 5.824 8.451
##
## $educ
## Less than 9th grade 9-11th grade High school graduate
## 2 1 2
## some college College or above
## 0 0
##
## $alc
## 0 <=1 1-3 3-7 >7
## 328 629 181 114 128
##
## $albu
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.931 4.048 4.273 4.268 4.501 5.217
##
## $race
## Length Class Mode
## 0 NULL NULL
##
## $HDL
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2401 1.1174 1.4177 1.3853 1.6470 2.4943
##
## $uricacid
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.5986 4.5398 5.3976 5.3840 6.4140 9.4719
##
## $smoke
## never former current
## 1 1 3
##
## $gender
## Length Class Mode
## 0 NULL NULL
# look into the distribution of SBP and height
densityplot(imp, ~SBP|hypten)
densityplot(imp, ~height|gender)
# plot for all categorical variables
probplot(imp)
## Warning: attributes are not identical across measure variables; they will
## be dropped
# look into the categorical variables where proportions of imputed and observed differ
probplot(imp, smoke ~ alc)
probplot(imp, alc ~ gender)
Unfortunately, we have some negative imputed values for bili
. Often, this would not result in bias in the analysis, but may be difficult to explain when providing a summary of the imputed data in a publication. In the present example we can see that the observed values have a slightly right-skewed distribution, compared to the imputed values. Re-doing the imputation with pmm
instead of norm
for bili
should fix this. (However, since the imputations seem fine overall, and there is little knowledge gain in re-doing the previous steps, we will skip this repetition in this practical.)
The distributions of the imputed values for height
and SBP
differ a bit from the distributions of the observed data, but these differences can be explained by adjusting for gender
and hypten
.
We also imputed a larger proportion than might have been expected in the highest category of alc
, and the distribution of values for educ
and smoke
looks a bit weird.
The difference in distribution of alc
could be explained by gender
. smoke
and educ
each have only one missing value, which makes it difficult to judge the distribution of imputed values. Adjusting the distributions of observed smoke
for alc
shows that the subject with missing smoking status was a drinker, and drinkers were more often in the current
group of smoking.
mids
objectsWhen we are confident that the imputation was successfull, the imputed data can be analysed with any complete data method.
Choose any analysis model of interest and fit it on the imputed data. The model could for instance be a linear regression model (using lm
) or a generalized linear regression model, e.g., logistic regression (using glm()
). Find out the class
and structure of the resulting object.
imp
is not a dataframe but an object of class mids
, so we cannot use the normal specification of a model in which we specify the argument data
to be a data.frame
. Instead, use with(imp, <model function without data>)
.
# for example:
models1 <- with(imp, lm(SBP ~ age + gender + bili + chol + BMI))
models2 <- with(imp, glm(hypchol ~ age + gender + bili + race + BMI,
family = "binomial"))
models3 <- with(imp, glm(creat ~ age + gender + uricacid + WC + BMI,
family = Gamma(link = 'log')))
# to determine the class
class(models1)
## [1] "mira" "matrix"
# to determine the structure
names(models1)
## [1] "call" "call1" "nmis" "analyses"
lapply(models1, class)
## $call
## [1] "call"
##
## $call1
## [1] "call"
##
## $nmis
## [1] "integer"
##
## $analyses
## [1] "list"
# explore further
models1$analyses
## [[1]]
##
## Call:
## lm(formula = SBP ~ age + gender + bili + chol + BMI)
##
## Coefficients:
## (Intercept) age gender2 bili chol
## 95.4107 0.4304 -6.5005 -1.3925 0.3092
## BMI
## 0.2778
##
##
## [[2]]
##
## Call:
## lm(formula = SBP ~ age + gender + bili + chol + BMI)
##
## Coefficients:
## (Intercept) age gender2 bili chol
## 95.1448 0.4258 -6.2289 -2.0012 0.4450
## BMI
## 0.2858
##
##
## [[3]]
##
## Call:
## lm(formula = SBP ~ age + gender + bili + chol + BMI)
##
## Coefficients:
## (Intercept) age gender2 bili chol
## 93.9945 0.4251 -6.4535 -0.6079 0.5089
## BMI
## 0.2893
##
##
## [[4]]
##
## Call:
## lm(formula = SBP ~ age + gender + bili + chol + BMI)
##
## Coefficients:
## (Intercept) age gender2 bili chol
## 94.7578 0.4290 -6.0788 -0.5295 0.4211
## BMI
## 0.2538
##
##
## [[5]]
##
## Call:
## lm(formula = SBP ~ age + gender + bili + chol + BMI)
##
## Coefficients:
## (Intercept) age gender2 bili chol
## 94.9890 0.4357 -6.4453 -1.0361 0.3775
## BMI
## 0.2665
summary(models1$analyses[[1]])
##
## Call:
## lm(formula = SBP ~ age + gender + bili + chol + BMI)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.168 -9.650 -0.956 8.350 73.810
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 95.41066 3.61149 26.419 < 2e-16 ***
## age 0.43038 0.02833 15.191 < 2e-16 ***
## gender2 -6.50051 0.95873 -6.780 2.05e-11 ***
## bili -1.39252 1.63938 -0.849 0.395851
## chol 0.30919 0.45987 0.672 0.501526
## BMI 0.27779 0.07553 3.678 0.000248 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.65 on 994 degrees of freedom
## Multiple R-squared: 0.2344, Adjusted R-squared: 0.2306
## F-statistic: 60.87 on 5 and 994 DF, p-value: < 2.2e-16
mira
objectsAnalyses performed on mids
objects will result in a mira
object (multiply imputed repeated analyses).
mira
object is a list with the following elements:
call :
|
The call that created the mira object.
|
call1 :
|
The call that created the mids object.
|
nmis :
|
The number of missing observations per column. |
analyses :
|
The models fitted on each of the imputed datasets: A list with imp$m components, where each component contains a fitted model object.
|
The class of each of the components of imp$analyses
will depend on the type of model that was fitted.
To pool the results from the repeated analyses contained in a mira
object, the function pool()
can be used.
Pool one of the provided mira
objects (models1
, models2
or models3
) and investigate its structure.
# pool the model
pooled1 <- pool(models1)
# investigate the structure
class(pooled1)
## [1] "mipo" "mira" "matrix"
names(pooled1)
## [1] "call" "call1" "call2" "nmis" "m" "qhat" "u"
## [8] "qbar" "ubar" "b" "t" "r" "dfcom" "df"
## [15] "fmi" "lambda"
str(pooled1)
## List of 16
## $ call : language pool(object = models1)
## $ call1 : language with.mids(data = imp, expr = lm(SBP ~ age + gender + bili + chol + BMI))
## $ call2 : language mice(data = NHANES, m = 5, method = meth, predictorMatrix = pred, visitSequence = visSeq, maxit = 10, seed = 2018)
## $ nmis : Named int [1:23] 20 30 28 0 0 41 87 41 85 86 ...
## ..- attr(*, "names")= chr [1:23] "height" "BMI" "hypten" "DM" ...
## $ m : int 5
## $ qhat : num [1:5, 1:6] 95.4 95.1 94 94.8 95 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:5] "1" "2" "3" "4" ...
## .. ..$ : chr [1:6] "(Intercept)" "age" "gender2" "bili" ...
## $ u : num [1:5, 1:6, 1:6] 13 13.4 13 13.1 13 ...
## ..- attr(*, "dimnames")=List of 3
## .. ..$ : chr [1:5] "1" "2" "3" "4" ...
## .. ..$ : chr [1:6] "(Intercept)" "age" "gender2" "bili" ...
## .. ..$ : chr [1:6] "(Intercept)" "age" "gender2" "bili" ...
## $ qbar : Named num [1:6] 94.859 0.429 -6.341 -1.113 0.412 ...
## ..- attr(*, "names")= chr [1:6] "(Intercept)" "age" "gender2" "bili" ...
## $ ubar : num [1:6, 1:6] 13.1216 -0.0259 -0.5455 -2.6786 -0.994 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:6] "(Intercept)" "age" "gender2" "bili" ...
## .. ..$ : chr [1:6] "(Intercept)" "age" "gender2" "bili" ...
## $ b : num [1:6, 1:6] 0.290113 0.001025 0.000302 -0.22342 -0.03413 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:6] "(Intercept)" "age" "gender2" "bili" ...
## .. ..$ : chr [1:6] "(Intercept)" "age" "gender2" "bili" ...
## $ t : num [1:6, 1:6] 13.4698 -0.0247 -0.5452 -2.9467 -1.0349 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:6] "(Intercept)" "age" "gender2" "bili" ...
## .. ..$ : chr [1:6] "(Intercept)" "age" "gender2" "bili" ...
## $ r : Named num [1:6] 0.0265 0.027 0.0424 0.1665 0.0313 ...
## ..- attr(*, "names")= chr [1:6] "(Intercept)" "age" "gender2" "bili" ...
## $ dfcom : int 994
## $ df : Named num [1:6] 832 828 683 159 788 ...
## ..- attr(*, "names")= chr [1:6] "(Intercept)" "age" "gender2" "bili" ...
## $ fmi : Named num [1:6] 0.0282 0.0287 0.0435 0.1533 0.0328 ...
## ..- attr(*, "names")= chr [1:6] "(Intercept)" "age" "gender2" "bili" ...
## $ lambda: Named num [1:6] 0.0258 0.0263 0.0407 0.1428 0.0303 ...
## ..- attr(*, "names")= chr [1:6] "(Intercept)" "age" "gender2" "bili" ...
## - attr(*, "class")= chr [1:3] "mipo" "mira" "matrix"
mipo
objectsPooling of a mira
object results in a mipo
object (multiply imputed pooled analysis).
mipo
object is a list that has the following elements:
call :
|
The call that created the mipo object.
|
call1 :
|
The call that created the mids object.
|
call2 :
|
The call that created the mira object.
|
data
|
The original, incomplete data. |
nmis :
|
The number of missing values per variable. |
m :
|
The number of imputations. |
qhat :
|
The coefficients from all analyses: a matrix with m rows and npar columns.
|
u :
|
The variance-covariance matrix of the coefficients from each analysis: an array of dimension c(m, npar, npar) .
|
qbar :
|
The pooled parameter estimates: a vector of length npar .
|
ubar :
|
The average within imputation variance: a matrix with npar rows and npar columns containing the averaged variance-covariance matrices.
|
b :
|
The between imputation variance-covariance matrix (\(u + b + b/m\)). |
t :
|
The total variance-covariance matrix. |
r :
|
The relative increase in variance due to the missing values. |
dfcom :
|
Degrees of freedom in the hypothetically complete data (N - # free parameters ).
|
df :
|
Degrees of freedom associated with the \(t\)-statistics. |
fmi :
|
Fraction of missing information. |
lambda :
|
Proportion of the variation due to the missing values: \((b+b/m)/t\). |
From the elements contained in the mipo
object, we could obtain the pooled coefficients (qbar
) and variances (t
), and calculate the pooled confidence intervals and p-values by hand, using the function provided in the course notes.
However, the function summary()
applied to the mipo
object will do this for us.
Note:
pool()
extracts the model coefficients and variance-covariance matrices using the functions coef()
and vcov()
. Hence, pooling using the pool()
function from mice only works for those types of models for which these functions will return the regression coefficients and corresponding variance-covariance matrix.
Moreover, pool()
only works for objects of class mira
. When data has been imputed in a different package/software, or analyses have not been performed in the way described above, a list of fitted models can be converted to a mira
object using as.mira()
.
Get the summary of the pooled results.
pooled1 <- pool(models1)
summary(pooled1)
## est se t df Pr(>|t|)
## (Intercept) 94.8593413 3.67011837 25.8463983 832.0829 0.000000e+00
## age 0.4291836 0.02880572 14.8992499 827.5060 0.000000e+00
## gender2 -6.3414040 0.97983585 -6.4719045 682.5690 1.848048e-10
## bili -1.1134519 1.75656645 -0.6338798 159.4478 5.270676e-01
## chol 0.4123433 0.46983785 0.8776290 787.5224 3.804127e-01
## BMI 0.2746369 0.07805927 3.5183129 670.5039 4.636353e-04
## lo 95 hi 95 nmis fmi lambda
## (Intercept) 87.6555629 102.0631196 NA 0.02817875 0.02584568
## age 0.3726428 0.4857245 0 0.02866178 0.02631698
## gender2 -8.2652584 -4.4175497 NA 0.04350338 0.04070484
## bili -4.5825893 2.3556854 87 0.15332223 0.14276830
## chol -0.5099394 1.3346260 85 0.03279790 0.03034470
## BMI 0.1213669 0.4279069 30 0.04476185 0.04191677