Getting to know the data

The first steps are to load the package mice and the data we want to impute. For this example, we will use the NHANES dataset.

To load this dataset, you can use the command file.choose() which opens the explorer and allows you to navigate to the location of the file NHANES_1.RData. If you know the path to the file, you can also use load("<path>/NHANES_1.RData").

Note:
To display the help file for any function, type a ? followed by the name of the function in one of the code fields and click on “Run Code”.

Alternatively, you can look up the help pages online at https://www.rdocumentation.org/ or find the whole manual for a package at https://cran.r-project.org/web/packages/available_packages_by_name.html

Dataset

Task

The mice package needs to be loaded with the command library(). Let’s take a first look at the data. Useful functions are dim(), head(), str() and summary().

Solution

library(mice)
dim(NHANES)
## [1] 1000   23
head(NHANES)
str(NHANES)
## 'data.frame':    1000 obs. of  23 variables:
##  $ height  : num  1.65 1.8 1.65 1.68 1.57 ...
##  $ BMI     : num  22.5 24.4 18 22.6 30.2 ...
##  $ hypten  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 NA 1 2 1 ...
##  $ DM      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 2 1 ...
##  $ cohort  : chr  "2011" "2011" "2011" "2011" ...
##  $ DBP     : num  74.7 73.3 75.3 68.7 82.7 ...
##  $ bili    : num  NA NA 1 0.5 NA 0.6 1.1 0.6 0.4 0.9 ...
##  $ SBP     : num  131 123 105 123 125 ...
##  $ hypchol : Factor w/ 2 levels "no","yes": NA NA 1 1 NA 1 1 1 1 1 ...
##  $ creat   : num  NA NA 0.96 0.75 NA 0.7 0.72 0.86 0.6 0.94 ...
##  $ WC      : num  79.2 89.4 70.2 89.9 105.2 ...
##  $ age     : num  61 20 24 30 61 26 61 48 57 40 ...
##  $ HyperMed: Ord.factor w/ 3 levels "no"<"previous"<..: NA NA NA NA NA NA NA NA 3 NA ...
##  $ weight  : num  61.2 79.4 49 63.5 74.8 ...
##  $ chol    : num  NA NA 4.99 4.84 NA 4.78 4.09 4.84 4.68 4.86 ...
##  $ educ    : Factor w/ 5 levels "Less than 9th grade",..: 3 3 3 4 5 4 2 3 5 1 ...
##  $ alc     : Ord.factor w/ 5 levels "0"<"<=1"<"1-3"<..: 1 4 NA 5 3 5 3 5 2 4 ...
##  $ albu    : num  NA NA 4 4.7 NA 4.1 4.8 4.3 3.5 4.3 ...
##  $ race    : Factor w/ 5 levels "Mexican American",..: 1 5 3 5 2 5 5 3 3 3 ...
##  $ HDL     : num  NA NA 1.34 2.02 NA 1.22 1.32 1.09 0.8 1.5 ...
##  $ uricacid: num  NA NA 4.4 5 NA 7.3 5.8 5.2 5 4.9 ...
##  $ smoke   : Ord.factor w/ 3 levels "never"<"former"<..: 2 2 1 1 1 3 2 3 1 3 ...
##  $ gender  : Factor w/ 2 levels "male","female": 1 1 2 2 2 2 1 1 2 1 ...
summary(NHANES)
##      height           BMI         hypten      DM         cohort         
##  Min.   :1.397   Min.   :16.27   no  :637   no :843   Length:1000       
##  1st Qu.:1.600   1st Qu.:23.33   yes :335   yes:157   Class :character  
##  Median :1.676   Median :26.61   NA's: 28             Mode  :character  
##  Mean   :1.679   Mean   :27.71                                          
##  3rd Qu.:1.753   3rd Qu.:30.67                                          
##  Max.   :1.930   Max.   :62.35                                          
##  NA's   :20      NA's   :30                                             
##       DBP              bili             SBP         hypchol   
##  Min.   : 12.67   Min.   :0.2000   Min.   : 81.33   no  :815  
##  1st Qu.: 64.00   1st Qu.:0.6000   1st Qu.:108.67   yes :100  
##  Median : 70.67   Median :0.7000   Median :118.00   NA's: 85  
##  Mean   : 70.50   Mean   :0.7492   Mean   :120.25             
##  3rd Qu.: 77.33   3rd Qu.:0.9000   3rd Qu.:128.67             
##  Max.   :110.00   Max.   :4.1000   Max.   :207.33             
##  NA's   :41       NA's   :87       NA's   :41                 
##      creat              WC              age            HyperMed  
##  Min.   :0.3400   Min.   : 64.50   Min.   :20.00   no      : 29  
##  1st Qu.:0.6900   1st Qu.: 85.00   1st Qu.:32.00   previous: 28  
##  Median :0.8200   Median : 94.95   Median :45.00   yes     :213  
##  Mean   :0.8559   Mean   : 96.59   Mean   :46.54   NA's    :730  
##  3rd Qu.:0.9600   3rd Qu.:105.78   3rd Qu.:60.00                 
##  Max.   :7.4600   Max.   :152.40   Max.   :79.00                 
##  NA's   :86       NA's   :50                                     
##      weight            chol                        educ       alc     
##  Min.   : 38.56   Min.   :2.07   Less than 9th grade : 78   0   :139  
##  1st Qu.: 63.50   1st Qu.:4.34   9-11th grade        :157   <=1 :276  
##  Median : 76.66   Median :4.99   High school graduate:216   1-3 :113  
##  Mean   : 78.44   Mean   :5.02   some college        :289   3-7 : 82  
##  3rd Qu.: 89.36   3rd Qu.:5.61   College or above    :259   >7  :114  
##  Max.   :167.83   Max.   :9.93   NA's                :  1   NA's:276  
##  NA's   :17       NA's   :85                                          
##       albu                       race          HDL           uricacid     
##  Min.   :2.600   Mexican American  : 96   Min.   :0.360   Min.   : 1.800  
##  1st Qu.:4.100   Other Hispanic    :110   1st Qu.:1.110   1st Qu.: 4.400  
##  Median :4.300   Non-Hispanic White:356   Median :1.340   Median : 5.400  
##  Mean   :4.277   Non-Hispanic Black:243   Mean   :1.388   Mean   : 5.406  
##  3rd Qu.:4.500   other             :195   3rd Qu.:1.580   3rd Qu.: 6.300  
##  Max.   :5.200                            Max.   :3.570   Max.   :11.000  
##  NA's   :86                               NA's   :85      NA's   :86      
##      smoke        gender   
##  never  :574   male  :476  
##  former :213   female:524  
##  current:212               
##  NA's   :  1               
##                            
##                            
## 

Variable coding

It is important to check that all variables are coded correctly, i.e., have the correct variable class specified, because mice() automatically selects imputation methods based on each variable’s class. When importing data from other software it can happen that factors become continuous variables or that ordered factors lose their ordering.

str() showed that in the NHANES data smoke and alc are correctly specified as ordinal variables, but educ is an unordered factor.

Task

Using levels(NHANES$educ) we can print the names of the categories of educ. Convert the unordered factor to an ordered factor, for example using as.ordered(). Afterwards, check if the conversion was successful.

Explore the missing data pattern of the NHANES data.

Solution 1

mdp <- md.pattern(NHANES)
mdp
##     DM cohort age race gender educ smoke weight height hypten BMI DBP SBP
## 143  1      1   1    1      1    1     1      1      1      1   1   1   1
##   9  1      1   1    1      1    1     1      1      1      1   1   1   1
## 483  1      1   1    1      1    1     1      1      1      1   1   1   1
##  60  1      1   1    1      1    1     1      1      1      1   1   1   1
##   1  1      1   1    1      1    1     0      1      1      1   1   1   1
##   3  1      1   1    1      1    1     1      1      0      1   0   1   1
##   4  1      1   1    1      1    1     1      1      1      1   1   0   0
##   2  1      1   1    1      1    1     1      1      1      0   1   1   1
##   6  1      1   1    1      1    1     1      1      1      1   1   1   1
##   1  1      1   1    1      1    1     1      0      1      1   0   1   1
##   5  1      1   1    1      1    1     1      1      1      1   1   1   1
## 146  1      1   1    1      1    1     1      1      1      1   1   1   1
##   2  1      1   1    1      1    1     1      1      0      1   0   1   1
##   3  1      1   1    1      1    1     1      1      1      1   1   0   0
##   1  1      1   1    1      1    1     1      1      1      1   1   1   1
##   3  1      1   1    1      1    1     1      1      1      1   1   1   1
##   3  1      1   1    1      1    1     1      0      1      1   0   1   1
##   8  1      1   1    1      1    1     1      1      1      0   1   0   0
##   5  1      1   1    1      1    1     1      0      0      1   0   1   1
##   3  1      1   1    1      1    1     1      1      1      1   1   0   0
##   4  1      1   1    1      1    1     1      1      0      1   0   1   1
##   2  1      1   1    1      1    1     1      0      1      1   0   1   1
##   1  1      1   1    1      1    1     1      1      1      0   1   0   0
##   8  1      1   1    1      1    1     1      1      1      0   1   0   0
##   1  1      1   1    1      1    1     1      0      0      1   0   1   1
##   1  1      1   1    1      1    1     1      1      1      1   1   1   1
##   1  1      1   1    1      1    1     1      1      0      1   0   0   0
##   3  1      1   1    1      1    1     1      1      1      0   1   0   0
##   2  1      1   1    1      1    1     1      1      0      0   0   0   0
##  17  1      1   1    1      1    1     1      1      1      1   1   1   1
##   1  1      1   1    1      1    1     1      0      0      0   0   0   0
##   2  1      1   1    1      1    1     1      1      1      1   1   1   1
##  31  1      1   1    1      1    1     1      1      1      1   1   1   1
##   6  1      1   1    1      1    1     1      1      1      1   1   1   1
##   1  1      1   1    1      1    1     1      1      1      1   1   1   1
##   3  1      1   1    1      1    1     1      1      1      1   1   1   1
##   7  1      1   1    1      1    1     1      1      1      1   1   1   1
##   1  1      1   1    1      1    1     1      1      1      1   1   0   0
##   1  1      1   1    1      1    1     1      1      0      1   0   1   1
##   1  1      1   1    1      1    1     1      1      1      1   1   0   0
##   6  1      1   1    1      1    1     1      1      1      1   1   1   1
##   1  1      1   1    1      1    1     1      0      1      1   0   1   1
##   2  1      1   1    1      1    1     1      1      1      1   1   0   0
##   1  1      1   1    1      1    1     1      0      1      1   0   1   1
##   1  1      1   1    1      1    1     1      0      1      1   0   1   1
##   1  1      1   1    1      1    1     1      1      1      0   1   0   0
##   1  1      1   1    1      1    1     1      1      1      0   1   0   0
##   1  1      1   1    1      1    1     1      0      1      1   0   1   1
##   1  1      1   1    1      1    0     1      1      1      0   1   0   0
##      0      0   0    0      0    1     1     17     20     28  30  41  41
##     WC hypchol chol HDL creat albu uricacid bili alc HyperMed     
## 143  1       1    1   1     1    1        1    1   1        1    0
##   9  0       1    1   1     1    1        1    1   1        1    1
## 483  1       1    1   1     1    1        1    1   1        0    1
##  60  1       1    1   1     1    1        1    1   0        1    1
##   1  1       1    1   1     1    1        1    1   1        1    1
##   3  1       1    1   1     1    1        1    1   1        1    2
##   4  1       1    1   1     1    1        1    1   1        1    2
##   2  1       1    1   1     1    1        1    1   1        0    2
##   6  0       1    1   1     1    1        1    1   1        0    2
##   1  1       1    1   1     1    1        1    1   1        1    2
##   5  0       1    1   1     1    1        1    1   0        1    2
## 146  1       1    1   1     1    1        1    1   0        0    2
##   2  1       1    1   1     1    1        1    1   1        0    3
##   3  1       1    1   1     1    1        1    1   0        1    3
##   1  1       1    1   1     1    1        1    0   0        0    3
##   3  0       1    1   1     1    1        1    1   0        0    3
##   3  1       1    1   1     1    1        1    1   0        1    3
##   8  1       1    1   1     1    1        1    1   1        0    4
##   5  1       1    1   1     1    1        1    1   1        0    4
##   3  0       1    1   1     1    1        1    1   0        1    4
##   4  1       1    1   1     1    1        1    1   0        0    4
##   2  1       1    1   1     1    1        1    1   0        0    4
##   1  0       1    1   1     1    1        1    1   1        0    5
##   8  1       1    1   1     1    1        1    1   0        0    5
##   1  1       1    1   1     1    1        1    1   0        0    5
##   1  1       1    1   1     0    0        0    0   1        0    5
##   1  0       1    1   1     1    1        1    1   0        1    6
##   3  0       1    1   1     1    1        1    1   0        0    6
##   2  1       1    1   1     1    1        1    1   0        0    7
##  17  1       0    0   0     0    0        0    0   1        1    7
##   1  1       1    1   1     1    1        1    1   0        0    8
##   2  0       0    0   0     0    0        0    0   1        1    8
##  31  1       0    0   0     0    0        0    0   1        0    8
##   6  1       0    0   0     0    0        0    0   0        1    8
##   1  0       0    0   0     0    0        0    0   1        0    9
##   3  0       0    0   0     0    0        0    0   0        1    9
##   7  1       0    0   0     0    0        0    0   0        0    9
##   1  0       0    0   0     0    0        0    0   1        1   10
##   1  1       0    0   0     0    0        0    0   1        0   10
##   1  1       0    0   0     0    0        0    0   0        1   10
##   6  0       0    0   0     0    0        0    0   0        0   10
##   1  1       0    0   0     0    0        0    0   0        1   10
##   2  0       0    0   0     0    0        0    0   0        1   11
##   1  0       0    0   0     0    0        0    0   0        1   11
##   1  1       0    0   0     0    0        0    0   0        0   11
##   1  0       0    0   0     0    0        0    0   1        0   12
##   1  1       0    0   0     0    0        0    0   0        0   12
##   1  0       0    0   0     0    0        0    0   0        0   12
##   1  0       0    0   0     0    0        0    0   1        0   13
##     50      85   85  85    86   86       86   87 276      730 1835

md.pattern() gives us a matrix where each row represents one missing data pattern (1 = observed, 0 = missing). The rownames show how many rows of the data have the given missing data pattern. The last column gives the number of missing values in the pattern, the last row the number of missing values per variable.

Task 2

Visualize and inspect the missing data pattern using one or more of the examples shown in the course slides.

Hint: Some functions you can try are JointAI::md_pattern(), VIM::aggr(), visdat::vis_dat(), visdat::vis_miss().

For JointAI::md_pattern() you can customize the y-axis by using the argument yaxis_pars, which is a list of arguments. To hide the counts use yaxis_pars = list(yaxt = "n") and printN = FALSE. To prevent the printed output use print = FALSE.

For more options check out the help pages of the functions.


Task 3

Now get an overview of how much missingness there is in each variable and the proportion of (in)complete cases.

The function complete.cases() will give you a vector that is TRUE if the the case is complete and FALSE if there are missing values.

is.na() returns TRUE if the value is missing, FALSE if the value is observed

colSums() calculates the sum of values in each column of a data.frame or matrix

colMeans() calculates the mean of values in each column of a data.frame or matrix

Solution 3

# number and proportion of complete cases
Ncc <- cbind(
  "#" = table(complete.cases(NHANES)),
  "%" = round(100 * table(complete.cases(NHANES))/nrow(NHANES), 2)
)
rownames(Ncc) <- c("incompl.", "complete")
Ncc
##            #    %
## incompl. 857 85.7
## complete 143 14.3
# number and proportion of missing values per variable
cbind("# NA" = sort(colSums(is.na(NHANES))),
      "% NA" = round(sort(colMeans(is.na(NHANES))) * 100, 2))
##          # NA % NA
## DM          0  0.0
## cohort      0  0.0
## age         0  0.0
## race        0  0.0
## gender      0  0.0
## educ        1  0.1
## smoke       1  0.1
## weight     17  1.7
## height     20  2.0
## hypten     28  2.8
## BMI        30  3.0
## DBP        41  4.1
## SBP        41  4.1
## WC         50  5.0
## hypchol    85  8.5
## chol       85  8.5
## HDL        85  8.5
## creat      86  8.6
## albu       86  8.6
## uricacid   86  8.6
## bili       87  8.7
## alc       276 27.6
## HyperMed  730 73.0

Task 4

Our data contains height, weight and BMI. Check the missing data pattern specifically for those three variables.

Solution 4

# Three solutions to choose from:
md.pattern(NHANES[, c("height", "weight", "BMI")])
##     weight height BMI   
## 970      1      1   1  0
##  13      1      0   0  2
##  10      0      1   0  2
##   7      0      0   0  3
##         17     20  30 67
with(NHANES, table(is.na(height), is.na(weight), is.na(BMI)))
## , ,  = FALSE
## 
##        
##         FALSE TRUE
##   FALSE   970    0
##   TRUE      0    0
## 
## , ,  = TRUE
## 
##        
##         FALSE TRUE
##   FALSE     0   10
##   TRUE     13    7
library(plyr)
ddply(NHANES, c(height = "ifelse(is.na(height), 'missing', 'observed')",
                weight = "ifelse(is.na(weight), 'missing', 'observed')",
                BMI = "ifelse(is.na(BMI), 'missing', 'observed')"),
      summarize,
      N = length(height)
)

As we have already seen in the lecture, there are quite a number of cases where only either height or weight is missing. BMI is only observed when both components are observed. To use all available information, we want to impute height and weight separately and calculate BMI from the imputed values.

Task 5

Because we might expect that hypertensive medication is only prescribed for patients with hypertension, we will also look into the variables HyperMed and hypten. Make a table of the two variables to confirm our expectation. Make sure that missing values are also displayed in that table!

Solution 5

with(NHANES, table(HyperMed, hypten, exclude = NULL))
##           hypten
## HyperMed    no yes <NA>
##   no         0  29    0
##   previous   0  28    0
##   yes        0 213    0
##   <NA>     637  65   28

Task 6

Furthermore, we can expect a systematic relationship between hypchol and chol. Find out how these two variables are related.

Solution 6

with(NHANES, plot(chol ~ hypchol))

with(NHANES, summary(chol[hypchol == "no"]))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   2.070   4.270   4.860   4.795   5.400   6.180      85
with(NHANES, summary(chol[hypchol == "yes"]))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   6.210   6.410   6.620   6.854   7.110   9.930      85

It seems that hypchol is defined as having chol > 6.2, which makes hypchol dependent on chol.

Distribution of the data

Before imputing the missing data it is important to take a look at how the incomplete variables are distributed so that we can choose appropriate imputation models.

Task

Visualize the distributions of the incomplete continuous and categorical variables. Pay attention to

  • whether continuous distributions deviate considerably from the normal distribution,
  • if variables have values close to the limits of the range they are defined in,
  • whether categorical variables are very unbalanced (i.e., some category very small).

Hint: You may want to use the functions hist(), table() and barplot().

Of course, you could also use the ggplot2 package.

Solution

# One option that allows to quickly get an overview of all variables,
# categorical and continuous

nc <- max(5, ceiling(sqrt(ncol(NHANES))))
nr <- ceiling(ncol(NHANES) / nc)
par(mfrow = c(nr, nc), mgp = c(2, 0.6, 0), mar = c(2, 3, 3, 0.5))
for (i in 1:ncol(NHANES)) {
  if (is.numeric(NHANES[, i])) {
    hist(NHANES[, i], nclass = 50, xlab = "",
         main = paste0(names(NHANES[i]), " (",
                       round(mean(is.na(NHANES[, i])) * 100, 2), "% NA)")
    )
  } else {
    barplot(table(NHANES[, i]), ylab = "Frequency",
            main = paste0(names(NHANES[i]), " (",
                          round(mean(is.na(NHANES[, i])) * 100, 2), "% NA)"))
  }
}

Imputation

Now that we know how our data and the missing values are distributed, we are ready to impute!

Set-up run

Task

First, do a set-up run of mice(), without any iterations (maxit = 0). This is to get all the objects that we may need to adjust in order to customize the imputation to our data.

If you need help to figure out what you need to specify take a look at the help page ?mice.

Solution

imp0 <- mice(NHANES, maxit = 0)

Note: This command will not produce any output.

Imputation method

There are many imputation methods available in mice. You can find the list in the help page of the mice() function. We will focus here on the following ones:

name description variable type
pmm Predictive mean matching any
norm Bayesian linear regression numeric
logreg Logistic regression factor, 2 levels
polyreg Polytomous logistic regression factor, >= 2 levels
polr Proportional odds model ordered, >= 2 levels

The default imputation methods that mice() selects can be specified in the argument defaultMethod.

If unspecified, mice will use

  • pmm for numerical columns,
  • logreg for factor columns with two categories,
  • polyreg for columns with unordered and
  • polr for columns with ordered factors with more than two categories.

Task 1

When a normal imputation model seems to be appropriate for most of the continuous covariates, you may want to specify norm as the default method in the setup run. Let’s do that:

The order of the types of variable is: continuous, binary, factor, ordered factor.

Solution 1

imp0 <- mice(NHANES, maxit = 0, 
             defaultMethod = c("norm", 'logreg', 'polyreg', 'polr'))

Task 2

Looking at the histograms we made for the continuous variables, we can see that the variable creat has a skewed distribution, so using a normal imputation model may not work well.

Let’s change the imputation method for creat so that it is imputed with predictive mean matching:

Solution 2

meth <- imp0$meth
meth["creat"] <- "pmm"

Predictor matrix

Task 1

We can now check out if we need to make any changes to the predictorMatrix argument.

Solution 1

pred <- imp0$predictorMatrix
pred
##          height BMI hypten DM cohort DBP bili SBP hypchol creat WC age
## height        0   1      1  1      0   1    1   1       1     1  1   1
## BMI           1   0      1  1      0   1    1   1       1     1  1   1
## hypten        1   1      0  1      0   1    1   1       1     1  1   1
## DM            0   0      0  0      0   0    0   0       0     0  0   0
## cohort        0   0      0  0      0   0    0   0       0     0  0   0
## DBP           1   1      1  1      0   0    1   1       1     1  1   1
## bili          1   1      1  1      0   1    0   1       1     1  1   1
## SBP           1   1      1  1      0   1    1   0       1     1  1   1
## hypchol       1   1      1  1      0   1    1   1       0     1  1   1
## creat         1   1      1  1      0   1    1   1       1     0  1   1
## WC            1   1      1  1      0   1    1   1       1     1  0   1
## age           0   0      0  0      0   0    0   0       0     0  0   0
## HyperMed      1   1      1  1      0   1    1   1       1     1  1   1
## weight        1   1      1  1      0   1    1   1       1     1  1   1
## chol          1   1      1  1      0   1    1   1       1     1  1   1
## educ          1   1      1  1      0   1    1   1       1     1  1   1
## alc           1   1      1  1      0   1    1   1       1     1  1   1
## albu          1   1      1  1      0   1    1   1       1     1  1   1
## race          0   0      0  0      0   0    0   0       0     0  0   0
## HDL           1   1      1  1      0   1    1   1       1     1  1   1
## uricacid      1   1      1  1      0   1    1   1       1     1  1   1
## smoke         1   1      1  1      0   1    1   1       1     1  1   1
## gender        0   0      0  0      0   0    0   0       0     0  0   0
##          HyperMed weight chol educ alc albu race HDL uricacid smoke gender
## height          1      1    1    1   1    1    1   1        1     1      1
## BMI             1      1    1    1   1    1    1   1        1     1      1
## hypten          1      1    1    1   1    1    1   1        1     1      1
## DM              0      0    0    0   0    0    0   0        0     0      0
## cohort          0      0    0    0   0    0    0   0        0     0      0
## DBP             1      1    1    1   1    1    1   1        1     1      1
## bili            1      1    1    1   1    1    1   1        1     1      1
## SBP             1      1    1    1   1    1    1   1        1     1      1
## hypchol         1      1    1    1   1    1    1   1        1     1      1
## creat           1      1    1    1   1    1    1   1        1     1      1
## WC              1      1    1    1   1    1    1   1        1     1      1
## age             0      0    0    0   0    0    0   0        0     0      0
## HyperMed        0      1    1    1   1    1    1   1        1     1      1
## weight          1      0    1    1   1    1    1   1        1     1      1
## chol            1      1    0    1   1    1    1   1        1     1      1
## educ            1      1    1    0   1    1    1   1        1     1      1
## alc             1      1    1    1   0    1    1   1        1     1      1
## albu            1      1    1    1   1    0    1   1        1     1      1
## race            0      0    0    0   0    0    0   0        0     0      0
## HDL             1      1    1    1   1    1    1   0        1     1      1
## uricacid        1      1    1    1   1    1    1   1        0     1      1
## smoke           1      1    1    1   1    1    1   1        1     0      1
## gender          0      0    0    0   0    0    0   0        0     0      0

Task 2

Remember that we wanted to impute height and weight separately? But let’s use BMI to impute the other variables. And since HyperMed does not give us a lot more information than hypten, but has a lot more missing values, we do not want to use it as a predictor variable.

Go ahead and apply the necessary changes to pred and meth:

For passive imputation, you need to specify the formula used to calculate BMI in meth using "~I(...)".

Solution 2

pred[c("height", "weight"), "BMI"] <- 0
pred[, c("height", "weight")] <- 0
pred["height", "weight"] <- 1
pred["weight", "height"] <- 1

pred[, "HyperMed"] <- 0

pred["chol", "hypchol"] <- 0


meth["BMI"] <- "~I(weight/height^2)"
meth["HyperMed"] <- ""

Visit sequence

Task

To make sure that the imputed values of BMI match the imputed values of height and weight,BMI needs to be imputed after height and weight. Get the visitSequence from imp0 (using imp0$visitSequence) and change it if necessary.

Solution

visSeq <- imp0$visitSequence

# the lazy version:
visSeq <- c(visSeq[-2], visSeq[2])

# the save version:
# To make sure you still do the correct thing if the order of the columns in 
# NHANES would change
which_BMI <- match("BMI", names(visSeq))
visSeq <- c(visSeq[-which_BMI], visSeq[which_BMI])

Run the imputation

Task

With the changes that we have made to the predictorMatrix and method, we can now perform the imputation. Use m = 5 and maxit = 10.

Solution

imp <- mice(NHANES, method = meth, predictorMatrix = pred, visitSequence = visSeq,
            maxit = 10, m = 5,
            seed = 2018)
## 
##  iter imp variable
##   1   1  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   1   2  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   1   3  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   1   4  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   1   5  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   2   1  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   2   2  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   2   3  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   2   4  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   2   5  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   3   1  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   3   2  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   3   3  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   3   4  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   3   5  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   4   1  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   4   2  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   4   3  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   4   4  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   4   5  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   5   1  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   5   2  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   5   3  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   5   4  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   5   5  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   6   1  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   6   2  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   6   3  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   6   4  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   6   5  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   7   1  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   7   2  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   7   3  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   7   4  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   7   5  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   8   1  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   8   2  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   8   3  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   8   4  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   8   5  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   9   1  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   9   2  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   9   3  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   9   4  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   9   5  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   10   1  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   10   2  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   10   3  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   10   4  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI
##   10   5  height  hypten  DBP  bili  SBP  hypchol  creat  WC  HyperMed  weight  chol  educ  alc  albu  HDL  uricacid  smoke  BMI

mice() prints the name of the variable being imputed for each iteration and imputation. If you run mice() on your own computer the output will show up continuously. There, you may notice that imputation is slowest for categorical variables, especially when they have many categories.

You can hide the lengthy output by specifying printFlag = FALSE.

Investigate the imputation

Type of class

Task

mice() does not return a data.frame. Find out the class of the object returned by mice() function using the function class(), and take a look at the help file for this class:

Solution

class(imp)
## [1] "mids"

Elements of the mids object

We see that imp is an object of class mids.

The help file tells us that a mids object is a list with several elements:

call: The call that created the mids object.
data: A copy of the incomplete data set.
where: The missingness indicator matrix.
m: The number of imputations.
nmis: The number of missing observations per variable.
imp: The imputed values: A list of ncol(data) components,each list component is a matrix with nmis[j] rows and m columns.
method: The vector imputation methods.
predictorMatrix: The predictor matrix.
visitSequence: The sequence in which columns are visited during imputation.
post: A vector of strings of length ncol(data) with commands for post-processing.
seed: The seed value of the solution.
iteration: The number of iterations.
lastSeedValue: The most recent seed value.
chainMean: The mean of imputed values per variable and iteration: a list of m components. Each component is a matrix with maxitcolumns and length(visitSequence) rows.
chainVar: The variances of imputed values per variable and iteration(same structure as chainMean).
loggedEvents: A matrix with the record of automatic removal actions (details below); (NULL if no action was made).
pad: A list containing the internally used version of the data,method, predictorMatrix, visitSequence, post and dummy coding.

Details of the loggedEvents:

mice() does some pre-processing of the data:

  • variables containing missing values, that are not imputed but used as predictor are removed
  • constant variables are removed
  • collinear variables are removed

Furthermore, during each iteration

  • variables that are linearly dependent are removed
  • polr imputation that does not converge is replaced by polyreg.
The matrix in loggedEvents has the following columns:
it iteration number
im imputation number
co column number in the data
dep name of the name of the variable being imputed
meth imputation method used
out character vector with names of altered/removed predictors

Task

It is good practice to make sure that mice() has not done any processing of the data that was not planned or that you are not aware of. This means checking the loggedEvents, but also ensuring that the correct method, predictorMatrix and visitSequence were used.

Do these checks for imp.

Solution

imp$method
##                height                   BMI                hypten 
##                "norm" "~I(weight/height^2)"              "logreg" 
##                    DM                cohort                   DBP 
##                    ""                    ""                "norm" 
##                  bili                   SBP               hypchol 
##                "norm"                "norm"              "logreg" 
##                 creat                    WC                   age 
##                 "pmm"                "norm"                    "" 
##              HyperMed                weight                  chol 
##                    ""                "norm"                "norm" 
##                  educ                   alc                  albu 
##                "polr"                "polr"                "norm" 
##                  race                   HDL              uricacid 
##                    ""                "norm"                "norm" 
##                 smoke                gender 
##                "polr"                    ""
imp$predictorMatrix
##          height BMI hypten DM cohort DBP bili SBP hypchol creat WC age
## height        0   0      1  1      0   1    1   1       1     1  1   1
## BMI           0   0      1  1      0   1    1   1       1     1  1   1
## hypten        0   1      0  1      0   1    1   1       1     1  1   1
## DM            0   0      0  0      0   0    0   0       0     0  0   0
## cohort        0   0      0  0      0   0    0   0       0     0  0   0
## DBP           0   1      1  1      0   0    1   1       1     1  1   1
## bili          0   1      1  1      0   1    0   1       1     1  1   1
## SBP           0   1      1  1      0   1    1   0       1     1  1   1
## hypchol       0   1      1  1      0   1    1   1       0     1  1   1
## creat         0   1      1  1      0   1    1   1       1     0  1   1
## WC            0   1      1  1      0   1    1   1       1     1  0   1
## age           0   0      0  0      0   0    0   0       0     0  0   0
## HyperMed      0   1      1  1      0   1    1   1       1     1  1   1
## weight        1   0      1  1      0   1    1   1       1     1  1   1
## chol          0   1      1  1      0   1    1   1       0     1  1   1
## educ          0   1      1  1      0   1    1   1       1     1  1   1
## alc           0   1      1  1      0   1    1   1       1     1  1   1
## albu          0   1      1  1      0   1    1   1       1     1  1   1
## race          0   0      0  0      0   0    0   0       0     0  0   0
## HDL           0   1      1  1      0   1    1   1       1     1  1   1
## uricacid      0   1      1  1      0   1    1   1       1     1  1   1
## smoke         0   1      1  1      0   1    1   1       1     1  1   1
## gender        0   0      0  0      0   0    0   0       0     0  0   0
##          HyperMed weight chol educ alc albu race HDL uricacid smoke gender
## height          0      1    1    1   1    1    1   1        1     1      1
## BMI             0      0    1    1   1    1    1   1        1     1      1
## hypten          0      0    1    1   1    1    1   1        1     1      1
## DM              0      0    0    0   0    0    0   0        0     0      0
## cohort          0      0    0    0   0    0    0   0        0     0      0
## DBP             0      0    1    1   1    1    1   1        1     1      1
## bili            0      0    1    1   1    1    1   1        1     1      1
## SBP             0      0    1    1   1    1    1   1        1     1      1
## hypchol         0      0    1    1   1    1    1   1        1     1      1
## creat           0      0    1    1   1    1    1   1        1     1      1
## WC              0      0    1    1   1    1    1   1        1     1      1
## age             0      0    0    0   0    0    0   0        0     0      0
## HyperMed        0      0    1    1   1    1    1   1        1     1      1
## weight          0      0    1    1   1    1    1   1        1     1      1
## chol            0      0    0    1   1    1    1   1        1     1      1
## educ            0      0    1    0   1    1    1   1        1     1      1
## alc             0      0    1    1   0    1    1   1        1     1      1
## albu            0      0    1    1   1    0    1   1        1     1      1
## race            0      0    0    0   0    0    0   0        0     0      0
## HDL             0      0    1    1   1    1    1   0        1     1      1
## uricacid        0      0    1    1   1    1    1   1        0     1      1
## smoke           0      0    1    1   1    1    1   1        1     0      1
## gender          0      0    0    0   0    0    0   0        0     0      0
imp$visitSequence
##   height   hypten      DBP     bili      SBP  hypchol    creat       WC 
##        1        3        6        7        8        9       10       11 
## HyperMed   weight     chol     educ      alc     albu      HDL uricacid 
##       13       14       15       16       17       18       20       21 
##    smoke      BMI 
##       22        2
imp$loggedEvents

Logged events

Checking the loggedEvents we see that for alc, educ and smoke the method multinom was used. That is a bit confusing: we specified polr in our imputation method and imp$method is indeed polr for the three ordinal variables.

The mice help page told us that in case polr does not converge polyreg will be used.

Task 1

Find out if there is more information in the help pages for those two functions.

Using polr and polyreg when specifying the method argument in mice actually calls the functions mice.impute.polr and mice.impute.polyreg when running the imputation.

Solution 1

The help for mice.impute.polr() tells us:

The call to polr might fail, usually because the data are very sparse. In that case, multinom is tried as a fallback, and a record is written to the loggedEvents component of the mids object.

and the help for mice.impute.polyreg() says:

The algorithm of mice.impute.polyreg uses the function multinom() from the nnet package.

So that explains it, everything seems to be in order, but unfortunately the ordinal models for alc, educ and smoke did not converge and multinomial models had to be used instead.

Task 2

Let’s see what else would have come up if we had not prepared the predictorMatrix, method and visitSequence before imputation. The object impnaive contains the result of

impnaive <- mice(NHANES, m = 5, maxit = 10)

Take a look at the loggedEvents of impnaive.

Solution 2

impnaive$loggedEvents

The loggedEvents of the “naive” imputation show that the constant variable cohort was excluded before the imputation (as it should be).

Furthermore, in the imputation model for HyperMed, the variable hypten.1 was excluded. hypten.1 is the first dummy variable belonging to hypten. We can check this by looking at impnaive$pad$categories:

impnaive$pad$categories

Here, we see in row 3 that hypten has 1 dummy variable and that hypten.1 (row 24) is a dummy variable, belonging to column 3 in the dataset. Hence, hypten was not used in the imputation of HyperMed.


Task 3

We also did not change the visitSequence in impnaive. Find out what the effects of that are.

Solution 3

naiveDF1 <- complete(impnaive, 1)
naivecalcBMI <- with(naiveDF1, weight/height^2)

impDF1 <- complete(imp, 1)
impcalcBMI <- with(impDF1, weight/height^2)

cbind(naiveBMI = naiveDF1$BMI, naivecalcBMI,
      impBMI = impDF1$BMI, impcalcBMI)[which(is.na(NHANES$BMI)), ]
##       naiveBMI naivecalcBMI   impBMI impcalcBMI
##  [1,] 34.70152     36.52053 50.86348   50.86348
##  [2,] 36.58007     35.61107 25.38624   25.38624
##  [3,] 21.35093     21.63289 20.26659   20.26659
##  [4,] 36.03137     37.19942 37.86607   37.86607
##  [5,] 29.78683     29.28613 19.50081   19.50081
##  [6,] 25.10961     23.02495 21.48701   21.48701
##  [7,] 24.87445     25.74716 28.77091   28.77091
##  [8,] 24.43877     24.32765 29.14430   29.14430
##  [9,] 27.63528     26.36509 54.91769   54.91769
## [10,] 31.07062     33.10497 35.08130   35.08130
## [11,] 29.18011     29.28613 25.87766   25.87766
## [12,] 31.60193     32.50797 26.86994   26.86994
## [13,] 27.26186     26.45248 25.73757   25.73757
## [14,] 30.17188     30.89659 33.43548   33.43548
## [15,] 24.79961     22.63133 23.82928   23.82928
## [16,] 22.70996     22.80717 23.33221   23.33221
## [17,] 27.82475     27.34157 28.46065   28.46065
## [18,] 25.69608     25.94145 37.60368   37.60368
## [19,] 27.69863     28.71536 22.52371   22.52371
## [20,] 30.82238     31.79628 34.14806   34.14806
## [21,] 24.14285     23.96292 25.41106   25.41106
## [22,] 27.66707     26.83037 26.65740   26.65740
## [23,] 24.02152     25.06720 29.66248   29.66248
## [24,] 30.54098     30.18226 32.73194   32.73194
## [25,] 31.74297     31.01124 38.04242   38.04242
## [26,] 26.62547     24.99801 25.77261   25.77261
## [27,] 28.97478     29.05244 30.92338   30.92338
## [28,] 32.98316     33.96469 36.76925   36.76925
## [29,] 29.18011     28.84171 33.11822   33.11822
## [30,] 30.11381     30.27103 37.89824   37.89824

When we compare the imputed and calculated values of BMI from impnaive we can see that the imputed height and weight give a different BMI than is imputed. This is because BMI is imputed before weight, which means that the most recent imputed value of weight is from the previous iteration.

Changing the visitSequence in imp prevented this inconsistency.

Convergence

The mean and variance of the imputed values per iteration and variable is stored in the elements chainMean and chainVar of the mids object.

Task

Let’s plot them to see if our imputation imp has converged:

Solution

# implemented plotting function (use layout to change the number of rows and columns)
plot(imp, layout = c(6, 6))

# plot for a single variable
matplot(imp$chainMean["weight", , ], type = "l", ylab = "imputed value",
        xlab = "iteration", main = "weight")

# or the ggplot version
library(ggplot2)
library(reshape2)
ggplot(melt(imp$chainMean), aes(x = Var2, y = value, color = Var3)) +
  geom_line() +
  facet_wrap("Var1", scales = 'free') +
  theme(legend.position = 'none') +
  xlab("iteration") +
  ylab("imputed value")

The chains in imp seem to have converged, but in practice, more iterations should be done to confirm this.

Continue

In comparison, impnaive had some convergence problems:

plot(impnaive, layout = c(6, 6))

height, and BMI show a clear trend and the chains do not mix well, i.e., there is more variation between the chains than within each chain (the same is the case for weight). Moreover, hyperchol shows some undesired behaviour, where four chains move upwards and the 5th chain separates from them after half of the iterations. These are clear signs that there is correlation or identification problems between these variables and some other variables (which is why we made adjustments to the predictorMatrix for imp).

Imputed values

Now that we know that imp has converged, we can compare the distribution of the imputed values against the distribution of the observed values.

Task

Plot the distributions of the imputed variables (continuous and categorical), and make sure the imputed values are realistic (e.g., no height of 2.50m or weight of 10kg for adults). If there are differences in the distributions of observed and imputed values see if those differences can be explained by other variables.

You can use densityplot() and probplot() to get plots for all continuous and categorical variables.

You can obtain probplot() from https://gist.github.com/NErler/0d00375da460dd33839b98faeee2fdab.

To check all imputed values you can either get a summary of the imp element of the mids object or create a complete dataset containing all imputations using the function complete() and get the summary of that.

Solution

densityplot(imp)

# create a long dataset with all imputed data sets stacked onto each other
longDF <- complete(imp, "long")
summary(longDF)
##  .imp          .id           height           BMI        hypten    
##  1:1000   1      :   5   Min.   :1.397   Min.   :11.28   no :3286  
##  2:1000   10     :   5   1st Qu.:1.600   1st Qu.:23.40   yes:1714  
##  3:1000   100    :   5   Median :1.676   Median :26.63             
##  4:1000   1006   :   5   Mean   :1.678   Mean   :27.80             
##  5:1000   1008   :   5   3rd Qu.:1.753   3rd Qu.:30.73             
##           1012   :   5   Max.   :1.930   Max.   :62.35             
##           (Other):4970                                             
##    DM          cohort               DBP              bili        
##  no :4215   Length:5000        Min.   : 12.67   Min.   :-0.2129  
##  yes: 785   Class :character   1st Qu.: 64.00   1st Qu.: 0.6000  
##             Mode  :character   Median : 70.67   Median : 0.7000  
##                                Mean   : 70.51   Mean   : 0.7492  
##                                3rd Qu.: 77.33   3rd Qu.: 0.9000  
##                                Max.   :110.00   Max.   : 4.1000  
##                                                                  
##       SBP         hypchol        creat              WC        
##  Min.   : 78.81   no :4452   Min.   :0.3400   Min.   : 54.53  
##  1st Qu.:108.67   yes: 548   1st Qu.:0.6900   1st Qu.: 84.95  
##  Median :118.00              Median :0.8200   Median : 94.90  
##  Mean   :120.38              Mean   :0.8574   Mean   : 96.65  
##  3rd Qu.:129.33              3rd Qu.:0.9600   3rd Qu.:105.80  
##  Max.   :207.33              Max.   :7.4600   Max.   :178.05  
##                                                               
##       age            HyperMed        weight            chol      
##  Min.   :20.00   no      : 145   Min.   : 26.21   Min.   :2.070  
##  1st Qu.:32.00   previous: 140   1st Qu.: 63.50   1st Qu.:4.340  
##  Median :45.00   yes     :1065   Median : 77.11   Median :4.990  
##  Mean   :46.54   NA's    :3650   Mean   : 78.51   Mean   :5.023  
##  3rd Qu.:60.00                   3rd Qu.: 89.36   3rd Qu.:5.640  
##  Max.   :79.00                   Max.   :167.83   Max.   :9.930  
##                                                                  
##                    educ       alc            albu      
##  Less than 9th grade : 392   0  :1023   Min.   :2.600  
##  9-11th grade        : 786   <=1:2009   1st Qu.:4.100  
##  High school graduate:1082   1-3: 746   Median :4.300  
##  some college        :1445   3-7: 524   Mean   :4.276  
##  College or above    :1295   >7 : 698   3rd Qu.:4.500  
##                                         Max.   :5.217  
##                                                        
##                  race           HDL            uricacid      
##  Mexican American  : 480   Min.   :0.2401   Min.   : 0.5986  
##  Other Hispanic    : 550   1st Qu.:1.1100   1st Qu.: 4.4000  
##  Non-Hispanic White:1780   Median :1.3400   Median : 5.4000  
##  Non-Hispanic Black:1215   Mean   :1.3879   Mean   : 5.4044  
##  other             : 975   3rd Qu.:1.6000   3rd Qu.: 6.3000  
##                            Max.   :3.5700   Max.   :11.0000  
##                                                              
##      smoke         gender    
##  never  :2871   male  :2380  
##  former :1066   female:2620  
##  current:1063                
##                              
##                              
##                              
## 
# get the summary of the "raw" imputed values
lapply(imp$imp, function(x) 
  summary(unlist(x))
)
## $height
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.409   1.540   1.590   1.600   1.671   1.814 
## 
## $BMI
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.28   25.78   29.41   30.77   34.56   54.92 
## 
## $hypten
##  no yes 
## 101  39 
## 
## $DM
## Length  Class   Mode 
##      0   NULL   NULL 
## 
## $cohort
## Length  Class   Mode 
##      0   NULL   NULL 
## 
## $DBP
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   37.74   62.53   71.54   70.67   79.66  103.95 
## 
## $bili
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.2129  0.5488  0.7627  0.7499  0.9493  1.8327 
## 
## $SBP
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   78.81  111.43  123.10  123.41  136.39  174.65 
## 
## $hypchol
##  no yes 
## 377  48 
## 
## $creat
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4300  0.7100  0.8400  0.8741  0.9700  1.9800 
## 
## $WC
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   54.53   84.40   94.05   97.87  107.30  178.05 
## 
## $age
## Length  Class   Mode 
##      0   NULL   NULL 
## 
## $HyperMed
##    Mode    NA's 
## logical    3650 
## 
## $weight
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   26.21   68.05   80.96   82.80   97.27  127.55 
## 
## $chol
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.400   4.392   5.023   5.054   5.824   8.451 
## 
## $educ
##  Less than 9th grade         9-11th grade High school graduate 
##                    2                    1                    2 
##         some college     College or above 
##                    0                    0 
## 
## $alc
##   0 <=1 1-3 3-7  >7 
## 328 629 181 114 128 
## 
## $albu
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.931   4.048   4.273   4.268   4.501   5.217 
## 
## $race
## Length  Class   Mode 
##      0   NULL   NULL 
## 
## $HDL
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2401  1.1174  1.4177  1.3853  1.6470  2.4943 
## 
## $uricacid
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.5986  4.5398  5.3976  5.3840  6.4140  9.4719 
## 
## $smoke
##   never  former current 
##       1       1       3 
## 
## $gender
## Length  Class   Mode 
##      0   NULL   NULL
# look into the distribution of SBP and height
densityplot(imp, ~SBP|hypten)

densityplot(imp, ~height|gender)

# plot for all categorical variables
probplot(imp)
## Warning: attributes are not identical across measure variables; they will
## be dropped

# look into the categorical variables where proportions of imputed and observed differ
probplot(imp, smoke ~ alc)

probplot(imp, alc ~ gender)

Unfortunately, we have some negative imputed values for bili. Often, this would not result in bias in the analysis, but may be difficult to explain when providing a summary of the imputed data in a publication. In the present example we can see that the observed values have a slightly right-skewed distribution, compared to the imputed values. Re-doing the imputation with pmm instead of norm for bili should fix this. (However, since the imputations seem fine overall, and there is little knowledge gain in re-doing the previous steps, we will skip this repetition in this practical.)

The distributions of the imputed values for height and SBP differ a bit from the distributions of the observed data, but these differences can be explained by adjusting for gender and hypten.

We also imputed a larger proportion than might have been expected in the highest category of alc, and the distribution of values for educ and smoke looks a bit weird.

The difference in distribution of alc could be explained by gender. smoke and educ each have only one missing value, which makes it difficult to judge the distribution of imputed values. Adjusting the distributions of observed smoke for alc shows that the subject with missing smoking status was a drinker, and drinkers were more often in the current group of smoking.

Analysis

Fitting models on mids objects

When we are confident that the imputation was successfull, the imputed data can be analysed with any complete data method.

Task

Choose any analysis model of interest and fit it on the imputed data. The model could for instance be a linear regression model (using lm) or a generalized linear regression model, e.g., logistic regression (using glm()). Find out the class and structure of the resulting object.

imp is not a dataframe but an object of class mids, so we cannot use the normal specification of a model in which we specify the argument data to be a data.frame. Instead, use with(imp, <model function without data>).

Solution

# for example:
models1 <- with(imp, lm(SBP ~ age + gender + bili + chol + BMI))
models2 <- with(imp, glm(hypchol ~ age + gender + bili + race + BMI,
                       family = "binomial"))
models3 <- with(imp, glm(creat ~ age + gender + uricacid + WC + BMI,
                       family = Gamma(link = 'log')))
# to determine the class
class(models1)
## [1] "mira"   "matrix"
# to determine the structure
names(models1)
## [1] "call"     "call1"    "nmis"     "analyses"
lapply(models1, class)
## $call
## [1] "call"
## 
## $call1
## [1] "call"
## 
## $nmis
## [1] "integer"
## 
## $analyses
## [1] "list"
# explore further
models1$analyses
## [[1]]
## 
## Call:
## lm(formula = SBP ~ age + gender + bili + chol + BMI)
## 
## Coefficients:
## (Intercept)          age      gender2         bili         chol  
##     95.4107       0.4304      -6.5005      -1.3925       0.3092  
##         BMI  
##      0.2778  
## 
## 
## [[2]]
## 
## Call:
## lm(formula = SBP ~ age + gender + bili + chol + BMI)
## 
## Coefficients:
## (Intercept)          age      gender2         bili         chol  
##     95.1448       0.4258      -6.2289      -2.0012       0.4450  
##         BMI  
##      0.2858  
## 
## 
## [[3]]
## 
## Call:
## lm(formula = SBP ~ age + gender + bili + chol + BMI)
## 
## Coefficients:
## (Intercept)          age      gender2         bili         chol  
##     93.9945       0.4251      -6.4535      -0.6079       0.5089  
##         BMI  
##      0.2893  
## 
## 
## [[4]]
## 
## Call:
## lm(formula = SBP ~ age + gender + bili + chol + BMI)
## 
## Coefficients:
## (Intercept)          age      gender2         bili         chol  
##     94.7578       0.4290      -6.0788      -0.5295       0.4211  
##         BMI  
##      0.2538  
## 
## 
## [[5]]
## 
## Call:
## lm(formula = SBP ~ age + gender + bili + chol + BMI)
## 
## Coefficients:
## (Intercept)          age      gender2         bili         chol  
##     94.9890       0.4357      -6.4453      -1.0361       0.3775  
##         BMI  
##      0.2665
summary(models1$analyses[[1]])
## 
## Call:
## lm(formula = SBP ~ age + gender + bili + chol + BMI)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.168  -9.650  -0.956   8.350  73.810 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 95.41066    3.61149  26.419  < 2e-16 ***
## age          0.43038    0.02833  15.191  < 2e-16 ***
## gender2     -6.50051    0.95873  -6.780 2.05e-11 ***
## bili        -1.39252    1.63938  -0.849 0.395851    
## chol         0.30919    0.45987   0.672 0.501526    
## BMI          0.27779    0.07553   3.678 0.000248 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.65 on 994 degrees of freedom
## Multiple R-squared:  0.2344, Adjusted R-squared:  0.2306 
## F-statistic: 60.87 on 5 and 994 DF,  p-value: < 2.2e-16

Structure of mira objects

Analyses performed on mids objects will result in a mira object (multiply imputed repeated analyses).

A mira object is a list with the following elements:
call: The call that created the mira object.
call1: The call that created the mids object.
nmis: The number of missing observations per column.
analyses: The models fitted on each of the imputed datasets: A list with imp$m components, where each component contains a fitted model object.

The class of each of the components of imp$analyses will depend on the type of model that was fitted.

Pooling

Combining results

To pool the results from the repeated analyses contained in a mira object, the function pool() can be used.

Task

Pool one of the provided mira objects (models1, models2 or models3) and investigate its structure.

Solution

# pool the model
pooled1 <- pool(models1)

# investigate the structure
class(pooled1)
## [1] "mipo"   "mira"   "matrix"
names(pooled1)
##  [1] "call"   "call1"  "call2"  "nmis"   "m"      "qhat"   "u"     
##  [8] "qbar"   "ubar"   "b"      "t"      "r"      "dfcom"  "df"    
## [15] "fmi"    "lambda"
str(pooled1)
## List of 16
##  $ call  : language pool(object = models1)
##  $ call1 : language with.mids(data = imp, expr = lm(SBP ~ age + gender + bili + chol +      BMI))
##  $ call2 : language mice(data = NHANES, m = 5, method = meth, predictorMatrix = pred,      visitSequence = visSeq, maxit = 10, seed = 2018)
##  $ nmis  : Named int [1:23] 20 30 28 0 0 41 87 41 85 86 ...
##   ..- attr(*, "names")= chr [1:23] "height" "BMI" "hypten" "DM" ...
##  $ m     : int 5
##  $ qhat  : num [1:5, 1:6] 95.4 95.1 94 94.8 95 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:5] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:6] "(Intercept)" "age" "gender2" "bili" ...
##  $ u     : num [1:5, 1:6, 1:6] 13 13.4 13 13.1 13 ...
##   ..- attr(*, "dimnames")=List of 3
##   .. ..$ : chr [1:5] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:6] "(Intercept)" "age" "gender2" "bili" ...
##   .. ..$ : chr [1:6] "(Intercept)" "age" "gender2" "bili" ...
##  $ qbar  : Named num [1:6] 94.859 0.429 -6.341 -1.113 0.412 ...
##   ..- attr(*, "names")= chr [1:6] "(Intercept)" "age" "gender2" "bili" ...
##  $ ubar  : num [1:6, 1:6] 13.1216 -0.0259 -0.5455 -2.6786 -0.994 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:6] "(Intercept)" "age" "gender2" "bili" ...
##   .. ..$ : chr [1:6] "(Intercept)" "age" "gender2" "bili" ...
##  $ b     : num [1:6, 1:6] 0.290113 0.001025 0.000302 -0.22342 -0.03413 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:6] "(Intercept)" "age" "gender2" "bili" ...
##   .. ..$ : chr [1:6] "(Intercept)" "age" "gender2" "bili" ...
##  $ t     : num [1:6, 1:6] 13.4698 -0.0247 -0.5452 -2.9467 -1.0349 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:6] "(Intercept)" "age" "gender2" "bili" ...
##   .. ..$ : chr [1:6] "(Intercept)" "age" "gender2" "bili" ...
##  $ r     : Named num [1:6] 0.0265 0.027 0.0424 0.1665 0.0313 ...
##   ..- attr(*, "names")= chr [1:6] "(Intercept)" "age" "gender2" "bili" ...
##  $ dfcom : int 994
##  $ df    : Named num [1:6] 832 828 683 159 788 ...
##   ..- attr(*, "names")= chr [1:6] "(Intercept)" "age" "gender2" "bili" ...
##  $ fmi   : Named num [1:6] 0.0282 0.0287 0.0435 0.1533 0.0328 ...
##   ..- attr(*, "names")= chr [1:6] "(Intercept)" "age" "gender2" "bili" ...
##  $ lambda: Named num [1:6] 0.0258 0.0263 0.0407 0.1428 0.0303 ...
##   ..- attr(*, "names")= chr [1:6] "(Intercept)" "age" "gender2" "bili" ...
##  - attr(*, "class")= chr [1:3] "mipo" "mira" "matrix"

Structure of mipo objects

Pooling of a mira object results in a mipo object (multiply imputed pooled analysis).

The mipo object is a list that has the following elements:
call: The call that created the mipo object.
call1: The call that created the mids object.
call2: The call that created the mira object.
data The original, incomplete data.
nmis: The number of missing values per variable.
m: The number of imputations.
qhat: The coefficients from all analyses: a matrix with m rows and nparcolumns.
u: The variance-covariance matrix of the coefficients from each analysis: an array of dimension c(m, npar, npar).
qbar: The pooled parameter estimates: a vector of length npar.
ubar: The average within imputation variance: a matrix with npar rows and npar columns containing the averaged variance-covariance matrices.
b: The between imputation variance-covariance matrix (\(u + b + b/m\)).
t: The total variance-covariance matrix.
r: The relative increase in variance due to the missing values.
dfcom: Degrees of freedom in the hypothetically complete data (N - # free parameters).
df: Degrees of freedom associated with the \(t\)-statistics.
fmi: Fraction of missing information.
lambda: Proportion of the variation due to the missing values: \((b+b/m)/t\).

From the elements contained in the mipo object, we could obtain the pooled coefficients (qbar) and variances (t), and calculate the pooled confidence intervals and p-values by hand, using the function provided in the course notes.

However, the function summary() applied to the mipo object will do this for us.

Note:
pool() extracts the model coefficients and variance-covariance matrices using the functions coef() and vcov(). Hence, pooling using the pool() function from mice only works for those types of models for which these functions will return the regression coefficients and corresponding variance-covariance matrix.

Moreover, pool() only works for objects of class mira. When data has been imputed in a different package/software, or analyses have not been performed in the way described above, a list of fitted models can be converted to a mira object using as.mira().

Task

Get the summary of the pooled results.

Solution

pooled1 <- pool(models1)
summary(pooled1)
##                    est         se          t       df     Pr(>|t|)
## (Intercept) 94.8593413 3.67011837 25.8463983 832.0829 0.000000e+00
## age          0.4291836 0.02880572 14.8992499 827.5060 0.000000e+00
## gender2     -6.3414040 0.97983585 -6.4719045 682.5690 1.848048e-10
## bili        -1.1134519 1.75656645 -0.6338798 159.4478 5.270676e-01
## chol         0.4123433 0.46983785  0.8776290 787.5224 3.804127e-01
## BMI          0.2746369 0.07805927  3.5183129 670.5039 4.636353e-04
##                  lo 95       hi 95 nmis        fmi     lambda
## (Intercept) 87.6555629 102.0631196   NA 0.02817875 0.02584568
## age          0.3726428   0.4857245    0 0.02866178 0.02631698
## gender2     -8.2652584  -4.4175497   NA 0.04350338 0.04070484
## bili        -4.5825893   2.3556854   87 0.15332223 0.14276830
## chol        -0.5099394   1.3346260   85 0.03279790 0.03034470
## BMI          0.1213669   0.4279069   30 0.04476185 0.04191677
You have reached the end of this practical. Well done!