The focus of this practical is the imputation of data that has features that require special attention. In the interest of time, we will focus on these features and abbreviate steps that are the same as in any imputation setting (e.g., getting to know the data or checking that imputed values are realistic). Nevertheless, these steps are of course required when analysing data in practice.

Non-linear associations

Data & Model of interest

To practice imputation when non-linear functional forms or interaction terms are involved, we use a subset of the NHANES data.

To get the NHANES data, load the file NHANES_2.RData.

The variables in this subset have the following distributions and proportions of missing values:

The missing data pattern is

##     weight gender age educ height DBP chol HDL bili    
## 547      1      1   1    1      1   1    1   1    1   0
##   8      1      1   1    1      1   1    1   1    0   1
##  59      1      1   1    1      1   0    1   1    1   1
##  13      1      1   1    1      0   1    1   1    1   1
##   1      1      1   1    1      1   0    1   1    0   2
##   5      1      1   1    1      0   0    1   1    1   2
## 104      1      1   1    1      1   1    0   0    0   3
##  11      1      1   1    1      1   0    0   0    0   4
##   1      1      1   1    1      0   1    0   0    0   4
##   1      1      1   1    1      0   0    0   0    0   5
##          0      0   0    0     20  77  117 117  126 457

Our aim is to fit the following linear regression model for weight:

mod <- lm(weight ~ gender + bili + age * (chol + HDL) + height)

We expect that the effects of cholesterol and HDL may differ with age, and, hence, include interaction terms between age and chol and HDL, respectively. Additionally, we want to include educ and DBP as auxiliary variables.

Imputation using mice

When the analysis model of interest involves interaction terms between incomplete variables, mice has limited options to reduce the bias that may be introduced by naive handling of the missing values.

Use of the “Just Another Variable” approach can in some settings reduce bias. Alternatively, we can use passive imputation, i.e., calculate the interaction terms in each iteration of the MICE algorithm. Furthermore, predictive mean matching tends to lead to less bias than normal imputation models.

Task 1

Calculate the interaction terms in the incomplete data and perform the imputation with mice(), using the JAV approach and using passive imputation. Follow the usual procedure of doing a setup run and adjusting the imputation method and predictor matrix.

Solution 1

# calculate the interaction terms
NHANES$agechol <- NHANES$age * NHANES$chol
NHANES$ageHDL <- NHANES$age * NHANES$HDL

# setup run
miceint0 <- mice(NHANES, maxit = 0)

meth_miceint <- miceint0$method 
pred_miceint <- miceint0$predictorMatrix

# change the imputation method:
meth_miceint["DBP"] <- "norm"
# replace all "pmm" with "midastouch"
meth_miceint <- gsub("pmm", "midastouch", meth_miceint)

# changes in predictor matrix to prevent that the original variables are
# imputed based on the interaction terms
pred_miceint["chol", "agechol"] <- 0
pred_miceint["HDL", "ageHDL"] <- 0

# changes in imputation method for passive imputation only
meth_miceintPAS <- meth_miceint
meth_miceintPAS[c("agechol", "ageHDL")] <- c("~I(age*chol)", "~I(age*HDL)")

# run imputation with the JAV approach
miceintJAV <- mice(NHANES, method = meth_miceint, predictorMatrix = pred_miceint,
                maxit = 10, m = 5)
## 
##  iter imp variable
##   1   1  bili  chol  HDL  DBP  height  agechol  ageHDL
##   1   2  bili  chol  HDL  DBP  height  agechol  ageHDL
##   1   3  bili  chol  HDL  DBP  height  agechol  ageHDL
##   1   4  bili  chol  HDL  DBP  height  agechol  ageHDL
##   1   5  bili  chol  HDL  DBP  height  agechol  ageHDL
##   2   1  bili  chol  HDL  DBP  height  agechol  ageHDL
##   2   2  bili  chol  HDL  DBP  height  agechol  ageHDL
##   2   3  bili  chol  HDL  DBP  height  agechol  ageHDL
##   2   4  bili  chol  HDL  DBP  height  agechol  ageHDL
##   2   5  bili  chol  HDL  DBP  height  agechol  ageHDL
##   3   1  bili  chol  HDL  DBP  height  agechol  ageHDL
##   3   2  bili  chol  HDL  DBP  height  agechol  ageHDL
##   3   3  bili  chol  HDL  DBP  height  agechol  ageHDL
##   3   4  bili  chol  HDL  DBP  height  agechol  ageHDL
##   3   5  bili  chol  HDL  DBP  height  agechol  ageHDL
##   4   1  bili  chol  HDL  DBP  height  agechol  ageHDL
##   4   2  bili  chol  HDL  DBP  height  agechol  ageHDL
##   4   3  bili  chol  HDL  DBP  height  agechol  ageHDL
##   4   4  bili  chol  HDL  DBP  height  agechol  ageHDL
##   4   5  bili  chol  HDL  DBP  height  agechol  ageHDL
##   5   1  bili  chol  HDL  DBP  height  agechol  ageHDL
##   5   2  bili  chol  HDL  DBP  height  agechol  ageHDL
##   5   3  bili  chol  HDL  DBP  height  agechol  ageHDL
##   5   4  bili  chol  HDL  DBP  height  agechol  ageHDL
##   5   5  bili  chol  HDL  DBP  height  agechol  ageHDL
##   6   1  bili  chol  HDL  DBP  height  agechol  ageHDL
##   6   2  bili  chol  HDL  DBP  height  agechol  ageHDL
##   6   3  bili  chol  HDL  DBP  height  agechol  ageHDL
##   6   4  bili  chol  HDL  DBP  height  agechol  ageHDL
##   6   5  bili  chol  HDL  DBP  height  agechol  ageHDL
##   7   1  bili  chol  HDL  DBP  height  agechol  ageHDL
##   7   2  bili  chol  HDL  DBP  height  agechol  ageHDL
##   7   3  bili  chol  HDL  DBP  height  agechol  ageHDL
##   7   4  bili  chol  HDL  DBP  height  agechol  ageHDL
##   7   5  bili  chol  HDL  DBP  height  agechol  ageHDL
##   8   1  bili  chol  HDL  DBP  height  agechol  ageHDL
##   8   2  bili  chol  HDL  DBP  height  agechol  ageHDL
##   8   3  bili  chol  HDL  DBP  height  agechol  ageHDL
##   8   4  bili  chol  HDL  DBP  height  agechol  ageHDL
##   8   5  bili  chol  HDL  DBP  height  agechol  ageHDL
##   9   1  bili  chol  HDL  DBP  height  agechol  ageHDL
##   9   2  bili  chol  HDL  DBP  height  agechol  ageHDL
##   9   3  bili  chol  HDL  DBP  height  agechol  ageHDL
##   9   4  bili  chol  HDL  DBP  height  agechol  ageHDL
##   9   5  bili  chol  HDL  DBP  height  agechol  ageHDL
##   10   1  bili  chol  HDL  DBP  height  agechol  ageHDL
##   10   2  bili  chol  HDL  DBP  height  agechol  ageHDL
##   10   3  bili  chol  HDL  DBP  height  agechol  ageHDL
##   10   4  bili  chol  HDL  DBP  height  agechol  ageHDL
##   10   5  bili  chol  HDL  DBP  height  agechol  ageHDL
# run imputation with passive imputation
miceintPAS <- mice(NHANES, method = meth_miceintPAS, predictorMatrix = pred_miceint,
                maxit = 10, m = 5)
## 
##  iter imp variable
##   1   1  bili  chol  HDL  DBP  height  agechol  ageHDL
##   1   2  bili  chol  HDL  DBP  height  agechol  ageHDL
##   1   3  bili  chol  HDL  DBP  height  agechol  ageHDL
##   1   4  bili  chol  HDL  DBP  height  agechol  ageHDL
##   1   5  bili  chol  HDL  DBP  height  agechol  ageHDL
##   2   1  bili  chol  HDL  DBP  height  agechol  ageHDL
##   2   2  bili  chol  HDL  DBP  height  agechol  ageHDL
##   2   3  bili  chol  HDL  DBP  height  agechol  ageHDL
##   2   4  bili  chol  HDL  DBP  height  agechol  ageHDL
##   2   5  bili  chol  HDL  DBP  height  agechol  ageHDL
##   3   1  bili  chol  HDL  DBP  height  agechol  ageHDL
##   3   2  bili  chol  HDL  DBP  height  agechol  ageHDL
##   3   3  bili  chol  HDL  DBP  height  agechol  ageHDL
##   3   4  bili  chol  HDL  DBP  height  agechol  ageHDL
##   3   5  bili  chol  HDL  DBP  height  agechol  ageHDL
##   4   1  bili  chol  HDL  DBP  height  agechol  ageHDL
##   4   2  bili  chol  HDL  DBP  height  agechol  ageHDL
##   4   3  bili  chol  HDL  DBP  height  agechol  ageHDL
##   4   4  bili  chol  HDL  DBP  height  agechol  ageHDL
##   4   5  bili  chol  HDL  DBP  height  agechol  ageHDL
##   5   1  bili  chol  HDL  DBP  height  agechol  ageHDL
##   5   2  bili  chol  HDL  DBP  height  agechol  ageHDL
##   5   3  bili  chol  HDL  DBP  height  agechol  ageHDL
##   5   4  bili  chol  HDL  DBP  height  agechol  ageHDL
##   5   5  bili  chol  HDL  DBP  height  agechol  ageHDL
##   6   1  bili  chol  HDL  DBP  height  agechol  ageHDL
##   6   2  bili  chol  HDL  DBP  height  agechol  ageHDL
##   6   3  bili  chol  HDL  DBP  height  agechol  ageHDL
##   6   4  bili  chol  HDL  DBP  height  agechol  ageHDL
##   6   5  bili  chol  HDL  DBP  height  agechol  ageHDL
##   7   1  bili  chol  HDL  DBP  height  agechol  ageHDL
##   7   2  bili  chol  HDL  DBP  height  agechol  ageHDL
##   7   3  bili  chol  HDL  DBP  height  agechol  ageHDL
##   7   4  bili  chol  HDL  DBP  height  agechol  ageHDL
##   7   5  bili  chol  HDL  DBP  height  agechol  ageHDL
##   8   1  bili  chol  HDL  DBP  height  agechol  ageHDL
##   8   2  bili  chol  HDL  DBP  height  agechol  ageHDL
##   8   3  bili  chol  HDL  DBP  height  agechol  ageHDL
##   8   4  bili  chol  HDL  DBP  height  agechol  ageHDL
##   8   5  bili  chol  HDL  DBP  height  agechol  ageHDL
##   9   1  bili  chol  HDL  DBP  height  agechol  ageHDL
##   9   2  bili  chol  HDL  DBP  height  agechol  ageHDL
##   9   3  bili  chol  HDL  DBP  height  agechol  ageHDL
##   9   4  bili  chol  HDL  DBP  height  agechol  ageHDL
##   9   5  bili  chol  HDL  DBP  height  agechol  ageHDL
##   10   1  bili  chol  HDL  DBP  height  agechol  ageHDL
##   10   2  bili  chol  HDL  DBP  height  agechol  ageHDL
##   10   3  bili  chol  HDL  DBP  height  agechol  ageHDL
##   10   4  bili  chol  HDL  DBP  height  agechol  ageHDL
##   10   5  bili  chol  HDL  DBP  height  agechol  ageHDL

Task 2

We skip the evaluation of convergence and investigation of the imputed values. With the settings given in the solution the chains have converged and distributions of the imputed values match the distributions of the observed data close enough. Omitting the changes in the predictorMatrix would have resulted in convergence problems for HDL, ageHDL, chol and agechol.

Analyse the imputed data and pool the results.

Solution 2

miceint_miraJAV <- with(miceintJAV, 
                     lm(weight  ~ gender + bili + age + chol + HDL + agechol +
                          ageHDL + height))

miceint_miraPAS <- with(miceintPAS, 
                     lm(weight  ~ gender + bili + age + chol + HDL + agechol +
                          ageHDL + height))

summary(pool(miceint_miraJAV))
##                      est          se          t         df     Pr(>|t|)
## (Intercept) -76.63915455 19.39481388 -3.9515282  27.155545 0.0004990517
## gender2       4.64534644  1.65210959  2.8117665 527.128374 0.0051105490
## bili         -5.76016870  2.73320552 -2.1074773  14.565363 0.0528448141
## age           0.21290955  0.24100360  0.8834289   7.220343 0.4054571384
## chol          2.70748429  1.60998512  1.6816828  27.704040 0.1038710010
## HDL         -22.29193149  7.95680304 -2.8016191   7.203969 0.0256977968
## agechol      -0.05763036  0.02959673 -1.9471870  34.545334 0.0596782985
## ageHDL        0.16270072  0.15434401  1.0541434   7.008678 0.3267956761
## height       99.99894552  7.52485321 13.2891557 672.844617 0.0000000000
##                    lo 95        hi 95 nmis        fmi     lambda
## (Intercept) -116.4233626 -36.85494645   NA 0.41401610 0.37239134
## gender2        1.3998192   7.89087364   NA 0.04750585 0.04389880
## bili         -11.6010360   0.08069864  126 0.56873791 0.51332486
## age           -0.3534668   0.77928590    0 0.78338247 0.73067968
## chol          -0.5920087   6.00697726  117 0.40965945 0.36852652
## HDL          -40.9993270  -3.58453600  117 0.78412569 0.73149890
## agechol       -0.1177432   0.00248247  117 0.36401112 0.32822646
## ageHDL        -0.2021733   0.52757472  117 0.79313475 0.74147447
## height        85.2239266 114.77396440   20 0.02337114 0.02047246
summary(pool(miceint_miraPAS))
##                      est          se         t        df     Pr(>|t|)
## (Intercept) -84.41034507 17.41918620 -4.845826 482.19221 1.702527e-06
## gender2       4.55210592  1.63844712  2.778305 604.66134 5.633998e-03
## bili         -6.38473588  2.29994476 -2.776039  42.82735 8.126896e-03
## age           0.34631723  0.19904132  1.739926 188.54777 8.350368e-02
## chol          4.33360717  1.96548001  2.204859 126.55016 2.927101e-02
## HDL         -22.73238938  5.34296081 -4.254643 111.49202 4.377521e-05
## agechol      -0.08985437  0.03809106 -2.358936 121.92983 1.991703e-02
## ageHDL        0.17825559  0.10114365  1.762400 105.97395 8.088488e-02
## height      100.64330497  7.46032910 13.490465 721.24291 0.000000e+00
##                     lo 95        hi 95 nmis        fmi      lambda
## (Intercept) -118.63723278 -50.18345735   NA 0.05482793 0.050915736
## gender2        1.33436779   7.76984406   NA 0.03518144 0.031995435
## bili         -11.02355733  -1.74591442  126 0.32368415 0.292821416
## age           -0.04631678   0.73895124    0 0.13182017 0.122659636
## chol           0.44414395   8.22307039  117 0.17165361 0.158665055
## HDL          -33.31930808 -12.14547069  117 0.18578119 0.171305165
## agechol       -0.16525986  -0.01444887  117 0.17571999 0.162309414
## ageHDL        -0.02227210   0.37878328  117 0.19166842 0.176555735
## height        85.99675001 115.28985993   20 0.01187011 0.009133832

Because we want to perform the imputation with other methods on the same original dataset, we remove the two columns agechol and ageHDL that we had added to the data.

NHANES <- subset(NHANES, select = -c(agechol, ageHDL))

Imputation using JointAI

The package JointAI performs analysis and imputation jointly. There are three main functions, lm_imp(), glm_imp() and lme_imp() that perform linear regression, generalized linear regression and analysis with linear mixed models. The specification of these functions is very similar to the use of their complete data versions lm(), glm() and lme (from the nlme package).

Task 1

Check the help file for lm_imp() to find out which arguments you need to specify to fit the linear regression model for weight.

Solution 1

lm_imp(), glm_imp() and lme_imp() have many arguments, but not all of them are relevant for this practical. The most important arguments are:
formula model formula
data original, incomplete dataset
family for glm’s: the distribution family of the outcome
meth vector of imputation methods. Implemented are norm (normal model), logit (logistic model), lognorm (linear with log-transformed outcome), multilogit (multinomial logit model), cumlogit (cumulative logit model)
n.chains number of MCMC chains
n.adapt number of iterations used in the adaptive phase (details see below)
n.iter number of iterations per MCMC chain in the sampling phase
auxvars vector of names of variables that are not part of the analysis model but should be used to predict missing values (optional)
refcats allows to specify which category of categorical variables is used as reference (optional)

Additional details:
Contrary to MICE, the n.chains different MCMC chains are not used to create multiple imputed datasets. Multiple chains are necessary to evaluate if the model has converged. For the final result the chains are combined.

Depending on the type of variable, different types of samplers (within the Gibbs sampling) are used. JAGS (which is called from within lm_imp()) needs an adaptive phase to find a suitable sampler for each of the parameters. Often n.adapt = 100 iterations (the default value) are sufficient for the adaption phase. These iterations are not included in the final result.

The number of iterations n.iter that is required depends on the specific data and how complex the model is, hence, no default value is given.

Columns of the dataset that are not used in the model formula, and are not specified as auxiliary variables are ignored. In the current version of JointAI, passive imputation (calculating variables based on other variables) is not yet available.

Task 2

Analyse (and impute) the data using lm_imp() and the linear regression model specified above.

  • Start with a small number of iterations, say n.iter = 100, to get an idea about the computational time.
  • Check convergence of the MCMC chains using traceplot().
  • Increase n.iter if necessary.

You need to specify the

  • model formula
  • the dataset
  • auxiliary variables
  • the number of iterations for the sample
  • set inits = NULL


traceplot() shows by default the MCMC chains of the main model parameters. This is different to the traceplots we have seen when using mice(), but the principle of convergence is the same.

The result of the model can be displayed with the function summary(). To exclude parts of the MCMC chains from the result (if the ) part of the MCMC chains after convergence was achieved, the argument start can be used to specify the first iteration to be used. See an example in the help page of the summary() for JointAI objects.

Note:
Due to a bug in version 0.1.0 of JointAI you need to specify inits = NULL when using auxiliary variables. This bug is resolved in the development version on GitHub.

Solution 2

library(JointAI)
JointAIint <- lm_imp(weight ~ gender + bili + age * (chol + HDL) + height,
                  data = NHANES,
                  auxvars = c("educ", "DBP"),
                  n.iter = 500, inits = NULL)
## This is new software. Please report any bug to the package maintainer.
traceplot(JointAIint)

summary(JointAIint, start = 200)
## 
##  Linear model fitted with JointAI 
## 
## Call:
## lm_imp(formula = weight ~ gender + bili + age * (chol + HDL) + 
##     height, data = NHANES, n.iter = 500, inits = NULL, auxvars = c("educ", 
##     "DBP"))
## 
## Posterior summary:
##                  Mean       SD       2.5%    97.5% tail-prob.
## (Intercept)  -86.4904 17.18574 -120.69996 -52.9285   0.000000
## genderfemale   4.9002  1.58440    1.79233   7.9446   0.004430
## age            0.3879  0.19603    0.01859   0.7758   0.035437
## height       101.1454  7.40559   87.54574 115.1592   0.000000
## chol           5.3293  1.88690    1.84898   9.1587   0.004430
## HDL          -25.1377  5.20770  -34.99747 -15.4332   0.000000
## bili          -6.9972  2.05366  -10.99778  -2.9416   0.004430
## age:chol      -0.1092  0.03814   -0.18346  -0.0342   0.006645
## age:HDL        0.2174  0.09936    0.03328   0.4088   0.024363
## 
## Posterior summary of residual std. deviation:
##               Mean     SD  2.5% 97.5%
## sigma_weight 15.92 0.4263 15.13 16.83
## 
## 
## MCMC settings:
## Iterations = 200:500
## Sample size per chain = 301 
## Thinning interval = 1 
## Number of chains = 3 
## 
## Number of observations: 750

Imputation using smcfcs

The package smcfcs performs imputation using “substantive model compatible fully conditional specification”.

The main function is called smcfcs() and its use is similar to the use of mice().

Task 1

Check the help file to find out the details of this function.

Solution 1

smcfcs() takes the following arguments as input:
originaldata original, incomplete data frame
smtype type of substantive (analysis) model: lm, logistic, poisson, coxph or compet
smformula formula of analysis model
method vector of imputation methods for each variable: norm (lin. regression), logreg (logistic regression), poisson (Poisson regression),podds (prop. odds regression for ordered factors,mlogit (multinomial logistic regression for unordered factors) '' if variable is complete, or a custom expression to impute passively imputed variables, e.g. 'x^2' or x1*x2
predictorMatrix predictor matrix to define which variables are used as predictors in which imputation model (optional). The outcome must not be included
m number of imputed datasets (default is 5)
numit number of iterations before obtaining imputed dataset
rjlimit maximum number of attempts in the rejection sampling (MCMC)
noisy if TRUE: output is printed, if FALSE: less output is printed

Additional details:

The vector of imputation methods needs to be specified in the order of the columns in the dataset and contain an entry for each column, even that column is not used in the imputation model. Depending on whether additional columns of the data should be used as auxiliary variables in the imputation, the predictorMatrix needs to be adjusted (0 if the column should not be used, 1 if it should be used).

In smcfcs() the outcome of the analysis model is implicitely included in the imputation model, similarily to how it is included in JointAI. Therefore, the outcome does not need to be explicitely specified as predictor variable for the imputation models in the predictorMatrix.

smcfcs uses rejection sampling. Rejection sampling is a method to create a random sample from a complex distribution (from which we can not sample directly) by drawing from a simpler proposal distribution and rejecting draws that are unlikely under the complex target distribution. The argument rjlimit specifies how many attepts are made to draw a value from the proposal that is not rejected.

For logistic regression, smcfcs requires the outcome to be coded as numeric (with values 0 and 1).

Task 2

Impute the NHANES data with smcfcs() and take into account that DBP and educ should be used as auxiliary variables.
=> you need to specify the predictorMatrix to specify this

To create the predictor matrix: pred <- matrix(nrow = ncol(NHANES), ncol = ncol(NHANES), data = 1, dimnames = list(names(NHANES), names(NHANES)))

Then exclude variables from their own imputation model: diag(pred) <- 0

In smcfcs() you need to specify

  • the data
  • the model type
  • the model formula
  • the vector of imputation methods
  • the predictor matrix
  • possibly the number of iterations

Solution 2

# To create the predictor matrix:
pred <- matrix(nrow = ncol(NHANES), ncol = ncol(NHANES), data = 1,
               dimnames = list(names(NHANES), names(NHANES)))
# exclude variables from their own imputation model
diag(pred) <- 0


smcint <- smcfcs(NHANES,
                 smtype = "lm",
                 smformula = "weight ~ gender + bili + age * (chol + HDL) + height",
                 method = c(weight = "", gender = "", bili = "norm", age = "",
                            chol = "norm", HDL = "norm", educ = "", DBP = 'norm',
                            height = 'norm'), 
                 predictorMatrix = pred, numit = 20)
## [1] "Outcome variable(s): weight"
## [1] "Passive variables: "
## [1] "Partially obs. variables: bili,chol,HDL,DBP,height"
## [1] "Fully obs. substantive model variables: gender,age"
## [1] "Imputation  1"
## [1] "Imputing:  bili  using  gender,age,chol,HDL,educ,DBP,height  plus outcome"
## [1] "Imputing:  chol  using  gender,bili,age,HDL,educ,DBP,height  plus outcome"
## [1] "Imputing:  HDL  using  gender,bili,age,chol,educ,DBP,height  plus outcome"
## [1] "Imputing:  DBP  using  gender,bili,age,chol,HDL,educ,height  plus outcome"
## [1] "Imputing:  height  using  gender,bili,age,chol,HDL,educ,DBP  plus outcome"
## [1] "Imputation  2"
## [1] "Imputation  3"
## [1] "Imputation  4"
## [1] "Imputation  5"
## Warning in smcfcs.core(originaldata, smtype, smformula, method,
## predictorMatrix, : Rejection sampling failed 443 times (across all
## variables, iterations, and imputations). You may want to increase the
## rejection sampling limit.

Task 3

Find out the class of the object returned by smcfcs() and its elements. The section Value in the help page can help with this.

The estimates of the model parameters from each iteration are returned in the list element smCoefIter.

Then plot the chains of the estimated model parameters to evaluate if the algorithm has converged.

Solution 3

# class of the returned object
class(smcint)
## [1] "list"
# names of its components
names(smcint)
## [1] "impDatasets" "smCoefIter"
# investigate the first component
class(smcint$impDatasets)
## [1] "list"
length(smcint$impDatasets)
## [1] 5
class(smcint$impDatasets[[1]])
## [1] "data.frame"
dim(smcint$impDatasets[[1]])
## [1] 750   9
# investigate the second component
class(smcint$smCoefIter)
## [1] "array"
dim(smcint$smCoefIter)
## [1]  5  9 20
# plot the chains of estimated coefficients
par(mfrow = c(3, 3), mar = c(3.5, 3.5, 1,1), mgp = c(2, 0.6, 0))
for (i in 1:dim(smcint$smCoefIter)[2]) {
  matplot(t(smcint$smCoefIter[, i, ]), type = 'l', ylab = "coefficient", 
          xlab = "iteration")
}

Continue

The imputed datasets are returned as a list, where each element of the list is one imputed dataset.

To get from a list of datasets to pooled results there are several options, as displayed in the following flow chart:

To use datalist2mids() from the package miceadds, ordered factors need to be converted to unordered factors. To use as.mids() we would have to turn the list into a long format dataset (where the original data and all imputed datasets are stacked on top of each other) and add a column that identifies the imputation number and one that identifies subjects.

Task 4

Instead do the following:

  1. convert smcint to an imputationList object
  2. fit the models on this imputationList object
  3. use MIcombine() to obtain an MIresult object
  4. use as.mira() to convert the list of models from 2. to a mira object
  5. use pool() on the mira object
  6. compare the summary() of the mira object and the summary() of the MIresult object

Solution 4

# 1. convert to imputationList
library(mitools)
smcint_impList <- imputationList(smcint$impDatasets)

# 2. fit the models
intmod_mitools <- with(smcint_impList, 
                       lm(weight ~ gender + bili + age * (chol + HDL) + height))

# 3. Pool the results
smcintpool_mitools <- MIcombine(intmod_mitools)

# 4. create a mira object
smcint_mira <- as.mira(intmod_mitools)


# 5. pool the mira object
smcintpool_mice <- pool(smcint_mira)

# 6. compare the results
summary(smcintpool_mitools)
## Multiple imputation results:
##       with(smcint_impList, lm(weight ~ gender + bili + age * (chol + 
##     HDL) + height))
##       MIcombine.default(intmod_mitools)
##                  results          se        (lower       upper) missInfo
## (Intercept)  -85.5033793 16.90841240 -118.64482413 -52.36193437      1 %
## genderfemale   4.7373652  1.64240651    1.51724652   7.95748382      3 %
## bili          -6.4598277  2.06067909  -10.51660594  -2.40304951     13 %
## age            0.3377543  0.19042106   -0.03617934   0.71168799      8 %
## chol           5.5389860  1.85729807    1.89409190   9.18388001      7 %
## HDL          -26.8677631  5.24750801  -37.21214077 -16.52338533     15 %
## height       101.2627544  7.42851757   86.70299231 115.82251653      1 %
## age:chol      -0.1138222  0.03621496   -0.18490366  -0.04274082      7 %
## age:HDL        0.2667638  0.09716448    0.07576538   0.45776230     10 %
summary(smcintpool_mice)
##                      est          se         t       df     Pr(>|t|)
## (Intercept)  -85.5033793 16.90841240 -5.056854 709.5648 5.431652e-07
## genderfemale   4.7373652  1.64240651  2.884405 598.2365 4.062705e-03
## bili          -6.4598277  2.06067909 -3.134805 192.7048 1.988727e-03
## age            0.3377543  0.19042106  1.773724 327.8257 7.703714e-02
## chol           5.5389860  1.85729807  2.982282 399.5557 3.036378e-03
## HDL          -26.8677631  5.24750801 -5.120099 158.2709 8.763846e-07
## height       101.2627544  7.42851757 13.631623 730.7931 0.000000e+00
## age:chol      -0.1138222  0.03621496 -3.142962 379.9926 1.803792e-03
## age:HDL        0.2667638  0.09716448  2.745487 255.0003 6.472089e-03
##                      lo 95        hi 95 nmis         fmi      lambda
## (Intercept)  -118.69988298 -52.30687552   NA 0.015266208 0.012494515
## genderfemale    1.51178173   7.96294861   NA 0.036215938 0.032999231
## bili          -10.52420964  -2.39544580   NA 0.129817397 0.120832771
## age            -0.03684708   0.71235572   NA 0.085072214 0.079507395
## chol            1.88768844   9.19028347   NA 0.069624702 0.064979279
## HDL           -37.23193762 -16.50358848   NA 0.148515446 0.137823165
## height         86.67897406 115.84653478   NA 0.008250439 0.005539975
## age:chol       -0.18502906  -0.04261543   NA 0.073527884 0.068664418
## age:HDL         0.07541681   0.45811087   NA 0.105390595 0.098401468

Imputation using jomo

In this part of the practical we will use jomo to impute the NHANES data with the joint model approach to multiple imputation. For our linear regression of weight, this can be done using the function jomo.lm().

Task 1

Check the help page to find out which arguments need to be specified in jomo.lm().

Solution 1

The main arguments of jomo.lm() are:
formula model formula
data the original, incomplete data set
nburn number of burn-in iterations
nbetween Number of iterations between imputations
nimp number of imputations (default is 5)
output if 1: output at end of imputation, if $ eq 1$: no output shown
out.iter prints output info each out.iter iterations

Additional details:

jomo only produces a single chain of MCMC samples. The first nburn iterations are discarded as burn-in. The next iteration forms the first imputed dataset. The MCMC chain is then continued and after nbetween additional iterations the next imputed dataset is created. This step is repeated until nimp sets of imputed values are obtained.

Convergence of the MCMC sampler can not be checked on the same object that was used to create the imputed values. Instead, a separate function has to be used: jomo.lm.MCMCchain(). It takes the same arguments as jomo.lm(), except for nbetween and nimp, and records all burn-in iterations, but does not create multiple imputed datasets.

jomo will use all columns that are in the dataset but not used in the model formula as auxiliary variables. As in JointAI, specification of passive imputation is currently not available.

Task 2

Run jomo.lm.MCMCchain() for our linear regression of weight and investigate the resulting object.

You need to specify

  • the model formula
  • the dataset
  • the distribution family of the outcome
  • the number of burn-in iterations

Solution 2

jomointMCMC <- jomo.lm.MCMCchain(weight ~ gender + bili + age * (chol + HDL) + height,
                    data = NHANES, output = 0, nburn = 30000)
class(jomointMCMC)
## [1] "list"
names(jomointMCMC)
## [1] "finimp"         "collectbeta"    "collectomega"   "finimp.latnorm"
## [5] "collectbetaY"   "collectvarY"
lapply(jomointMCMC, class)
## $finimp
## [1] "data.frame"
## 
## $collectbeta
## [1] "array"
## 
## $collectomega
## [1] "array"
## 
## $finimp.latnorm
## [1] "matrix"
## 
## $collectbetaY
## [1] "array"
## 
## $collectvarY
## [1] "numeric"

jomoint is a list with six elements:

  • finimp is the imputed dataset that was generated at the end of the burn-in period
  • finimp.latnorm contains a matrix corresponding estimates of the latent normal distributions
  • collectbetaY and collectvarY contain the MCMC samples throughout the burn-in period for the coefficients of the analysis model and residual variation
  • collectbeta and collectomega contain the MCMC samples throughout the burn-in period for the mean and variance-covariance matrix of the multivariate normal distribution.

Task 3

Make plots to evaluate convergence.


Note:
In version 2.6-2 of jomo there is a bug in the sampling of the covariance parameters, which will be resolved in the next version. The objects jomointMCMC and jomoint (see later) used in the solution were created using a preliminary fix of the bug.

Solution 3

par(mfrow = c(3,4), mar = c(3,3,1,1), mgp = c(2, 0.6, 0))
for (i in 1:dim(jomointMCMC$collectbeta)[2]) {
  plot(jomointMCMC$collectbeta[1, i, ], type = 'l')
}

plot(jomointMCMC$collectvarY, type = 'l')

par(mfrow = c(3,3), mar = c(3,3,1,1), mgp = c(2, 0.6, 0))
for (i in 1:dim(jomointMCMC$collectbetaY)[2]) {
  plot(jomointMCMC$collectbetaY[1, i, ], type = 'l')
}

par(mfrow = c(9, 8), mar = c(2.1, 2.5, 0.5, 0.5), mgp = c(2, 0.6, 0))
for (i in 1:dim(jomointMCMC$collectomega)[2]) {
  for (j in 1:i) {
    plot(jomointMCMC$collectomega[i, j , ], type = 'l')
  }
}

Task 4

When you are satisfied with the convergence, run the imputation with the number of burn-in iterations that you found to be sufficient and explore the resulting object.

Solution 4

jomoint <- jomo.lmC(weight ~ gender + bili + age * (chol + HDL) + height,
                    nburn = 1000, nbetween = 200, nimp = 5, data = NHANES)
## This function is beta software. Use carefully and please report any bug to the package mantainer
## ....................................................................................................First imputation registered. 
## ....................Imputation number  2 registered 
## ....................Imputation number  3 registered 
## ....................Imputation number  4 registered 
## ....................Imputation number  5 registered 
## The posterior mean of the substantive model fixed effects estimates is:
##           [,1]     [,2]      [,3]      [,4]     [,5]    [,6]     [,7]
## [1,] -85.25802 4.848687 -7.031685 0.3672109 4.988777 -24.785 101.3408
##            [,8]      [,9]
## [1,] -0.1037916 0.2098365
## The posterior mean of the substantive model residual variance is:
## [1] 253.033
## The posterior mean of the fixed effects estimates is:
##           [,1]    [,2]     [,3]     [,4]     [,5]     [,6]       [,7]
## [1,] 0.7472807 47.6825 5.027179 1.380295 1.683901 70.75355 0.01799191
##            [,8]       [,9]     [,10]      [,11]
## [1,] -0.3415452 -0.2617833 -0.126716 0.02619691
## The posterior mean of the level 1 covariance matrix is:
##              [,1]        [,2]         [,3]        [,4]        [,5]
##  [1,]  0.12862580   0.6140839 -0.027653973 -0.06691115 -0.03546014
##  [2,]  0.61408390 304.6846905  1.824955569 -0.41032266 -0.59686868
##  [3,] -0.02765397   1.8249556  1.235047646  0.14701218  0.01232511
##  [4,] -0.06691115  -0.4103227  0.147012180  0.31587376  0.08073964
##  [5,] -0.03546014  -0.5968687  0.012325107  0.08073964  0.05955106
##  [6,]  0.01022712  -4.2158261  2.157704683 -0.15607124  0.17203748
##  [7,] -0.06012189  -0.7124416 -0.057771815  0.11340621  0.19244043
##  [8,]  0.04105525   6.1093452 -0.018955517 -0.22282767 -0.12330949
##  [9,] -0.01482710   2.6423658 -0.049865247 -0.11331859 -0.05010485
## [10,] -0.02141806   0.9482815  0.002079085 -0.10048987 -0.02123084
## [11,] -0.03428924  -0.5397380 -0.086099295 -0.04975737 -0.01238275
##               [,6]         [,7]        [,8]         [,9]        [,10]
##  [1,]   0.01022712 -0.060121893  0.04105525 -0.014827097 -0.021418063
##  [2,]  -4.21582611 -0.712441565  6.10934521  2.642365812  0.948281485
##  [3,]   2.15770468 -0.057771815 -0.01895552 -0.049865247  0.002079085
##  [4,]  -0.15607124  0.113406212 -0.22282767 -0.113318589 -0.100489873
##  [5,]   0.17203748  0.192440426 -0.12330949 -0.050104847 -0.021230837
##  [6,] 157.57109071  2.040882449 -0.25096545  0.614532707  0.436222115
##  [7,]   2.04088245  1.000000000 -0.20066966 -0.002008077  0.053475636
##  [8,]  -0.25096545 -0.200669663  1.00000000  0.500000000  0.500000000
##  [9,]   0.61453271 -0.002008077  0.50000000  1.000000000  0.500000000
## [10,]   0.43622212  0.053475636  0.50000000  0.500000000  1.000000000
## [11,]  -0.97614224 -0.042733669  0.50000000  0.500000000  0.500000000
##             [,11]
##  [1,] -0.03428924
##  [2,] -0.53973803
##  [3,] -0.08609930
##  [4,] -0.04975737
##  [5,] -0.01238275
##  [6,] -0.97614224
##  [7,] -0.04273367
##  [8,]  0.50000000
##  [9,]  0.50000000
## [10,]  0.50000000
## [11,]  1.00000000
class(jomoint)
## [1] "data.frame"
dim(jomoint)
## [1] 4500   11
summary(jomoint)
##      weight            bili              age             chol       
##  Min.   : 38.56   Min.   :-0.2677   Min.   :20.00   Min.   : 2.373  
##  1st Qu.: 64.86   1st Qu.: 0.5661   1st Qu.:33.00   1st Qu.: 4.270  
##  Median : 78.02   Median : 0.7000   Median :48.00   Median : 4.990  
##  Mean   : 79.74   Mean   : 0.7473   Mean   :47.69   Mean   : 5.012  
##  3rd Qu.: 90.72   3rd Qu.: 0.9000   3rd Qu.:63.00   3rd Qu.: 5.640  
##  Max.   :176.90   Max.   : 4.1000   Max.   :79.00   Max.   :10.290  
##                   NA's   :126                       NA's   :117     
##       HDL             height           DBP            gender    
##  Min.   :0.1433   Min.   :1.241   Min.   : 14.00   male  :2322  
##  1st Qu.:1.1100   1st Qu.:1.600   1st Qu.: 64.00   female:2178  
##  Median :1.3400   Median :1.676   Median : 70.67                
##  Mean   :1.3802   Mean   :1.685   Mean   : 70.79                
##  3rd Qu.:1.6000   3rd Qu.:1.753   3rd Qu.: 78.00                
##  Max.   :4.0300   Max.   :2.091   Max.   :111.48                
##  NA's   :117      NA's   :20      NA's   :77                    
##                    educ            id          Imputation 
##  Less than 9th grade : 432   Min.   :  1.0   Min.   :0.0  
##  9-11th grade        : 690   1st Qu.:188.0   1st Qu.:1.0  
##  High school graduate: 936   Median :375.5   Median :2.5  
##  some college        :1260   Mean   :375.5   Mean   :2.5  
##  College or above    :1182   3rd Qu.:563.0   3rd Qu.:4.0  
##                              Max.   :750.0   Max.   :5.0  
## 

Task 5

jomo.lm() returns a data.frame in long format, including the original data. jomo has added a column id to identify subjects and a column Imputation that identifies the imputation number.

With this information and the flowchart from above, fit the linear regression model for weight on the imputed data and pool the results.

Solution 5

jomoint_mids <- as.mids(jomoint, 
                        .imp = which(names(jomoint) == "Imputation"),
                        .id = which(names(jomoint) == "id")
)

jomoint_mira <- with(jomoint_mids,
                     lm(weight ~ gender + bili + age * (chol + HDL) + height))

summary(pool(jomoint_mira))
##                     est          se         t        df     Pr(>|t|)
## (Intercept) -91.6593083 17.27408160 -5.306175 157.01377 3.755321e-07
## gender2       5.0936148  1.67349193  3.043704 277.73912 2.560799e-03
## bili         -7.7036956  1.97222280 -3.906098 262.46753 1.193767e-04
## age           0.4458178  0.18743140  2.378565 250.05704 1.813184e-02
## chol          6.0134518  1.96429820  3.061374  91.40218 2.892984e-03
## HDL         -25.2198440  5.00056997 -5.043394 169.48278 1.167590e-06
## height      102.9499357  7.49821968 13.729917 294.35181 0.000000e+00
## age:chol     -0.1235710  0.03851946 -3.208015  87.68336 1.866853e-03
## age:HDL       0.2186667  0.09501044  2.301502 179.59171 2.251136e-02
##                     lo 95        hi 95 nmis        fmi     lambda
## (Intercept) -125.77886386 -57.53975276   NA 0.14930391 0.13853654
## gender2        1.79927552   8.38795402   NA 0.09834833 0.09187883
## bili         -11.58708802  -3.82030327  126 0.10299295 0.09618371
## age            0.07667240   0.81496325    0 0.10702794 0.09991424
## chol           2.11184600   9.91505758  117 0.20961611 0.19250864
## HDL          -35.09126865 -15.34841939  117 0.14185114 0.13178386
## height        88.19301973 117.70685168   20 0.09364434 0.08750688
## age:chol      -0.20012422  -0.04701784   NA 0.21487793 0.19717175
## age:HDL        0.03118632   0.40614714   NA 0.13634877 0.12678409

Comparison of the results for the NHANES data

We have imputed the NHANES data using five different approaches:

  • with mice, using the Just Another Variable approach
  • with mice, using passive imputation of interactions
  • with JontAI
  • with smcfcs
  • with jomo

Here is how the pooled results from these imputations compare:

Longitudinal data

Data & Model of interest

In this part of the practical, we will work with data on a trial on primary biliary cirrhosis (PBC) of the liver.

To get the pbclong data, load the file pbclong.RData.

The variables contained in the dataset pbclong are:

id patient identifier
protime blood clotting time (the longitudinal outcome)
year continuously measured year of follow-up time (the time variable)
sex patients’ sex (f: female, m: male)
trt treatment group (0: D-penicillmain, 1: placebo)
age patients’ age at intake
ascites presence of ascites at baseline (0:no, 1:yes)
bili serum bilirubin level at baseline
chol serum cholesterol level at baseline
albumin serum albumin level at follow-up (time-varying)

The variables have the following distributions and proportions of missing values:

The missing data pattern is:

##     protime year sex trt age albumin id ascites bili chol     
## 839       1    1   1   1   1       1  1       1    1    1    0
## 136       1    1   1   1   1       1  1       0    1    1    1
## 328       1    1   1   1   1       1  1       1    0    1    1
## 245       1    1   1   1   1       1  1       1    1    0    1
##  27       1    1   1   1   1       1  1       0    0    1    2
## 186       1    1   1   1   1       1  1       0    1    0    2
## 129       1    1   1   1   1       1  1       1    0    0    2
##  55       1    1   1   1   1       1  1       0    0    0    3
##           0    0   0   0   0       0  0     404  539  615 1558


The longitudinal outcome protime shows relatively linear trajectories over time:

To analyse the trajectories of protime we want to use the following linear mixed effects model with random intercept and slope (either using lme from the package nlme or using lmer from the package lme4):

# using package nlme
lme(protime ~ year + sex + trt + age + ascites + bili + chol + albumin, random = ~year|id)


# using package lme4
lmer(protime ~ year + sex + trt + age + ascites + bili + chol + albumin + (year|id))

Imputation using mice

For imputation of longitudinal data, the mice package provides special imputation methods that take into account the two-level structure of the data. In the pbclong data missing values occur in baseline covariates only. For such variables, the imputation methods 2lonly.pmm and 2lonly.norm are available. The predictorMatrix requires some extra specification to identify the id variable (set to -2) and the random effects structure (set to 2).

Task 1

Run the setup imputation and perform the necessary changes in the imputation method and predictor matrix:

Solution 1

micelong0 <- mice(pbclong, maxit = 0)
meth_micelong <- micelong0$method
pred_micelong <- micelong0$predictorMatrix


meth_micelong[c("ascites", "bili", "chol")] <- "2lonly.pmm"


pred_micelong[, "id"] <- -2
pred_micelong[, "year"] <- 2

# check the predictor matrix
pred_micelong
##         protime year sex trt age ascites bili chol albumin id
## protime       0    2   0   0   0       0    0    0       0 -2
## year          0    2   0   0   0       0    0    0       0 -2
## sex           0    2   0   0   0       0    0    0       0 -2
## trt           0    2   0   0   0       0    0    0       0 -2
## age           0    2   0   0   0       0    0    0       0 -2
## ascites       1    2   1   1   1       0    1    1       1 -2
## bili          1    2   1   1   1       1    0    1       1 -2
## chol          1    2   1   1   1       1    1    0       1 -2
## albumin       0    2   0   0   0       0    0    0       0 -2
## id            0    2   0   0   0       0    0    0       0 -2

Task 2

Now run the imputation with the adapted imputation method meth_micelong and predictor matrix pred_micelong, analyse the imputed data and pool the results.

Solution 2

micelong <- mice(pbclong, meth = meth_micelong, pred = pred_micelong,
                 maxit = 20)
## 
##  iter imp variable
##   1   1  ascites  bili  chol
##   1   2  ascites  bili  chol
##   1   3  ascites  bili  chol
##   1   4  ascites  bili  chol
##   1   5  ascites  bili  chol
##   2   1  ascites  bili  chol
##   2   2  ascites  bili  chol
##   2   3  ascites  bili  chol
##   2   4  ascites  bili  chol
##   2   5  ascites  bili  chol
##   3   1  ascites  bili  chol
##   3   2  ascites  bili  chol
##   3   3  ascites  bili  chol
##   3   4  ascites  bili  chol
##   3   5  ascites  bili  chol
##   4   1  ascites  bili  chol
##   4   2  ascites  bili  chol
##   4   3  ascites  bili  chol
##   4   4  ascites  bili  chol
##   4   5  ascites  bili  chol
##   5   1  ascites  bili  chol
##   5   2  ascites  bili  chol
##   5   3  ascites  bili  chol
##   5   4  ascites  bili  chol
##   5   5  ascites  bili  chol
##   6   1  ascites  bili  chol
##   6   2  ascites  bili  chol
##   6   3  ascites  bili  chol
##   6   4  ascites  bili  chol
##   6   5  ascites  bili  chol
##   7   1  ascites  bili  chol
##   7   2  ascites  bili  chol
##   7   3  ascites  bili  chol
##   7   4  ascites  bili  chol
##   7   5  ascites  bili  chol
##   8   1  ascites  bili  chol
##   8   2  ascites  bili  chol
##   8   3  ascites  bili  chol
##   8   4  ascites  bili  chol
##   8   5  ascites  bili  chol
##   9   1  ascites  bili  chol
##   9   2  ascites  bili  chol
##   9   3  ascites  bili  chol
##   9   4  ascites  bili  chol
##   9   5  ascites  bili  chol
##   10   1  ascites  bili  chol
##   10   2  ascites  bili  chol
##   10   3  ascites  bili  chol
##   10   4  ascites  bili  chol
##   10   5  ascites  bili  chol
##   11   1  ascites  bili  chol
##   11   2  ascites  bili  chol
##   11   3  ascites  bili  chol
##   11   4  ascites  bili  chol
##   11   5  ascites  bili  chol
##   12   1  ascites  bili  chol
##   12   2  ascites  bili  chol
##   12   3  ascites  bili  chol
##   12   4  ascites  bili  chol
##   12   5  ascites  bili  chol
##   13   1  ascites  bili  chol
##   13   2  ascites  bili  chol
##   13   3  ascites  bili  chol
##   13   4  ascites  bili  chol
##   13   5  ascites  bili  chol
##   14   1  ascites  bili  chol
##   14   2  ascites  bili  chol
##   14   3  ascites  bili  chol
##   14   4  ascites  bili  chol
##   14   5  ascites  bili  chol
##   15   1  ascites  bili  chol
##   15   2  ascites  bili  chol
##   15   3  ascites  bili  chol
##   15   4  ascites  bili  chol
##   15   5  ascites  bili  chol
##   16   1  ascites  bili  chol
##   16   2  ascites  bili  chol
##   16   3  ascites  bili  chol
##   16   4  ascites  bili  chol
##   16   5  ascites  bili  chol
##   17   1  ascites  bili  chol
##   17   2  ascites  bili  chol
##   17   3  ascites  bili  chol
##   17   4  ascites  bili  chol
##   17   5  ascites  bili  chol
##   18   1  ascites  bili  chol
##   18   2  ascites  bili  chol
##   18   3  ascites  bili  chol
##   18   4  ascites  bili  chol
##   18   5  ascites  bili  chol
##   19   1  ascites  bili  chol
##   19   2  ascites  bili  chol
##   19   3  ascites  bili  chol
##   19   4  ascites  bili  chol
##   19   5  ascites  bili  chol
##   20   1  ascites  bili  chol
##   20   2  ascites  bili  chol
##   20   3  ascites  bili  chol
##   20   4  ascites  bili  chol
##   20   5  ascites  bili  chol
micelong_mira <- with(micelong, lme(protime ~ year + sex + trt + age + 
                                      ascites + bili + chol + albumin,
                                    random = ~year|id)
)

# alternative:
micelong_mira <- with(micelong, lmer(protime ~ year + sex + trt + age + 
                                      ascites + bili + chol + albumin + 
                                    (year|id))
)

summary(pool(micelong_mira))
##                       est           se          t         df     Pr(>|t|)
## (Intercept) 12.1440772490 0.4331159193 28.0388614  260.86694 0.000000e+00
## year         0.1777262252 0.0252284464  7.0446758 1824.06668 2.624567e-12
## sex2        -0.3003235708 0.1540701831 -1.9492647   70.13239 5.526383e-02
## trt2        -0.0458308495 0.0917002778 -0.4997896  233.67788 6.176935e-01
## age          0.0009080765 0.0044242268  0.2052509  840.74221 8.374258e-01
## ascites2     0.6333646181 0.1015949611  6.2342129  172.04971 3.397080e-09
## bili         0.1119195116 0.0209596387  5.3397634   16.39779 6.113449e-05
## chol        -0.0003241279 0.0004062786 -0.7977973   16.27452 4.364749e-01
## albumin     -0.4723209273 0.0705680451 -6.6931276 1171.57563 3.379053e-11
##                    lo 95         hi 95 nmis        fmi      lambda
## (Intercept) 11.291228951 12.9969255470   NA 0.12070812 0.113992555
## year         0.128246547  0.2272059035    0 0.01105802 0.009974288
## sex2        -0.607596664  0.0069495227   NA 0.25406649 0.233093390
## trt2        -0.226495780  0.1348340814   NA 0.12890807 0.121484350
## age         -0.007775750  0.0095919029    0 0.05298855 0.050738423
## ascites2     0.432831594  0.8338976420   NA 0.15409215 0.144315678
## bili         0.067574503  0.1562645198  539 0.54237586 0.489768699
## chol        -0.001184221  0.0005359649  615 0.54438595 0.491636116
## albumin     -0.610774790 -0.3338670650    0 0.03721226 0.035570083

Imputation using JointAI

To analyse incomplete longitudinal data a linear mixed model, the function lme_imp() can be used. Its specification of the major model components is analogous to the function lme() from the nlme package.

For the specification of the other arguments of lme_imp(), refer to the help page or to the details given in the previous part of this practical.

Task

Run the imputation (start with n.iter = 500; this will take a few seconds) and check the traceplot(). If you are satisfied by convergence and mixing of the chains, get the model summary().

Solution

library(JointAI)
JointAIlong <- lme_imp(protime ~ year + sex + trt + age + ascites + bili + chol + albumin,
    random = ~year|id, data = pbclong, n.iter = 1000)
## This is new software. Please report any bug to the package maintainer.
traceplot(JointAIlong)

summary(JointAIlong)
## 
##  Linear mixed model fitted with JointAI 
## 
## Call:
## lme_imp(fixed = protime ~ year + sex + trt + age + ascites + 
##     bili + chol + albumin, data = pbclong, random = ~year | id, 
##     n.iter = 1000)
## 
## Posterior summary:
##                   Mean        SD      2.5%      97.5% tail-prob.
## (Intercept) 12.4596335 0.4655691 11.521611 13.3400223    0.00000
## sexf        -0.3730065 0.1662976 -0.725929 -0.0577856    0.01333
## trt1        -0.0436455 0.0942028 -0.224690  0.1365296    0.66467
## age         -0.0006715 0.0048848 -0.010225  0.0090504    0.89267
## ascites1     0.6352674 0.1116524  0.417236  0.8459319    0.00000
## chol        -0.0004723 0.0004191 -0.001282  0.0003174    0.26800
## bili         0.1027571 0.0174612  0.067687  0.1372530    0.00000
## year         0.1802536 0.0248264  0.132637  0.2287294    0.00000
## albumin     -0.4977301 0.0695397 -0.632876 -0.3607458    0.00000
## 
## Posterior summary of random effects covariance matrix:
##          Mean     SD   2.5%  97.5%
## D[1,1] 0.7605 0.1348 0.5267 1.0513
## D[1,2] 0.6512 0.1185 0.4429 0.9044
## D[2,2] 0.8643 0.1337 0.6325 1.1560
## 
## Posterior summary of residual std. deviation:
##                Mean      SD  2.5% 97.5%
## sigma_protime 1.056 0.02049 1.018 1.098
## 
## 
## MCMC settings:
## Iterations = 101:1100
## Sample size per chain = 1000 
## Thinning interval = 1 
## Number of chains = 3 
## 
## Number of observations: 1945 
## Number of groups: 312

Imputation using jomo

The package jomo provides functionality to impute normal and non-normal longitudinal outcomes using the functions jomo.lmer() and jomo.glmer(), which use the model specification of the complete data versions lmer() and glmer() from the package lme4.

Different to the other jomo.<...>() functions is that here the level of each variable in the dataset (1 for repeated measurements, 2 for baseline covariates) needs to be specified.

Refer to the help page of jomo.lmer() or the details provided in the previous part of this practical to get more information on the other parameters that need to be specified.

Note:
In the current version of jomo the random intercept needs to be specified explicitely, i.e. (1 + year|id) instead of (year|id).


Task 1

We skip the step to determine the number of iterations necessary in the burn-in phase (you would have to use jomo.lmer.MCMCchain() to do this step) and set nburn = 6000 and nbetween = 100. This will take approximately 2 minutes to run. (To save time, you may use less iterations.)

Solution 1

# specify the levels of the variables
lvl_jomolong <- c(protime = 1, year = 1, sex = 2, trt = 2, age = 2, ascites = 2,
                  bili = 2, chol = 2, albumin = 1, id = 1)

t0 <- Sys.time()
jomolong <- jomo.lmer(protime ~ year + sex + trt + age + ascites +
                                      bili + chol + albumin + (1 + year|id),
                                    data = pbclong,
                                    level = lvl_jomolong, meth = "common", output = 0,
                                    nburn = 6000, nbetween = 100)
## This function is beta software. Use carefully and please report any bug to the package mantainer
t1 <- Sys.time()
t1 - t0
## Time difference of 1.31898 mins

Continue

When we try to convert the long format dataset returned by jomo.lmer() to a mids object using as.mids() (like we did in the previous part of this practical), we get an error message: Error in row.names<-.data.frame(*tmp*, value = value) : duplicate 'row.names' are not allowed. This is because as.mids() wants to assign the id variable as rownames. Since we have longitudinal data, where the same id occurs more than once, this leads to an error.

Instead, we use the function datalist2mids() from the miceadds package.

Task 2

Analyse the imputed data and pool the results, using the following steps:

  1. split jomolong by imputation number into a list of imputed datasets
  2. exclude the first element, which contains the original data
  3. convert to a mids object using datalist2mids()
  4. analyse the mids object
  5. pool the results

To split a dataset into a list, use split().


Note:
jomo.lmer() changes the id variable to clus. When analysing the imputed data, you need to use clus in the random effects specification.

Solution 2

library(miceadds)
jomolong_mids <- datalist2mids(
  split(jomolong, jomolong$Imputation)[-1] # split and remove first element
)

# analyse using nlme package:
jomolong_mira <- with(jomolong_mids, lme(protime ~ year + sex + trt + age + ascites +
                                     bili + chol + albumin,
                                   random = ~ year|clus))

# alternative: lme4 package
jomolong_mira <- with(jomolong_mids, lmer(protime ~ year + sex + trt + age + ascites +
                                            bili + chol + albumin + (year|clus))
)

summary(pool(jomolong_mira))
##                       est           se          t         df     Pr(>|t|)
## (Intercept) 12.1211813665 0.4683214474 25.8821829   86.04087 0.000000e+00
## year         0.1804080015 0.0254432167  7.0906129 1754.58297 1.928013e-12
## sex2        -0.3837352619 0.1611171541 -2.3817157   43.61502 2.166261e-02
## trt2        -0.0254703288 0.0906932040 -0.2808405  474.21798 7.789552e-01
## age          0.0016644122 0.0050221796  0.3314123   65.55242 7.413902e-01
## ascites2     0.6231743090 0.1027405410  6.0655152  169.94019 8.285254e-09
## bili         0.1026290954 0.0187968019  5.4599232   16.76121 4.451162e-05
## chol        -0.0002251263 0.0004556045 -0.4941266   10.75881 6.311500e-01
## albumin     -0.4634483540 0.0703617867 -6.5866485 1481.53788 6.231415e-11
##                    lo 95         hi 95 nmis        fmi     lambda
## (Intercept) 11.190195451 13.0521672819   NA 0.22720375 0.20944667
## year         0.130505790  0.2303102135    0 0.01450538 0.01338268
## sex2        -0.708526522 -0.0589440022   NA 0.32804759 0.29792534
## trt2        -0.203680575  0.1527399170   NA 0.08250923 0.07864788
## age         -0.008363971  0.0116927954    0 0.26355962 0.24142848
## ascites2     0.420362253  0.8259863653   NA 0.15518682 0.14530250
## bili         0.062928228  0.1423299630  539 0.53656759 0.48438282
## chol        -0.001230655  0.0007804021  615 0.66277850 0.60542209
## albumin     -0.601467677 -0.3254290309    0 0.02532280 0.02400792

Comparison of the results for the longitudinal pbc data

We have imputed the longitudinal pbc data using three different approaches:

  • with mice, using the 2-level imputation methods 2lonly.norm and 2lonly.pmm
  • with JointAI
  • with jomo

Here is how the pooled results from these imputations compare:

Survival data

Data & Model of interest

In this part of the practical, we will work with a different subset of the PBC data, omitting the longitudinal structure and focusing on the survival component.

To get the pbcdat data, load the file pbcdat.RData.

The variables contained in this subset (pbcdat) are:
time number of years between inclusion and death, transplantion, or end of follow-up
status status at time (censored, transplant, dead)
age patient’s age at intake
sex patient’s sex
platelet platelet count
chol serum cholesterol
stage histologic stage of disease

The variables in pbcdat dataset have the following distributions:

The missing data pattern is

##     time status age sex stage platelet chol    
## 280    1      1   1   1     1        1    1   0
##   4    1      1   1   1     1        0    1   1
## 121    1      1   1   1     1        1    0   1
##   7    1      1   1   1     1        0    0   2
##   6    1      1   1   1     0        1    0   2
##        0      0   0   0     6       11  134 151

We are interested to determine predictor variables for patient survival, using the following Cox proportional hazards model:

coxph(Surv(time, status == 'dead') ~ platelet + age + sex + chol + stage)

Imputation with mice

As we have seen in the lecture, the mice package provides the function nelsonaalen() that calculates the Nelson-Aalen estimator with which the cumulative Hazard can be approximated.

Task 1

Calculate the Nelson-Aalen estimate for patient survival in the pbc data and perform the usual setup steps for imputation using mice(). nelsonaalen() does not accept the status == 'dead' specification of the event, hence, we need to create a new event indicator event.


Note:
nelsonaalen() uses the name status internally in such a way that it is replaced by the covariate status of our dataset pbcdat, which leads to an error. Since we do not need the column status for calculating the Nelson-Aalen estimate (we use event instead), we can exclude it from the data for just this step. To do this, use data = subset(pbcdat, select = -c(status)) inside the nelsonaalen() function.

Solution 1

pbcdat$event <- as.numeric(pbcdat$status == 'dead')

pbcdat$na <- nelsonaalen(data = subset(pbcdat, select = -c(status)),
                         timevar = "time", 
                         statusvar = "event")

micesurv0 <- mice(pbcdat, maxit = 0)
micesurvmeth <- micesurv0$meth
micesurvpred <- micesurv0$pred

micesurvmeth[c("chol")] <- "midastouch"
micesurvmeth[c("platelet")] <- "norm"

micesurvpred[, "event"] <- 0

# check the method and predictorMatrix
micesurvmeth
##         time       status     platelet          age          sex 
##           ""           ""       "norm"           ""           "" 
##         chol        stage        event           na 
## "midastouch"       "polr"           ""           ""
micesurvpred
##          time status platelet age sex chol stage event na
## time        0      0        0   0   0    0     0     0  0
## status      0      0        0   0   0    0     0     0  0
## platelet    1      1        0   1   1    1     1     0  1
## age         0      0        0   0   0    0     0     0  0
## sex         0      0        0   0   0    0     0     0  0
## chol        1      1        1   1   1    0     1     0  1
## stage       1      1        1   1   1    1     0     0  1
## event       0      0        0   0   0    0     0     0  0
## na          0      0        0   0   0    0     0     0  0

Task 2

Now run the imputation, and analyse the imputed data.

Solution 2

micesurv <- mice(pbcdat, predictorMatrix = micesurvpred, maxit = 10, m = 5)
## 
##  iter imp variable
##   1   1  platelet  chol  stage
##   1   2  platelet  chol  stage
##   1   3  platelet  chol  stage
##   1   4  platelet  chol  stage
##   1   5  platelet  chol  stage
##   2   1  platelet  chol  stage
##   2   2  platelet  chol  stage
##   2   3  platelet  chol  stage
##   2   4  platelet  chol  stage
##   2   5  platelet  chol  stage
##   3   1  platelet  chol  stage
##   3   2  platelet  chol  stage
##   3   3  platelet  chol  stage
##   3   4  platelet  chol  stage
##   3   5  platelet  chol  stage
##   4   1  platelet  chol  stage
##   4   2  platelet  chol  stage
##   4   3  platelet  chol  stage
##   4   4  platelet  chol  stage
##   4   5  platelet  chol  stage
##   5   1  platelet  chol  stage
##   5   2  platelet  chol  stage
##   5   3  platelet  chol  stage
##   5   4  platelet  chol  stage
##   5   5  platelet  chol  stage
##   6   1  platelet  chol  stage
##   6   2  platelet  chol  stage
##   6   3  platelet  chol  stage
##   6   4  platelet  chol  stage
##   6   5  platelet  chol  stage
##   7   1  platelet  chol  stage
##   7   2  platelet  chol  stage
##   7   3  platelet  chol  stage
##   7   4  platelet  chol  stage
##   7   5  platelet  chol  stage
##   8   1  platelet  chol  stage
##   8   2  platelet  chol  stage
##   8   3  platelet  chol  stage
##   8   4  platelet  chol  stage
##   8   5  platelet  chol  stage
##   9   1  platelet  chol  stage
##   9   2  platelet  chol  stage
##   9   3  platelet  chol  stage
##   9   4  platelet  chol  stage
##   9   5  platelet  chol  stage
##   10   1  platelet  chol  stage
##   10   2  platelet  chol  stage
##   10   3  platelet  chol  stage
##   10   4  platelet  chol  stage
##   10   5  platelet  chol  stage
micesurv_mira <- with(micesurv, coxph(Surv(time = time, event) ~ platelet + 
                               age + sex + chol + stage))

summary(pool(micesurv_mira))
## Warning in mice.df(m, lambda, dfcom, method): Large sample assumed.
##                   est           se          t          df     Pr(>|t|)
## platelet -0.001735838 0.0009212356 -1.8842500    288.7869 6.053549e-02
## age       0.031921824 0.0082122636  3.8870919   8613.1672 1.022137e-04
## sex2     -0.087872239 0.2362019158 -0.3720217   1382.9532 7.099337e-01
## chol      0.001712381 0.0002908418  5.8876719    185.3976 1.804560e-08
## stage2    0.822266919 0.7438101615  1.1054795  34380.0625 2.689596e-01
## stage3    1.215504072 0.7297672590  1.6656051  47202.4012 9.579873e-02
## stage4    2.174046613 0.7223116486  3.0098457 436715.8270 2.613953e-03
##                 lo 95        hi 95 nmis         fmi      lambda
## platelet -0.003549026 7.734929e-05   11 0.123718956 0.117671195
## age       0.015823820 4.801983e-02    0 0.021682153 0.021455012
## sex2     -0.551225009 3.754805e-01   NA 0.055106860 0.053741362
## chol      0.001138596 2.286166e-03  134 0.155925928 0.146869221
## stage2   -0.635625534 2.280159e+00   NA 0.010654893 0.010597341
## stage3   -0.214850150 2.645858e+00   NA 0.009025594 0.008983606
## stage4    0.758337873 3.589755e+00   NA 0.002273966 0.002269397

Again, we remove the added column from the dataset to use the original data in the subsequent imputations.

pbcdat <- subset(pbcdat, select = -c(na))

Imputation with smcfcs

An alternative approach, that aims to remove the bias introduced by using the Nelson-Aalen estimator to approximate the cumulative hazard is implemented in the package smcfcs.

If you are uncertain which parameters need to be specified, check the help page for the function smcfcs() or go back to the first part of this practical where some of the details about the specification were described.

Task 1

Impute the pbcdat data with smcfcs() and the model formula specified above. We will skip the confirmation of convergence in the interest of time.

smcfcs() does can not handle the event specification used in the complete data model above (status == 'dead'), so you need to use event instead.

Solution 1

# specify imputation method (in order of variables in dataset)
meth_smcsurv <- c(time = "", status = "", platelet = "norm",
                  age = "", sex = "", chol = "norm", stage = "podds",
                  event = "")

# impute the data
smcsurv <- smcfcs(originaldata = pbcdat,
                  smtype = "coxph", 
                  smformula = "Surv(time = time, event) ~ platelet + 
                               age + sex + chol + stage",
                  method = meth_smcsurv, numit = 20)
## [1] "Outcome variable(s): time,event"
## [1] "Passive variables: "
## [1] "Partially obs. variables: platelet,chol,stage"
## [1] "Fully obs. substantive model variables: age,sex"
## [1] "Imputation  1"
## [1] "Imputing:  platelet  using  chol,stage,age,sex  plus outcome"
## [1] "Imputing:  chol  using  platelet,stage,age,sex  plus outcome"
## [1] "Imputing:  stage  using  platelet,chol,age,sex  plus outcome"
## [1] "Imputation  2"
## [1] "Imputation  3"
## [1] "Imputation  4"
## [1] "Imputation  5"

Task 2

Assuming the algorithm has converged (it has for the settings chosen in the solution), fit the Cox model and pool the results.

Solution 2

# create mids object or imputationList object
smcsurv_impList <- mitools::imputationList(smcsurv$impDatasets)

opt <- getOption("contrasts")
# this specifies that ordinal factors are treated as unordered factors when covariate in a model
options(contrasts = rep("contr.treatment", 2))

smcsurv_models <- with(smcsurv_impList, coxph(Surv(time = time, event) ~ platelet + 
                               age + sex + chol + stage))
# re-set the option for factors
options(contrasts = opt)

smcsurv_mira <- as.mira(smcsurv_models)

summary(pool(smcsurv_mira))
## Warning in mice.df(m, lambda, dfcom, method): Large sample assumed.
##                   est           se          t         df     Pr(>|t|)
## platelet -0.001845292 0.0009359963 -1.9714741  224.87336 0.0498958048
## age       0.029968160 0.0082623858  3.6270589 1930.49730 0.0002941095
## sexf     -0.164311796 0.2346507399 -0.7002398 1278.55849 0.4839049319
## chol      0.001487577 0.0004008993  3.7105997   22.15989 0.0012070539
## stage2    0.793470917 0.7325710678  1.0831317 2261.56145 0.2788654514
## stage3    1.180754737 0.7209558146  1.6377630 1653.17146 0.1016614558
## stage4    2.124575481 0.7118224306  2.9846987 2844.71708 0.0028627479
##                  lo 95         hi 95 nmis        fmi     lambda
## platelet -0.0036897381 -8.468001e-07   NA 0.14095996 0.13335357
## age       0.0137640219  4.617230e-02   NA 0.04646059 0.04547323
## sexf     -0.6246545774  2.960310e-01   NA 0.05736869 0.05589532
## chol      0.0006565102  2.318643e-03   NA 0.47057154 0.42485212
## stage2   -0.6431108286  2.230053e+00   NA 0.04285219 0.04200612
## stage3   -0.2333279972  2.594837e+00   NA 0.05029482 0.04914657
## stage4    0.7288353014  3.520316e+00   NA 0.03811877 0.03744275

Imputation with jomo

To impute the pbc survival data with the package jomo we use the function jomo.coxph(), which works analogously to the other jomo.<...>() functions we have seen.

Task

Perform the imputation, analyse the imputed data and pool the results. Again, we will skip the convergence check to save some time. Use nburn = 1000 and nbetween = 500. As in the previous imputation approaches, we use the new event indicator event instead of status == 'dead'. Since jomo considers all extra columns as auxiliary variables, we need to exclude status to prevent multicollinearity issues.

To exclude the variable status use data = subset(pbcdat, select = -status).

To run the analysis on the imputed data, convert the long format data to a mids object, like we did in the NHANES example.

Solution

jomosurv <- jomo.coxph(formula = Surv(time = time, event) ~ platelet + 
                         age + sex + chol + stage,
                       data = subset(pbcdat, select = -status), 
                       nburn = 1000, nbetween = 500, nimp = 5)
## This function is beta software. Use carefully and please report any bug to the package mantainer
## ....................................................................................................First imputation registered. 
## ..................................................Imputation number  2 registered 
## ..................................................Imputation number  3 registered 
## ..................................................Imputation number  4 registered 
## ..................................................Imputation number  5 registered 
## The posterior mean of the substantive model fixed effects estimates is:
##              [,1]       [,2]       [,3]        [,4]      [,5]     [,6]
## [1,] -0.001714145 0.03118608 -0.1500963 0.001628265 0.8112937 1.171847
##         [,7]
## [1,] 2.14595
## The posterior mean of the fixed effects estimates is:
##          [,1]     [,2]     [,3]      [,4]       [,5]       [,6]       [,7]
## [1,] 257.5485 50.74662 362.9459 -1.240514 -0.8258932 -0.2330519 0.04388935
## The posterior mean of the level 1 covariance matrix is:
##             [,1]        [,2]         [,3]         [,4]         [,5]
## [1,] 10064.71424 -160.422961  4387.388754 -15.00747951  28.51097587
## [2,]  -160.42296  113.474201  -418.918982   2.77597525  -2.82686212
## [3,]  4387.38875 -418.918982 57296.144073  -2.92997690 -65.81701667
## [4,]   -15.00748    2.775975    -2.929977   1.00000000   0.04727537
## [5,]    28.51098   -2.826862   -65.817017   0.04727537   1.00000000
## [6,]    33.73793   -2.301642    11.885783  -0.09646215   0.50000000
## [7,]    22.78600   -2.778327    38.757157  -0.05911280   0.50000000
##             [,6]       [,7]
## [1,] 33.73792518 22.7860002
## [2,] -2.30164238 -2.7783268
## [3,] 11.88578283 38.7571566
## [4,] -0.09646215 -0.0591128
## [5,]  0.50000000  0.5000000
## [6,]  1.00000000  0.5000000
## [7,]  0.50000000  1.0000000
jomosurv_mids <- as.mids(jomosurv,
                         .id = which(names(jomosurv) == "id"),
                         .imp = which(names(jomosurv) == "Imputation"))


jomosurv_mira <- with(jomosurv_mids, coxph(Surv(time, event) ~ platelet + 
                                             age + sex + chol + stage))


summary(pool(jomosurv_mira))
## Warning in mice.df(m, lambda, dfcom, method): Large sample assumed.
##                   est           se          t         df     Pr(>|t|)
## platelet -0.001812719 0.0009026999 -2.0081077   938.7056 4.491794e-02
## age       0.031359505 0.0083208859  3.7687700  1771.2879 1.694271e-04
## sex2     -0.143246361 0.2320142519 -0.6174033  3734.5524 5.370064e-01
## chol      0.001565833 0.0003282181  4.7707085   114.8626 5.440550e-06
## stage2    0.828615944 0.7556223446  1.0966006  2573.6901 2.729186e-01
## stage3    1.203302917 0.7349855006  1.6371791 10311.2249 1.016236e-01
## stage4    2.145234890 0.7315858524  2.9323078  6481.3797 3.376304e-03
##                  lo 95         hi 95 nmis        fmi     lambda
## platelet -0.0035842619 -4.117517e-05   11 0.06723022 0.06524498
## age       0.0150397167  4.767929e-02    0 0.04855048 0.04747678
## sex2     -0.5981333660  3.116406e-01   NA 0.03318174 0.03266410
## chol      0.0009156876  2.215978e-03  134 0.20040179 0.18659927
## stage2   -0.6530734463  2.310305e+00   NA 0.04011598 0.03937035
## stage3   -0.2374113090  2.644017e+00   NA 0.01978212 0.01959201
## stage4    0.7110851482  3.579385e+00   NA 0.02506066 0.02475987

Comparison of the results for the pbc survival data

We have imputed the pbc survival data using three different approaches:

  • with mice, using the Nelson-Aalen approximation
  • with smcfcs
  • with jomo

Here is how the pooled results from these imputations compare:

You have reached the end of this practical. Well done!