The focus of this practical is the imputation of data that has features that require special attention. In the interest of time, we will focus on these features and abbreviate steps that are the same as in any imputation setting (e.g., getting to know the data or checking that imputed values are realistic). Nevertheless, these steps are of course required when analysing data in practice.
To practice imputation when non-linear functional forms or interaction terms are involved, we use a subset of the NHANES data.
To get the NHANES data, load the file NHANES_2.RData
.
The variables in this subset have the following distributions and proportions of missing values:
The missing data pattern is
## weight gender age educ height DBP chol HDL bili
## 547 1 1 1 1 1 1 1 1 1 0
## 8 1 1 1 1 1 1 1 1 0 1
## 59 1 1 1 1 1 0 1 1 1 1
## 13 1 1 1 1 0 1 1 1 1 1
## 1 1 1 1 1 1 0 1 1 0 2
## 5 1 1 1 1 0 0 1 1 1 2
## 104 1 1 1 1 1 1 0 0 0 3
## 11 1 1 1 1 1 0 0 0 0 4
## 1 1 1 1 1 0 1 0 0 0 4
## 1 1 1 1 1 0 0 0 0 0 5
## 0 0 0 0 20 77 117 117 126 457
Our aim is to fit the following linear regression model for weight:
mod <- lm(weight ~ gender + bili + age * (chol + HDL) + height)
We expect that the effects of cholesterol and HDL may differ with age, and, hence, include interaction terms between age
and chol
and HDL
, respectively. Additionally, we want to include educ
and DBP
as auxiliary variables.
When the analysis model of interest involves interaction terms between incomplete variables, mice has limited options to reduce the bias that may be introduced by naive handling of the missing values.
Use of the “Just Another Variable” approach can in some settings reduce bias. Alternatively, we can use passive imputation, i.e., calculate the interaction terms in each iteration of the MICE algorithm. Furthermore, predictive mean matching tends to lead to less bias than normal imputation models.
Calculate the interaction terms in the incomplete data and perform the imputation with mice()
, using the JAV approach and using passive imputation. Follow the usual procedure of doing a setup run and adjusting the imputation method and predictor matrix.
# calculate the interaction terms
NHANES$agechol <- NHANES$age * NHANES$chol
NHANES$ageHDL <- NHANES$age * NHANES$HDL
# setup run
miceint0 <- mice(NHANES, maxit = 0)
meth_miceint <- miceint0$method
pred_miceint <- miceint0$predictorMatrix
# change the imputation method:
meth_miceint["DBP"] <- "norm"
# replace all "pmm" with "midastouch"
meth_miceint <- gsub("pmm", "midastouch", meth_miceint)
# changes in predictor matrix to prevent that the original variables are
# imputed based on the interaction terms
pred_miceint["chol", "agechol"] <- 0
pred_miceint["HDL", "ageHDL"] <- 0
# changes in imputation method for passive imputation only
meth_miceintPAS <- meth_miceint
meth_miceintPAS[c("agechol", "ageHDL")] <- c("~I(age*chol)", "~I(age*HDL)")
# run imputation with the JAV approach
miceintJAV <- mice(NHANES, method = meth_miceint, predictorMatrix = pred_miceint,
maxit = 10, m = 5)
##
## iter imp variable
## 1 1 bili chol HDL DBP height agechol ageHDL
## 1 2 bili chol HDL DBP height agechol ageHDL
## 1 3 bili chol HDL DBP height agechol ageHDL
## 1 4 bili chol HDL DBP height agechol ageHDL
## 1 5 bili chol HDL DBP height agechol ageHDL
## 2 1 bili chol HDL DBP height agechol ageHDL
## 2 2 bili chol HDL DBP height agechol ageHDL
## 2 3 bili chol HDL DBP height agechol ageHDL
## 2 4 bili chol HDL DBP height agechol ageHDL
## 2 5 bili chol HDL DBP height agechol ageHDL
## 3 1 bili chol HDL DBP height agechol ageHDL
## 3 2 bili chol HDL DBP height agechol ageHDL
## 3 3 bili chol HDL DBP height agechol ageHDL
## 3 4 bili chol HDL DBP height agechol ageHDL
## 3 5 bili chol HDL DBP height agechol ageHDL
## 4 1 bili chol HDL DBP height agechol ageHDL
## 4 2 bili chol HDL DBP height agechol ageHDL
## 4 3 bili chol HDL DBP height agechol ageHDL
## 4 4 bili chol HDL DBP height agechol ageHDL
## 4 5 bili chol HDL DBP height agechol ageHDL
## 5 1 bili chol HDL DBP height agechol ageHDL
## 5 2 bili chol HDL DBP height agechol ageHDL
## 5 3 bili chol HDL DBP height agechol ageHDL
## 5 4 bili chol HDL DBP height agechol ageHDL
## 5 5 bili chol HDL DBP height agechol ageHDL
## 6 1 bili chol HDL DBP height agechol ageHDL
## 6 2 bili chol HDL DBP height agechol ageHDL
## 6 3 bili chol HDL DBP height agechol ageHDL
## 6 4 bili chol HDL DBP height agechol ageHDL
## 6 5 bili chol HDL DBP height agechol ageHDL
## 7 1 bili chol HDL DBP height agechol ageHDL
## 7 2 bili chol HDL DBP height agechol ageHDL
## 7 3 bili chol HDL DBP height agechol ageHDL
## 7 4 bili chol HDL DBP height agechol ageHDL
## 7 5 bili chol HDL DBP height agechol ageHDL
## 8 1 bili chol HDL DBP height agechol ageHDL
## 8 2 bili chol HDL DBP height agechol ageHDL
## 8 3 bili chol HDL DBP height agechol ageHDL
## 8 4 bili chol HDL DBP height agechol ageHDL
## 8 5 bili chol HDL DBP height agechol ageHDL
## 9 1 bili chol HDL DBP height agechol ageHDL
## 9 2 bili chol HDL DBP height agechol ageHDL
## 9 3 bili chol HDL DBP height agechol ageHDL
## 9 4 bili chol HDL DBP height agechol ageHDL
## 9 5 bili chol HDL DBP height agechol ageHDL
## 10 1 bili chol HDL DBP height agechol ageHDL
## 10 2 bili chol HDL DBP height agechol ageHDL
## 10 3 bili chol HDL DBP height agechol ageHDL
## 10 4 bili chol HDL DBP height agechol ageHDL
## 10 5 bili chol HDL DBP height agechol ageHDL
# run imputation with passive imputation
miceintPAS <- mice(NHANES, method = meth_miceintPAS, predictorMatrix = pred_miceint,
maxit = 10, m = 5)
##
## iter imp variable
## 1 1 bili chol HDL DBP height agechol ageHDL
## 1 2 bili chol HDL DBP height agechol ageHDL
## 1 3 bili chol HDL DBP height agechol ageHDL
## 1 4 bili chol HDL DBP height agechol ageHDL
## 1 5 bili chol HDL DBP height agechol ageHDL
## 2 1 bili chol HDL DBP height agechol ageHDL
## 2 2 bili chol HDL DBP height agechol ageHDL
## 2 3 bili chol HDL DBP height agechol ageHDL
## 2 4 bili chol HDL DBP height agechol ageHDL
## 2 5 bili chol HDL DBP height agechol ageHDL
## 3 1 bili chol HDL DBP height agechol ageHDL
## 3 2 bili chol HDL DBP height agechol ageHDL
## 3 3 bili chol HDL DBP height agechol ageHDL
## 3 4 bili chol HDL DBP height agechol ageHDL
## 3 5 bili chol HDL DBP height agechol ageHDL
## 4 1 bili chol HDL DBP height agechol ageHDL
## 4 2 bili chol HDL DBP height agechol ageHDL
## 4 3 bili chol HDL DBP height agechol ageHDL
## 4 4 bili chol HDL DBP height agechol ageHDL
## 4 5 bili chol HDL DBP height agechol ageHDL
## 5 1 bili chol HDL DBP height agechol ageHDL
## 5 2 bili chol HDL DBP height agechol ageHDL
## 5 3 bili chol HDL DBP height agechol ageHDL
## 5 4 bili chol HDL DBP height agechol ageHDL
## 5 5 bili chol HDL DBP height agechol ageHDL
## 6 1 bili chol HDL DBP height agechol ageHDL
## 6 2 bili chol HDL DBP height agechol ageHDL
## 6 3 bili chol HDL DBP height agechol ageHDL
## 6 4 bili chol HDL DBP height agechol ageHDL
## 6 5 bili chol HDL DBP height agechol ageHDL
## 7 1 bili chol HDL DBP height agechol ageHDL
## 7 2 bili chol HDL DBP height agechol ageHDL
## 7 3 bili chol HDL DBP height agechol ageHDL
## 7 4 bili chol HDL DBP height agechol ageHDL
## 7 5 bili chol HDL DBP height agechol ageHDL
## 8 1 bili chol HDL DBP height agechol ageHDL
## 8 2 bili chol HDL DBP height agechol ageHDL
## 8 3 bili chol HDL DBP height agechol ageHDL
## 8 4 bili chol HDL DBP height agechol ageHDL
## 8 5 bili chol HDL DBP height agechol ageHDL
## 9 1 bili chol HDL DBP height agechol ageHDL
## 9 2 bili chol HDL DBP height agechol ageHDL
## 9 3 bili chol HDL DBP height agechol ageHDL
## 9 4 bili chol HDL DBP height agechol ageHDL
## 9 5 bili chol HDL DBP height agechol ageHDL
## 10 1 bili chol HDL DBP height agechol ageHDL
## 10 2 bili chol HDL DBP height agechol ageHDL
## 10 3 bili chol HDL DBP height agechol ageHDL
## 10 4 bili chol HDL DBP height agechol ageHDL
## 10 5 bili chol HDL DBP height agechol ageHDL
We skip the evaluation of convergence and investigation of the imputed values. With the settings given in the solution the chains have converged and distributions of the imputed values match the distributions of the observed data close enough. Omitting the changes in the predictorMatrix
would have resulted in convergence problems for HDL
, ageHDL
, chol
and agechol
.
Analyse the imputed data and pool the results.
miceint_miraJAV <- with(miceintJAV,
lm(weight ~ gender + bili + age + chol + HDL + agechol +
ageHDL + height))
miceint_miraPAS <- with(miceintPAS,
lm(weight ~ gender + bili + age + chol + HDL + agechol +
ageHDL + height))
summary(pool(miceint_miraJAV))
## est se t df Pr(>|t|)
## (Intercept) -76.63915455 19.39481388 -3.9515282 27.155545 0.0004990517
## gender2 4.64534644 1.65210959 2.8117665 527.128374 0.0051105490
## bili -5.76016870 2.73320552 -2.1074773 14.565363 0.0528448141
## age 0.21290955 0.24100360 0.8834289 7.220343 0.4054571384
## chol 2.70748429 1.60998512 1.6816828 27.704040 0.1038710010
## HDL -22.29193149 7.95680304 -2.8016191 7.203969 0.0256977968
## agechol -0.05763036 0.02959673 -1.9471870 34.545334 0.0596782985
## ageHDL 0.16270072 0.15434401 1.0541434 7.008678 0.3267956761
## height 99.99894552 7.52485321 13.2891557 672.844617 0.0000000000
## lo 95 hi 95 nmis fmi lambda
## (Intercept) -116.4233626 -36.85494645 NA 0.41401610 0.37239134
## gender2 1.3998192 7.89087364 NA 0.04750585 0.04389880
## bili -11.6010360 0.08069864 126 0.56873791 0.51332486
## age -0.3534668 0.77928590 0 0.78338247 0.73067968
## chol -0.5920087 6.00697726 117 0.40965945 0.36852652
## HDL -40.9993270 -3.58453600 117 0.78412569 0.73149890
## agechol -0.1177432 0.00248247 117 0.36401112 0.32822646
## ageHDL -0.2021733 0.52757472 117 0.79313475 0.74147447
## height 85.2239266 114.77396440 20 0.02337114 0.02047246
summary(pool(miceint_miraPAS))
## est se t df Pr(>|t|)
## (Intercept) -84.41034507 17.41918620 -4.845826 482.19221 1.702527e-06
## gender2 4.55210592 1.63844712 2.778305 604.66134 5.633998e-03
## bili -6.38473588 2.29994476 -2.776039 42.82735 8.126896e-03
## age 0.34631723 0.19904132 1.739926 188.54777 8.350368e-02
## chol 4.33360717 1.96548001 2.204859 126.55016 2.927101e-02
## HDL -22.73238938 5.34296081 -4.254643 111.49202 4.377521e-05
## agechol -0.08985437 0.03809106 -2.358936 121.92983 1.991703e-02
## ageHDL 0.17825559 0.10114365 1.762400 105.97395 8.088488e-02
## height 100.64330497 7.46032910 13.490465 721.24291 0.000000e+00
## lo 95 hi 95 nmis fmi lambda
## (Intercept) -118.63723278 -50.18345735 NA 0.05482793 0.050915736
## gender2 1.33436779 7.76984406 NA 0.03518144 0.031995435
## bili -11.02355733 -1.74591442 126 0.32368415 0.292821416
## age -0.04631678 0.73895124 0 0.13182017 0.122659636
## chol 0.44414395 8.22307039 117 0.17165361 0.158665055
## HDL -33.31930808 -12.14547069 117 0.18578119 0.171305165
## agechol -0.16525986 -0.01444887 117 0.17571999 0.162309414
## ageHDL -0.02227210 0.37878328 117 0.19166842 0.176555735
## height 85.99675001 115.28985993 20 0.01187011 0.009133832
Because we want to perform the imputation with other methods on the same original dataset, we remove the two columns agechol
and ageHDL
that we had added to the data.
NHANES <- subset(NHANES, select = -c(agechol, ageHDL))
The package JointAI performs analysis and imputation jointly. There are three main functions, lm_imp()
, glm_imp()
and lme_imp()
that perform linear regression, generalized linear regression and analysis with linear mixed models. The specification of these functions is very similar to the use of their complete data versions lm()
, glm()
and lme
(from the nlme package).
Check the help file for lm_imp()
to find out which arguments you need to specify to fit the linear regression model for weight
.
lm_imp()
, glm_imp()
and lme_imp()
have many arguments, but not all of them are relevant for this practical. The most important arguments are:
formula
|
model formula |
data
|
original, incomplete dataset |
family
|
for glm’s: the distribution family of the outcome |
meth
|
vector of imputation methods. Implemented are norm (normal model), logit (logistic model), lognorm (linear with log-transformed outcome), multilogit (multinomial logit model), cumlogit (cumulative logit model)
|
n.chains
|
number of MCMC chains |
n.adapt
|
number of iterations used in the adaptive phase (details see below) |
n.iter
|
number of iterations per MCMC chain in the sampling phase |
auxvars
|
vector of names of variables that are not part of the analysis model but should be used to predict missing values (optional) |
refcats
|
allows to specify which category of categorical variables is used as reference (optional) |
Additional details:
Contrary to MICE, the n.chains
different MCMC chains are not used to create multiple imputed datasets. Multiple chains are necessary to evaluate if the model has converged. For the final result the chains are combined.
Depending on the type of variable, different types of samplers (within the Gibbs sampling) are used. JAGS (which is called from within lm_imp()
) needs an adaptive phase to find a suitable sampler for each of the parameters. Often n.adapt = 100
iterations (the default value) are sufficient for the adaption phase. These iterations are not included in the final result.
The number of iterations n.iter
that is required depends on the specific data and how complex the model is, hence, no default value is given.
Columns of the dataset that are not used in the model formula, and are not specified as auxiliary variables are ignored. In the current version of JointAI, passive imputation (calculating variables based on other variables) is not yet available.
Analyse (and impute) the data using lm_imp()
and the linear regression model specified above.
n.iter = 100
, to get an idea about the computational time.traceplot()
.n.iter
if necessary.You need to specify the
inits = NULL
traceplot()
shows by default the MCMC chains of the main model parameters. This is different to the traceplots we have seen when using mice()
, but the principle of convergence is the same.
The result of the model can be displayed with the function summary()
. To exclude parts of the MCMC chains from the result (if the ) part of the MCMC chains after convergence was achieved, the argument start
can be used to specify the first iteration to be used. See an example in the help page of the summary()
for JointAI
objects.
Note:
Due to a bug in version 0.1.0 of JointAI you need to specify inits = NULL
when using auxiliary variables. This bug is resolved in the development version on GitHub.
library(JointAI)
JointAIint <- lm_imp(weight ~ gender + bili + age * (chol + HDL) + height,
data = NHANES,
auxvars = c("educ", "DBP"),
n.iter = 500, inits = NULL)
## This is new software. Please report any bug to the package maintainer.
traceplot(JointAIint)
summary(JointAIint, start = 200)
##
## Linear model fitted with JointAI
##
## Call:
## lm_imp(formula = weight ~ gender + bili + age * (chol + HDL) +
## height, data = NHANES, n.iter = 500, inits = NULL, auxvars = c("educ",
## "DBP"))
##
## Posterior summary:
## Mean SD 2.5% 97.5% tail-prob.
## (Intercept) -86.4904 17.18574 -120.69996 -52.9285 0.000000
## genderfemale 4.9002 1.58440 1.79233 7.9446 0.004430
## age 0.3879 0.19603 0.01859 0.7758 0.035437
## height 101.1454 7.40559 87.54574 115.1592 0.000000
## chol 5.3293 1.88690 1.84898 9.1587 0.004430
## HDL -25.1377 5.20770 -34.99747 -15.4332 0.000000
## bili -6.9972 2.05366 -10.99778 -2.9416 0.004430
## age:chol -0.1092 0.03814 -0.18346 -0.0342 0.006645
## age:HDL 0.2174 0.09936 0.03328 0.4088 0.024363
##
## Posterior summary of residual std. deviation:
## Mean SD 2.5% 97.5%
## sigma_weight 15.92 0.4263 15.13 16.83
##
##
## MCMC settings:
## Iterations = 200:500
## Sample size per chain = 301
## Thinning interval = 1
## Number of chains = 3
##
## Number of observations: 750
The package smcfcs performs imputation using “substantive model compatible fully conditional specification”.
The main function is called smcfcs()
and its use is similar to the use of mice()
.
Check the help file to find out the details of this function.
smcfcs()
takes the following arguments as input:
originaldata
|
original, incomplete data frame |
smtype
|
type of substantive (analysis) model: lm , logistic , poisson , coxph or compet
|
smformula
|
formula of analysis model |
method
|
vector of imputation methods for each variable: norm (lin. regression), logreg (logistic regression), poisson (Poisson regression),podds (prop. odds regression for ordered factors,mlogit (multinomial logistic regression for unordered factors) '' if variable is complete, or a custom expression to impute passively imputed variables, e.g. 'x^2' or x1*x2
|
predictorMatrix
|
predictor matrix to define which variables are used as predictors in which imputation model (optional). The outcome must not be included |
m
|
number of imputed datasets (default is 5) |
numit
|
number of iterations before obtaining imputed dataset |
rjlimit
|
maximum number of attempts in the rejection sampling (MCMC) |
noisy
|
if TRUE : output is printed, if FALSE : less output is printed
|
Additional details:
The vector of imputation methods needs to be specified in the order of the columns in the dataset and contain an entry for each column, even that column is not used in the imputation model. Depending on whether additional columns of the data should be used as auxiliary variables in the imputation, the predictorMatrix
needs to be adjusted (0 if the column should not be used, 1 if it should be used).
In smcfcs()
the outcome of the analysis model is implicitely included in the imputation model, similarily to how it is included in JointAI. Therefore, the outcome does not need to be explicitely specified as predictor variable for the imputation models in the predictorMatrix
.
smcfcs uses rejection sampling. Rejection sampling is a method to create a random sample from a complex distribution (from which we can not sample directly) by drawing from a simpler proposal distribution and rejecting draws that are unlikely under the complex target distribution. The argument rjlimit
specifies how many attepts are made to draw a value from the proposal that is not rejected.
For logistic regression, smcfcs
requires the outcome to be coded as numeric (with values 0 and 1).
Impute the NHANES data with smcfcs()
and take into account that DBP
and educ
should be used as auxiliary variables.
=> you need to specify the predictorMatrix
to specify this
To create the predictor matrix: pred <- matrix(nrow = ncol(NHANES), ncol = ncol(NHANES), data = 1, dimnames = list(names(NHANES), names(NHANES)))
Then exclude variables from their own imputation model: diag(pred) <- 0
In smcfcs()
you need to specify
# To create the predictor matrix:
pred <- matrix(nrow = ncol(NHANES), ncol = ncol(NHANES), data = 1,
dimnames = list(names(NHANES), names(NHANES)))
# exclude variables from their own imputation model
diag(pred) <- 0
smcint <- smcfcs(NHANES,
smtype = "lm",
smformula = "weight ~ gender + bili + age * (chol + HDL) + height",
method = c(weight = "", gender = "", bili = "norm", age = "",
chol = "norm", HDL = "norm", educ = "", DBP = 'norm',
height = 'norm'),
predictorMatrix = pred, numit = 20)
## [1] "Outcome variable(s): weight"
## [1] "Passive variables: "
## [1] "Partially obs. variables: bili,chol,HDL,DBP,height"
## [1] "Fully obs. substantive model variables: gender,age"
## [1] "Imputation 1"
## [1] "Imputing: bili using gender,age,chol,HDL,educ,DBP,height plus outcome"
## [1] "Imputing: chol using gender,bili,age,HDL,educ,DBP,height plus outcome"
## [1] "Imputing: HDL using gender,bili,age,chol,educ,DBP,height plus outcome"
## [1] "Imputing: DBP using gender,bili,age,chol,HDL,educ,height plus outcome"
## [1] "Imputing: height using gender,bili,age,chol,HDL,educ,DBP plus outcome"
## [1] "Imputation 2"
## [1] "Imputation 3"
## [1] "Imputation 4"
## [1] "Imputation 5"
## Warning in smcfcs.core(originaldata, smtype, smformula, method,
## predictorMatrix, : Rejection sampling failed 443 times (across all
## variables, iterations, and imputations). You may want to increase the
## rejection sampling limit.
class
of the object returned by smcfcs()
and its elements. The section Value in the help page can help with this.
The estimates of the model parameters from each iteration are returned in the list element smCoefIter
.
Then plot the chains of the estimated model parameters to evaluate if the algorithm has converged.
# class of the returned object
class(smcint)
## [1] "list"
# names of its components
names(smcint)
## [1] "impDatasets" "smCoefIter"
# investigate the first component
class(smcint$impDatasets)
## [1] "list"
length(smcint$impDatasets)
## [1] 5
class(smcint$impDatasets[[1]])
## [1] "data.frame"
dim(smcint$impDatasets[[1]])
## [1] 750 9
# investigate the second component
class(smcint$smCoefIter)
## [1] "array"
dim(smcint$smCoefIter)
## [1] 5 9 20
# plot the chains of estimated coefficients
par(mfrow = c(3, 3), mar = c(3.5, 3.5, 1,1), mgp = c(2, 0.6, 0))
for (i in 1:dim(smcint$smCoefIter)[2]) {
matplot(t(smcint$smCoefIter[, i, ]), type = 'l', ylab = "coefficient",
xlab = "iteration")
}
The imputed datasets are returned as a list
, where each element of the list
is one imputed dataset.
To get from a list
of datasets to pooled results there are several options, as displayed in the following flow chart:
To use datalist2mids()
from the package miceadds, ordered factors need to be converted to unordered factors. To use as.mids()
we would have to turn the list into a long format dataset (where the original data and all imputed datasets are stacked on top of each other) and add a column that identifies the imputation number and one that identifies subjects.
Instead do the following:
smcint
to an imputationList
objectimputationList
objectMIcombine()
to obtain an MIresult
objectas.mira()
to convert the list
of models from 2. to a mira
objectpool()
on the mira
objectsummary()
of the mira
object and the summary()
of the MIresult
object# 1. convert to imputationList
library(mitools)
smcint_impList <- imputationList(smcint$impDatasets)
# 2. fit the models
intmod_mitools <- with(smcint_impList,
lm(weight ~ gender + bili + age * (chol + HDL) + height))
# 3. Pool the results
smcintpool_mitools <- MIcombine(intmod_mitools)
# 4. create a mira object
smcint_mira <- as.mira(intmod_mitools)
# 5. pool the mira object
smcintpool_mice <- pool(smcint_mira)
# 6. compare the results
summary(smcintpool_mitools)
## Multiple imputation results:
## with(smcint_impList, lm(weight ~ gender + bili + age * (chol +
## HDL) + height))
## MIcombine.default(intmod_mitools)
## results se (lower upper) missInfo
## (Intercept) -85.5033793 16.90841240 -118.64482413 -52.36193437 1 %
## genderfemale 4.7373652 1.64240651 1.51724652 7.95748382 3 %
## bili -6.4598277 2.06067909 -10.51660594 -2.40304951 13 %
## age 0.3377543 0.19042106 -0.03617934 0.71168799 8 %
## chol 5.5389860 1.85729807 1.89409190 9.18388001 7 %
## HDL -26.8677631 5.24750801 -37.21214077 -16.52338533 15 %
## height 101.2627544 7.42851757 86.70299231 115.82251653 1 %
## age:chol -0.1138222 0.03621496 -0.18490366 -0.04274082 7 %
## age:HDL 0.2667638 0.09716448 0.07576538 0.45776230 10 %
summary(smcintpool_mice)
## est se t df Pr(>|t|)
## (Intercept) -85.5033793 16.90841240 -5.056854 709.5648 5.431652e-07
## genderfemale 4.7373652 1.64240651 2.884405 598.2365 4.062705e-03
## bili -6.4598277 2.06067909 -3.134805 192.7048 1.988727e-03
## age 0.3377543 0.19042106 1.773724 327.8257 7.703714e-02
## chol 5.5389860 1.85729807 2.982282 399.5557 3.036378e-03
## HDL -26.8677631 5.24750801 -5.120099 158.2709 8.763846e-07
## height 101.2627544 7.42851757 13.631623 730.7931 0.000000e+00
## age:chol -0.1138222 0.03621496 -3.142962 379.9926 1.803792e-03
## age:HDL 0.2667638 0.09716448 2.745487 255.0003 6.472089e-03
## lo 95 hi 95 nmis fmi lambda
## (Intercept) -118.69988298 -52.30687552 NA 0.015266208 0.012494515
## genderfemale 1.51178173 7.96294861 NA 0.036215938 0.032999231
## bili -10.52420964 -2.39544580 NA 0.129817397 0.120832771
## age -0.03684708 0.71235572 NA 0.085072214 0.079507395
## chol 1.88768844 9.19028347 NA 0.069624702 0.064979279
## HDL -37.23193762 -16.50358848 NA 0.148515446 0.137823165
## height 86.67897406 115.84653478 NA 0.008250439 0.005539975
## age:chol -0.18502906 -0.04261543 NA 0.073527884 0.068664418
## age:HDL 0.07541681 0.45811087 NA 0.105390595 0.098401468
In this part of the practical we will use jomo to impute the NHANES data with the joint model approach to multiple imputation. For our linear regression of weight
, this can be done using the function jomo.lm()
.
Check the help page to find out which arguments need to be specified in jomo.lm()
.
jomo.lm()
are:
formula
|
model formula |
data
|
the original, incomplete data set |
nburn
|
number of burn-in iterations |
nbetween
|
Number of iterations between imputations |
nimp
|
number of imputations (default is 5) |
output
|
if 1: output at end of imputation, if $ eq 1$: no output shown |
out.iter
|
prints output info each out.iter iterations
|
Additional details:
jomo only produces a single chain of MCMC samples. The first nburn
iterations are discarded as burn-in. The next iteration forms the first imputed dataset. The MCMC chain is then continued and after nbetween
additional iterations the next imputed dataset is created. This step is repeated until nimp
sets of imputed values are obtained.
Convergence of the MCMC sampler can not be checked on the same object that was used to create the imputed values. Instead, a separate function has to be used: jomo.lm.MCMCchain()
. It takes the same arguments as jomo.lm()
, except for nbetween
and nimp
, and records all burn-in iterations, but does not create multiple imputed datasets.
jomo will use all columns that are in the dataset but not used in the model formula as auxiliary variables. As in JointAI, specification of passive imputation is currently not available.
jomo.lm.MCMCchain()
for our linear regression of weight
and investigate the resulting object.
You need to specify
jomointMCMC <- jomo.lm.MCMCchain(weight ~ gender + bili + age * (chol + HDL) + height,
data = NHANES, output = 0, nburn = 30000)
class(jomointMCMC)
## [1] "list"
names(jomointMCMC)
## [1] "finimp" "collectbeta" "collectomega" "finimp.latnorm"
## [5] "collectbetaY" "collectvarY"
lapply(jomointMCMC, class)
## $finimp
## [1] "data.frame"
##
## $collectbeta
## [1] "array"
##
## $collectomega
## [1] "array"
##
## $finimp.latnorm
## [1] "matrix"
##
## $collectbetaY
## [1] "array"
##
## $collectvarY
## [1] "numeric"
jomoint
is a list
with six elements:
finimp
is the imputed dataset that was generated at the end of the burn-in periodfinimp.latnorm
contains a matrix
corresponding estimates of the latent normal distributionscollectbetaY
and collectvarY
contain the MCMC samples throughout the burn-in period for the coefficients of the analysis model and residual variationcollectbeta
and collectomega
contain the MCMC samples throughout the burn-in period for the mean and variance-covariance matrix of the multivariate normal distribution.Make plots to evaluate convergence.
Note:
In version 2.6-2 of jomo there is a bug in the sampling of the covariance parameters, which will be resolved in the next version. The objects jomointMCMC
and jomoint
(see later) used in the solution were created using a preliminary fix of the bug.
par(mfrow = c(3,4), mar = c(3,3,1,1), mgp = c(2, 0.6, 0))
for (i in 1:dim(jomointMCMC$collectbeta)[2]) {
plot(jomointMCMC$collectbeta[1, i, ], type = 'l')
}
plot(jomointMCMC$collectvarY, type = 'l')
par(mfrow = c(3,3), mar = c(3,3,1,1), mgp = c(2, 0.6, 0))
for (i in 1:dim(jomointMCMC$collectbetaY)[2]) {
plot(jomointMCMC$collectbetaY[1, i, ], type = 'l')
}
par(mfrow = c(9, 8), mar = c(2.1, 2.5, 0.5, 0.5), mgp = c(2, 0.6, 0))
for (i in 1:dim(jomointMCMC$collectomega)[2]) {
for (j in 1:i) {
plot(jomointMCMC$collectomega[i, j , ], type = 'l')
}
}
When you are satisfied with the convergence, run the imputation with the number of burn-in iterations that you found to be sufficient and explore the resulting object.
jomoint <- jomo.lmC(weight ~ gender + bili + age * (chol + HDL) + height,
nburn = 1000, nbetween = 200, nimp = 5, data = NHANES)
## This function is beta software. Use carefully and please report any bug to the package mantainer
## ....................................................................................................First imputation registered.
## ....................Imputation number 2 registered
## ....................Imputation number 3 registered
## ....................Imputation number 4 registered
## ....................Imputation number 5 registered
## The posterior mean of the substantive model fixed effects estimates is:
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] -85.25802 4.848687 -7.031685 0.3672109 4.988777 -24.785 101.3408
## [,8] [,9]
## [1,] -0.1037916 0.2098365
## The posterior mean of the substantive model residual variance is:
## [1] 253.033
## The posterior mean of the fixed effects estimates is:
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.7472807 47.6825 5.027179 1.380295 1.683901 70.75355 0.01799191
## [,8] [,9] [,10] [,11]
## [1,] -0.3415452 -0.2617833 -0.126716 0.02619691
## The posterior mean of the level 1 covariance matrix is:
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.12862580 0.6140839 -0.027653973 -0.06691115 -0.03546014
## [2,] 0.61408390 304.6846905 1.824955569 -0.41032266 -0.59686868
## [3,] -0.02765397 1.8249556 1.235047646 0.14701218 0.01232511
## [4,] -0.06691115 -0.4103227 0.147012180 0.31587376 0.08073964
## [5,] -0.03546014 -0.5968687 0.012325107 0.08073964 0.05955106
## [6,] 0.01022712 -4.2158261 2.157704683 -0.15607124 0.17203748
## [7,] -0.06012189 -0.7124416 -0.057771815 0.11340621 0.19244043
## [8,] 0.04105525 6.1093452 -0.018955517 -0.22282767 -0.12330949
## [9,] -0.01482710 2.6423658 -0.049865247 -0.11331859 -0.05010485
## [10,] -0.02141806 0.9482815 0.002079085 -0.10048987 -0.02123084
## [11,] -0.03428924 -0.5397380 -0.086099295 -0.04975737 -0.01238275
## [,6] [,7] [,8] [,9] [,10]
## [1,] 0.01022712 -0.060121893 0.04105525 -0.014827097 -0.021418063
## [2,] -4.21582611 -0.712441565 6.10934521 2.642365812 0.948281485
## [3,] 2.15770468 -0.057771815 -0.01895552 -0.049865247 0.002079085
## [4,] -0.15607124 0.113406212 -0.22282767 -0.113318589 -0.100489873
## [5,] 0.17203748 0.192440426 -0.12330949 -0.050104847 -0.021230837
## [6,] 157.57109071 2.040882449 -0.25096545 0.614532707 0.436222115
## [7,] 2.04088245 1.000000000 -0.20066966 -0.002008077 0.053475636
## [8,] -0.25096545 -0.200669663 1.00000000 0.500000000 0.500000000
## [9,] 0.61453271 -0.002008077 0.50000000 1.000000000 0.500000000
## [10,] 0.43622212 0.053475636 0.50000000 0.500000000 1.000000000
## [11,] -0.97614224 -0.042733669 0.50000000 0.500000000 0.500000000
## [,11]
## [1,] -0.03428924
## [2,] -0.53973803
## [3,] -0.08609930
## [4,] -0.04975737
## [5,] -0.01238275
## [6,] -0.97614224
## [7,] -0.04273367
## [8,] 0.50000000
## [9,] 0.50000000
## [10,] 0.50000000
## [11,] 1.00000000
class(jomoint)
## [1] "data.frame"
dim(jomoint)
## [1] 4500 11
summary(jomoint)
## weight bili age chol
## Min. : 38.56 Min. :-0.2677 Min. :20.00 Min. : 2.373
## 1st Qu.: 64.86 1st Qu.: 0.5661 1st Qu.:33.00 1st Qu.: 4.270
## Median : 78.02 Median : 0.7000 Median :48.00 Median : 4.990
## Mean : 79.74 Mean : 0.7473 Mean :47.69 Mean : 5.012
## 3rd Qu.: 90.72 3rd Qu.: 0.9000 3rd Qu.:63.00 3rd Qu.: 5.640
## Max. :176.90 Max. : 4.1000 Max. :79.00 Max. :10.290
## NA's :126 NA's :117
## HDL height DBP gender
## Min. :0.1433 Min. :1.241 Min. : 14.00 male :2322
## 1st Qu.:1.1100 1st Qu.:1.600 1st Qu.: 64.00 female:2178
## Median :1.3400 Median :1.676 Median : 70.67
## Mean :1.3802 Mean :1.685 Mean : 70.79
## 3rd Qu.:1.6000 3rd Qu.:1.753 3rd Qu.: 78.00
## Max. :4.0300 Max. :2.091 Max. :111.48
## NA's :117 NA's :20 NA's :77
## educ id Imputation
## Less than 9th grade : 432 Min. : 1.0 Min. :0.0
## 9-11th grade : 690 1st Qu.:188.0 1st Qu.:1.0
## High school graduate: 936 Median :375.5 Median :2.5
## some college :1260 Mean :375.5 Mean :2.5
## College or above :1182 3rd Qu.:563.0 3rd Qu.:4.0
## Max. :750.0 Max. :5.0
##
jomo.lm()
returns a data.frame
in long format, including the original data. jomo has added a column id
to identify subjects and a column Imputation
that identifies the imputation number.
With this information and the flowchart from above, fit the linear regression model for weight
on the imputed data and pool the results.
jomoint_mids <- as.mids(jomoint,
.imp = which(names(jomoint) == "Imputation"),
.id = which(names(jomoint) == "id")
)
jomoint_mira <- with(jomoint_mids,
lm(weight ~ gender + bili + age * (chol + HDL) + height))
summary(pool(jomoint_mira))
## est se t df Pr(>|t|)
## (Intercept) -91.6593083 17.27408160 -5.306175 157.01377 3.755321e-07
## gender2 5.0936148 1.67349193 3.043704 277.73912 2.560799e-03
## bili -7.7036956 1.97222280 -3.906098 262.46753 1.193767e-04
## age 0.4458178 0.18743140 2.378565 250.05704 1.813184e-02
## chol 6.0134518 1.96429820 3.061374 91.40218 2.892984e-03
## HDL -25.2198440 5.00056997 -5.043394 169.48278 1.167590e-06
## height 102.9499357 7.49821968 13.729917 294.35181 0.000000e+00
## age:chol -0.1235710 0.03851946 -3.208015 87.68336 1.866853e-03
## age:HDL 0.2186667 0.09501044 2.301502 179.59171 2.251136e-02
## lo 95 hi 95 nmis fmi lambda
## (Intercept) -125.77886386 -57.53975276 NA 0.14930391 0.13853654
## gender2 1.79927552 8.38795402 NA 0.09834833 0.09187883
## bili -11.58708802 -3.82030327 126 0.10299295 0.09618371
## age 0.07667240 0.81496325 0 0.10702794 0.09991424
## chol 2.11184600 9.91505758 117 0.20961611 0.19250864
## HDL -35.09126865 -15.34841939 117 0.14185114 0.13178386
## height 88.19301973 117.70685168 20 0.09364434 0.08750688
## age:chol -0.20012422 -0.04701784 NA 0.21487793 0.19717175
## age:HDL 0.03118632 0.40614714 NA 0.13634877 0.12678409
We have imputed the NHANES data using five different approaches:
Here is how the pooled results from these imputations compare:
In this part of the practical, we will work with data on a trial on primary biliary cirrhosis (PBC) of the liver.
To get the pbclong data, load the file pbclong.RData
.
The variables contained in the dataset pbclong
are:
id
|
patient identifier |
protime
|
blood clotting time (the longitudinal outcome) |
year
|
continuously measured year of follow-up time (the time variable) |
sex
|
patients’ sex (f: female, m: male) |
trt
|
treatment group (0: D-penicillmain, 1: placebo) |
age
|
patients’ age at intake |
ascites
|
presence of ascites at baseline (0:no, 1:yes) |
bili
|
serum bilirubin level at baseline |
chol
|
serum cholesterol level at baseline |
albumin
|
serum albumin level at follow-up (time-varying) |
The variables have the following distributions and proportions of missing values:
The missing data pattern is:
## protime year sex trt age albumin id ascites bili chol
## 839 1 1 1 1 1 1 1 1 1 1 0
## 136 1 1 1 1 1 1 1 0 1 1 1
## 328 1 1 1 1 1 1 1 1 0 1 1
## 245 1 1 1 1 1 1 1 1 1 0 1
## 27 1 1 1 1 1 1 1 0 0 1 2
## 186 1 1 1 1 1 1 1 0 1 0 2
## 129 1 1 1 1 1 1 1 1 0 0 2
## 55 1 1 1 1 1 1 1 0 0 0 3
## 0 0 0 0 0 0 0 404 539 615 1558
The longitudinal outcome protime
shows relatively linear trajectories over time:
To analyse the trajectories of protime
we want to use the following linear mixed effects model with random intercept and slope (either using lme
from the package nlme or using lmer
from the package lme4):
# using package nlme
lme(protime ~ year + sex + trt + age + ascites + bili + chol + albumin, random = ~year|id)
# using package lme4
lmer(protime ~ year + sex + trt + age + ascites + bili + chol + albumin + (year|id))
For imputation of longitudinal data, the mice package provides special imputation methods that take into account the two-level structure of the data. In the pbclong
data missing values occur in baseline covariates only. For such variables, the imputation methods 2lonly.pmm
and 2lonly.norm
are available. The predictorMatrix
requires some extra specification to identify the id variable (set to -2) and the random effects structure (set to 2).
Run the setup imputation and perform the necessary changes in the imputation method and predictor matrix:
micelong0 <- mice(pbclong, maxit = 0)
meth_micelong <- micelong0$method
pred_micelong <- micelong0$predictorMatrix
meth_micelong[c("ascites", "bili", "chol")] <- "2lonly.pmm"
pred_micelong[, "id"] <- -2
pred_micelong[, "year"] <- 2
# check the predictor matrix
pred_micelong
## protime year sex trt age ascites bili chol albumin id
## protime 0 2 0 0 0 0 0 0 0 -2
## year 0 2 0 0 0 0 0 0 0 -2
## sex 0 2 0 0 0 0 0 0 0 -2
## trt 0 2 0 0 0 0 0 0 0 -2
## age 0 2 0 0 0 0 0 0 0 -2
## ascites 1 2 1 1 1 0 1 1 1 -2
## bili 1 2 1 1 1 1 0 1 1 -2
## chol 1 2 1 1 1 1 1 0 1 -2
## albumin 0 2 0 0 0 0 0 0 0 -2
## id 0 2 0 0 0 0 0 0 0 -2
Now run the imputation with the adapted imputation method meth_micelong
and predictor matrix pred_micelong
, analyse the imputed data and pool the results.
micelong <- mice(pbclong, meth = meth_micelong, pred = pred_micelong,
maxit = 20)
##
## iter imp variable
## 1 1 ascites bili chol
## 1 2 ascites bili chol
## 1 3 ascites bili chol
## 1 4 ascites bili chol
## 1 5 ascites bili chol
## 2 1 ascites bili chol
## 2 2 ascites bili chol
## 2 3 ascites bili chol
## 2 4 ascites bili chol
## 2 5 ascites bili chol
## 3 1 ascites bili chol
## 3 2 ascites bili chol
## 3 3 ascites bili chol
## 3 4 ascites bili chol
## 3 5 ascites bili chol
## 4 1 ascites bili chol
## 4 2 ascites bili chol
## 4 3 ascites bili chol
## 4 4 ascites bili chol
## 4 5 ascites bili chol
## 5 1 ascites bili chol
## 5 2 ascites bili chol
## 5 3 ascites bili chol
## 5 4 ascites bili chol
## 5 5 ascites bili chol
## 6 1 ascites bili chol
## 6 2 ascites bili chol
## 6 3 ascites bili chol
## 6 4 ascites bili chol
## 6 5 ascites bili chol
## 7 1 ascites bili chol
## 7 2 ascites bili chol
## 7 3 ascites bili chol
## 7 4 ascites bili chol
## 7 5 ascites bili chol
## 8 1 ascites bili chol
## 8 2 ascites bili chol
## 8 3 ascites bili chol
## 8 4 ascites bili chol
## 8 5 ascites bili chol
## 9 1 ascites bili chol
## 9 2 ascites bili chol
## 9 3 ascites bili chol
## 9 4 ascites bili chol
## 9 5 ascites bili chol
## 10 1 ascites bili chol
## 10 2 ascites bili chol
## 10 3 ascites bili chol
## 10 4 ascites bili chol
## 10 5 ascites bili chol
## 11 1 ascites bili chol
## 11 2 ascites bili chol
## 11 3 ascites bili chol
## 11 4 ascites bili chol
## 11 5 ascites bili chol
## 12 1 ascites bili chol
## 12 2 ascites bili chol
## 12 3 ascites bili chol
## 12 4 ascites bili chol
## 12 5 ascites bili chol
## 13 1 ascites bili chol
## 13 2 ascites bili chol
## 13 3 ascites bili chol
## 13 4 ascites bili chol
## 13 5 ascites bili chol
## 14 1 ascites bili chol
## 14 2 ascites bili chol
## 14 3 ascites bili chol
## 14 4 ascites bili chol
## 14 5 ascites bili chol
## 15 1 ascites bili chol
## 15 2 ascites bili chol
## 15 3 ascites bili chol
## 15 4 ascites bili chol
## 15 5 ascites bili chol
## 16 1 ascites bili chol
## 16 2 ascites bili chol
## 16 3 ascites bili chol
## 16 4 ascites bili chol
## 16 5 ascites bili chol
## 17 1 ascites bili chol
## 17 2 ascites bili chol
## 17 3 ascites bili chol
## 17 4 ascites bili chol
## 17 5 ascites bili chol
## 18 1 ascites bili chol
## 18 2 ascites bili chol
## 18 3 ascites bili chol
## 18 4 ascites bili chol
## 18 5 ascites bili chol
## 19 1 ascites bili chol
## 19 2 ascites bili chol
## 19 3 ascites bili chol
## 19 4 ascites bili chol
## 19 5 ascites bili chol
## 20 1 ascites bili chol
## 20 2 ascites bili chol
## 20 3 ascites bili chol
## 20 4 ascites bili chol
## 20 5 ascites bili chol
micelong_mira <- with(micelong, lme(protime ~ year + sex + trt + age +
ascites + bili + chol + albumin,
random = ~year|id)
)
# alternative:
micelong_mira <- with(micelong, lmer(protime ~ year + sex + trt + age +
ascites + bili + chol + albumin +
(year|id))
)
summary(pool(micelong_mira))
## est se t df Pr(>|t|)
## (Intercept) 12.1440772490 0.4331159193 28.0388614 260.86694 0.000000e+00
## year 0.1777262252 0.0252284464 7.0446758 1824.06668 2.624567e-12
## sex2 -0.3003235708 0.1540701831 -1.9492647 70.13239 5.526383e-02
## trt2 -0.0458308495 0.0917002778 -0.4997896 233.67788 6.176935e-01
## age 0.0009080765 0.0044242268 0.2052509 840.74221 8.374258e-01
## ascites2 0.6333646181 0.1015949611 6.2342129 172.04971 3.397080e-09
## bili 0.1119195116 0.0209596387 5.3397634 16.39779 6.113449e-05
## chol -0.0003241279 0.0004062786 -0.7977973 16.27452 4.364749e-01
## albumin -0.4723209273 0.0705680451 -6.6931276 1171.57563 3.379053e-11
## lo 95 hi 95 nmis fmi lambda
## (Intercept) 11.291228951 12.9969255470 NA 0.12070812 0.113992555
## year 0.128246547 0.2272059035 0 0.01105802 0.009974288
## sex2 -0.607596664 0.0069495227 NA 0.25406649 0.233093390
## trt2 -0.226495780 0.1348340814 NA 0.12890807 0.121484350
## age -0.007775750 0.0095919029 0 0.05298855 0.050738423
## ascites2 0.432831594 0.8338976420 NA 0.15409215 0.144315678
## bili 0.067574503 0.1562645198 539 0.54237586 0.489768699
## chol -0.001184221 0.0005359649 615 0.54438595 0.491636116
## albumin -0.610774790 -0.3338670650 0 0.03721226 0.035570083
To analyse incomplete longitudinal data a linear mixed model, the function lme_imp()
can be used. Its specification of the major model components is analogous to the function lme()
from the nlme package.
For the specification of the other arguments of lme_imp()
, refer to the help page or to the details given in the previous part of this practical.
Run the imputation (start with n.iter = 500
; this will take a few seconds) and check the traceplot()
. If you are satisfied by convergence and mixing of the chains, get the model summary()
.
library(JointAI)
JointAIlong <- lme_imp(protime ~ year + sex + trt + age + ascites + bili + chol + albumin,
random = ~year|id, data = pbclong, n.iter = 1000)
## This is new software. Please report any bug to the package maintainer.
traceplot(JointAIlong)
summary(JointAIlong)
##
## Linear mixed model fitted with JointAI
##
## Call:
## lme_imp(fixed = protime ~ year + sex + trt + age + ascites +
## bili + chol + albumin, data = pbclong, random = ~year | id,
## n.iter = 1000)
##
## Posterior summary:
## Mean SD 2.5% 97.5% tail-prob.
## (Intercept) 12.4596335 0.4655691 11.521611 13.3400223 0.00000
## sexf -0.3730065 0.1662976 -0.725929 -0.0577856 0.01333
## trt1 -0.0436455 0.0942028 -0.224690 0.1365296 0.66467
## age -0.0006715 0.0048848 -0.010225 0.0090504 0.89267
## ascites1 0.6352674 0.1116524 0.417236 0.8459319 0.00000
## chol -0.0004723 0.0004191 -0.001282 0.0003174 0.26800
## bili 0.1027571 0.0174612 0.067687 0.1372530 0.00000
## year 0.1802536 0.0248264 0.132637 0.2287294 0.00000
## albumin -0.4977301 0.0695397 -0.632876 -0.3607458 0.00000
##
## Posterior summary of random effects covariance matrix:
## Mean SD 2.5% 97.5%
## D[1,1] 0.7605 0.1348 0.5267 1.0513
## D[1,2] 0.6512 0.1185 0.4429 0.9044
## D[2,2] 0.8643 0.1337 0.6325 1.1560
##
## Posterior summary of residual std. deviation:
## Mean SD 2.5% 97.5%
## sigma_protime 1.056 0.02049 1.018 1.098
##
##
## MCMC settings:
## Iterations = 101:1100
## Sample size per chain = 1000
## Thinning interval = 1
## Number of chains = 3
##
## Number of observations: 1945
## Number of groups: 312
The package jomo provides functionality to impute normal and non-normal longitudinal outcomes using the functions jomo.lmer()
and jomo.glmer()
, which use the model specification of the complete data versions lmer()
and glmer()
from the package lme4.
Different to the other jomo.<...>()
functions is that here the level of each variable in the dataset (1 for repeated measurements, 2 for baseline covariates) needs to be specified.
Refer to the help page of jomo.lmer()
or the details provided in the previous part of this practical to get more information on the other parameters that need to be specified.
Note:
In the current version of jomo the random intercept needs to be specified explicitely, i.e. (1 + year|id)
instead of (year|id)
.
We skip the step to determine the number of iterations necessary in the burn-in phase (you would have to use jomo.lmer.MCMCchain()
to do this step) and set nburn = 6000
and nbetween = 100
. This will take approximately 2 minutes to run. (To save time, you may use less iterations.)
# specify the levels of the variables
lvl_jomolong <- c(protime = 1, year = 1, sex = 2, trt = 2, age = 2, ascites = 2,
bili = 2, chol = 2, albumin = 1, id = 1)
t0 <- Sys.time()
jomolong <- jomo.lmer(protime ~ year + sex + trt + age + ascites +
bili + chol + albumin + (1 + year|id),
data = pbclong,
level = lvl_jomolong, meth = "common", output = 0,
nburn = 6000, nbetween = 100)
## This function is beta software. Use carefully and please report any bug to the package mantainer
t1 <- Sys.time()
t1 - t0
## Time difference of 1.31898 mins
When we try to convert the long format dataset returned by jomo.lmer()
to a mids
object using as.mids()
(like we did in the previous part of this practical), we get an error message: Error in row.names<-.data.frame(*tmp*, value = value) : duplicate 'row.names' are not allowed
. This is because as.mids()
wants to assign the id variable as rownames. Since we have longitudinal data, where the same id occurs more than once, this leads to an error.
Instead, we use the function datalist2mids()
from the miceadds package.
Analyse the imputed data and pool the results, using the following steps:
jomolong
by imputation number into a list of imputed datasetsmids
object using datalist2mids()
mids
objectTo split a dataset into a list, use split()
.
Note:
jomo.lmer()
changes the id
variable to clus
. When analysing the imputed data, you need to use clus
in the random effects specification.
library(miceadds)
jomolong_mids <- datalist2mids(
split(jomolong, jomolong$Imputation)[-1] # split and remove first element
)
# analyse using nlme package:
jomolong_mira <- with(jomolong_mids, lme(protime ~ year + sex + trt + age + ascites +
bili + chol + albumin,
random = ~ year|clus))
# alternative: lme4 package
jomolong_mira <- with(jomolong_mids, lmer(protime ~ year + sex + trt + age + ascites +
bili + chol + albumin + (year|clus))
)
summary(pool(jomolong_mira))
## est se t df Pr(>|t|)
## (Intercept) 12.1211813665 0.4683214474 25.8821829 86.04087 0.000000e+00
## year 0.1804080015 0.0254432167 7.0906129 1754.58297 1.928013e-12
## sex2 -0.3837352619 0.1611171541 -2.3817157 43.61502 2.166261e-02
## trt2 -0.0254703288 0.0906932040 -0.2808405 474.21798 7.789552e-01
## age 0.0016644122 0.0050221796 0.3314123 65.55242 7.413902e-01
## ascites2 0.6231743090 0.1027405410 6.0655152 169.94019 8.285254e-09
## bili 0.1026290954 0.0187968019 5.4599232 16.76121 4.451162e-05
## chol -0.0002251263 0.0004556045 -0.4941266 10.75881 6.311500e-01
## albumin -0.4634483540 0.0703617867 -6.5866485 1481.53788 6.231415e-11
## lo 95 hi 95 nmis fmi lambda
## (Intercept) 11.190195451 13.0521672819 NA 0.22720375 0.20944667
## year 0.130505790 0.2303102135 0 0.01450538 0.01338268
## sex2 -0.708526522 -0.0589440022 NA 0.32804759 0.29792534
## trt2 -0.203680575 0.1527399170 NA 0.08250923 0.07864788
## age -0.008363971 0.0116927954 0 0.26355962 0.24142848
## ascites2 0.420362253 0.8259863653 NA 0.15518682 0.14530250
## bili 0.062928228 0.1423299630 539 0.53656759 0.48438282
## chol -0.001230655 0.0007804021 615 0.66277850 0.60542209
## albumin -0.601467677 -0.3254290309 0 0.02532280 0.02400792
We have imputed the longitudinal pbc data using three different approaches:
2lonly.norm
and 2lonly.pmm
Here is how the pooled results from these imputations compare:
In this part of the practical, we will work with a different subset of the PBC data, omitting the longitudinal structure and focusing on the survival component.
To get the pbcdat data, load the file pbcdat.RData
.
pbcdat
) are:
time | number of years between inclusion and death, transplantion, or end of follow-up |
status |
status at time (censored, transplant, dead)
|
age | patient’s age at intake |
sex | patient’s sex |
platelet | platelet count |
chol | serum cholesterol |
stage | histologic stage of disease |
The variables in pbcdat
dataset have the following distributions:
The missing data pattern is
## time status age sex stage platelet chol
## 280 1 1 1 1 1 1 1 0
## 4 1 1 1 1 1 0 1 1
## 121 1 1 1 1 1 1 0 1
## 7 1 1 1 1 1 0 0 2
## 6 1 1 1 1 0 1 0 2
## 0 0 0 0 6 11 134 151
We are interested to determine predictor variables for patient survival, using the following Cox proportional hazards model:
coxph(Surv(time, status == 'dead') ~ platelet + age + sex + chol + stage)
As we have seen in the lecture, the mice package provides the function nelsonaalen()
that calculates the Nelson-Aalen estimator with which the cumulative Hazard can be approximated.
Calculate the Nelson-Aalen estimate for patient survival in the pbc data and perform the usual setup steps for imputation using mice()
. nelsonaalen()
does not accept the status == 'dead'
specification of the event, hence, we need to create a new event indicator event
.
Note:
nelsonaalen()
uses the name status
internally in such a way that it is replaced by the covariate status
of our dataset pbcdat
, which leads to an error. Since we do not need the column status
for calculating the Nelson-Aalen estimate (we use event
instead), we can exclude it from the data for just this step. To do this, use data = subset(pbcdat, select = -c(status))
inside the nelsonaalen()
function.
pbcdat$event <- as.numeric(pbcdat$status == 'dead')
pbcdat$na <- nelsonaalen(data = subset(pbcdat, select = -c(status)),
timevar = "time",
statusvar = "event")
micesurv0 <- mice(pbcdat, maxit = 0)
micesurvmeth <- micesurv0$meth
micesurvpred <- micesurv0$pred
micesurvmeth[c("chol")] <- "midastouch"
micesurvmeth[c("platelet")] <- "norm"
micesurvpred[, "event"] <- 0
# check the method and predictorMatrix
micesurvmeth
## time status platelet age sex
## "" "" "norm" "" ""
## chol stage event na
## "midastouch" "polr" "" ""
micesurvpred
## time status platelet age sex chol stage event na
## time 0 0 0 0 0 0 0 0 0
## status 0 0 0 0 0 0 0 0 0
## platelet 1 1 0 1 1 1 1 0 1
## age 0 0 0 0 0 0 0 0 0
## sex 0 0 0 0 0 0 0 0 0
## chol 1 1 1 1 1 0 1 0 1
## stage 1 1 1 1 1 1 0 0 1
## event 0 0 0 0 0 0 0 0 0
## na 0 0 0 0 0 0 0 0 0
Now run the imputation, and analyse the imputed data.
micesurv <- mice(pbcdat, predictorMatrix = micesurvpred, maxit = 10, m = 5)
##
## iter imp variable
## 1 1 platelet chol stage
## 1 2 platelet chol stage
## 1 3 platelet chol stage
## 1 4 platelet chol stage
## 1 5 platelet chol stage
## 2 1 platelet chol stage
## 2 2 platelet chol stage
## 2 3 platelet chol stage
## 2 4 platelet chol stage
## 2 5 platelet chol stage
## 3 1 platelet chol stage
## 3 2 platelet chol stage
## 3 3 platelet chol stage
## 3 4 platelet chol stage
## 3 5 platelet chol stage
## 4 1 platelet chol stage
## 4 2 platelet chol stage
## 4 3 platelet chol stage
## 4 4 platelet chol stage
## 4 5 platelet chol stage
## 5 1 platelet chol stage
## 5 2 platelet chol stage
## 5 3 platelet chol stage
## 5 4 platelet chol stage
## 5 5 platelet chol stage
## 6 1 platelet chol stage
## 6 2 platelet chol stage
## 6 3 platelet chol stage
## 6 4 platelet chol stage
## 6 5 platelet chol stage
## 7 1 platelet chol stage
## 7 2 platelet chol stage
## 7 3 platelet chol stage
## 7 4 platelet chol stage
## 7 5 platelet chol stage
## 8 1 platelet chol stage
## 8 2 platelet chol stage
## 8 3 platelet chol stage
## 8 4 platelet chol stage
## 8 5 platelet chol stage
## 9 1 platelet chol stage
## 9 2 platelet chol stage
## 9 3 platelet chol stage
## 9 4 platelet chol stage
## 9 5 platelet chol stage
## 10 1 platelet chol stage
## 10 2 platelet chol stage
## 10 3 platelet chol stage
## 10 4 platelet chol stage
## 10 5 platelet chol stage
micesurv_mira <- with(micesurv, coxph(Surv(time = time, event) ~ platelet +
age + sex + chol + stage))
summary(pool(micesurv_mira))
## Warning in mice.df(m, lambda, dfcom, method): Large sample assumed.
## est se t df Pr(>|t|)
## platelet -0.001735838 0.0009212356 -1.8842500 288.7869 6.053549e-02
## age 0.031921824 0.0082122636 3.8870919 8613.1672 1.022137e-04
## sex2 -0.087872239 0.2362019158 -0.3720217 1382.9532 7.099337e-01
## chol 0.001712381 0.0002908418 5.8876719 185.3976 1.804560e-08
## stage2 0.822266919 0.7438101615 1.1054795 34380.0625 2.689596e-01
## stage3 1.215504072 0.7297672590 1.6656051 47202.4012 9.579873e-02
## stage4 2.174046613 0.7223116486 3.0098457 436715.8270 2.613953e-03
## lo 95 hi 95 nmis fmi lambda
## platelet -0.003549026 7.734929e-05 11 0.123718956 0.117671195
## age 0.015823820 4.801983e-02 0 0.021682153 0.021455012
## sex2 -0.551225009 3.754805e-01 NA 0.055106860 0.053741362
## chol 0.001138596 2.286166e-03 134 0.155925928 0.146869221
## stage2 -0.635625534 2.280159e+00 NA 0.010654893 0.010597341
## stage3 -0.214850150 2.645858e+00 NA 0.009025594 0.008983606
## stage4 0.758337873 3.589755e+00 NA 0.002273966 0.002269397
Again, we remove the added column from the dataset to use the original data in the subsequent imputations.
pbcdat <- subset(pbcdat, select = -c(na))
An alternative approach, that aims to remove the bias introduced by using the Nelson-Aalen estimator to approximate the cumulative hazard is implemented in the package smcfcs.
If you are uncertain which parameters need to be specified, check the help page for the function smcfcs()
or go back to the first part of this practical where some of the details about the specification were described.
Impute the pbcdat
data with smcfcs()
and the model formula specified above. We will skip the confirmation of convergence in the interest of time.
smcfcs()
does can not handle the event specification used in the complete data model above (status == 'dead'
), so you need to use event
instead.
# specify imputation method (in order of variables in dataset)
meth_smcsurv <- c(time = "", status = "", platelet = "norm",
age = "", sex = "", chol = "norm", stage = "podds",
event = "")
# impute the data
smcsurv <- smcfcs(originaldata = pbcdat,
smtype = "coxph",
smformula = "Surv(time = time, event) ~ platelet +
age + sex + chol + stage",
method = meth_smcsurv, numit = 20)
## [1] "Outcome variable(s): time,event"
## [1] "Passive variables: "
## [1] "Partially obs. variables: platelet,chol,stage"
## [1] "Fully obs. substantive model variables: age,sex"
## [1] "Imputation 1"
## [1] "Imputing: platelet using chol,stage,age,sex plus outcome"
## [1] "Imputing: chol using platelet,stage,age,sex plus outcome"
## [1] "Imputing: stage using platelet,chol,age,sex plus outcome"
## [1] "Imputation 2"
## [1] "Imputation 3"
## [1] "Imputation 4"
## [1] "Imputation 5"
Assuming the algorithm has converged (it has for the settings chosen in the solution), fit the Cox model and pool the results.
# create mids object or imputationList object
smcsurv_impList <- mitools::imputationList(smcsurv$impDatasets)
opt <- getOption("contrasts")
# this specifies that ordinal factors are treated as unordered factors when covariate in a model
options(contrasts = rep("contr.treatment", 2))
smcsurv_models <- with(smcsurv_impList, coxph(Surv(time = time, event) ~ platelet +
age + sex + chol + stage))
# re-set the option for factors
options(contrasts = opt)
smcsurv_mira <- as.mira(smcsurv_models)
summary(pool(smcsurv_mira))
## Warning in mice.df(m, lambda, dfcom, method): Large sample assumed.
## est se t df Pr(>|t|)
## platelet -0.001845292 0.0009359963 -1.9714741 224.87336 0.0498958048
## age 0.029968160 0.0082623858 3.6270589 1930.49730 0.0002941095
## sexf -0.164311796 0.2346507399 -0.7002398 1278.55849 0.4839049319
## chol 0.001487577 0.0004008993 3.7105997 22.15989 0.0012070539
## stage2 0.793470917 0.7325710678 1.0831317 2261.56145 0.2788654514
## stage3 1.180754737 0.7209558146 1.6377630 1653.17146 0.1016614558
## stage4 2.124575481 0.7118224306 2.9846987 2844.71708 0.0028627479
## lo 95 hi 95 nmis fmi lambda
## platelet -0.0036897381 -8.468001e-07 NA 0.14095996 0.13335357
## age 0.0137640219 4.617230e-02 NA 0.04646059 0.04547323
## sexf -0.6246545774 2.960310e-01 NA 0.05736869 0.05589532
## chol 0.0006565102 2.318643e-03 NA 0.47057154 0.42485212
## stage2 -0.6431108286 2.230053e+00 NA 0.04285219 0.04200612
## stage3 -0.2333279972 2.594837e+00 NA 0.05029482 0.04914657
## stage4 0.7288353014 3.520316e+00 NA 0.03811877 0.03744275
To impute the pbc survival data with the package jomo we use the function jomo.coxph()
, which works analogously to the other jomo.<...>()
functions we have seen.
Perform the imputation, analyse the imputed data and pool the results. Again, we will skip the convergence check to save some time. Use nburn = 1000
and nbetween = 500
. As in the previous imputation approaches, we use the new event indicator event
instead of status == 'dead'
. Since jomo considers all extra columns as auxiliary variables, we need to exclude status
to prevent multicollinearity issues.
To exclude the variable status
use data = subset(pbcdat, select = -status)
.
To run the analysis on the imputed data, convert the long format data to a mids
object, like we did in the NHANES example.
jomosurv <- jomo.coxph(formula = Surv(time = time, event) ~ platelet +
age + sex + chol + stage,
data = subset(pbcdat, select = -status),
nburn = 1000, nbetween = 500, nimp = 5)
## This function is beta software. Use carefully and please report any bug to the package mantainer
## ....................................................................................................First imputation registered.
## ..................................................Imputation number 2 registered
## ..................................................Imputation number 3 registered
## ..................................................Imputation number 4 registered
## ..................................................Imputation number 5 registered
## The posterior mean of the substantive model fixed effects estimates is:
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] -0.001714145 0.03118608 -0.1500963 0.001628265 0.8112937 1.171847
## [,7]
## [1,] 2.14595
## The posterior mean of the fixed effects estimates is:
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 257.5485 50.74662 362.9459 -1.240514 -0.8258932 -0.2330519 0.04388935
## The posterior mean of the level 1 covariance matrix is:
## [,1] [,2] [,3] [,4] [,5]
## [1,] 10064.71424 -160.422961 4387.388754 -15.00747951 28.51097587
## [2,] -160.42296 113.474201 -418.918982 2.77597525 -2.82686212
## [3,] 4387.38875 -418.918982 57296.144073 -2.92997690 -65.81701667
## [4,] -15.00748 2.775975 -2.929977 1.00000000 0.04727537
## [5,] 28.51098 -2.826862 -65.817017 0.04727537 1.00000000
## [6,] 33.73793 -2.301642 11.885783 -0.09646215 0.50000000
## [7,] 22.78600 -2.778327 38.757157 -0.05911280 0.50000000
## [,6] [,7]
## [1,] 33.73792518 22.7860002
## [2,] -2.30164238 -2.7783268
## [3,] 11.88578283 38.7571566
## [4,] -0.09646215 -0.0591128
## [5,] 0.50000000 0.5000000
## [6,] 1.00000000 0.5000000
## [7,] 0.50000000 1.0000000
jomosurv_mids <- as.mids(jomosurv,
.id = which(names(jomosurv) == "id"),
.imp = which(names(jomosurv) == "Imputation"))
jomosurv_mira <- with(jomosurv_mids, coxph(Surv(time, event) ~ platelet +
age + sex + chol + stage))
summary(pool(jomosurv_mira))
## Warning in mice.df(m, lambda, dfcom, method): Large sample assumed.
## est se t df Pr(>|t|)
## platelet -0.001812719 0.0009026999 -2.0081077 938.7056 4.491794e-02
## age 0.031359505 0.0083208859 3.7687700 1771.2879 1.694271e-04
## sex2 -0.143246361 0.2320142519 -0.6174033 3734.5524 5.370064e-01
## chol 0.001565833 0.0003282181 4.7707085 114.8626 5.440550e-06
## stage2 0.828615944 0.7556223446 1.0966006 2573.6901 2.729186e-01
## stage3 1.203302917 0.7349855006 1.6371791 10311.2249 1.016236e-01
## stage4 2.145234890 0.7315858524 2.9323078 6481.3797 3.376304e-03
## lo 95 hi 95 nmis fmi lambda
## platelet -0.0035842619 -4.117517e-05 11 0.06723022 0.06524498
## age 0.0150397167 4.767929e-02 0 0.04855048 0.04747678
## sex2 -0.5981333660 3.116406e-01 NA 0.03318174 0.03266410
## chol 0.0009156876 2.215978e-03 134 0.20040179 0.18659927
## stage2 -0.6530734463 2.310305e+00 NA 0.04011598 0.03937035
## stage3 -0.2374113090 2.644017e+00 NA 0.01978212 0.01959201
## stage4 0.7110851482 3.579385e+00 NA 0.02506066 0.02475987
We have imputed the pbc survival data using three different approaches:
Here is how the pooled results from these imputations compare: