Select the Best Imputation Method with the R-Package Iscores

Authors: Meta-Lina Spohn and Jeffrey Näf, ETH Zurich

This is a notebook with an example illustrating the use of the R-package Iscores, available on CRAN.

R-Package Iscores

The function Iscores provided in the package computes a score for several imputations of a data set with missing values (Imputation Score, I-Score). As such, it helps selecting the best imputation of a data set, if the goal is to impute with the true joint distribution of the data. This in turn is crucial to make valid inference on data with missing values. The algorithm employs random projections in the variable space and is based on density-ratios, resulting in the so-called DR I-Score. The density-ratios are assessed through classification random forests. See the paper Imputation Scores for further elaboration on the algorithm and its properties.

In more detail, the function Iscores is applied to a data set with missing values and a corresponding list of imputed versions of this data set. The imputed data sets may have been obtained by applying any imputation method to the data set with missing values. The function Iscores has several arguments, most of them having default values.

A full list of all function arguments of Iscores is given here:

X.NA: A matrix, containing missing values: the data to impute.
imputations: A list of lists of imputations matrices of the same size as X.NA, containing no more missing values. If the methods were applied \(m\) times, yielding \(m\) multiple imputations, then the each list element contains a list of length \(m\).
methods: A vector of strings with the same length as imputations, indicating the names of imputations (often the methods used for imputation).
m: An integer \(\geq 1\), the number of of times each imputation method was used for imputation (multiple imputation). Default is the number of provided multiple imputations (the length of each list argument in imputations).
num.proj: An integer, specifying the number of projections in the variable space to consider for the DR I-Score. Default is \(100\). May be lowered, if the dimension \(d\) of the data set is small, and increased, if it is large.
num.trees.per.proj: An integer, the number of trees fitted per projection. Default is \(5\), may be increased, if computation capacity allows.
min.node.size: An integer, the minimum number of observations in a leaf node of a tree. Default is \(10\), the same default as for a regular probability forest.
n.cores: An integer, the number of cores to use for parallel computation, only possible to set \(\geq 1\) if multiple cores are available. Default is \(1\).
projection.function: A function, providing the user-specific projections. Default is NULL. If not provided, random projections are employed.
rescale: A boolean, TRUE if the computed DR I-Scores should be rescaled such that the maximal score is \(0\). Default is TRUE.

Example

We create a simple \(2d\) data set to be able to visualize the data and its imputations. We use a noisy version of the spiral from the R-package mlbench, available on CRAN. We apply \(3\) different imputation methods. In this example, superiority of one of the methods can be detected visually, and it is therefore suitable to showcase the I-Score. In the upcoming code, we apply the following steps before the actual scoring:

Generate the true data \(X\) without missing values.
Generate a data set \(X.NA\) from \(X\) by setting random entries to \(NA\) (MCAR).
Create the imputations with \(3\) different methods: i) mean imputation of each entry with \(NA\) with the mean of the respective column, ii) sample a value for each entry with \(NA\) from the observed values in the respective column, iii) mice-cart imputation with the package mice, available on CRAN.
Plot the true data and the imputations of the \(3\) methods for visualization.

# parameter set-up
n <- 500 # sample size
epsilon <- 0.05 # parameter for the spiral
m <- 1 # number of imputations per method 
methods <- c("mean","sample", "mice-cart") # names of imputation methods

# generate the true data X
library(mlbench)
X <- mlbench.spirals(n)$x # create a spiral in 2d
X <- jitter(X, amount = 0.05) # add noise to the spiral

# generate the data with missing values X.NA with a MCAR mechanism 
X.NA <- X
X.NA[,1] <- ifelse(stats::runif(n) <= 0.3, NA, X[,1])
X.NA[,2] <- ifelse(stats::runif(n) <= 0.3, NA, X[,2])
# remove observations that have no observed value
X <- X[rowSums(is.na(X.NA)) < 2,]
X.NA <- X.NA[rowSums(is.na(X.NA)) < 2,]
# the resulting overall probability of missingness is given by
mean(is.na(X.NA))
## [1] 0.2291667

# initialize list of lists for the imputations
imputations <- list()
for(i in 1:length(methods)){
  imputations[[i]] <- list()
}
names(imputations) <-methods

# generate imputation with method "mean"
X.imp <- X.NA
X.imp[is.na(X.NA[,1]),1] <- mean(X.NA[,1], na.rm = TRUE)
X.imp[is.na(X.NA[,2]),2] <- mean(X.NA[,2], na.rm = TRUE)
imputations[[1]][[1]] <- X.imp # we need to have a list of lists, even 
# though m = 1 only.

# generate imputation with method "sample"
X.imp <- X.NA
X.imp[is.na(X.NA[,1]),1] <- sample(X.NA[!is.na(X.NA[,1]),1],
                                    size = sum(is.na(X.NA[,1])), replace = TRUE)
X.imp[is.na(X.NA[,2]),2] <- sample(X.NA[!is.na(X.NA[,2]),2],
                                    size = sum(is.na(X.NA[,2])), replace = TRUE)
imputations[[2]][[1]] <- X.imp
  
# generate imputation with method "mice-cart"
library(mice)
## 
## Attache Paket: 'mice'
## Das folgende Objekt ist maskiert 'package:stats':
## 
##     filter
## Die folgenden Objekte sind maskiert von 'package:base':
## 
##     cbind, rbind
X.imp <- mice::complete(mice(X.NA, method = "cart", m = m), action="all")
## 
##  iter imp variable
##   1   1  V1  V2
##   2   1  V1  V2
##   3   1  V1  V2
##   4   1  V1  V2
##   5   1  V1  V2
imputations[[3]][[1]] <- unname(do.call(cbind, X.imp$`1`))

# plot the true data and the imputations of the 3 methods
# store the indices of observations with at least one NA
ind.candidates <- which(!complete.cases(X.NA))
ind.candidates <- sort(ind.candidates)

# layout settings
layout(mat = matrix(c(1, 2, 3, 4), 
                    nrow = 1, 
                    ncol = 4), heights = 1, widths = 1) 
par(mai=c(0.4, 0.2, 0.4, 0.01))

# plot
plot(X, pch=19, col="gray", cex=1.2, yaxt="n", xaxt="n",xlab="", ylab="",
     font.lab=1, main = "true data", cex.main = 1)
points(X[ind.candidates,],pch=1,col="black",cex=1.2)
for(i in 1:3){
  plot(X.NA[-ind.candidates,], pch=19, col="gray",cex=1.2,yaxt="n",xaxt="n",
    xlab ="", ylab="" ,font.lab=19, main=methods[i],cex.main=1)
  points(imputations[[i]][[1]][ind.candidates,],pch=1,col="black",cex=1.2)
  
}

In gray one can see the data points that had no \(NA\), in black one can see the data points that had a missing entry in one of the two dimensions and that were correspondingly imputed. In this example, we can indeed observe the ranking of imputation methods visually: The best is mice-cart, followed by sample and the worst is mean. However, most data examples have more than \(2\) or \(3\) dimensions, such that a visual inspection is not possible. In these cases, one can use our I-Score, which we showcase to return the same ranking as the visual inspection in the example at hand.

As such, we call the function Iscores, to rank the \(3\) imputed data sets, obtained by mean, sample and mice-cart imputation. We use num.proj\(=5\), since the dimension \(d\) is only \(2\). For the remaining arguments we use the default values. In particular, we use the default of rescale = TRUE, such that the highest score lies at \(0\). The result is shown in the next code-snippet:

library(Iscores)
## Lade nötiges Paket: parallel
## Lade nötiges Paket: ranger
## Lade nötiges Paket: kernlab
Iscores(imputations = imputations,
        methods = methods,
        m = 1,
        X.NA = X.NA,
        num.proj=5
 )
## [1] "Evaluating method mean"
## [1] "nr of projections 5"
## [1] "nr of projections 5"
## [1] "nr of projections 5"
## [1] "nr of projections 5"
## [1] "Evaluating method sample"
## [1] "nr of projections 5"
## [1] "nr of projections 5"
## [1] "nr of projections 5"
## [1] "nr of projections 5"
## [1] "Evaluating method mice-cart"
## [1] "nr of projections 5"
## [1] "nr of projections 5"
## [1] "nr of projections 5"
## [1] "nr of projections 5"
## [1] "done scoring"
##           mean    sample mice-cart
## [1,] -22.37682 -1.393615         0

The I-Score does indeed return the same ranking of the imputed data sets: the best method is mice-cart, the second best is sample and the worst is mean.

We quickly compare this ranking to another ranking obtained by a method that is often used in research: The root mean squared error (RMSE), calculated between imputed and true values. This is not a method that can be used in practice, since the actual values underlying the missing observations need to be observed, but it is an interesting benchmark. We thus quickly calculate the negative RMSE (NRMSE) to have the same orientation (the higher the score the better) as with the I-Score above:

# store the indices of observations with at least one NA
ind.candidates <- which(!complete.cases(X.NA))
ind.candidates <- sort(ind.candidates)
# get the number of NAs in the data set
nrofmissing=sum(is.na(X.NA))

# calculate the NRMSE
NRMSE <- list()

for(method in methods){
  NRMSE[[method]] <- -sqrt(sum((X[ind.candidates,] - imputations[[method]][[1]][ind.candidates,])^2)/nrofmissing) }
  
# sort the NRMSE and rescale it as in the above Iscore
sort(unlist(NRMSE) - max(unlist(NRMSE)))

##  mice-cart     sample       mean 
## -0.2163287 -0.1805961  0.0000000

We again rescale the values such that the maximal NRMSE value lies at \(0\). Crucially, now the mean gets ranked highest, and with a seemingly high difference to the other two. This is no coincidence, as RMSE tends to favor methods that impute with (conditional) means, even though one can see visually that this is not ideal in this case. The I-Score avoids this trap, while also working without access to the true underlying data.