1) Regression with NA (quantitative) for ozone

First of all you will need to install the following packages

install.packages("VIM")
install.packages("naniar")
install.packages("missMDA")
install.packages("Amelia")
install.packages("mice")
install.packages("missForest")
install.packages("FactoMineR")
install.packages("tidyverse")

Air pollution is currently one of the most serious public health worries worldwide. Many epidemiological studies have proved the influence that some chemical compounds, such as sulphur dioxide (SO2), nitrogen dioxide (NO2), ozone (O3) can have on our health. Associations set up to monitor air quality are active all over the world to measure the concentration of these pollutants. They also keep a record of meteorological conditions such as temperature, cloud cover, wind, etc.

We have at our disposal 112 observations collected during the summer of 2001 in Rennes. The variables available are:

maxO3 (maximum daily ozone)
maxO3v (maximum daily ozone the previous day)
T12 (temperature at midday)
T9
T15 (Temp at 3pm)
Vx12 (projection of the wind speed vector on the east-west axis at midday)
Vx9 and Vx15 as well as the Nebulosity (cloud) Ne9, Ne12, Ne15

Here the final aim is to analyse the relationship between the maximum daily ozone (maxO3) level and the other meteorological variables. To do so we will perform regression to explain maxO3 in function of all the other variables. This data is incomplete - there are missing values. Indeed, it occurs frenquently to have machines that fail one day, leading to some information not recorded. We will therefore perform regression with missing values via multiple imputation.

Importing the data.

ozo <- read.table("ozoneNA.csv",
                  header = TRUE,
                  sep = ",",
                  row.names = 1)

WindDirection <- ozo[, 12]
don <- ozo[, 1:11]   #### keep the continuous variables

summary(don)

##      maxO3              T9             T12             T15       
##  Min.   : 42.00   Min.   :11.30   Min.   :14.30   Min.   :14.90  
##  1st Qu.: 71.00   1st Qu.:16.00   1st Qu.:18.60   1st Qu.:18.90  
##  Median : 81.50   Median :17.70   Median :20.40   Median :21.40  
##  Mean   : 91.24   Mean   :18.22   Mean   :21.46   Mean   :22.41  
##  3rd Qu.:108.25   3rd Qu.:19.90   3rd Qu.:23.60   3rd Qu.:25.65  
##  Max.   :166.00   Max.   :25.30   Max.   :33.50   Max.   :35.50  
##  NA's   :16       NA's   :37      NA's   :33      NA's   :37     
##       Ne9             Ne12            Ne15           Vx9         
##  Min.   :0.000   Min.   :0.000   Min.   :0.00   Min.   :-7.8785  
##  1st Qu.:3.000   1st Qu.:4.000   1st Qu.:3.00   1st Qu.:-3.0000  
##  Median :5.000   Median :5.000   Median :5.00   Median :-0.8671  
##  Mean   :4.987   Mean   :4.986   Mean   :4.60   Mean   :-1.0958  
##  3rd Qu.:7.000   3rd Qu.:7.000   3rd Qu.:6.25   3rd Qu.: 0.6919  
##  Max.   :8.000   Max.   :8.000   Max.   :8.00   Max.   : 5.1962  
##  NA's   :34      NA's   :42      NA's   :32     NA's   :18       
##       Vx12              Vx15            maxO3v      
##  Min.   :-7.8785   Min.   :-9.000   Min.   : 42.00  
##  1st Qu.:-3.6941   1st Qu.:-3.759   1st Qu.: 70.00  
##  Median :-1.9284   Median :-1.710   Median : 82.50  
##  Mean   :-1.6853   Mean   :-1.830   Mean   : 89.39  
##  3rd Qu.:-0.1302   3rd Qu.: 0.000   3rd Qu.:101.00  
##  Max.   : 6.5778   Max.   : 3.830   Max.   :166.00  
##  NA's   :10        NA's   :21       NA's   :12

head(don)

##          maxO3   T9  T12  T15 Ne9 Ne12 Ne15     Vx9    Vx12    Vx15 maxO3v
## 20010601    87 15.6 18.5   NA   4    4    8  0.6946 -1.7101 -0.6946     84
## 20010602    82   NA   NA   NA   5    5    7 -4.3301 -4.0000 -3.0000     87
## 20010603    92 15.3 17.6 19.5   2   NA   NA  2.9544      NA  0.5209     82
## 20010604   114 16.2 19.7   NA   1    1    0      NA  0.3473 -0.1736     92
## 20010605    94   NA 20.5 20.4  NA   NA   NA -0.5000 -2.9544 -4.3301    114
## 20010606    80 17.7 19.8 18.3   6   NA    7 -5.6382 -5.0000 -6.0000     94

dim(don)

## [1] 112  11

Load the libraries.

library(VIM)
library(FactoMineR)
library(missMDA)

1.1) Descriptive statistics with missing values

Q1 When could it be a good idea to delete rows or columns with missing values to work with a complete data set? Could you do it here?

dim(na.omit(don))

## [1] 13 11

Deleting rows or columns is possible as long as there is enough data left and the missing values are of the MCAR type so that the sample is a subsample of the original data. We will obtain unbiased estimators but with more variance. Deleting observations with missing data for ozone data leads to a table with 13 rows.

First, we perfom some descriptive statistics (how many missing? how many variables, individuals with missing?) and try to inspect and vizualize the pattern of missing entries and get hints on the mechanism. For this purpose, we use the R package VIM (Visualization and Imputation of Missing Values - Mathias Templ) as well as Multiple Correspondence Analysis (FactoMineR package). You should install the package VIM, then you can check the documentation by executing

?VIM

The other package that can be used is the package naniar developped by Nick Tierney’s and which is based on ggplot. Naniar provides principled, tidy ways to summarise, visualise, and manipulate missing data with minimal deviations from the workflows in ggplot2 and tidy data.

library(naniar)
gg_miss_var(don)

The function VIM aggr calculates and represents the number of missing entries in each variable and for certain combinations of variables (which tend to be missing simultaneously).

res<-summary(aggr(don, sortVar=TRUE))$combinations

## 
##  Variables sorted by number of missings: 
##  Variable      Count
##      Ne12 0.37500000
##        T9 0.33035714
##       T15 0.33035714
##       Ne9 0.30357143
##       T12 0.29464286
##      Ne15 0.28571429
##      Vx15 0.18750000
##       Vx9 0.16071429
##     maxO3 0.14285714
##    maxO3v 0.10714286
##      Vx12 0.08928571

head(res[rev(order(res[,2])),])

##             Combinations Count   Percent
## 1  0:0:0:0:0:0:0:0:0:0:0    13 11.607143
## 45 0:1:1:1:0:0:0:0:0:0:0     7  6.250000
## 10 0:0:0:0:0:1:0:0:0:0:0     5  4.464286
## 35 0:1:0:0:0:0:0:0:0:0:0     4  3.571429
## 41 0:1:0:0:1:1:1:0:0:0:0     3  2.678571
## 28 0:0:1:0:0:0:0:0:0:0:0     3  2.678571

We can see that the combination which is the most frequent is the one where all the variables are observed (13 values). Then, the second one is the one where T9, T12 and T15 are simultaneously missing (7 rows) (1 is missing, 0 is observed - there is a 1 for the second, third and fourth variables). The graph on the right panel represents the pattern, with blue for observed and red for missing.

The VIM function matrixplot creates a matrix plot in which all cells of a data matrix are visualized by rectangles. Available data is coded according to a continuous color scheme (gray scale), while missing/imputed data is visualized by a clearly distinguishable color (red). If you use Rstudio the plot is not interactive (there are the warnings), but if you use R directly, you can click on a column of your choice: the rows are sorted (decreasing order) of the values of this column. This is useful to check if there is an association between the value of a variable and the missingness of another one.

matrixplot(don, sortby = 2)

## 
## Click in a column to sort by the corresponding variable.
## To regain use of the VIM GUI and the R console, click outside the plot region.

#Here the variable selected is variable 2.

Q2 Do you observe any associations between the missing entries ? When values are missing on a variable does it correspond to small or large values on another one ?

We observe that the temperature variables T9, T12 and T15 tend to be missing together (probably indicating that thermometers failed) [as well as the Ne9, Ne12 and Ne15 variables.] We see more “red” values. We do not see more black or white values which should imply that when T9 is missing it would have corresponded to high or low values in another variable which should suggest MAR missing values for instance. Here everything points to MCAR values.

The VIM function marginplot creates a scatterplot with additional information on the missing values. If you plot the variables (x,y), the points with no missing values are represented as in a standard scatterplot. The points for which x (resp. y) is missing are represented in red along the y (resp. x) axis. In addition, boxplots of the x and y variables are represented along the axes with and without missing values (in red all variables x where y is missing, in blue all variables x where y is observed).

marginplot(don[,c("T9","maxO3")])

We can see that the distribution of T9 is the same when maxO3 is oberved and when maxO3 is missing. If the two boxplots (red and blue) would have been very different it would imply that when maxO3 is missing the values of T9 can be very high or very low which lead to suspect the MAR hypothesis.

R1 Create a categorical dataset with “o” when the value of the cell is observed and “m” when it is missing, and with the same row and column names as in the original data. Then, you can perform Multiple Correspondence Analysis with the MCA function of the FactoMineR package.

?MCA

MCA can be seen as the counterpart of PCA for categorical data and here is used to study associations between missing and observed entries. MCA is a straightforwardly tool to visualise the missing data pattern even if the number of variable is large. It shows if missing values simultaneously occur in several variables or if missing values occur when some other variables are observed

data_miss <- data.frame(is.na(don))
data_miss <- apply(X=data_miss, FUN=function(x) if(x) "m" else "o", MARGIN=c(1,2))

# data_miss <- as_shadow(don) with the naniar package.
res.mca <- MCA(data_miss, graph = F)
plot(res.mca, invis = "ind", title = "MCA graph of the categories", cex  = 0.5)

Other features from the package naniar

Summary with missing values

pct_miss(don) # percentage of missing value in the data.

## [1] 23.7013

n_miss(don) # number of missing values in the

## [1] 292

n_complete(don) # without missing value

## [1] 940

n_miss(don$maxO3) # number of missing value for maxO3

## [1] 16

A matrix with missing and non missing:

as_shadow(don)

## # A tibble: 112 x 11
##    maxO3_NA T9_NA T12_NA T15_NA Ne9_NA Ne12_NA Ne15_NA Vx9_NA Vx12_NA
##    <fct>    <fct> <fct>  <fct>  <fct>  <fct>   <fct>   <fct>  <fct>  
##  1 !NA      !NA   !NA    NA     !NA    !NA     !NA     !NA    !NA    
##  2 !NA      NA    NA     NA     !NA    !NA     !NA     !NA    !NA    
##  3 !NA      !NA   !NA    !NA    !NA    NA      NA      !NA    NA     
##  4 !NA      !NA   !NA    NA     !NA    !NA     !NA     NA     !NA    
##  5 !NA      NA    !NA    !NA    NA     NA      NA      !NA    !NA    
##  6 !NA      !NA   !NA    !NA    !NA    NA      !NA     !NA    !NA    
##  7 !NA      !NA   !NA    !NA    !NA    !NA     NA      !NA    !NA    
##  8 !NA      !NA   !NA    !NA    !NA    !NA     NA      !NA    !NA    
##  9 !NA      !NA   !NA    !NA    !NA    NA      !NA     !NA    !NA    
## 10 !NA      !NA   NA     !NA    !NA    NA      NA      !NA    !NA    
## # ... with 102 more rows, and 2 more variables: Vx15_NA <fct>,
## #   maxO3v_NA <fct>

The initial matrix concatenated with the matrix with missing and non missing:

bind_shadow(don)

## # A tibble: 112 x 22
##    maxO3    T9   T12   T15   Ne9  Ne12  Ne15     Vx9    Vx12   Vx15 maxO3v
##    <int> <dbl> <dbl> <dbl> <int> <int> <int>   <dbl>   <dbl>  <dbl>  <int>
##  1    87  15.6  18.5  NA       4     4     8   0.695  -1.71  -0.695     84
##  2    82  NA    NA    NA       5     5     7  -4.33   -4     -3         87
##  3    92  15.3  17.6  19.5     2    NA    NA   2.95   NA      0.521     82
##  4   114  16.2  19.7  NA       1     1     0  NA       0.347 -0.174     92
##  5    94  NA    20.5  20.4    NA    NA    NA  -0.5    -2.95  -4.33     114
##  6    80  17.7  19.8  18.3     6    NA     7  -5.64   -5     -6         94
##  7    79  16.8  15.6  14.9     7     8    NA  -4.33   -1.88  -3.76      80
##  8    79  14.9  17.5  18.9     5     5    NA   0      -1.04  -1.39      99
##  9   101  16.1  19.6  21.4     2    NA     4  -0.766  -1.03  -2.30      79
## 10   106  18.3  NA    22.9     5    NA    NA   1.29   -2.30  -3.94     101
## # ... with 102 more rows, and 11 more variables: maxO3_NA <fct>,
## #   T9_NA <fct>, T12_NA <fct>, T15_NA <fct>, Ne9_NA <fct>, Ne12_NA <fct>,
## #   Ne15_NA <fct>, Vx9_NA <fct>, Vx12_NA <fct>, Vx15_NA <fct>,
## #   maxO3v_NA <fct>

Replacing values with NA: replace_with_na recodes various values with a missing value (NA). For example, we might know that all values of “N/A”, “N A”, and “Not Available”, or -99, or -1 are supposed to be missing.

Plot with missing values

Missing values in each variable per category of another variable the wind direction:

library(dplyr)
don %>%
  group_by(ozo$WindDirection) %>%
  miss_var_summary()

## # A tibble: 44 x 5
##    `ozo$WindDirection` variable n_miss pct_miss n_miss_cumsum
##    <fct>               <chr>     <int>    <dbl>         <int>
##  1 North               T15          12     38.7            39
##  2 North               T9           11     35.5            16
##  3 North               T12          11     35.5            27
##  4 North               Ne12          9     29.0            55
##  5 North               Ne9           7     22.6            46
##  6 North               Ne15          7     22.6            62
##  7 North               Vx9           6     19.4            68
##  8 North               maxO3         5     16.1             5
##  9 North               Vx15          5     16.1            75
## 10 North               maxO3v        4     12.9            79
## # ... with 34 more rows

Below with using bind_shadow function, we show the mean, sd, variance, and min and max values of T9 for when maximum daily ozone level is present, and when it is missing.

don %>%
  bind_shadow() %>%
  group_by(maxO3_NA) %>%
  summarise_at(.vars = "T9",
               .funs = c("mean", "sd", "var", "min", "max"),
               na.rm = TRUE)

## # A tibble: 2 x 6
##   maxO3_NA  mean    sd   var   min   max
##   <fct>    <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 !NA       18.3  3.06  9.36  11.3  25.3
## 2 NA        17.9  2.61  6.83  13.3  21

As VIM package has matrix plot, similarly naniar has the var_miss() function. It provides a summary of whether the data is missing (in black) or not. It also provides the percentage of missing values in each column.

vis_miss(don, sort_miss = TRUE)

The function geom_miss_point() is close to the margin plot function of VIM but within the ggplot framework.

library(ggplot2)
ggplot(don, 
       aes(x = T9, 
           y = maxO3)) + 
  geom_miss_point() + 
  facet_wrap(~ozo$WindDirection)+ 
  theme_dark()

Below, we can plot the distribution of Temperature at 9, plotting for values of temperature when Ozone is missing, and when it is not missing.

ggplot( bind_shadow(don),
       aes(x = T9,
           fill = maxO3_NA)) + 
  geom_density(alpha=0.5)

1.2) PCA with missing values

Then, before modeling the data, we perform a PCA with missing values to explore the correlation between variables. Use the R package missMDA dedicated to perform principal components methods with missing values and to impute data with PC methods.

Perform PCA with missing values using the imputePCA functions, with the number of components determined by the estim_ncpPCA. Then plot the variables circle.

library(missMDA)

?estim_ncpPCA
?imputePCA

The package missMDA allows the use of principal component methods for an incomplete data set. To achieve this goal in the case of PCA, the missing values are predicted using the iterative PCA algorithm for a predefined number of dimensions. Then, PCA is performed on the imputed data set. The single imputation step requires tuning the number of dimensions used to impute the data.

nb <- estim_ncpPCA(don,method.cv = "Kfold", verbose = FALSE) # estimate the number of components from incomplete data
#(available methods include GCV to approximate CV)
nb$ncp #2

## [1] 2

plot(0:5, nb$criterion, xlab = "nb dim", ylab = "MSEP")

res.comp <- imputePCA(don, ncp = nb$ncp) # iterativePCA algorithm
res.comp$completeObs[1:3,] # the imputed data set

##          maxO3      T9      T12      T15 Ne9     Ne12     Ne15     Vx9
## 20010601    87 15.6000 18.50000 20.47146   4 4.000000 8.000000  0.6946
## 20010602    82 18.5047 20.86986 21.79932   5 5.000000 7.000000 -4.3301
## 20010603    92 15.3000 17.60000 19.50000   2 3.984066 3.812104  2.9544
##               Vx12    Vx15 maxO3v
## 20010601 -1.710100 -0.6946     84
## 20010602 -4.000000 -3.0000     87
## 20010603  1.950563  0.5209     82

imp <- cbind.data.frame(res.comp$completeObs,WindDirection)

res.pca <- PCA(imp, quanti.sup = 1, quali.sup = 12, ncp = nb$ncp, graph=FALSE)
plot(res.pca, hab=12, lab="quali");

plot(res.pca, choix="var")

head(res.pca$ind$coord) #scores (principal components)

##               Dim.1      Dim.2
## 20010601 -0.6604580 -1.2048271
## 20010602 -1.2317545  1.0465411
## 20010603  0.7984643 -2.7299508
## 20010604  2.5423205 -1.7435774
## 20010605 -0.4047517  0.8406578
## 20010606 -2.6701824  1.6934864

The incomplete data set can be imputed using the function imputePCA performing the iterative PCA algorithm, specifying the number of dimensions through the argument ncp=2. At convergence the algorithm provides both an estimation of the scores and loadings as well as a completed data set. The imputePCA function outputs the imputed data set. The completed data set is in the object completeObs. The imputePCA function also outputs the fitted matrix \(\hat X\) in the object fitted.

Q3 Could you guess how cross-validation is performed to select the number of components?

The cross-validation is performed with the Kfold methodFor the Kfold. A percentage pNA of missing values is inserted and predicted with a PCA model using ncp.min to ncp.max dimensions. This process is repeated nbsim times. The number of components which leads to the smallest MSEP (Mean Standard Error of Prediction) is retained.

Through the argument method.cv, the function estim_ncpPCA proposes several cross-validation procedures to choose this number. The default method is the generalised cross-validation method (method.cv=“gcv”). It consists in searching the number of dimensions which minimises the generalised cross-validation criterion, which can be seen as an approximation of the leave-one-out cross-validation criterion. The procedure is very fast, because it does not require adding explicitly missing values and predicting them for each cell of the data set. However, the number of dimensions minimising the criterion can sometimes be unobvious when several local minimum occur. In such a case, more computationally intensive methods, those performing explicit cross-validation, can be used, such as Kfold (method.cv=“Kfold”) or leave-one-out (method.cv=“loo”).

The Kfold cross-validation suggests to retain 2 dimensions for the imputation of the data set.

1.3) Multiple imputation

Generate multiple data sets

We perform multiple imputation either assuming 1) Joint Modeling (one joint probabilistic model for the variables all together) - We use the R package Amelia, which is by default consider Gaussian distribution 2) Condional Modeling (one model per variable) approach - We use the R package mice which by default consider one model of linear regression per variable 3) a PCA based model - We use the R package missMDA

For each approach we generate 100 imputed data sets.

library(Amelia)

?amelia

res.amelia <- amelia(don, m = 5)

## -- Imputation 1 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
##  41 42 43 44 45 46 47 48 49
## 
## -- Imputation 2 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
##  41 42 43 44 45
## 
## -- Imputation 3 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
##  41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
##  61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
##  81
## 
## -- Imputation 4 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
## 
## -- Imputation 5 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
##  41 42 43 44

#names(res.amelia$imputations) 
#res.amelia$imputations$imp1# the first imputed data set

library(mice)

imp.mice <- mice(don, m = 100, defaultMethod = "norm.boot") # the variability of the parameters is obtained

Now generate 100 imputed data sets with the MIPCA method and 2 components. Store the result in a variable called res.MIPCA.

?MIPCA
?plot.MIPCA

res.MIPCA <- MIPCA(don, ncp = 2, nboot = 100) # MI with PCA using 2 dimensions

The function MIPCA gives as output the data set imputed by the iterative PCA algorithm (in res.imputePCA) and the other data sets generated by the MIPCA algorithm (in res.MI). The number of data sets generated by this algorithm is controlled by the nboot argument, equal to 100 by default. The other arguments of this function are the same as those for the imputePCA function.

Inspect the imputed values

Exploratory analysis is very important and even at this stage of the analysis.

We will inspect the imputed values created to know if the imputation method should require more investigation or if we can continue and analyze the data. A common practice consists in comparing the distribution of the imputed values and of the observed values. Check the compare.density function.

compare.density(res.amelia, var = "T12")

Q Do both distributions need to be close? Could the missing values differ from the observed ones both in spread and in location?

Note that a difference between these distributions does not mean that the model is unsuitable. Indeed, when the missing data mechanism is not MCAR, it could make sense to observe differences between the distribution of imputed values and the distribution of observed values. However, if differences occur, more investigations would be required to try to explain them.

The quality of imputation can also be assessed with cross-validation using the overimpute function. Each observed value is deleted and for each one 100 values are predicted (using the same MI method) and the mean and 90% confidence intervals are computed for these 100 values. Then, we inspect whether the observed value falls within the obtained interval. On the graph, the y=x line is plotted (where the imputations should fall if they were perfect), as well as the mean (dots) and intervals (lines) for each value. Around ninety percent of these confidence intervals should contain the y = x line, which means that the true observed value falls within this range. The color of the line (as coded in the legend) represents the fraction of missing observations in the pattern of missingness for that observation (ex: blue=0-2 missing entries).

overimpute(res.amelia, var = "maxO3")

Comment the quality of the imputation.

We can also examine the variability by projecting as supplementary tables the imputed data sets on the PCA configuration (plot the results of MI with PCA).

plot(res.MIPCA,choice= "ind.supp")

plot(res.MIPCA,choice= "var")

The plots represent the projection of the individuals (top) and variables (bottom) of each imputed data set as supplementary elements onto the reference configuration obtained with the iterative PCA algorithm. For the individuals, a confidence area is constructed for each, and if one has no missing entries, its confidence area is restricted to a point. All the plots show that the variability across different imputations is small and a user can interpret the PCA results with confidence.

Perform regression

MI aims to apply a statistical method on an incomplete data set. We now apply a regression model on each imputed data set of the amelia method and MIPCA methods.

resamelia <- lapply(res.amelia$imputations, as.data.frame)
# A regression on each imputed dataset
fitamelia<-lapply(resamelia, lm, 
                  formula="maxO3~ T9+T12+T15+Ne9+Ne12+Ne15+Vx9+Vx12+Vx15+maxO3v")  
# fitamelia <- lapply(resamelia, with, 
#                     lm(maxO3 ~ T9+T12+T15+Ne9+Ne12+Ne15+Vx9+Vx12+Vx15+maxO3v))

imp.mice <- mice(don, m=100,defaultMethod="norm.boot") # the variability of the parameters is obtained 
lm.mice.out <- with(imp.mice, lm(maxO3 ~ T9+T12+T15+Ne9+Ne12+Ne15+Vx9+Vx12+Vx15+maxO3v))

res.MIPCA <- lapply(res.MIPCA$res.MI, as.data.frame)
fitMIPCA<-lapply(res.MIPCA,lm, formula="maxO3~T9+T12+T15+Ne9+Ne12+Ne15+Vx9+Vx12+Vx15+maxO3v")

Aggregate the results of Regression with Multiple Imputation according to Rubin’s rule for MI with amelia and with PCA with the pool function from the mice package

poolamelia<-pool(as.mira(fitamelia)) 
summary(poolamelia)

##               estimate  std.error  statistic        df    p.value
## (Intercept) 19.9420015 19.4901617  1.0231830  7.078897 0.32625157
## T9           1.0070129  4.2031210  0.2395869  3.161697 0.81466268
## T12          2.4735036  4.1893532  0.5904261  3.383789 0.56577415
## T15         -0.2473683  1.8644070 -0.1326794  5.808150 0.89662884
## Ne9         -2.5547392  1.1369961 -2.2469200 12.094691 0.04407947
## Ne12        -2.3052677  3.6854195 -0.6255102  3.750581 0.54325910
## Ne15         0.6871875  2.2297828  0.3081858  4.390821 0.76318418
## Vx9          0.4595960  2.2721088  0.2022773  4.249724 0.84306258
## Vx12         0.4610013  3.0938075  0.1490078  3.694674 0.88400324
## Vx15         0.4372643  1.4972484  0.2920453  6.784286 0.77520100
## maxO3v       0.3124507  0.1236261  2.5273857  6.256804 0.02641370

poolMIPCA<-pool(as.mira(fitMIPCA))
summary(poolMIPCA)

##               estimate   std.error  statistic       df     p.value
## (Intercept) 12.2231485 18.40710622  0.6640451 61.82679 0.508868203
## T9           1.1201262  1.28505469  0.8716564 46.74630 0.386410286
## T12          1.6440663  1.03302947  1.5914999 50.14252 0.116055786
## T15          0.6877634  0.95080846  0.7233459 47.30412 0.471905654
## Ne9         -1.2553566  1.15348210 -1.0883191 57.43342 0.280230483
## Ne12        -1.6173184  1.41326639 -1.1443832 62.90186 0.256408058
## Ne15         0.3069931  1.17113893  0.2621321 59.25107 0.793998434
## Vx9          0.7566807  1.07550651  0.7035575 66.90143 0.484069200
## Vx12         0.9674316  1.19089898  0.8123541 62.56397 0.419374622
## Vx15         0.3065800  1.18920510  0.2578025 58.31779 0.797324535
## maxO3v       0.2524263  0.09034403  2.7940564 69.19377 0.006728215

#pool.mice <- pool(lm.mice.out)
#summary(pool.mice)

Write a function that removes the variables with the largest pvalues step by step (each time a variable is removed the regression model is performed again) until all variables are significant.

don2 <- don
reg <- lm(maxO3 ~. , data = don2)
while(any(summary(reg)$coeff[-1, 4]>0.05)){
  don2 <- don2[,!(colnames(don2)%in%names(which.max(summary(reg)$coeff[-1, 4])))]
  reg <- lm(maxO3 ~. , data = don2)
  }

We combine the results and performed the regression with missing values

# Submodel to compare
fitMIPCA<-lapply(res.MIPCA,lm, formula="maxO3~ T12+Ne9+Vx12+maxO3v")
poolMIPCA<-pool(as.mira(fitMIPCA))
summary(poolMIPCA)

##               estimate   std.error  statistic       df      p.value
## (Intercept)  9.7948175 14.53718673  0.6737767 61.07589 5.024488e-01
## T12          2.9627531  0.61038949  4.8538731 67.15423 6.108510e-06
## Ne9         -1.9143282  1.06362944 -1.7998075 60.08612 7.576516e-02
## Vx12         1.7828147  0.72646723  2.4540882 72.60097 1.635728e-02
## maxO3v       0.3279477  0.08035646  4.0811612 77.85290 1.076020e-04

#lm.mice.out <- with(imp.mice, lm(maxO3 ~ T12+Ne9+Vx12+maxO3v))
#pool.mice <- pool(lm.mice.out)
#summary(pool.mice)
fitamelia<-lapply(resamelia,lm, formula="maxO3~ T12+Ne9+Vx12+maxO3v")
poolamelia<-pool(as.mira(fitamelia))
summary(poolamelia)

##               estimate   std.error statistic       df      p.value
## (Intercept) 12.1506142 11.72848055  1.035992 18.58387 3.090306e-01
## T12          3.2340389  0.51325416  6.301048 23.78404 8.008505e-07
## Ne9         -3.2861788  0.95822637 -3.429439 10.79985 1.885150e-03
## Vx12         1.2487496  0.69891023  1.786710 15.37911 8.476774e-02
## maxO3v       0.3077342  0.06916046  4.449568 28.12829 1.235728e-04

1.4) Ecological example

Studies in community ecology aim to understand how and why individuals of different species co-occur in the same location at the same time. Hence, ecologists usually collect and store data on species distribution as tables containing the abundances of different species in several sampling sites. Additional information such as measures of environmental variables or species traits can also be recorded to examine the effects of abiotic features (characteristics, i.e. due to physico-chemical action and no biological action) and biotic features. Several projects compile data from preexisting databases. Due to the wide heterogeneity of measurement methods and research objectives, these huge data sets are often characterized by a high number of missing values. Hence, in addition to ecological questions, such data sets also present some important methodological and technical challenges for multivariate analysis. The GLOPNET data set contains 6 traits measured for 2494 plant species: LMA (leaf mass per area), LL (leaf lifes-pan), Amass (photosynthetic assimilation), Nmass (leaf nitrogen), Pmass (leaf phosphorus), Rmass (dark respiration rate). The last four variables are expressed per leaf dry mass. GLOPNET is a compilation of several existing data sets and thus contains a large proportion of missing values. All traits were log-normally distributed and log-transformed before analysis.

Ecolo <- read.csv("ecological.csv", header = TRUE, sep=";",dec=",")

## Delete species with only missing values for contiuous variables
ind <- which(rowSums(is.na(Ecolo[,-1])) == 6)
biome <- Ecolo[-ind,1]    ### Keep a categorical variable
Ecolo <- Ecolo[-ind,-1]   ### Select continuous variables
dim(Ecolo)

## [1] 2494    6

## proportion of missing values
sum(is.na(Ecolo))/(nrow(Ecolo)*ncol(Ecolo)) # 55% of missing values

## [1] 0.5338145

## Delete species with missing values
dim(na.omit(Ecolo)) # only 72 remaining species!

## [1] 72  6

53.38% of the entries in the GLOPNET data set are missing. Only 72 species have complete information for the 6 traits and the proportion of missing values varied between 4.97 % (LMA) to 89.01 % (Rmass).

# Visualize the pattern
library(VIM)
#aggr(Ecolo)
aggr(Ecolo,only.miss=TRUE,numbers=TRUE,sortVar=TRUE)

## 
##  Variables sorted by number of missings: 
##  Variable      Count
##     Rmass 0.89013633
##        LL 0.69967923
##     Pmass 0.69847634
##     Amass 0.69125902
##     Nmass 0.17361668
##       LMA 0.04971933

res <- summary(aggr(Ecolo,prop=TRUE,combined=TRUE))$combinations

#res[rev(order(res[,2])),]

mis.ind <- matrix("o",nrow=nrow(Ecolo),ncol=ncol(Ecolo))
mis.ind[is.na(Ecolo)] <- "m"
dimnames(mis.ind) <- dimnames(Ecolo)
library(FactoMineR)
resMCA <- MCA(mis.ind)

plot(resMCA,invis="ind",title="MCA graph of the categories")

### Impute the incomplete data set
library(missMDA)
### nb <- estim_ncpPCA(Ecolo,method.cv="Kfold",nbsim=100) ### Time consuming!
res.comp <- imputePCA(Ecolo,ncp=2)

#Perform a PCA on the completed data set
imp <- cbind.data.frame(res.comp$completeObs,biome)
res.pca <- PCA(imp,quali.sup=7,graph=FALSE)
plot(res.pca, hab=7, lab="quali")

plot(res.pca, hab=7, lab="quali",invisible="ind")

plot(res.pca, choix="var")

# Compare with PCA on the data imputed by the mean
PCA(Ecolo)

## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 2494 individuals, described by 6 variables
## *The results are available in the following objects:
## 
##    name               description                          
## 1  "$eig"             "eigenvalues"                        
## 2  "$var"             "results for the variables"          
## 3  "$var$coord"       "coord. for the variables"           
## 4  "$var$cor"         "correlations variables - dimensions"
## 5  "$var$cos2"        "cos2 for the variables"             
## 6  "$var$contrib"     "contributions of the variables"     
## 7  "$ind"             "results for the individuals"        
## 8  "$ind$coord"       "coord. for the individuals"         
## 9  "$ind$cos2"        "cos2 for the individuals"           
## 10 "$ind$contrib"     "contributions of the individuals"   
## 11 "$call"            "summary statistics"                 
## 12 "$call$centre"     "mean of the variables"              
## 13 "$call$ecart.type" "standard error of the variables"    
## 14 "$call$row.w"      "weights for the individuals"        
## 15 "$call$col.w"      "weights for the variables"

This first axis corresponding to the “leaf economic spectrum” separates species with potential for quick returns for investment with high values for Nmass, Amass, Rmass and Pmass and low values for LL and LMA (right part) from species with slow returns on the left part. Scores for the traits are very consistent between methods, to a lesser extent for the Mean. This representation can be used to add external information: grouping species by major biomes illustrates the universality of the leaf economic spectrum but also some specificities (e.g., Desert and Boreal forest mainly contain species of the quick-return end).

The graphical representation obtained by the Mean imputation highlights a very particular shape indicating that results are not reliable.

2) Categorical/mixed/multi-block data with missing values

2.1) Single imputation of categorical data with MCA/ MCA with missing values

We use the survey data set health concerning students’ health. 320 students answered 20 questions on their consumption of products (drugs, alcohol), on their psychological state and their sleeping condition. In addition, we have information regarding their gender, age and accommodation. The aim is to study the principal dimensions of variability of this data and to see if there are relationships between alcohol consumption and psychological state for instance. Then, after grouping individuals with the same profile, one can “label” them and see if there are relationships with the socio-economic questions.

Missing values are inserted to illustrate the methods.

library(FactoMineR)
health <- read.csv("sante.tex",sep=";",header=T)
dim(health)

## [1] 327  20

summary(health)

##   Pbsleep           Asleep          Fatigue         Nightmare  
##  Never:136   Never     : 66   Never     : 41   Never     :172  
##  Often: 70   QuiteOften: 86   QuiteOften:159   QuiteOften: 28  
##  Rare :121   Rare      :142   Rare      : 91   Rare      :119  
##              Veryoften : 33   Veryoften : 36   Veryoften :  8  
##                                                                
##                                                                
##                                                                
##      Fatiguecste        Insomnia      Sex               Age     
##  Never     : 73   Never     :250   Boy  :120   18yrsorless:147  
##  QuiteOften: 98   QuiteOften: 16   Girls:207   19yrs      : 99  
##  Rare      :139   Rare      : 54               20yrs      : 49  
##  Veryoften : 17   Veryoften :  7               21yrsetplus: 32  
##                                                                 
##                                                                 
##                                                                 
##  Liveparents          Housing           Absenteeism         Tabac    
##  No :186     Campus       : 31   Exceptionally:112   Frequent  : 81  
##  Yes:141     Flat_alone   : 59   Never        :129   Never     :208  
##              FurnishedRoom: 18   Often        : 21   Occasional: 38  
##              Hostel       : 10   Sometimes    : 57                   
##              Other        :  8   Veryoften    :  8                   
##              Parents      :141                                       
##              Roomate      : 60                                       
##       Alcohol      Druunk          Cannabis     Lonely     Depress   
##  Frequent : 27   Never:122   Frequent  : 22   Never:128   Never:119  
##  Never    : 42   Yes  :205   Never     :165   Often: 55   Often: 54  
##  Ocasional:258               Occasional:140   Rare :144   Rare :154  
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##  Desperate   Aggressiv       Hallucination
##  Never:184   Never:182   Never      :314  
##  Often: 52   Often: 31   RareorOften: 13  
##  Rare : 91   Rare :114                    
##                                           
##                                           
##                                           
##

healthNA <-health 
healthNA[5:10,4:6] <- NA
healthNA[55:60,12:14] <- NA

First, we can explore the pattern of missing using MCA (by default it codes a missing values as a new category):

res.mcaNA  <- MCA(healthNA, quali.sup = c(7:11))

Then, we can study the similarities between the students and the associations between categories performing MCA while skipping the missing values. We carry-out the following steps:

library(missMDA)
## First the number of components has to be estimated
# nb <- estim_ncpMCA(healthNA[,c(1:6,12:20)],ncp.max=10) ## Time-consuming, nb = 5

## Impute the indicator matrix and perform MCA
res.impute <- imputeMCA(healthNA[,c(1:6,12:20)], ncp=5)
res.impute$tab.disj[1:10, 10:21]

##    Rare Veryoften        Never QuiteOften      Rare   Veryoften      Never
## 1     1         0  1.000000000 0.00000000 0.0000000  0.00000000 1.00000000
## 2     0         0  1.000000000 0.00000000 0.0000000  0.00000000 1.00000000
## 3     0         0  0.000000000 0.00000000 1.0000000  0.00000000 1.00000000
## 4     0         0  1.000000000 0.00000000 0.0000000  0.00000000 0.00000000
## 5     0         0  0.645530145 0.06407905 0.2462122  0.04417856 0.34030543
## 6     0         0  0.632528934 0.12949032 0.2588293 -0.02084853 0.17287949
## 7     0         0  0.239751676 0.17029060 0.5334148  0.05654293 0.19788142
## 8     0         1  0.005433251 0.21241405 0.6274436  0.15470908 0.10561374
## 9     0         0 -0.049112351 0.04268786 0.9488273  0.05759715 0.08682299
## 10    0         0  0.502837245 0.07045245 0.4656037 -0.03889340 0.08939468
##    QuiteOften      Rare    Veryoften     Never  QuiteOften
## 1   0.0000000 0.0000000  0.000000000 1.0000000 0.000000000
## 2   0.0000000 0.0000000  0.000000000 1.0000000 0.000000000
## 3   0.0000000 0.0000000  0.000000000 1.0000000 0.000000000
## 4   1.0000000 0.0000000  0.000000000 1.0000000 0.000000000
## 5   0.2443741 0.3281611  0.087159320 0.8930104 0.006230533
## 6   0.3393262 0.4931404 -0.005346056 0.9004445 0.035455413
## 7   0.4458801 0.2826159  0.073622655 0.6013528 0.079902701
## 8   0.4111983 0.2504949  0.232692973 0.2616871 0.188485280
## 9   0.4161824 0.4202936  0.076701008 0.3061799 0.071792221
## 10  0.3394858 0.6013289 -0.030209419 0.7766499 0.027674843

apply(res.impute$tab.disj[1:10, 12:15],1,sum) # sum to 1 per variable

##  1  2  3  4  5  6  7  8  9 10 
##  1  1  1  1  1  1  1  1  1  1

res.impute$comp[5:10,4:6]  # the completed data set with the most plausible category

##    Nightmare Fatiguecste Insomnia
## 5      Never       Never    Never
## 6      Never        Rare    Never
## 7       Rare  QuiteOften    Never
## 8       Rare  QuiteOften     Rare
## 9       Rare        Rare     Rare
## 10     Never        Rare    Never

health[5:10,4:6]

##     Nightmare Fatiguecste   Insomnia
## 5  QuiteOften  QuiteOften QuiteOften
## 6        Rare        Rare      Never
## 7   Veryoften  QuiteOften QuiteOften
## 8       Never  QuiteOften      Never
## 9        Rare        Rare       Rare
## 10      Never        Rare      Never

## The imputed indicator matrix can be used as an input of the MCA function of the
## FactoMineR package to perform the MCA on the incomplete data 

res.mca <- MCA(healthNA,tab.disj=res.impute$tab.disj,quali.sup=7:11)

plot(res.mca, invisible=c("var","quali.sup"))

plot(res.mca, invisible=c("ind","quali.sup"), cex = 0.6)

plot(res.mca, invisible=c("ind","var"),  cex = 0.6)

plot(res.mca,invisible=c("ind"),autoLab="yes", selectMod="cos2 15", cex  = 0.6)

plot(res.mca,autoLab="yes", selectMod="cos2 5", select="cos2 5")

res.mca

## **Results of the Multiple Correspondence Analysis (MCA)**
## The analysis was performed on 327 individuals, described by 20 variables
## *The results are available in the following objects:
## 
##    name               
## 1  "$eig"             
## 2  "$var"             
## 3  "$var$coord"       
## 4  "$var$cos2"        
## 5  "$var$contrib"     
## 6  "$var$v.test"      
## 7  "$ind"             
## 8  "$ind$coord"       
## 9  "$ind$cos2"        
## 10 "$ind$contrib"     
## 11 "$quali.sup"       
## 12 "$quali.sup$coord" 
## 13 "$quali.sup$cos2"  
## 14 "$quali.sup$v.test"
## 15 "$call"            
## 16 "$call$marge.col"  
## 17 "$call$marge.li"   
##    description                                          
## 1  "eigenvalues"                                        
## 2  "results for the variables"                          
## 3  "coord. of the categories"                           
## 4  "cos2 for the categories"                            
## 5  "contributions of the categories"                    
## 6  "v-test for the categories"                          
## 7  "results for the individuals"                        
## 8  "coord. for the individuals"                         
## 9  "cos2 for the individuals"                           
## 10 "contributions of the individuals"                   
## 11 "results for the supplementary categorical variables"
## 12 "coord. for the supplementary categories"            
## 13 "cos2 for the supplementary categories"              
## 14 "v-test for the supplementary categories"            
## 15 "intermediate results"                               
## 16 "weights of columns"                                 
## 17 "weights of rows"

 ## Another ex of imputation of categorical data
data(vnf)

# Look at the pattern of missing values with MCA
MCA(vnf)

## **Results of the Multiple Correspondence Analysis (MCA)**
## The analysis was performed on 1232 individuals, described by 14 variables
## *The results are available in the following objects:
## 
##    name              description                       
## 1  "$eig"            "eigenvalues"                     
## 2  "$var"            "results for the variables"       
## 3  "$var$coord"      "coord. of the categories"        
## 4  "$var$cos2"       "cos2 for the categories"         
## 5  "$var$contrib"    "contributions of the categories" 
## 6  "$var$v.test"     "v-test for the categories"       
## 7  "$ind"            "results for the individuals"     
## 8  "$ind$coord"      "coord. for the individuals"      
## 9  "$ind$cos2"       "cos2 for the individuals"        
## 10 "$ind$contrib"    "contributions of the individuals"
## 11 "$call"           "intermediate results"            
## 12 "$call$marge.col" "weights of columns"              
## 13 "$call$marge.li"  "weights of rows"

#1) select the number of components
#nb <- estim_ncpMCA(vnf, ncp.max = 5) #Time-consuming, nb = 4

#2) Impute the indicator matrix 
res.impute <- imputeMCA(vnf, ncp = 4)
res.impute$tab.disj[1:5, 1:5]

##   Q7.1.1 Q7.1.2 Q7.1.3 Q7.2.1 Q7.2.2
## 1      1      0      0      1      0
## 2      1      0      0      1      0
## 3      1      0      0      1      0
## 4      1      0      0      1      0
## 5      1      0      0      0      0

res.impute$comp[1:5, 1:5]

##   Q7.1 Q7.2 Q7.4 Q8.1 Q8.2
## 1    1    1    3    1    2
## 2    1    1    1    2    1
## 3    1    1    1    2    2
## 4    1    1    1    1    1
## 5    1    3    1    1    1

## 2.2) Single imputation for mixed data with FAMD and with Forest

 res.ncp <- estim_ncpFAMD(ozo)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=                                                                |   1%
  |                                                                       
  |=                                                                |   2%
  |                                                                       
  |==                                                               |   3%
  |                                                                       
  |===                                                              |   4%
  |                                                                       
  |===                                                              |   5%
  |                                                                       
  |====                                                             |   6%
  |                                                                       
  |=====                                                            |   7%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |======                                                           |   9%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |=======                                                          |  11%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |=========                                                        |  13%
  |                                                                       
  |=========                                                        |  14%
  |                                                                       
  |==========                                                       |  15%
  |                                                                       
  |===========                                                      |  16%
  |                                                                       
  |===========                                                      |  17%
  |                                                                       
  |============                                                     |  18%
  |                                                                       
  |============                                                     |  19%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |==============                                                   |  21%
  |                                                                       
  |==============                                                   |  22%
  |                                                                       
  |===============                                                  |  23%
  |                                                                       
  |================                                                 |  24%
  |                                                                       
  |================                                                 |  25%
  |                                                                       
  |=================                                                |  26%
  |                                                                       
  |==================                                               |  27%
  |                                                                       
  |==================                                               |  28%
  |                                                                       
  |===================                                              |  29%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |====================                                             |  31%
  |                                                                       
  |=====================                                            |  32%
  |                                                                       
  |======================                                           |  33%
  |                                                                       
  |======================                                           |  34%
  |                                                                       
  |=======================                                          |  35%
  |                                                                       
  |========================                                         |  36%
  |                                                                       
  |========================                                         |  37%
  |                                                                       
  |=========================                                        |  38%
  |                                                                       
  |==========================                                       |  39%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |===========================                                      |  41%
  |                                                                       
  |============================                                     |  42%
  |                                                                       
  |============================                                     |  43%
  |                                                                       
  |=============================                                    |  44%
  |                                                                       
  |==============================                                   |  45%
  |                                                                       
  |==============================                                   |  46%
  |                                                                       
  |===============================                                  |  47%
  |                                                                       
  |================================                                 |  48%
  |                                                                       
  |================================                                 |  49%
  |                                                                       
  |=================================                                |  51%
  |                                                                       
  |=================================                                |  52%
  |                                                                       
  |==================================                               |  53%
  |                                                                       
  |===================================                              |  54%
  |                                                                       
  |===================================                              |  55%
  |                                                                       
  |====================================                             |  56%
  |                                                                       
  |=====================================                            |  57%
  |                                                                       
  |=====================================                            |  58%
  |                                                                       
  |======================================                           |  59%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |=======================================                          |  61%
  |                                                                       
  |========================================                         |  62%
  |                                                                       
  |=========================================                        |  63%
  |                                                                       
  |=========================================                        |  64%
  |                                                                       
  |==========================================                       |  65%
  |                                                                       
  |===========================================                      |  66%
  |                                                                       
  |===========================================                      |  67%
  |                                                                       
  |============================================                     |  68%
  |                                                                       
  |=============================================                    |  69%
  |                                                                       
  |=============================================                    |  70%
  |                                                                       
  |==============================================                   |  71%
  |                                                                       
  |===============================================                  |  72%
  |                                                                       
  |===============================================                  |  73%
  |                                                                       
  |================================================                 |  74%
  |                                                                       
  |=================================================                |  75%
  |                                                                       
  |=================================================                |  76%
  |                                                                       
  |==================================================               |  77%
  |                                                                       
  |===================================================              |  78%
  |                                                                       
  |===================================================              |  79%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |=====================================================            |  81%
  |                                                                       
  |=====================================================            |  82%
  |                                                                       
  |======================================================           |  83%
  |                                                                       
  |======================================================           |  84%
  |                                                                       
  |=======================================================          |  85%
  |                                                                       
  |========================================================         |  86%
  |                                                                       
  |========================================================         |  87%
  |                                                                       
  |=========================================================        |  88%
  |                                                                       
  |==========================================================       |  89%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |===========================================================      |  91%
  |                                                                       
  |============================================================     |  92%
  |                                                                       
  |============================================================     |  93%
  |                                                                       
  |=============================================================    |  94%
  |                                                                       
  |==============================================================   |  95%
  |                                                                       
  |==============================================================   |  96%
  |                                                                       
  |===============================================================  |  97%
  |                                                                       
  |================================================================ |  98%
  |                                                                       
  |================================================================ |  99%
  |                                                                       
  |=================================================================| 100%

 res.famd <-imputeFAMD(ozo, ncp = 2)
 res.famd$completeObs[1:5,1:5]

##          maxO3       T9    T12      T15      Ne9
## 20010601    87 15.60000 18.500 20.17593 4.000000
## 20010602    82 17.20488 19.579 20.42794 5.000000
## 20010603    92 15.30000 17.600 19.50000 2.000000
## 20010604   114 16.20000 19.700 24.34365 1.000000
## 20010605    94 18.89175 20.500 20.40000 5.395526

library(missForest)
 res.rf <- missForest(ozo)

##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!
##   missForest iteration 4 in progress...done!
##   missForest iteration 5 in progress...done!

 res.rf$ximp[1:5,1:5]

##          maxO3     T9    T12    T15  Ne9
## 20010601    87 15.600 18.500 19.185 4.00
## 20010602    82 16.181 18.593 18.797 5.00
## 20010603    92 15.300 17.600 19.500 2.00
## 20010604   114 16.200 19.700 23.046 1.00
## 20010605    94 17.930 20.500 20.400 5.19

2.3) Multiple imputation for categorical data: Multinomial regression with missing values

To perform a multinomial regression with missing values, we can use Multiple Imputation.

# With mice
library(mice)
x.impmi<-mice(healthNA[,c(1:6,12:20)], m = 5, printFlag = FALSE)

# with MCA
x.impmimca<-MIMCA(healthNA[,c(1:6,12:20)], ncp = 5)

# Perfoming a model on each imputed data table 
lm.mice.out <- with( x.impmi, nnet::multinom(Alcohol ~ Pbsleep + Fatigue +Nightmare,  trace = FALSE)) 
pool.mice <- pool(lm.mice.out) #combining the results
summary(pool.mice)

##                                   estimate   std.error   statistic
## Never:(Intercept)               0.67110392   0.6344357  1.05779660
## Never:PbsleepOften              0.20414120   0.7819089  0.26108054
## Never:PbsleepRare               0.12950926   0.6113161  0.21185319
## Never:FatigueQuiteOften        -0.47398908   0.7480367 -0.63364415
## Never:FatigueRare              -0.33793016   0.7795125 -0.43351475
## Never:FatigueVeryoften         -0.65536095   1.0394159 -0.63050888
## Never:NightmareQuiteOften       0.68569880   1.2737349  0.53833714
## Never:NightmareRare             0.37047385   0.6229343  0.59472377
## Never:NightmareVeryoften      -11.75103275 310.5442829 -0.03784012
## Ocasional:(Intercept)           1.49481817   0.5658549  2.64169864
## Ocasional:PbsleepOften         -0.12278255   0.6618573 -0.18551212
## Ocasional:PbsleepRare           0.41658350   0.5065365  0.82241554
## Ocasional:FatigueQuiteOften     0.39074738   0.6487862  0.60227453
## Ocasional:FatigueRare           0.47770770   0.6774482  0.70515758
## Ocasional:FatigueVeryoften      0.05897967   0.8665900  0.06805948
## Ocasional:NightmareQuiteOften   1.19305394   1.1102553  1.07457624
## Ocasional:NightmareRare         0.96956732   0.5256050  1.84466914
## Ocasional:NightmareVeryoften   -0.63515392   1.1552053 -0.54981911
##                                     df     p.value
## Never:(Intercept)             306.4054 0.290980080
## Never:PbsleepOften            244.3243 0.794205510
## Never:PbsleepRare             305.7557 0.832362199
## Never:FatigueQuiteOften       303.1832 0.526785031
## Never:FatigueRare             306.8137 0.664945411
## Never:FatigueVeryoften        303.7270 0.528830908
## Never:NightmareQuiteOften     175.7557 0.590734534
## Never:NightmareRare           288.4736 0.552466342
## Never:NightmareVeryoften      306.8727 0.969839759
## Ocasional:(Intercept)         306.6144 0.008671815
## Ocasional:PbsleepOften        216.4953 0.852949918
## Ocasional:PbsleepRare         306.2545 0.411478956
## Ocasional:FatigueQuiteOften   306.6040 0.547436134
## Ocasional:FatigueRare         306.8949 0.481247020
## Ocasional:FatigueVeryoften    302.8580 0.945782607
## Ocasional:NightmareQuiteOften 179.0795 0.283408618
## Ocasional:NightmareRare       286.0343 0.066049186
## Ocasional:NightmareVeryoften   81.2193 0.582843349

imp<-prelim(x.impmimca,healthNA[,c(1:6,12:20)])
fit <- with(data=imp,exp=nnet::multinom(Alcohol ~ Pbsleep + Fatigue +Nightmare,  trace = FALSE))
res.pool<-pool(fit)
summary(res.pool)

##                                    estimate   std.error   statistic
## Never:(Intercept)               0.654898879   0.6342719  1.03252071
## Never:PbsleepOften              0.289362763   0.7814772  0.37027666
## Never:PbsleepRare               0.100590492   0.6124561  0.16424113
## Never:FatigueQuiteOften        -0.462642261   0.7488728 -0.61778485
## Never:FatigueRare              -0.334404679   0.7796742 -0.42890310
## Never:FatigueVeryoften         -0.623359906   1.0439450 -0.59711949
## Never:NightmareQuiteOften       0.723419192   1.2491948  0.57910840
## Never:NightmareRare             0.455304320   0.6331114  0.71915353
## Never:NightmareVeryoften      -12.226832279 345.9404211 -0.03534375
## Ocasional:(Intercept)           1.483090478   0.5658218  2.62112634
## Ocasional:PbsleepOften         -0.009405037   0.6637354 -0.01416986
## Ocasional:PbsleepRare           0.404146999   0.5067273  0.79756317
## Ocasional:FatigueQuiteOften     0.394544457   0.6510951  0.60597052
## Ocasional:FatigueRare           0.478611135   0.6778531  0.70606908
## Ocasional:FatigueVeryoften      0.058675772   0.8749965  0.06705830
## Ocasional:NightmareQuiteOften   1.170419328   1.0971861  1.06674644
## Ocasional:NightmareRare         1.027996671   0.5366310  1.91564896
## Ocasional:NightmareVeryoften   -0.809033753   1.0995226 -0.73580457
##                                     df     p.value
## Never:(Intercept)             306.4604 0.302641077
## Never:PbsleepOften            301.8843 0.711431731
## Never:PbsleepRare             303.7831 0.869649418
## Never:FatigueQuiteOften       304.3513 0.537175236
## Never:FatigueRare             306.9062 0.668294598
## Never:FatigueVeryoften        302.9307 0.550867801
## Never:NightmareQuiteOften     291.9789 0.562940757
## Never:NightmareRare           299.2522 0.472593454
## Never:NightmareVeryoften      306.9523 0.971828609
## Ocasional:(Intercept)         306.6102 0.009200397
## Ocasional:PbsleepOften        300.4142 0.988703672
## Ocasional:PbsleepRare         304.9798 0.425740571
## Ocasional:FatigueQuiteOften   305.7482 0.544982157
## Ocasional:FatigueRare         306.9335 0.480680726
## Ocasional:FatigueVeryoften    303.9714 0.946578952
## Ocasional:NightmareQuiteOften 292.5767 0.286924648
## Ocasional:NightmareRare       301.1098 0.056338660
## Ocasional:NightmareVeryoften  286.7724 0.462411519

2.3) Imputation with groups of variables/Multiple Factor Analysis with missing values.

Let us consider the journal impact factors data from journalmetrics.com. We use a subset of 443 journals of the same sections than Journal of Statistical Software (Computer Science :: Software“, Decision Sciences :: Statistics, Probability and Uncertainty” and Mathematics :: Statistics and Probability“). This data has 45 columns which correspond to three metrics recorded each year from 1999 to 2013: IPP - impact per publication (it is closed to the ISI impact factor but for three rather than two years), SNIP - source normalized impact per paper (tries to weight by the number of citations per subject field to adjust for different citation cultures) and the SJR - SCImago journal rank (tries to capture average prestige per publication). This data contains 31% of missing values. We impute it with single imputation by Multiple Factor Analysis.

library(denoiseR)
data(impactfactor)
summary(impactfactor)

##     SNIP1999        SNIP2000         SNIP2001        SNIP2002     
##  Min.   :0.000   Min.   :0.0000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.516   1st Qu.:0.5640   1st Qu.:0.525   1st Qu.:0.5205  
##  Median :0.830   Median :0.8545   Median :0.821   Median :0.8630  
##  Mean   :1.061   Mean   :1.0439   Mean   :1.035   Mean   :1.0851  
##  3rd Qu.:1.406   3rd Qu.:1.4058   3rd Qu.:1.382   3rd Qu.:1.3475  
##  Max.   :7.149   Max.   :4.9960   Max.   :4.820   Max.   :6.7660  
##  NA's   :202     NA's   :197      NA's   :190     NA's   :181     
##     SNIP2003         SNIP2004         SNIP2005        SNIP2006     
##  Min.   :0.0000   Min.   : 0.000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.6735   1st Qu.: 0.752   1st Qu.:0.747   1st Qu.:0.6797  
##  Median :1.0560   Median : 1.199   Median :1.204   Median :1.2110  
##  Mean   :1.4958   Mean   : 1.688   Mean   :1.667   Mean   :1.5612  
##  3rd Qu.:1.6400   3rd Qu.: 1.896   3rd Qu.:2.015   3rd Qu.:1.9042  
##  Max.   :9.7420   Max.   :11.601   Max.   :8.962   Max.   :8.3660  
##  NA's   :172      NA's   :166      NA's   :155     NA's   :143     
##     SNIP2007         SNIP2008         SNIP2009         SNIP2010     
##  Min.   : 0.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 0.652   1st Qu.:0.6425   1st Qu.:0.6258   1st Qu.:0.6015  
##  Median : 1.151   Median :1.1060   Median :1.0750   Median :1.0295  
##  Mean   : 1.516   Mean   :1.4387   Mean   :1.3975   Mean   :1.3493  
##  3rd Qu.: 1.931   3rd Qu.:1.8960   3rd Qu.:1.8805   3rd Qu.:1.7915  
##  Max.   :10.803   Max.   :7.8170   Max.   :7.0950   Max.   :7.8100  
##  NA's   :133      NA's   :112      NA's   :91       NA's   :67      
##     SNIP2011        SNIP2012        SNIP2013         IPP1999      
##  Min.   :0.000   Min.   :0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.712   1st Qu.:0.656   1st Qu.:0.6915   1st Qu.:0.2140  
##  Median :1.111   Median :1.087   Median :1.1350   Median :0.4320  
##  Mean   :1.458   Mean   :1.462   Mean   :1.4269   Mean   :0.5808  
##  3rd Qu.:1.844   3rd Qu.:1.915   3rd Qu.:1.8173   3rd Qu.:0.8000  
##  Max.   :7.917   Max.   :9.526   Max.   :8.8220   Max.   :5.0000  
##  NA's   :55      NA's   :40      NA's   :33       NA's   :202     
##     IPP2000          IPP2001          IPP2002          IPP2003      
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.2500   1st Qu.:0.2350   1st Qu.:0.2440   1st Qu.:0.3560  
##  Median :0.4275   Median :0.4460   Median :0.4745   Median :0.5840  
##  Mean   :0.5530   Mean   :0.5977   Mean   :0.6118   Mean   :0.7975  
##  3rd Qu.:0.7380   3rd Qu.:0.7750   3rd Qu.:0.7933   3rd Qu.:0.9805  
##  Max.   :2.8140   Max.   :3.7060   Max.   :4.1240   Max.   :5.5160  
##  NA's   :197      NA's   :190      NA's   :181      NA's   :172     
##     IPP2004          IPP2005          IPP2006          IPP2007      
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.3540   1st Qu.:0.3530   1st Qu.:0.3563   1st Qu.:0.3755  
##  Median :0.6400   Median :0.7235   Median :0.8000   Median :0.7385  
##  Mean   :0.9127   Mean   :0.9774   Mean   :1.0083   Mean   :1.0035  
##  3rd Qu.:1.1340   3rd Qu.:1.1572   3rd Qu.:1.2195   3rd Qu.:1.3013  
##  Max.   :6.1930   Max.   :5.7760   Max.   :6.1930   Max.   :6.7280  
##  NA's   :166      NA's   :155      NA's   :143      NA's   :133     
##     IPP2008         IPP2009          IPP2010          IPP2011      
##  Min.   :0.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.343   1st Qu.:0.3738   1st Qu.:0.3575   1st Qu.:0.4145  
##  Median :0.800   Median :0.8165   Median :0.8820   Median :0.8980  
##  Mean   :1.011   Mean   :1.0450   Mean   :1.0785   Mean   :1.1539  
##  3rd Qu.:1.381   3rd Qu.:1.4072   3rd Qu.:1.4318   3rd Qu.:1.5097  
##  Max.   :5.006   Max.   :4.9080   Max.   :5.9190   Max.   :5.7920  
##  NA's   :112     NA's   :91       NA's   :67       NA's   :55      
##     IPP2012         IPP2013          SJR1999          SJR2000      
##  Min.   :0.000   Min.   :0.0000   Min.   :0.1000   Min.   :0.1000  
##  1st Qu.:0.424   1st Qu.:0.4627   1st Qu.:0.1690   1st Qu.:0.2077  
##  Median :0.899   Median :0.8885   Median :0.3865   Median :0.4015  
##  Mean   :1.207   Mean   :1.2348   Mean   :0.6501   Mean   :0.6368  
##  3rd Qu.:1.622   3rd Qu.:1.6042   3rd Qu.:0.8063   3rd Qu.:0.7420  
##  Max.   :7.103   Max.   :7.8650   Max.   :5.9850   Max.   :5.7420  
##  NA's   :40      NA's   :33       NA's   :177      NA's   :173     
##     SJR2001          SJR2002          SJR2003          SJR2004      
##  Min.   :0.1000   Min.   :0.1000   Min.   :0.1000   Min.   :0.1000  
##  1st Qu.:0.2570   1st Qu.:0.2105   1st Qu.:0.2137   1st Qu.:0.2200  
##  Median :0.4820   Median :0.5020   Median :0.5450   Median :0.5480  
##  Mean   :0.8048   Mean   :0.7308   Mean   :0.8066   Mean   :0.8134  
##  3rd Qu.:0.9935   3rd Qu.:0.8935   3rd Qu.:0.9150   3rd Qu.:1.0150  
##  Max.   :5.4930   Max.   :8.1210   Max.   :5.5480   Max.   :6.6790  
##  NA's   :168      NA's   :160      NA's   :151      NA's   :142     
##     SJR2005          SJR2006          SJR2007          SJR2008      
##  Min.   :0.1000   Min.   :0.1000   Min.   :0.1000   Min.   :0.1000  
##  1st Qu.:0.2160   1st Qu.:0.2515   1st Qu.:0.2657   1st Qu.:0.2747  
##  Median :0.5830   Median :0.5790   Median :0.5980   Median :0.5960  
##  Mean   :0.8521   Mean   :0.8490   Mean   :0.9131   Mean   :0.8706  
##  3rd Qu.:1.0680   3rd Qu.:1.0825   3rd Qu.:1.1983   3rd Qu.:1.1345  
##  Max.   :5.3830   Max.   :4.7700   Max.   :7.5350   Max.   :5.6900  
##  NA's   :134      NA's   :124      NA's   :115      NA's   :95      
##     SJR2009          SJR2010          SJR2011          SJR2012      
##  Min.   :0.1000   Min.   :0.1000   Min.   :0.1000   Min.   :0.1000  
##  1st Qu.:0.2420   1st Qu.:0.2250   1st Qu.:0.2475   1st Qu.:0.2480  
##  Median :0.5540   Median :0.5770   Median :0.5750   Median :0.5280  
##  Mean   :0.8313   Mean   :0.8146   Mean   :0.8268   Mean   :0.8264  
##  3rd Qu.:1.0810   3rd Qu.:1.0170   3rd Qu.:1.0765   3rd Qu.:1.0300  
##  Max.   :5.5740   Max.   :5.9280   Max.   :6.6580   Max.   :7.7180  
##  NA's   :78       NA's   :54       NA's   :48       NA's   :34      
##     SJR2013      
##  Min.   :0.1000  
##  1st Qu.:0.2572  
##  Median :0.5545  
##  Mean   :0.8736  
##  3rd Qu.:1.1318  
##  Max.   :7.8630  
##  NA's   :29

year=NULL; for (i in 1: 15) year= c(year, seq(i,45,15)) 
res.imp <- imputeMFA(impactfactor,  group = rep(3, 15),  type = rep("s", 15))

## MFA on the imputed data set
res.mfa  <-MFA(res.imp$completeObs, group=rep(3,15),  type=rep("s",15), 
name.group=paste("year", 1999:2013,sep="_"),graph=F)

plot(res.mfa, choix = "ind", select = "contrib 15", habillage = "group", cex = 0.7)
points(res.mfa$ind$coord[c("Journal of Statistical Software", "Journal of the American Statistical Association", "Annals of Statistics"), 1:2], col=2, cex=0.6)
text(res.mfa$ind$coord[c("Journal of Statistical Software"), 1], 
res.mfa$ind$coord[c("Journal of Statistical Software"), 2],cex=1,
labels=c("Journal of Statistical Software"),pos=3, col=2)

plot.MFA(res.mfa,choix="var", cex=0.5,shadow=TRUE, autoLab = "yes")

plot(res.mfa, select="IEEE/ACM Transactions on Networking",  partial="all", habillage="group",unselect=0.9,chrono=TRUE)

3) Contingency tables with count data and missing values

4) Multilevel (mixed) data with missing values

Handling missing values with R

Julie Josse