First of all you will need to install the following packages
install.packages("VIM")
install.packages("devtools")
library(devtools)
install_github("njtierney/naniar")
install.packages("missMDA")
install.packages("Amelia")
install.packages("mice")
install.packages("missForest")
install.packages("FactoMineR")
install.packages("tidyverse")
Air pollution is currently one of the most serious public health worries worldwide. Many epidemiological studies have proved the influence that some chemical compounds, such as sulphur dioxide (SO2), nitrogen dioxide (NO2), ozone (O3) can have on our health. Associations set up to monitor air quality are active all over the world to measure the concentration of these pollutants. They also keep a record of meteorological conditions such as temperature, cloud cover, wind, etc.
We have at our disposal 112 observations collected during the summer of 2001 in Rennes. The variables available are:
Here the final aim is to analyse the relationship between the maximum daily ozone (maxO3) level and the other meteorological variables. To do so we will perform regression to explain maxO3 in function of all the other variables. This data is incomplete - there are missing values. Indeed, it occurs frenquently to have machines that fail one day, leading to some information not recorded. We will therefore perform regression with missing values via multiple imputation.
ozo <- read.table("ozoneNA.csv",
header = TRUE,
sep = ",",
row.names = 1)
WindDirection <- ozo[, 12]
don <- ozo[, 1:11] #### keep the continuous variables
summary(don)
## maxO3 T9 T12 T15
## Min. : 42.00 Min. :11.30 Min. :14.30 Min. :14.90
## 1st Qu.: 71.00 1st Qu.:16.00 1st Qu.:18.60 1st Qu.:18.90
## Median : 81.50 Median :17.70 Median :20.40 Median :21.40
## Mean : 91.24 Mean :18.22 Mean :21.46 Mean :22.41
## 3rd Qu.:108.25 3rd Qu.:19.90 3rd Qu.:23.60 3rd Qu.:25.65
## Max. :166.00 Max. :25.30 Max. :33.50 Max. :35.50
## NA's :16 NA's :37 NA's :33 NA's :37
## Ne9 Ne12 Ne15 Vx9
## Min. :0.000 Min. :0.000 Min. :0.00 Min. :-7.8785
## 1st Qu.:3.000 1st Qu.:4.000 1st Qu.:3.00 1st Qu.:-3.0000
## Median :5.000 Median :5.000 Median :5.00 Median :-0.8671
## Mean :4.987 Mean :4.986 Mean :4.60 Mean :-1.0958
## 3rd Qu.:7.000 3rd Qu.:7.000 3rd Qu.:6.25 3rd Qu.: 0.6919
## Max. :8.000 Max. :8.000 Max. :8.00 Max. : 5.1962
## NA's :34 NA's :42 NA's :32 NA's :18
## Vx12 Vx15 maxO3v
## Min. :-7.8785 Min. :-9.000 Min. : 42.00
## 1st Qu.:-3.6941 1st Qu.:-3.759 1st Qu.: 70.00
## Median :-1.9284 Median :-1.710 Median : 82.50
## Mean :-1.6853 Mean :-1.830 Mean : 89.39
## 3rd Qu.:-0.1302 3rd Qu.: 0.000 3rd Qu.:101.00
## Max. : 6.5778 Max. : 3.830 Max. :166.00
## NA's :10 NA's :21 NA's :12
head(don)
## maxO3 T9 T12 T15 Ne9 Ne12 Ne15 Vx9 Vx12 Vx15 maxO3v
## 20010601 87 15.6 18.5 NA 4 4 8 0.6946 -1.7101 -0.6946 84
## 20010602 82 NA NA NA 5 5 7 -4.3301 -4.0000 -3.0000 87
## 20010603 92 15.3 17.6 19.5 2 NA NA 2.9544 NA 0.5209 82
## 20010604 114 16.2 19.7 NA 1 1 0 NA 0.3473 -0.1736 92
## 20010605 94 NA 20.5 20.4 NA NA NA -0.5000 -2.9544 -4.3301 114
## 20010606 80 17.7 19.8 18.3 6 NA 7 -5.6382 -5.0000 -6.0000 94
dim(don)
## [1] 112 11
library(VIM)
library(FactoMineR)
library(missMDA)
Q1 When could it be a good idea to delete rows or columns with missing values to work with a complete data set? Could you do it here?
dim(na.omit(don))
## [1] 13 11
Deleting rows or columns is possible as long as there is enough data left and the missing values are of the MCAR type so that the sample is a subsample of the original data. We will obtain unbiased estimators but with more variance. Deleting observations with missing data for ozone data leads to a table with 13 rows.
First, we perfom some descriptive statistics (how many missing? how many variables, individuals with missing?) and try to inspect and vizualize the pattern of missing entries and get hints on the mechanism. For this purpose, we use the R package naniar as well as Multiple Correspondence Analysis (FactoMineR package). An alternative would be to use VIM (Visualization and Imputation of Missing Values - Mathias Templ)
naniar
provides principled, tidy ways to summarise, visualise, and manipulate missing data with minimal deviations from the workflows in ggplot2 and tidy data.
We can start off with some quick summaries of the amount of missing and complete data in don
using:
pct_miss()
To give us the percentage of missings in the datan_miss()
to give the number of missings,and their complete
equivalents:
pct_complete()
To give us the percentage of completes in the datan_complete()
to give the number of complete valueslibrary(naniar)
library(tidyverse)
pct_miss(don) # percentage of missing value in the data.
## [1] 23.7013
n_miss(don) # number of missing values in the
## [1] 292
n_complete(don) # without missing value
## [1] 940
pct_complete(don) # without missing value
## [1] 76.2987
This is useful, but would be repetitive if you wanted to repeat this for every variable. We can instead look at summaries across the variables and cases
You can find the number and percentage missings in each variable and case using miss_case_summary
and miss_var_summary
.
miss_var_summary(don)
## # A tibble: 11 x 4
## variable n_miss pct_miss n_miss_cumsum
## <chr> <int> <dbl> <int>
## 1 Ne12 42 37.5 199
## 2 T9 37 33.0 53
## 3 T15 37 33.0 123
## 4 Ne9 34 30.4 157
## 5 T12 33 29.5 86
## 6 Ne15 32 28.6 231
## 7 Vx15 21 18.8 280
## 8 Vx9 18 16.1 249
## 9 maxO3 16 14.3 16
## 10 maxO3v 12 10.7 292
## 11 Vx12 10 8.93 259
miss_case_summary(don)
## # A tibble: 112 x 4
## case n_miss pct_miss n_miss_cumsum
## <int> <int> <dbl> <int>
## 1 18 8 72.7 44
## 2 56 8 72.7 167
## 3 25 7 63.6 69
## 4 47 7 63.6 138
## 5 110 7 63.6 286
## 6 24 6 54.5 62
## 7 31 6 54.5 91
## 8 45 6 54.5 131
## 9 22 5 45.5 55
## 10 33 5 45.5 97
## # ... with 102 more rows
Another way to summarise the missingness is to look at the number of times that there are a number of missings - to look at the tabulations using: miss_var_table
and miss_case_table
.
These show the number of times that there are a certain number of missings in a variable or a case, respectively:
miss_var_table(don)
## # A tibble: 10 x 3
## n_miss_in_var n_vars pct_vars
## <int> <int> <dbl>
## 1 10 1 9.09
## 2 12 1 9.09
## 3 16 1 9.09
## 4 18 1 9.09
## 5 21 1 9.09
## 6 32 1 9.09
## 7 33 1 9.09
## 8 34 1 9.09
## 9 37 2 18.2
## 10 42 1 9.09
miss_case_table(don)
## # A tibble: 9 x 3
## n_miss_in_case n_cases pct_cases
## <int> <int> <dbl>
## 1 0 13 11.6
## 2 1 24 21.4
## 3 2 22 19.6
## 4 3 19 17.0
## 5 4 18 16.1
## 6 5 8 7.14
## 7 6 3 2.68
## 8 7 3 2.68
## 9 8 2 1.79
This shows us there are two variables with exactly 37 missings in both, but that each individual variable seems to have unique numbers of missings. We also note that there are no variables with zero missings.
For the cases, we see that there are 13 cases with no missings, 24 with 1 missing, 22 with 2 missing, and so on.
A quick way to get a look at the missingness in the data is to use vis_miss
.
This visualises the missingness across the entire dataset.
library(visdat)
vis_miss(don)
You can also apply clustering to find similar missingness groups by setting cluster = TRUE
.
vis_miss(don, cluster = TRUE)
There are a lot of different clusters here, it is difficult to see clear relationships here.
Another technique is to try arranging by different variables using arrange
.
don %>%
arrange(maxO3) %>%
vis_miss()
don %>%
arrange(T12) %>%
vis_miss()
don %>%
arrange(maxO3v) %>%
vis_miss()
Another way to look at missings is to visualise them by variables
and cases
To visualise the missings for each variable, we use gg_miss_var
:
library(naniar)
gg_miss_var(don)
And show the percent missing by setting show_pct = TRUE
, and set the ylimits to be between 0 and 100.
library(tidyverse)
gg_miss_var(don,
show_pct = TRUE) +
ylim(0, 100)
We can look at the missings across cases using gg_miss_case
:
gg_miss_case(don)
And we can look at the combination and patterns of missingness by looking at an upset plot of the missingness - with gg_miss_upset.
The upset shows the combination of missings, by default choosing the 5 variables with the most missings, and then orders by the size of the missings in that set.
We set order.by = "freq"
to order the missingness by their frequency.
#gg_miss_upset(don,
# order.by = "freq")
library(UpSetR)
don %>%
as_shadow_upset() %>%
upset()
We can see that the combination which is the most frequent is the one where all the variables are observed (13 values). Then, the second one is the one where T9, T12 and T15 are simultaneously missing (7 rows) (1 is missing, 0 is observed - there is a 1 for the second, third and fourth variables).
We can look at the relationship between variables in a scatterplot, but we get a warning!
ggplot(don,
aes(x = T9,
y = maxO3)) +
geom_point()
We can display the missing values using geom_miss_point
. The points with missings are represented in red below the main cloud of points.
ggplot(don,
aes(x = T9,
y = maxO3)) +
geom_miss_point()
We can then explore the missingness by some categorical variable using facetting:
ggplot(don,
aes(x = T9,
y = maxO3)) +
geom_miss_point() +
facet_wrap(~ozo$WindDirection) +
theme_dark()
To take a closer look at the distribution of missings we add some missingness indicator information to the data. We call this indicator information a “shadow matrix”, and it gets added to the data with bind_shadow
. This creates a copy of the data with the names “Variable_NA”, and the values “NA” and “!NA” for missing, and not missing, respectively.
don %>% bind_shadow() %>% glimpse()
## Observations: 112
## Variables: 22
## $ maxO3 <int> 87, 82, 92, 114, 94, 80, 79, 79, 101, 106, 101, 90, ...
## $ T9 <dbl> 15.6, NA, 15.3, 16.2, NA, 17.7, 16.8, 14.9, 16.1, 18...
## $ T12 <dbl> 18.5, NA, 17.6, 19.7, 20.5, 19.8, 15.6, 17.5, 19.6, ...
## $ T15 <dbl> NA, NA, 19.5, NA, 20.4, 18.3, 14.9, 18.9, 21.4, 22.9...
## $ Ne9 <int> 4, 5, 2, 1, NA, 6, 7, 5, 2, 5, 7, NA, 7, NA, 8, 6, N...
## $ Ne12 <int> 4, 5, NA, 1, NA, NA, 8, 5, NA, NA, 7, 6, 5, 7, 7, 5,...
## $ Ne15 <int> 8, 7, NA, 0, NA, 7, NA, NA, 4, NA, 3, 8, 6, NA, NA, ...
## $ Vx9 <dbl> 0.6946, -4.3301, 2.9544, NA, -0.5000, -5.6382, -4.33...
## $ Vx12 <dbl> -1.7101, -4.0000, NA, 0.3473, -2.9544, -5.0000, -1.8...
## $ Vx15 <dbl> -0.6946, -3.0000, 0.5209, -0.1736, -4.3301, -6.0000,...
## $ maxO3v <int> 84, 87, 82, 92, 114, 94, 80, 99, 79, 101, 106, 101, ...
## $ maxO3_NA <fct> !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !N...
## $ T9_NA <fct> !NA, NA, !NA, !NA, NA, !NA, !NA, !NA, !NA, !NA, !NA,...
## $ T12_NA <fct> !NA, NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, NA, !NA,...
## $ T15_NA <fct> NA, NA, !NA, NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, ...
## $ Ne9_NA <fct> !NA, !NA, !NA, !NA, NA, !NA, !NA, !NA, !NA, !NA, !NA...
## $ Ne12_NA <fct> !NA, !NA, NA, !NA, NA, NA, !NA, !NA, NA, NA, !NA, !N...
## $ Ne15_NA <fct> !NA, !NA, NA, !NA, NA, !NA, NA, NA, !NA, NA, !NA, !N...
## $ Vx9_NA <fct> !NA, !NA, !NA, NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA...
## $ Vx12_NA <fct> !NA, !NA, NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA...
## $ Vx15_NA <fct> !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !N...
## $ maxO3v_NA <fct> !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !N...
This allows us to think about the “missingness” of a variable as it’s own variable.
So we can look at a density plot of max03
in ggplot2:
ggplot(don,
aes(x = maxO3)) +
geom_density()
We can use the “shadow matrix” to allow us to look at the density according to whether T9_NA is missing:
don %>%
bind_shadow() %>%
ggplot(aes(x = maxO3,
fill = T9_NA)) +
geom_density()
Or equivalently look at the variable T9 according to whether maxO3
is missing:
don %>%
bind_shadow() %>%
ggplot(aes(x = T9,
fill = maxO3_NA)) +
geom_density()
We can see that the distribution of T9 is the same when maxO3 is oberved and when maxO3 is missing. If the two densities (red and blue) were very different it would imply that when maxO3 is missing the values of T9 can be very high or very low which lead to suspect the MAR hypothesis.
We can use bind_shadow()
to then group by the missingness of a variable and perform some summary statistics on T9 for when maximum daily ozone level is present, and when it is missing.
don %>%
bind_shadow() %>%
group_by(maxO3_NA) %>%
summarise_at(
.vars = vars(T9),
.funs = funs(mean, sd, var, min, max),
na.rm = TRUE
)
## # A tibble: 2 x 6
## maxO3_NA mean sd var min max
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 !NA 18.3 3.06 9.36 11.3 25.3
## 2 NA 17.9 2.61 6.83 13.3 21
Q2 Do you observe any associations between the missing entries ? When values are missing on a variable does it correspond to small or large values on another one ?
We observe that the temperature variables T9, T12 and T15 tend to be missing together (probably indicating that thermometers failed) [as well as the Ne9, Ne12 and Ne15 variables.] We see more “red” values. We do not see more black or white values which should imply that when T9 is missing it would have corresponded to high or low values in another variable which should suggest MAR missing values for instance. Here everything points to MCAR values.
R1 Create a categorical dataset with “o” when the value of the cell is observed and “m” when it is missing, and with the same row and column names as in the original data. Then, you can perform Multiple Correspondence Analysis with the MCA function of the FactoMineR package.
?MCA
MCA can be seen as the counterpart of PCA for categorical data and here is used to study associations between missing and observed entries. MCA is a straightforwardly tool to visualise the missing data pattern even if the number of variable is large. It shows if missing values simultaneously occur in several variables or if missing values occur when some other variables are observed
data_miss <- data.frame(is.na(don))
data_miss <- apply(X=data_miss, FUN=function(x) if(x) "m" else "o", MARGIN=c(1,2))
# data_miss <- as_shadow(don) with the naniar package.
res.mca <- MCA(data_miss, graph = F)
plot(res.mca, invis = "ind", title = "MCA graph of the categories", cex = 0.5)
Then, before modeling the data, we perform a PCA with missing values to explore the correlation between variables. Use the R package missMDA dedicated to perform principal components methods with missing values and to impute data with PC methods.
library(missMDA)
?estim_ncpPCA
?imputePCA
The package missMDA allows the use of principal component methods for an incomplete data set. To achieve this goal in the case of PCA, the missing values are predicted using the iterative PCA algorithm for a predefined number of dimensions. Then, PCA is performed on the imputed data set. The single imputation step requires tuning the number of dimensions used to impute the data.
nb <- estim_ncpPCA(don,method.cv = "Kfold", verbose = FALSE) # estimate the number of components from incomplete data
#(available methods include GCV to approximate CV)
nb$ncp #2
## [1] 2
plot(0:5, nb$criterion, xlab = "nb dim", ylab = "MSEP")
res.comp <- imputePCA(don, ncp = nb$ncp) # iterativePCA algorithm
res.comp$completeObs[1:3,] # the imputed data set
## maxO3 T9 T12 T15 Ne9 Ne12 Ne15 Vx9
## 20010601 87 15.6000 18.50000 20.47146 4 4.000000 8.000000 0.6946
## 20010602 82 18.5047 20.86986 21.79932 5 5.000000 7.000000 -4.3301
## 20010603 92 15.3000 17.60000 19.50000 2 3.984066 3.812104 2.9544
## Vx12 Vx15 maxO3v
## 20010601 -1.710100 -0.6946 84
## 20010602 -4.000000 -3.0000 87
## 20010603 1.950563 0.5209 82
imp <- cbind.data.frame(res.comp$completeObs,WindDirection)
res.pca <- PCA(imp, quanti.sup = 1, quali.sup = 12, ncp = nb$ncp, graph=FALSE)
plot(res.pca, hab=12, lab="quali");
plot(res.pca, choix="var")
head(res.pca$ind$coord) #scores (principal components)
## Dim.1 Dim.2
## 20010601 -0.6604580 -1.2048271
## 20010602 -1.2317545 1.0465411
## 20010603 0.7984643 -2.7299508
## 20010604 2.5423205 -1.7435774
## 20010605 -0.4047517 0.8406578
## 20010606 -2.6701824 1.6934864
The incomplete data set can be imputed using the function imputePCA performing the iterative PCA algorithm, specifying the number of dimensions through the argument ncp=2. At convergence the algorithm provides both an estimation of the scores and loadings as well as a completed data set. The imputePCA function outputs the imputed data set. The completed data set is in the object completeObs. The imputePCA function also outputs the fitted matrix \(\hat X\) in the object fitted.
Q3 Could you guess how cross-validation is performed to select the number of components?
The cross-validation is performed with the Kfold methodFor the Kfold. A percentage pNA of missing values is inserted and predicted with a PCA model using ncp.min to ncp.max dimensions. This process is repeated nbsim times. The number of components which leads to the smallest MSEP (Mean Standard Error of Prediction) is retained.
Through the argument method.cv, the function estim_ncpPCA proposes several cross-validation procedures to choose this number. The default method is the generalised cross-validation method (method.cv=“gcv”). It consists in searching the number of dimensions which minimises the generalised cross-validation criterion, which can be seen as an approximation of the leave-one-out cross-validation criterion. The procedure is very fast, because it does not require adding explicitly missing values and predicting them for each cell of the data set. However, the number of dimensions minimising the criterion can sometimes be unobvious when several local minimum occur. In such a case, more computationally intensive methods, those performing explicit cross-validation, can be used, such as Kfold (method.cv=“Kfold”) or leave-one-out (method.cv=“loo”).
The Kfold cross-validation suggests to retain 2 dimensions for the imputation of the data set.
We perform multiple imputation either assuming 1) Joint Modeling (one joint probabilistic model for the variables all together) - We use the R package Amelia, which is by default consider Gaussian distribution 2) Condional Modeling (one model per variable) approach - We use the R package mice which by default consider one model of linear regression per variable 3) a PCA based model - We use the R package missMDA
For each approach we generate 100 imputed data sets.
library(Amelia)
?amelia
res.amelia <- amelia(don, m = 5)
## -- Imputation 1 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 21 22 23 24 25 26 27
##
## -- Imputation 2 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 61 62 63 64
##
## -- Imputation 3 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
##
## -- Imputation 4 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 41 42 43 44 45 46
##
## -- Imputation 5 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 41 42 43 44 45 46 47 48
#names(res.amelia$imputations)
#res.amelia$imputations$imp1# the first imputed data set
library(mice)
imp.mice <- mice(don, m = 100, defaultMethod = "norm.boot") # the variability of the parameters is obtained
?MIPCA
?plot.MIPCA
res.MIPCA <- MIPCA(don, ncp = 2, nboot = 100) # MI with PCA using 2 dimensions
The function MIPCA gives as output the data set imputed by the iterative PCA algorithm (in res.imputePCA) and the other data sets generated by the MIPCA algorithm (in res.MI). The number of data sets generated by this algorithm is controlled by the nboot argument, equal to 100 by default. The other arguments of this function are the same as those for the imputePCA function.
Exploratory analysis is very important and even at this stage of the analysis.
We will inspect the imputed values created to know if the imputation method should require more investigation or if we can continue and analyze the data. A common practice consists in comparing the distribution of the imputed values and of the observed values. Check the compare.density function.
compare.density(res.amelia, var = "T12")
Q Do both distributions need to be close? Could the missing values differ from the observed ones both in spread and in location?
Note that a difference between these distributions does not mean that the model is unsuitable. Indeed, when the missing data mechanism is not MCAR, it could make sense to observe differences between the distribution of imputed values and the distribution of observed values. However, if differences occur, more investigations would be required to try to explain them.
The quality of imputation can also be assessed with cross-validation using the overimpute function. Each observed value is deleted and for each one 100 values are predicted (using the same MI method) and the mean and 90% confidence intervals are computed for these 100 values. Then, we inspect whether the observed value falls within the obtained interval. On the graph, the y=x line is plotted (where the imputations should fall if they were perfect), as well as the mean (dots) and intervals (lines) for each value. Around ninety percent of these confidence intervals should contain the y = x line, which means that the true observed value falls within this range. The color of the line (as coded in the legend) represents the fraction of missing observations in the pattern of missingness for that observation (ex: blue=0-2 missing entries).
overimpute(res.amelia, var = "maxO3")
We can also examine the variability by projecting as supplementary tables the imputed data sets on the PCA configuration (plot the results of MI with PCA).
plot(res.MIPCA,choice= "ind.supp")
plot(res.MIPCA,choice= "var")
The plots represent the projection of the individuals (top) and variables (bottom) of each imputed data set as supplementary elements onto the reference configuration obtained with the iterative PCA algorithm. For the individuals, a confidence area is constructed for each, and if one has no missing entries, its confidence area is restricted to a point. All the plots show that the variability across different imputations is small and a user can interpret the PCA results with confidence.
MI aims to apply a statistical method on an incomplete data set. We now apply a regression model on each imputed data set of the amelia method and MIPCA methods.
resamelia <- lapply(res.amelia$imputations, as.data.frame)
# A regression on each imputed dataset
fitamelia<-lapply(resamelia, lm,
formula="maxO3~ T9+T12+T15+Ne9+Ne12+Ne15+Vx9+Vx12+Vx15+maxO3v")
# fitamelia <- lapply(resamelia, with,
# lm(maxO3 ~ T9+T12+T15+Ne9+Ne12+Ne15+Vx9+Vx12+Vx15+maxO3v))
imp.mice <- mice(don, m=100,defaultMethod="norm.boot") # the variability of the parameters is obtained
lm.mice.out <- with(imp.mice, lm(maxO3 ~ T9+T12+T15+Ne9+Ne12+Ne15+Vx9+Vx12+Vx15+maxO3v))
res.MIPCA <- lapply(res.MIPCA$res.MI, as.data.frame)
fitMIPCA<-lapply(res.MIPCA,lm, formula="maxO3~T9+T12+T15+Ne9+Ne12+Ne15+Vx9+Vx12+Vx15+maxO3v")
poolamelia<-pool(as.mira(fitamelia))
summary(poolamelia)
## estimate std.error statistic df p.value
## (Intercept) 19.3008802 20.4269267 0.9448744 9.156416 0.36334317
## T9 1.0365941 3.7064439 0.2796735 3.725997 0.78448590
## T12 1.3936132 3.1734209 0.4391517 4.852282 0.66834788
## T15 0.6400731 1.9088331 0.3353217 6.137191 0.74316648
## Ne9 -1.2425404 1.6724030 -0.7429671 5.934270 0.47178381
## Ne12 -2.7872454 2.9442641 -0.9466696 5.307437 0.36246397
## Ne15 0.8739977 1.3634798 0.6410052 12.011132 0.53355926
## Vx9 0.8666599 1.6405449 0.5282756 7.015473 0.60693100
## Vx12 0.4887195 2.1687298 0.2253482 6.209202 0.82549695
## Vx15 0.4964256 1.4392222 0.3449263 9.386397 0.73611455
## maxO3v 0.2981957 0.1203265 2.4782219 7.689351 0.02903622
poolMIPCA<-pool(as.mira(fitMIPCA))
summary(poolMIPCA)
## estimate std.error statistic df p.value
## (Intercept) 12.7547158 18.44444620 0.6915207 62.65724 0.491533365
## T9 1.0124513 1.32664563 0.7631663 43.94939 0.447937301
## T12 1.5481541 1.02177779 1.5151573 53.57541 0.134250639
## T15 0.8270436 0.90466203 0.9142018 56.48618 0.363759801
## Ne9 -1.0669031 1.18182976 -0.9027553 56.20932 0.369762308
## Ne12 -1.8046512 1.56165049 -1.1556051 50.39130 0.251785274
## Ne15 0.3818986 1.16884956 0.3267303 62.74103 0.744850124
## Vx9 0.8234073 1.06833916 0.7707358 69.73461 0.443466007
## Vx12 0.9935546 1.14848906 0.8650971 69.76246 0.389950634
## Vx15 0.2081217 1.13987925 0.1825822 65.20298 0.855655377
## maxO3v 0.2502536 0.09206577 2.7182051 67.67098 0.008273753
#pool.mice <- pool(lm.mice.out)
#summary(pool.mice)
don2 <- don
reg <- lm(maxO3 ~. , data = don2)
while(any(summary(reg)$coeff[-1, 4]>0.05)){
don2 <- don2[,!(colnames(don2)%in%names(which.max(summary(reg)$coeff[-1, 4])))]
reg <- lm(maxO3 ~. , data = don2)
}
We combine the results and performed the regression with missing values
# Submodel to compare
fitMIPCA<-lapply(res.MIPCA,lm, formula="maxO3~ T12+Ne9+Vx12+maxO3v")
poolMIPCA<-pool(as.mira(fitMIPCA))
summary(poolMIPCA)
## estimate std.error statistic df p.value
## (Intercept) 9.4746829 14.26052896 0.6643991 66.39501 0.5085254775
## T12 2.9522183 0.63368128 4.6588378 63.63925 0.0000139317
## Ne9 -1.8307128 1.07180284 -1.7080686 61.45186 0.0918675454
## Vx12 1.7929235 0.74833086 2.3958968 69.73607 0.0191404878
## maxO3v 0.3286382 0.08415626 3.9050956 73.09463 0.0002079746
#lm.mice.out <- with(imp.mice, lm(maxO3 ~ T12+Ne9+Vx12+maxO3v))
#pool.mice <- pool(lm.mice.out)
#summary(pool.mice)
fitamelia<-lapply(resamelia,lm, formula="maxO3~ T12+Ne9+Vx12+maxO3v")
poolamelia<-pool(as.mira(fitamelia))
summary(poolamelia)
## estimate std.error statistic df p.value
## (Intercept) 6.3208449 14.98158031 0.4219078 11.699053 6.767519e-01
## T12 3.1149265 0.58410855 5.3327869 20.590983 1.653628e-05
## Ne9 -2.0510525 1.52007385 -1.3493111 5.446113 1.895097e-01
## Vx12 1.6882470 0.88284691 1.9122761 9.195620 6.753819e-02
## maxO3v 0.3400338 0.07501389 4.5329447 24.632555 1.289599e-04
Studies in community ecology aim to understand how and why individuals of different species co-occur in the same location at the same time. Hence, ecologists usually collect and store data on species distribution as tables containing the abundances of different species in several sampling sites. Additional information such as measures of environmental variables or species traits can also be recorded to examine the effects of abiotic features (characteristics, i.e. due to physico-chemical action and no biological action) and biotic features. Several projects compile data from preexisting databases. Due to the wide heterogeneity of measurement methods and research objectives, these huge data sets are often characterized by a high number of missing values. Hence, in addition to ecological questions, such data sets also present some important methodological and technical challenges for multivariate analysis. The GLOPNET data set contains 6 traits measured for 2494 plant species: LMA (leaf mass per area), LL (leaf lifes-pan), Amass (photosynthetic assimilation), Nmass (leaf nitrogen), Pmass (leaf phosphorus), Rmass (dark respiration rate). The last four variables are expressed per leaf dry mass. GLOPNET is a compilation of several existing data sets and thus contains a large proportion of missing values. All traits were log-normally distributed and log-transformed before analysis.
Ecolo <- read.csv("ecological.csv", header = TRUE, sep=";",dec=",")
## Delete species with only missing values for contiuous variables
ind <- which(rowSums(is.na(Ecolo[,-1])) == 6)
biome <- Ecolo[-ind,1] ### Keep a categorical variable
Ecolo <- Ecolo[-ind,-1] ### Select continuous variables
dim(Ecolo)
## [1] 2494 6
## proportion of missing values
sum(is.na(Ecolo))/(nrow(Ecolo)*ncol(Ecolo)) # 55% of missing values
## [1] 0.5338145
## Delete species with missing values
dim(na.omit(Ecolo)) # only 72 remaining species!
## [1] 72 6
53.38% of the entries in the GLOPNET data set are missing. Only 72 species have complete information for the 6 traits and the proportion of missing values varied between 4.97 % (LMA) to 89.01 % (Rmass).
vis_miss(Ecolo)
vis_miss(Ecolo, cluster = TRUE)
gg_miss_case(Ecolo)
gg_miss_var(Ecolo)
gg_miss_upset(Ecolo,
order.by = "freq")
miss_case_table(Ecolo)
## # A tibble: 6 x 3
## n_miss_in_case n_cases pct_cases
## <int> <int> <dbl>
## 1 0 72 2.89
## 2 1 248 9.94
## 3 2 277 11.1
## 4 3 807 32.4
## 5 4 685 27.5
## 6 5 405 16.2
miss_var_table(Ecolo)
## # A tibble: 6 x 3
## n_miss_in_var n_vars pct_vars
## <int> <int> <dbl>
## 1 124 1 16.7
## 2 433 1 16.7
## 3 1724 1 16.7
## 4 1742 1 16.7
## 5 1745 1 16.7
## 6 2220 1 16.7
# Visualize the pattern
# library(VIM)
#aggr(Ecolo)
# aggr(Ecolo,only.miss=TRUE,numbers=TRUE,sortVar=TRUE)
# res <- summary(aggr(Ecolo,prop=TRUE,combined=TRUE))$combinations
#res[rev(order(res[,2])),]
mis.ind <- matrix("o",nrow=nrow(Ecolo),ncol=ncol(Ecolo))
mis.ind[is.na(Ecolo)] <- "m"
dimnames(mis.ind) <- dimnames(Ecolo)
library(FactoMineR)
resMCA <- MCA(mis.ind)
plot(resMCA,invis="ind",title="MCA graph of the categories")
### Impute the incomplete data set
library(missMDA)
### nb <- estim_ncpPCA(Ecolo,method.cv="Kfold",nbsim=100) ### Time consuming!
res.comp <- imputePCA(Ecolo,ncp=2)
#Perform a PCA on the completed data set
imp <- cbind.data.frame(res.comp$completeObs,biome)
res.pca <- PCA(imp,quali.sup=7,graph=FALSE)
plot(res.pca, hab=7, lab="quali")
plot(res.pca, hab=7, lab="quali",invisible="ind")
plot(res.pca, choix="var")
# Compare with PCA on the data imputed by the mean
PCA(Ecolo)
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 2494 individuals, described by 6 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
This first axis corresponding to the “leaf economic spectrum” separates species with potential for quick returns for investment with high values for Nmass, Amass, Rmass and Pmass and low values for LL and LMA (right part) from species with slow returns on the left part. Scores for the traits are very consistent between methods, to a lesser extent for the Mean. This representation can be used to add external information: grouping species by major biomes illustrates the universality of the leaf economic spectrum but also some specificities (e.g., Desert and Boreal forest mainly contain species of the quick-return end).
The graphical representation obtained by the Mean imputation highlights a very particular shape indicating that results are not reliable.
We use the survey data set health concerning students’ health. 320 students answered 20 questions on their consumption of products (drugs, alcohol), on their psychological state and their sleeping condition. In addition, we have information regarding their gender, age and accommodation. The aim is to study the principal dimensions of variability of this data and to see if there are relationships between alcohol consumption and psychological state for instance. Then, after grouping individuals with the same profile, one can “label” them and see if there are relationships with the socio-economic questions.
Missing values are inserted to illustrate the methods.
library(FactoMineR)
health <- read.csv("sante.tex",sep=";",header=T)
dim(health)
## [1] 327 20
summary(health)
## Pbsleep Asleep Fatigue Nightmare
## Never:136 Never : 66 Never : 41 Never :172
## Often: 70 QuiteOften: 86 QuiteOften:159 QuiteOften: 28
## Rare :121 Rare :142 Rare : 91 Rare :119
## Veryoften : 33 Veryoften : 36 Veryoften : 8
##
##
##
## Fatiguecste Insomnia Sex Age
## Never : 73 Never :250 Boy :120 18yrsorless:147
## QuiteOften: 98 QuiteOften: 16 Girls:207 19yrs : 99
## Rare :139 Rare : 54 20yrs : 49
## Veryoften : 17 Veryoften : 7 21yrsetplus: 32
##
##
##
## Liveparents Housing Absenteeism Tabac
## No :186 Campus : 31 Exceptionally:112 Frequent : 81
## Yes:141 Flat_alone : 59 Never :129 Never :208
## FurnishedRoom: 18 Often : 21 Occasional: 38
## Hostel : 10 Sometimes : 57
## Other : 8 Veryoften : 8
## Parents :141
## Roomate : 60
## Alcohol Druunk Cannabis Lonely Depress
## Frequent : 27 Never:122 Frequent : 22 Never:128 Never:119
## Never : 42 Yes :205 Never :165 Often: 55 Often: 54
## Ocasional:258 Occasional:140 Rare :144 Rare :154
##
##
##
##
## Desperate Aggressiv Hallucination
## Never:184 Never:182 Never :314
## Often: 52 Often: 31 RareorOften: 13
## Rare : 91 Rare :114
##
##
##
##
healthNA <-health
healthNA[5:10,4:6] <- NA
healthNA[55:60,12:14] <- NA
First, we can explore the pattern of missing using MCA (by default it codes a missing values as a new category):
res.mcaNA <- MCA(healthNA, quali.sup = c(7:11))
We can also explore some of the healthNA
missingness using tools from naniar:
Then, we can study the similarities between the students and the associations between categories performing MCA while skipping the missing values. We carry-out the following steps:
library(missMDA)
## First the number of components has to be estimated
# nb <- estim_ncpMCA(healthNA[,c(1:6,12:20)],ncp.max=10) ## Time-consuming, nb = 5
## Impute the indicator matrix and perform MCA
res.impute <- imputeMCA(healthNA[,c(1:6,12:20)], ncp=5)
res.impute$tab.disj[1:10, 10:21]
## Rare Veryoften Never QuiteOften Rare Veryoften Never
## 1 1 0 1.000000000 0.00000000 0.0000000 0.00000000 1.00000000
## 2 0 0 1.000000000 0.00000000 0.0000000 0.00000000 1.00000000
## 3 0 0 0.000000000 0.00000000 1.0000000 0.00000000 1.00000000
## 4 0 0 1.000000000 0.00000000 0.0000000 0.00000000 0.00000000
## 5 0 0 0.645530145 0.06407905 0.2462122 0.04417856 0.34030543
## 6 0 0 0.632528934 0.12949032 0.2588293 -0.02084853 0.17287949
## 7 0 0 0.239751676 0.17029060 0.5334148 0.05654293 0.19788142
## 8 0 1 0.005433251 0.21241405 0.6274436 0.15470908 0.10561374
## 9 0 0 -0.049112351 0.04268786 0.9488273 0.05759715 0.08682299
## 10 0 0 0.502837245 0.07045245 0.4656037 -0.03889340 0.08939468
## QuiteOften Rare Veryoften Never QuiteOften
## 1 0.0000000 0.0000000 0.000000000 1.0000000 0.000000000
## 2 0.0000000 0.0000000 0.000000000 1.0000000 0.000000000
## 3 0.0000000 0.0000000 0.000000000 1.0000000 0.000000000
## 4 1.0000000 0.0000000 0.000000000 1.0000000 0.000000000
## 5 0.2443741 0.3281611 0.087159320 0.8930104 0.006230533
## 6 0.3393262 0.4931404 -0.005346056 0.9004445 0.035455413
## 7 0.4458801 0.2826159 0.073622655 0.6013528 0.079902701
## 8 0.4111983 0.2504949 0.232692973 0.2616871 0.188485280
## 9 0.4161824 0.4202936 0.076701008 0.3061799 0.071792221
## 10 0.3394858 0.6013289 -0.030209419 0.7766499 0.027674843
apply(res.impute$tab.disj[1:10, 12:15],1,sum) # sum to 1 per variable
## 1 2 3 4 5 6 7 8 9 10
## 1 1 1 1 1 1 1 1 1 1
res.impute$comp[5:10,4:6] # the completed data set with the most plausible category
## Nightmare Fatiguecste Insomnia
## 5 Never Never Never
## 6 Never Rare Never
## 7 Rare QuiteOften Never
## 8 Rare QuiteOften Rare
## 9 Rare Rare Rare
## 10 Never Rare Never
health[5:10,4:6]
## Nightmare Fatiguecste Insomnia
## 5 QuiteOften QuiteOften QuiteOften
## 6 Rare Rare Never
## 7 Veryoften QuiteOften QuiteOften
## 8 Never QuiteOften Never
## 9 Rare Rare Rare
## 10 Never Rare Never
## The imputed indicator matrix can be used as an input of the MCA function of the
## FactoMineR package to perform the MCA on the incomplete data
res.mca <- MCA(healthNA,tab.disj=res.impute$tab.disj,quali.sup=7:11)
plot(res.mca, invisible=c("var","quali.sup"))
plot(res.mca, invisible=c("ind","quali.sup"), cex = 0.6)
plot(res.mca, invisible=c("ind","var"), cex = 0.6)
plot(res.mca,invisible=c("ind"),autoLab="yes", selectMod="cos2 15", cex = 0.6)
plot(res.mca,autoLab="yes", selectMod="cos2 5", select="cos2 5")
res.mca
## **Results of the Multiple Correspondence Analysis (MCA)**
## The analysis was performed on 327 individuals, described by 20 variables
## *The results are available in the following objects:
##
## name
## 1 "$eig"
## 2 "$var"
## 3 "$var$coord"
## 4 "$var$cos2"
## 5 "$var$contrib"
## 6 "$var$v.test"
## 7 "$ind"
## 8 "$ind$coord"
## 9 "$ind$cos2"
## 10 "$ind$contrib"
## 11 "$quali.sup"
## 12 "$quali.sup$coord"
## 13 "$quali.sup$cos2"
## 14 "$quali.sup$v.test"
## 15 "$call"
## 16 "$call$marge.col"
## 17 "$call$marge.li"
## description
## 1 "eigenvalues"
## 2 "results for the variables"
## 3 "coord. of the categories"
## 4 "cos2 for the categories"
## 5 "contributions of the categories"
## 6 "v-test for the categories"
## 7 "results for the individuals"
## 8 "coord. for the individuals"
## 9 "cos2 for the individuals"
## 10 "contributions of the individuals"
## 11 "results for the supplementary categorical variables"
## 12 "coord. for the supplementary categories"
## 13 "cos2 for the supplementary categories"
## 14 "v-test for the supplementary categories"
## 15 "intermediate results"
## 16 "weights of columns"
## 17 "weights of rows"
## Another ex of imputation of categorical data
data(vnf)
# Look at the pattern of missing values with MCA
MCA(vnf)
## **Results of the Multiple Correspondence Analysis (MCA)**
## The analysis was performed on 1232 individuals, described by 14 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. of the categories"
## 4 "$var$cos2" "cos2 for the categories"
## 5 "$var$contrib" "contributions of the categories"
## 6 "$var$v.test" "v-test for the categories"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "intermediate results"
## 12 "$call$marge.col" "weights of columns"
## 13 "$call$marge.li" "weights of rows"
#1) select the number of components
#nb <- estim_ncpMCA(vnf, ncp.max = 5) #Time-consuming, nb = 4
#2) Impute the indicator matrix
res.impute <- imputeMCA(vnf, ncp = 4)
res.impute$tab.disj[1:5, 1:5]
## Q7.1.1 Q7.1.2 Q7.1.3 Q7.2.1 Q7.2.2
## 1 1 0 0 1 0
## 2 1 0 0 1 0
## 3 1 0 0 1 0
## 4 1 0 0 1 0
## 5 1 0 0 0 0
res.impute$comp[1:5, 1:5]
## Q7.1 Q7.2 Q7.4 Q8.1 Q8.2
## 1 1 1 3 1 2
## 2 1 1 1 2 1
## 3 1 1 1 2 2
## 4 1 1 1 1 1
## 5 1 3 1 1 1
## 2.2) Single imputation for mixed data with FAMD and with Forest
res.ncp <- estim_ncpFAMD(ozo)
##
|
| | 0%
|
|= | 1%
|
|= | 2%
|
|== | 3%
|
|=== | 4%
|
|=== | 5%
|
|==== | 6%
|
|===== | 7%
|
|===== | 8%
|
|====== | 9%
|
|======= | 10%
|
|======= | 11%
|
|======== | 12%
|
|========= | 13%
|
|========= | 14%
|
|========== | 15%
|
|=========== | 16%
|
|=========== | 17%
|
|============ | 18%
|
|============ | 19%
|
|============= | 20%
|
|============== | 21%
|
|============== | 22%
|
|=============== | 23%
|
|================ | 24%
|
|================ | 25%
|
|================= | 26%
|
|================== | 27%
|
|================== | 28%
|
|=================== | 29%
|
|==================== | 30%
|
|==================== | 31%
|
|===================== | 32%
|
|====================== | 33%
|
|====================== | 34%
|
|======================= | 35%
|
|======================== | 36%
|
|======================== | 37%
|
|========================= | 38%
|
|========================== | 39%
|
|========================== | 40%
|
|=========================== | 41%
|
|============================ | 42%
|
|============================ | 43%
|
|============================= | 44%
|
|============================== | 45%
|
|============================== | 46%
|
|=============================== | 47%
|
|================================ | 48%
|
|================================ | 49%
|
|================================= | 51%
|
|================================= | 52%
|
|================================== | 53%
|
|=================================== | 54%
|
|=================================== | 55%
|
|==================================== | 56%
|
|===================================== | 57%
|
|===================================== | 58%
|
|====================================== | 59%
|
|======================================= | 60%
|
|======================================= | 61%
|
|======================================== | 62%
|
|========================================= | 63%
|
|========================================= | 64%
|
|========================================== | 65%
|
|=========================================== | 66%
|
|=========================================== | 67%
|
|============================================ | 68%
|
|============================================= | 69%
|
|============================================= | 70%
|
|============================================== | 71%
|
|=============================================== | 72%
|
|=============================================== | 73%
|
|================================================ | 74%
|
|================================================= | 75%
|
|================================================= | 76%
|
|================================================== | 77%
|
|=================================================== | 78%
|
|=================================================== | 79%
|
|==================================================== | 80%
|
|===================================================== | 81%
|
|===================================================== | 82%
|
|====================================================== | 83%
|
|====================================================== | 84%
|
|======================================================= | 85%
|
|======================================================== | 86%
|
|======================================================== | 87%
|
|========================================================= | 88%
|
|========================================================== | 89%
|
|========================================================== | 90%
|
|=========================================================== | 91%
|
|============================================================ | 92%
|
|============================================================ | 93%
|
|============================================================= | 94%
|
|============================================================== | 95%
|
|============================================================== | 96%
|
|=============================================================== | 97%
|
|================================================================ | 98%
|
|================================================================ | 99%
|
|=================================================================| 100%
res.famd <-imputeFAMD(ozo, ncp = 2)
res.famd$completeObs[1:5,1:5]
## maxO3 T9 T12 T15 Ne9
## 20010601 87 15.60000 18.500 20.17593 4.000000
## 20010602 82 17.20488 19.579 20.42794 5.000000
## 20010603 92 15.30000 17.600 19.50000 2.000000
## 20010604 114 16.20000 19.700 24.34365 1.000000
## 20010605 94 18.89175 20.500 20.40000 5.395526
library(missForest)
res.rf <- missForest(ozo)
## missForest iteration 1 in progress...done!
## missForest iteration 2 in progress...done!
## missForest iteration 3 in progress...done!
## missForest iteration 4 in progress...done!
## missForest iteration 5 in progress...done!
## missForest iteration 6 in progress...done!
## missForest iteration 7 in progress...done!
## missForest iteration 8 in progress...done!
res.rf$ximp[1:5,1:5]
## maxO3 T9 T12 T15 Ne9
## 20010601 87 15.600 18.500 19.070 4.00
## 20010602 82 16.527 18.632 19.174 5.00
## 20010603 92 15.300 17.600 19.500 2.00
## 20010604 114 16.200 19.700 22.216 1.00
## 20010605 94 18.387 20.500 20.400 5.26
To perform a multinomial regression with missing values, we can use Multiple Imputation.
# With mice
library(mice)
x.impmi<-mice(healthNA[,c(1:6,12:20)], m = 5, printFlag = FALSE)
# with MCA
x.impmimca<-MIMCA(healthNA[,c(1:6,12:20)], ncp = 5)
# Perfoming a model on each imputed data table
lm.mice.out <- with( x.impmi, nnet::multinom(Alcohol ~ Pbsleep + Fatigue +Nightmare, trace = FALSE))
pool.mice <- pool(lm.mice.out) #combining the results
summary(pool.mice)
## estimate std.error statistic df
## Never:(Intercept) 0.67707080 0.6352599 1.0658170 305.991449
## Never:PbsleepOften 0.21876201 0.7751914 0.2822039 275.216435
## Never:PbsleepRare 0.10456264 0.6099034 0.1714413 297.947549
## Never:FatigueQuiteOften -0.46872375 0.7454158 -0.6288085 302.558070
## Never:FatigueRare -0.33956346 0.7794693 -0.4356342 306.838211
## Never:FatigueVeryoften -0.55999746 1.0516879 -0.5324749 291.390991
## Never:NightmareQuiteOften 0.77883699 1.2583534 0.6189335 301.315751
## Never:NightmareRare 0.35684427 0.6175182 0.5778685 286.656229
## Never:NightmareVeryoften -11.83460688 6.6795725 -1.7717611 1.721062
## Ocasional:(Intercept) 1.51242913 0.5677928 2.6636990 303.394763
## Ocasional:PbsleepOften -0.09757196 0.6588871 -0.1480860 251.348070
## Ocasional:PbsleepRare 0.39063928 0.5052704 0.7731292 299.139080
## Ocasional:FatigueQuiteOften 0.36682928 0.6489295 0.5652837 302.769471
## Ocasional:FatigueRare 0.47033399 0.6770749 0.6946558 306.905482
## Ocasional:FatigueVeryoften 0.10355154 0.8812286 0.1175081 298.091870
## Ocasional:NightmareQuiteOften 1.30304785 1.1047645 1.1794802 306.101057
## Ocasional:NightmareRare 0.92246444 0.5329642 1.7308187 215.798410
## Ocasional:NightmareVeryoften -0.74421685 1.1261818 -0.6608319 138.356600
## p.value
## Never:(Intercept) 0.287344101
## Never:PbsleepOften 0.777977396
## Never:PbsleepRare 0.863989760
## Never:FatigueQuiteOften 0.529942170
## Never:FatigueRare 0.663408410
## Never:FatigueVeryoften 0.594782385
## Never:NightmareQuiteOften 0.536419328
## Never:NightmareRare 0.563776651
## Never:NightmareVeryoften 0.077426431
## Ocasional:(Intercept) 0.008136773
## Ocasional:PbsleepOften 0.882372135
## Ocasional:PbsleepRare 0.440040958
## Ocasional:FatigueQuiteOften 0.572293802
## Ocasional:FatigueRare 0.487796585
## Ocasional:FatigueVeryoften 0.906534252
## Ocasional:NightmareQuiteOften 0.239120533
## Ocasional:NightmareRare 0.084489189
## Ocasional:NightmareVeryoften 0.509215854
imp<-prelim(x.impmimca,healthNA[,c(1:6,12:20)])
fit <- with(data=imp,exp=nnet::multinom(Alcohol ~ Pbsleep + Fatigue +Nightmare, trace = FALSE))
res.pool<-pool(fit)
summary(res.pool)
## estimate std.error statistic
## Never:(Intercept) 0.661718593 0.6348120 1.04238517
## Never:PbsleepOften 0.288514505 0.7815972 0.36913452
## Never:PbsleepRare 0.079914622 0.6126321 0.13044471
## Never:FatigueQuiteOften -0.472917675 0.7482938 -0.63199467
## Never:FatigueRare -0.334661367 0.7796712 -0.42923398
## Never:FatigueVeryoften -0.634307807 1.0459340 -0.60645109
## Never:NightmareQuiteOften 0.743783692 1.2535872 0.59332425
## Never:NightmareRare 0.444595404 0.6327971 0.70258759
## Never:NightmareVeryoften -11.825321874 357.7210671 -0.03305738
## Ocasional:(Intercept) 1.490236403 0.5663645 2.63123179
## Ocasional:PbsleepOften -0.008082391 0.6628675 -0.01219307
## Ocasional:PbsleepRare 0.396025974 0.5058571 0.78288108
## Ocasional:FatigueQuiteOften 0.393681883 0.6505090 0.60519055
## Ocasional:FatigueRare 0.477286666 0.6776656 0.70431001
## Ocasional:FatigueVeryoften 0.057572156 0.8743279 0.06584733
## Ocasional:NightmareQuiteOften 1.179666637 1.0991253 1.07327769
## Ocasional:NightmareRare 1.007035652 0.5381321 1.87135403
## Ocasional:NightmareVeryoften -0.836610658 1.0953290 -0.76379849
## df p.value
## Never:(Intercept) 306.2339 0.298053428
## Never:PbsleepOften 301.3213 0.712281958
## Never:PbsleepRare 303.3691 0.896300041
## Never:FatigueQuiteOften 304.7282 0.527860789
## Never:FatigueRare 306.9038 0.668054073
## Never:FatigueVeryoften 301.8462 0.544663504
## Never:NightmareQuiteOften 292.2901 0.553401098
## Never:NightmareRare 297.5941 0.482845275
## Never:NightmareVeryoften 306.9485 0.973650314
## Ocasional:(Intercept) 306.1600 0.008937206
## Ocasional:PbsleepOften 299.8354 0.990279498
## Ocasional:PbsleepRare 304.3165 0.434300529
## Ocasional:FatigueQuiteOften 305.5360 0.545499553
## Ocasional:FatigueRare 306.9327 0.481773731
## Ocasional:FatigueVeryoften 304.0138 0.947542240
## Ocasional:NightmareQuiteOften 294.4692 0.283989579
## Ocasional:NightmareRare 295.8139 0.062247112
## Ocasional:NightmareVeryoften 287.3435 0.445574033
Let us consider the journal impact factors data from journalmetrics.com. We use a subset of 443 journals of the same sections than Journal of Statistical Software (Computer Science :: Software“, Decision Sciences :: Statistics, Probability and Uncertainty” and Mathematics :: Statistics and Probability"). This data has 45 columns which correspond to three metrics recorded each year from 1999 to 2013: IPP - impact per publication (it is closed to the ISI impact factor but for three rather than two years), SNIP - source normalized impact per paper (tries to weight by the number of citations per subject field to adjust for different citation cultures) and the SJR - SCImago journal rank (tries to capture average prestige per publication). This data contains 31% of missing values. We impute it with single imputation by Multiple Factor Analysis.
library(denoiseR)
data(impactfactor)
summary(impactfactor)
## SNIP1999 SNIP2000 SNIP2001 SNIP2002
## Min. :0.000 Min. :0.0000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.516 1st Qu.:0.5640 1st Qu.:0.525 1st Qu.:0.5205
## Median :0.830 Median :0.8545 Median :0.821 Median :0.8630
## Mean :1.061 Mean :1.0439 Mean :1.035 Mean :1.0851
## 3rd Qu.:1.406 3rd Qu.:1.4058 3rd Qu.:1.382 3rd Qu.:1.3475
## Max. :7.149 Max. :4.9960 Max. :4.820 Max. :6.7660
## NA's :202 NA's :197 NA's :190 NA's :181
## SNIP2003 SNIP2004 SNIP2005 SNIP2006
## Min. :0.0000 Min. : 0.000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.6735 1st Qu.: 0.752 1st Qu.:0.747 1st Qu.:0.6797
## Median :1.0560 Median : 1.199 Median :1.204 Median :1.2110
## Mean :1.4958 Mean : 1.688 Mean :1.667 Mean :1.5612
## 3rd Qu.:1.6400 3rd Qu.: 1.896 3rd Qu.:2.015 3rd Qu.:1.9042
## Max. :9.7420 Max. :11.601 Max. :8.962 Max. :8.3660
## NA's :172 NA's :166 NA's :155 NA's :143
## SNIP2007 SNIP2008 SNIP2009 SNIP2010
## Min. : 0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 0.652 1st Qu.:0.6425 1st Qu.:0.6258 1st Qu.:0.6015
## Median : 1.151 Median :1.1060 Median :1.0750 Median :1.0295
## Mean : 1.516 Mean :1.4387 Mean :1.3975 Mean :1.3493
## 3rd Qu.: 1.931 3rd Qu.:1.8960 3rd Qu.:1.8805 3rd Qu.:1.7915
## Max. :10.803 Max. :7.8170 Max. :7.0950 Max. :7.8100
## NA's :133 NA's :112 NA's :91 NA's :67
## SNIP2011 SNIP2012 SNIP2013 IPP1999
## Min. :0.000 Min. :0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.712 1st Qu.:0.656 1st Qu.:0.6915 1st Qu.:0.2140
## Median :1.111 Median :1.087 Median :1.1350 Median :0.4320
## Mean :1.458 Mean :1.462 Mean :1.4269 Mean :0.5808
## 3rd Qu.:1.844 3rd Qu.:1.915 3rd Qu.:1.8173 3rd Qu.:0.8000
## Max. :7.917 Max. :9.526 Max. :8.8220 Max. :5.0000
## NA's :55 NA's :40 NA's :33 NA's :202
## IPP2000 IPP2001 IPP2002 IPP2003
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.2500 1st Qu.:0.2350 1st Qu.:0.2440 1st Qu.:0.3560
## Median :0.4275 Median :0.4460 Median :0.4745 Median :0.5840
## Mean :0.5530 Mean :0.5977 Mean :0.6118 Mean :0.7975
## 3rd Qu.:0.7380 3rd Qu.:0.7750 3rd Qu.:0.7933 3rd Qu.:0.9805
## Max. :2.8140 Max. :3.7060 Max. :4.1240 Max. :5.5160
## NA's :197 NA's :190 NA's :181 NA's :172
## IPP2004 IPP2005 IPP2006 IPP2007
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.3540 1st Qu.:0.3530 1st Qu.:0.3563 1st Qu.:0.3755
## Median :0.6400 Median :0.7235 Median :0.8000 Median :0.7385
## Mean :0.9127 Mean :0.9774 Mean :1.0083 Mean :1.0035
## 3rd Qu.:1.1340 3rd Qu.:1.1572 3rd Qu.:1.2195 3rd Qu.:1.3013
## Max. :6.1930 Max. :5.7760 Max. :6.1930 Max. :6.7280
## NA's :166 NA's :155 NA's :143 NA's :133
## IPP2008 IPP2009 IPP2010 IPP2011
## Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.343 1st Qu.:0.3738 1st Qu.:0.3575 1st Qu.:0.4145
## Median :0.800 Median :0.8165 Median :0.8820 Median :0.8980
## Mean :1.011 Mean :1.0450 Mean :1.0785 Mean :1.1539
## 3rd Qu.:1.381 3rd Qu.:1.4072 3rd Qu.:1.4318 3rd Qu.:1.5097
## Max. :5.006 Max. :4.9080 Max. :5.9190 Max. :5.7920
## NA's :112 NA's :91 NA's :67 NA's :55
## IPP2012 IPP2013 SJR1999 SJR2000
## Min. :0.000 Min. :0.0000 Min. :0.1000 Min. :0.1000
## 1st Qu.:0.424 1st Qu.:0.4627 1st Qu.:0.1690 1st Qu.:0.2077
## Median :0.899 Median :0.8885 Median :0.3865 Median :0.4015
## Mean :1.207 Mean :1.2348 Mean :0.6501 Mean :0.6368
## 3rd Qu.:1.622 3rd Qu.:1.6042 3rd Qu.:0.8063 3rd Qu.:0.7420
## Max. :7.103 Max. :7.8650 Max. :5.9850 Max. :5.7420
## NA's :40 NA's :33 NA's :177 NA's :173
## SJR2001 SJR2002 SJR2003 SJR2004
## Min. :0.1000 Min. :0.1000 Min. :0.1000 Min. :0.1000
## 1st Qu.:0.2570 1st Qu.:0.2105 1st Qu.:0.2137 1st Qu.:0.2200
## Median :0.4820 Median :0.5020 Median :0.5450 Median :0.5480
## Mean :0.8048 Mean :0.7308 Mean :0.8066 Mean :0.8134
## 3rd Qu.:0.9935 3rd Qu.:0.8935 3rd Qu.:0.9150 3rd Qu.:1.0150
## Max. :5.4930 Max. :8.1210 Max. :5.5480 Max. :6.6790
## NA's :168 NA's :160 NA's :151 NA's :142
## SJR2005 SJR2006 SJR2007 SJR2008
## Min. :0.1000 Min. :0.1000 Min. :0.1000 Min. :0.1000
## 1st Qu.:0.2160 1st Qu.:0.2515 1st Qu.:0.2657 1st Qu.:0.2747
## Median :0.5830 Median :0.5790 Median :0.5980 Median :0.5960
## Mean :0.8521 Mean :0.8490 Mean :0.9131 Mean :0.8706
## 3rd Qu.:1.0680 3rd Qu.:1.0825 3rd Qu.:1.1983 3rd Qu.:1.1345
## Max. :5.3830 Max. :4.7700 Max. :7.5350 Max. :5.6900
## NA's :134 NA's :124 NA's :115 NA's :95
## SJR2009 SJR2010 SJR2011 SJR2012
## Min. :0.1000 Min. :0.1000 Min. :0.1000 Min. :0.1000
## 1st Qu.:0.2420 1st Qu.:0.2250 1st Qu.:0.2475 1st Qu.:0.2480
## Median :0.5540 Median :0.5770 Median :0.5750 Median :0.5280
## Mean :0.8313 Mean :0.8146 Mean :0.8268 Mean :0.8264
## 3rd Qu.:1.0810 3rd Qu.:1.0170 3rd Qu.:1.0765 3rd Qu.:1.0300
## Max. :5.5740 Max. :5.9280 Max. :6.6580 Max. :7.7180
## NA's :78 NA's :54 NA's :48 NA's :34
## SJR2013
## Min. :0.1000
## 1st Qu.:0.2572
## Median :0.5545
## Mean :0.8736
## 3rd Qu.:1.1318
## Max. :7.8630
## NA's :29
year=NULL; for (i in 1: 15) year= c(year, seq(i,45,15))
res.imp <- imputeMFA(impactfactor, group = rep(3, 15), type = rep("s", 15))
## MFA on the imputed data set
res.mfa <-MFA(res.imp$completeObs, group=rep(3,15), type=rep("s",15),
name.group=paste("year", 1999:2013,sep="_"),graph=F)
plot(res.mfa, choix = "ind", select = "contrib 15", habillage = "group", cex = 0.7)
points(res.mfa$ind$coord[c("Journal of Statistical Software", "Journal of the American Statistical Association", "Annals of Statistics"), 1:2], col=2, cex=0.6)
text(res.mfa$ind$coord[c("Journal of Statistical Software"), 1],
res.mfa$ind$coord[c("Journal of Statistical Software"), 2],cex=1,
labels=c("Journal of Statistical Software"),pos=3, col=2)
plot.MFA(res.mfa,choix="var", cex=0.5,shadow=TRUE, autoLab = "yes")
plot(res.mfa, select="IEEE/ACM Transactions on Networking", partial="all", habillage="group",unselect=0.9,chrono=TRUE)
The function VIM aggr calculates and represents the number of missing entries in each variable and for certain combinations of variables (which tend to be missing simultaneously).
res<-summary(aggr(don, sortVar=TRUE))$combinations
##
## Variables sorted by number of missings:
## Variable Count
## Ne12 0.37500000
## T9 0.33035714
## T15 0.33035714
## Ne9 0.30357143
## T12 0.29464286
## Ne15 0.28571429
## Vx15 0.18750000
## Vx9 0.16071429
## maxO3 0.14285714
## maxO3v 0.10714286
## Vx12 0.08928571
head(res[rev(order(res[,2])),])
## Combinations Count Percent
## 1 0:0:0:0:0:0:0:0:0:0:0 13 11.607143
## 45 0:1:1:1:0:0:0:0:0:0:0 7 6.250000
## 10 0:0:0:0:0:1:0:0:0:0:0 5 4.464286
## 35 0:1:0:0:0:0:0:0:0:0:0 4 3.571429
## 41 0:1:0:0:1:1:1:0:0:0:0 3 2.678571
## 28 0:0:1:0:0:0:0:0:0:0:0 3 2.678571
The VIM function matrixplot creates a matrix plot in which all cells of a data matrix are visualized by rectangles. Available data is coded according to a continuous color scheme (gray scale), while missing/imputed data is visualized by a clearly distinguishable color (red). If you use Rstudio the plot is not interactive (there are the warnings), but if you use R directly, you can click on a column of your choice: the rows are sorted (decreasing order) of the values of this column. This is useful to check if there is an association between the value of a variable and the missingness of another one.
matrixplot(don, sortby = 2)
##
## Click in a column to sort by the corresponding variable.
## To regain use of the VIM GUI and the R console, click outside the plot region.
#Here the variable selected is variable 2.
The VIM function marginplot creates a scatterplot with additional information on the missing values. If you plot the variables (x,y), the points with no missing values are represented as in a standard scatterplot. The points for which x (resp. y) is missing are represented in red along the y (resp. x) axis. In addition, boxplots of the x and y variables are represented along the axes with and without missing values (in red all variables x where y is missing, in blue all variables x where y is observed).
marginplot(don[,c("T9","maxO3")])
# Visualize the pattern
library(VIM)
#aggr(Ecolo)
aggr(Ecolo,
only.miss = TRUE,
numbers = TRUE,
sortVar = TRUE)
##
## Variables sorted by number of missings:
## Variable Count
## Rmass 0.89013633
## LL 0.69967923
## Pmass 0.69847634
## Amass 0.69125902
## Nmass 0.17361668
## LMA 0.04971933
res <- summary(aggr(Ecolo, prop = TRUE, combined = TRUE))$combinations
#res[rev(order(res[,2])),]