First of all you will need to install the following packages
install.packages("VIM")
install.packages("devtools")
library(devtools)
install_github("njtierney/naniar")
install.packages("missMDA")
install.packages("Amelia")
install.packages("mice")
install.packages("missForest")
install.packages("FactoMineR")
install.packages("tidyverse")
Air pollution is currently one of the most serious public health worries worldwide. Many epidemiological studies have proved the influence that some chemical compounds, such as sulphur dioxide (SO2), nitrogen dioxide (NO2), ozone (O3) can have on our health. Associations set up to monitor air quality are active all over the world to measure the concentration of these pollutants. They also keep a record of meteorological conditions such as temperature, cloud cover, wind, etc.
We have at our disposal 112 observations collected during the summer of 2001 in Rennes. The variables available are:
Here the final aim is to analyse the relationship between the maximum daily ozone (maxO3) level and the other meteorological variables. To do so we will perform regression to explain maxO3 in function of all the other variables. This data is incomplete - there are missing values. Indeed, it occurs frenquently to have machines that fail one day, leading to some information not recorded. We will therefore perform regression with missing values via multiple imputation.
ozo <- read.table("ozoneNA.csv",
header = TRUE,
sep = ",",
row.names = 1)
WindDirection <- ozo[, 12]
don <- ozo[, 1:11] #### keep the continuous variables
summary(don)
## maxO3 T9 T12 T15
## Min. : 42.00 Min. :11.30 Min. :14.30 Min. :14.90
## 1st Qu.: 71.00 1st Qu.:16.00 1st Qu.:18.60 1st Qu.:18.90
## Median : 81.50 Median :17.70 Median :20.40 Median :21.40
## Mean : 91.24 Mean :18.22 Mean :21.46 Mean :22.41
## 3rd Qu.:108.25 3rd Qu.:19.90 3rd Qu.:23.60 3rd Qu.:25.65
## Max. :166.00 Max. :25.30 Max. :33.50 Max. :35.50
## NA's :16 NA's :37 NA's :33 NA's :37
## Ne9 Ne12 Ne15 Vx9
## Min. :0.000 Min. :0.000 Min. :0.00 Min. :-7.8785
## 1st Qu.:3.000 1st Qu.:4.000 1st Qu.:3.00 1st Qu.:-3.0000
## Median :5.000 Median :5.000 Median :5.00 Median :-0.8671
## Mean :4.987 Mean :4.986 Mean :4.60 Mean :-1.0958
## 3rd Qu.:7.000 3rd Qu.:7.000 3rd Qu.:6.25 3rd Qu.: 0.6919
## Max. :8.000 Max. :8.000 Max. :8.00 Max. : 5.1962
## NA's :34 NA's :42 NA's :32 NA's :18
## Vx12 Vx15 maxO3v
## Min. :-7.8785 Min. :-9.000 Min. : 42.00
## 1st Qu.:-3.6941 1st Qu.:-3.759 1st Qu.: 70.00
## Median :-1.9284 Median :-1.710 Median : 82.50
## Mean :-1.6853 Mean :-1.830 Mean : 89.39
## 3rd Qu.:-0.1302 3rd Qu.: 0.000 3rd Qu.:101.00
## Max. : 6.5778 Max. : 3.830 Max. :166.00
## NA's :10 NA's :21 NA's :12
head(don)
## maxO3 T9 T12 T15 Ne9 Ne12 Ne15 Vx9 Vx12 Vx15 maxO3v
## 20010601 87 15.6 18.5 NA 4 4 8 0.6946 -1.7101 -0.6946 84
## 20010602 82 NA NA NA 5 5 7 -4.3301 -4.0000 -3.0000 87
## 20010603 92 15.3 17.6 19.5 2 NA NA 2.9544 NA 0.5209 82
## 20010604 114 16.2 19.7 NA 1 1 0 NA 0.3473 -0.1736 92
## 20010605 94 NA 20.5 20.4 NA NA NA -0.5000 -2.9544 -4.3301 114
## 20010606 80 17.7 19.8 18.3 6 NA 7 -5.6382 -5.0000 -6.0000 94
dim(don)
## [1] 112 11
library(VIM)
library(FactoMineR)
library(missMDA)
Q1 When could it be a good idea to delete rows or columns with missing values to work with a complete data set? Could you do it here?
dim(na.omit(don))
## [1] 13 11
Deleting rows or columns is possible as long as there is enough data left and the missing values are of the MCAR type so that the sample is a subsample of the original data. We will obtain unbiased estimators but with more variance. Deleting observations with missing data for ozone data leads to a table with 13 rows.
First, we perfom some descriptive statistics (how many missing? how many variables, individuals with missing?) and try to inspect and vizualize the pattern of missing entries and get hints on the mechanism. For this purpose, we use the R package naniar as well as Multiple Correspondence Analysis (FactoMineR package). An alternative would be to use VIM (Visualization and Imputation of Missing Values - Mathias Templ)
naniar
provides principled, tidy ways to summarise, visualise, and manipulate missing data with minimal deviations from the workflows in ggplot2 and tidy data.
We can start off with some quick summaries of the amount of missing and complete data in don
using:
pct_miss()
To give us the percentage of missings in the datan_miss()
to give the number of missings,and their complete
equivalents:
pct_complete()
To give us the percentage of completes in the datan_complete()
to give the number of complete valueslibrary(naniar)
library(tidyverse)
pct_miss(don) # percentage of missing value in the data.
## [1] 23.7013
n_miss(don) # number of missing values in the
## [1] 292
n_complete(don) # without missing value
## [1] 940
pct_complete(don) # without missing value
## [1] 76.2987
This is useful, but would be repetitive if you wanted to repeat this for every variable. We can instead look at summaries across the variables and cases
You can find the number and percentage missings in each variable and case using miss_case_summary
and miss_var_summary
.
miss_var_summary(don)
## # A tibble: 11 x 4
## variable n_miss pct_miss n_miss_cumsum
## <chr> <int> <dbl> <int>
## 1 Ne12 42 37.5 199
## 2 T9 37 33.0 53
## 3 T15 37 33.0 123
## 4 Ne9 34 30.4 157
## 5 T12 33 29.5 86
## 6 Ne15 32 28.6 231
## 7 Vx15 21 18.8 280
## 8 Vx9 18 16.1 249
## 9 maxO3 16 14.3 16
## 10 maxO3v 12 10.7 292
## 11 Vx12 10 8.93 259
miss_case_summary(don)
## # A tibble: 112 x 4
## case n_miss pct_miss n_miss_cumsum
## <int> <int> <dbl> <int>
## 1 18 8 72.7 44
## 2 56 8 72.7 167
## 3 25 7 63.6 69
## 4 47 7 63.6 138
## 5 110 7 63.6 286
## 6 24 6 54.5 62
## 7 31 6 54.5 91
## 8 45 6 54.5 131
## 9 22 5 45.5 55
## 10 33 5 45.5 97
## # ... with 102 more rows
Another way to summarise the missingness is to look at the number of times that there are a number of missings - to look at the tabulations using: miss_var_table
and miss_case_table
.
These show the number of times that there are a certain number of missings in a variable or a case, respectively:
miss_var_table(don)
## # A tibble: 10 x 3
## n_miss_in_var n_vars pct_vars
## <int> <int> <dbl>
## 1 10 1 9.09
## 2 12 1 9.09
## 3 16 1 9.09
## 4 18 1 9.09
## 5 21 1 9.09
## 6 32 1 9.09
## 7 33 1 9.09
## 8 34 1 9.09
## 9 37 2 18.2
## 10 42 1 9.09
miss_case_table(don)
## # A tibble: 9 x 3
## n_miss_in_case n_cases pct_cases
## <int> <int> <dbl>
## 1 0 13 11.6
## 2 1 24 21.4
## 3 2 22 19.6
## 4 3 19 17.0
## 5 4 18 16.1
## 6 5 8 7.14
## 7 6 3 2.68
## 8 7 3 2.68
## 9 8 2 1.79
This shows us there are two variables with exactly 37 missings in both, but that each individual variable seems to have unique numbers of missings. We also note that there are no variables with zero missings.
For the cases, we see that there are 13 cases with no missings, 24 with 1 missing, 22 with 2 missing, and so on.
A quick way to get a look at the missingness in the data is to use vis_miss
.
This visualises the missingness across the entire dataset.
library(visdat)
vis_miss(don)
You can also apply clustering to find similar missingness groups by setting cluster = TRUE
.
vis_miss(don, cluster = TRUE)