1) Regression with NA (quantitative) for ozone

First of all you will need to install the following packages

install.packages("VIM")
install.packages("devtools")
library(devtools)
install_github("njtierney/naniar")

install.packages("missMDA")
install.packages("Amelia")
install.packages("mice")
install.packages("missForest")
install.packages("FactoMineR")
install.packages("tidyverse")

Air pollution is currently one of the most serious public health worries worldwide. Many epidemiological studies have proved the influence that some chemical compounds, such as sulphur dioxide (SO2), nitrogen dioxide (NO2), ozone (O3) can have on our health. Associations set up to monitor air quality are active all over the world to measure the concentration of these pollutants. They also keep a record of meteorological conditions such as temperature, cloud cover, wind, etc.

We have at our disposal 112 observations collected during the summer of 2001 in Rennes. The variables available are:

  • maxO3 (maximum daily ozone)
  • maxO3v (maximum daily ozone the previous day)
  • T12 (temperature at midday)
  • T9
  • T15 (Temp at 3pm)
  • Vx12 (projection of the wind speed vector on the east-west axis at midday)
  • Vx9 and Vx15 as well as the Nebulosity (cloud) Ne9, Ne12, Ne15

Here the final aim is to analyse the relationship between the maximum daily ozone (maxO3) level and the other meteorological variables. To do so we will perform regression to explain maxO3 in function of all the other variables. This data is incomplete - there are missing values. Indeed, it occurs frenquently to have machines that fail one day, leading to some information not recorded. We will therefore perform regression with missing values via multiple imputation.

  • Importing the data.
ozo <- read.table("ozoneNA.csv",
                  header = TRUE,
                  sep = ",",
                  row.names = 1)

WindDirection <- ozo[, 12]
don <- ozo[, 1:11]   #### keep the continuous variables

summary(don)
##      maxO3              T9             T12             T15       
##  Min.   : 42.00   Min.   :11.30   Min.   :14.30   Min.   :14.90  
##  1st Qu.: 71.00   1st Qu.:16.00   1st Qu.:18.60   1st Qu.:18.90  
##  Median : 81.50   Median :17.70   Median :20.40   Median :21.40  
##  Mean   : 91.24   Mean   :18.22   Mean   :21.46   Mean   :22.41  
##  3rd Qu.:108.25   3rd Qu.:19.90   3rd Qu.:23.60   3rd Qu.:25.65  
##  Max.   :166.00   Max.   :25.30   Max.   :33.50   Max.   :35.50  
##  NA's   :16       NA's   :37      NA's   :33      NA's   :37     
##       Ne9             Ne12            Ne15           Vx9         
##  Min.   :0.000   Min.   :0.000   Min.   :0.00   Min.   :-7.8785  
##  1st Qu.:3.000   1st Qu.:4.000   1st Qu.:3.00   1st Qu.:-3.0000  
##  Median :5.000   Median :5.000   Median :5.00   Median :-0.8671  
##  Mean   :4.987   Mean   :4.986   Mean   :4.60   Mean   :-1.0958  
##  3rd Qu.:7.000   3rd Qu.:7.000   3rd Qu.:6.25   3rd Qu.: 0.6919  
##  Max.   :8.000   Max.   :8.000   Max.   :8.00   Max.   : 5.1962  
##  NA's   :34      NA's   :42      NA's   :32     NA's   :18       
##       Vx12              Vx15            maxO3v      
##  Min.   :-7.8785   Min.   :-9.000   Min.   : 42.00  
##  1st Qu.:-3.6941   1st Qu.:-3.759   1st Qu.: 70.00  
##  Median :-1.9284   Median :-1.710   Median : 82.50  
##  Mean   :-1.6853   Mean   :-1.830   Mean   : 89.39  
##  3rd Qu.:-0.1302   3rd Qu.: 0.000   3rd Qu.:101.00  
##  Max.   : 6.5778   Max.   : 3.830   Max.   :166.00  
##  NA's   :10        NA's   :21       NA's   :12
head(don)
##          maxO3   T9  T12  T15 Ne9 Ne12 Ne15     Vx9    Vx12    Vx15 maxO3v
## 20010601    87 15.6 18.5   NA   4    4    8  0.6946 -1.7101 -0.6946     84
## 20010602    82   NA   NA   NA   5    5    7 -4.3301 -4.0000 -3.0000     87
## 20010603    92 15.3 17.6 19.5   2   NA   NA  2.9544      NA  0.5209     82
## 20010604   114 16.2 19.7   NA   1    1    0      NA  0.3473 -0.1736     92
## 20010605    94   NA 20.5 20.4  NA   NA   NA -0.5000 -2.9544 -4.3301    114
## 20010606    80 17.7 19.8 18.3   6   NA    7 -5.6382 -5.0000 -6.0000     94
dim(don)
## [1] 112  11
  • Load the libraries.
library(VIM)
library(FactoMineR)
library(missMDA)

1.1) Descriptive statistics, visualization with missing values

Q1 When could it be a good idea to delete rows or columns with missing values to work with a complete data set? Could you do it here?

dim(na.omit(don))
## [1] 13 11

Deleting rows or columns is possible as long as there is enough data left and the missing values are of the MCAR type so that the sample is a subsample of the original data. We will obtain unbiased estimators but with more variance. Deleting observations with missing data for ozone data leads to a table with 13 rows.

First, we perfom some descriptive statistics (how many missing? how many variables, individuals with missing?) and try to inspect and vizualize the pattern of missing entries and get hints on the mechanism. For this purpose, we use the R package naniar as well as Multiple Correspondence Analysis (FactoMineR package). An alternative would be to use VIM (Visualization and Imputation of Missing Values - Mathias Templ)

naniar provides principled, tidy ways to summarise, visualise, and manipulate missing data with minimal deviations from the workflows in ggplot2 and tidy data.

We can start off with some quick summaries of the amount of missing and complete data in don using:

  • pct_miss() To give us the percentage of missings in the data
  • n_miss() to give the number of missings,

and their complete equivalents:

  • pct_complete() To give us the percentage of completes in the data
  • n_complete() to give the number of complete values
library(naniar)
library(tidyverse)

pct_miss(don) # percentage of missing value in the data.
## [1] 23.7013
n_miss(don) # number of missing values in the 
## [1] 292
n_complete(don) # without missing value
## [1] 940
pct_complete(don) # without missing value
## [1] 76.2987

This is useful, but would be repetitive if you wanted to repeat this for every variable. We can instead look at summaries across the variables and cases

1.2.1) Tabulations and summaries

You can find the number and percentage missings in each variable and case using miss_case_summary and miss_var_summary.

miss_var_summary(don)
## # A tibble: 11 x 4
##    variable n_miss pct_miss n_miss_cumsum
##    <chr>     <int>    <dbl>         <int>
##  1 Ne12         42    37.5            199
##  2 T9           37    33.0             53
##  3 T15          37    33.0            123
##  4 Ne9          34    30.4            157
##  5 T12          33    29.5             86
##  6 Ne15         32    28.6            231
##  7 Vx15         21    18.8            280
##  8 Vx9          18    16.1            249
##  9 maxO3        16    14.3             16
## 10 maxO3v       12    10.7            292
## 11 Vx12         10     8.93           259
miss_case_summary(don)
## # A tibble: 112 x 4
##     case n_miss pct_miss n_miss_cumsum
##    <int>  <int>    <dbl>         <int>
##  1    18      8     72.7            44
##  2    56      8     72.7           167
##  3    25      7     63.6            69
##  4    47      7     63.6           138
##  5   110      7     63.6           286
##  6    24      6     54.5            62
##  7    31      6     54.5            91
##  8    45      6     54.5           131
##  9    22      5     45.5            55
## 10    33      5     45.5            97
## # ... with 102 more rows

Another way to summarise the missingness is to look at the number of times that there are a number of missings - to look at the tabulations using: miss_var_table and miss_case_table.

These show the number of times that there are a certain number of missings in a variable or a case, respectively:

miss_var_table(don)
## # A tibble: 10 x 3
##    n_miss_in_var n_vars pct_vars
##            <int>  <int>    <dbl>
##  1            10      1     9.09
##  2            12      1     9.09
##  3            16      1     9.09
##  4            18      1     9.09
##  5            21      1     9.09
##  6            32      1     9.09
##  7            33      1     9.09
##  8            34      1     9.09
##  9            37      2    18.2 
## 10            42      1     9.09
miss_case_table(don)
## # A tibble: 9 x 3
##   n_miss_in_case n_cases pct_cases
##            <int>   <int>     <dbl>
## 1              0      13     11.6 
## 2              1      24     21.4 
## 3              2      22     19.6 
## 4              3      19     17.0 
## 5              4      18     16.1 
## 6              5       8      7.14
## 7              6       3      2.68
## 8              7       3      2.68
## 9              8       2      1.79

This shows us there are two variables with exactly 37 missings in both, but that each individual variable seems to have unique numbers of missings. We also note that there are no variables with zero missings.

For the cases, we see that there are 13 cases with no missings, 24 with 1 missing, 22 with 2 missing, and so on.

1.2.2) Visualisation

A quick way to get a look at the missingness in the data is to use vis_miss.

This visualises the missingness across the entire dataset.

library(visdat)

vis_miss(don)

You can also apply clustering to find similar missingness groups by setting cluster = TRUE.

vis_miss(don, cluster = TRUE)