Imagine you are solving a linear regression but there are a lot of missing values in your covariate matrix, how can you deal with so many “NA”? The trouble may be caused by survey non-responses, lost records or machine failures. Classical packages performing estimation would let you ignore all the lines with missingness, as a result, much information would be lost.
A more reasonable method to handle this problem, is modifying an estimation process so that the method can be applied to incomplete data. For example, one can use the Expectation-Maximization (EM) algorithm (Dempster, Laird, and Rubin 1977) to obtain the maximum likelihood estimate (MLE) despite missing values. This strategy is valid under missing at random (MAR) mechanisms (Little and Rubin 2002; Seaman et al. 2013), in which the missingness of data is independent of the missing values, given the observed data. Even though this approach is perfectly suited to specific inference problems with missing values, there are few solutions or implementations available, even for simple models. In the following , we will demonstrate how to estimate parameters based on EM algorithm in a linear regression model and in a logistic regression model.
Another popular choice is multiple imputation (Buuren and Groothuis-Oudshoorn 2011), followed by classical regression procedure on each imputed dataset, and finally we aggregate the estimate on each set with Rubin’s combining rules. In the following, we will compare this method to EM and illustrate the bias and variance of estimation by an example of simulated dataset.
Consider the following linear regression model which fits to numerous practical settings even while being restricted to the joint multivariate normal assumption:
\[ y = \tilde{X}^\top \boldsymbol{\beta} + \varepsilon\,, \] where \(\tilde{X} = (1, X^\top)^\top\) with random variable \(X = (x_1,...,x_p)^\top \sim\mathcal{N}_p(\mu_X,\Sigma_X)\) independent of the noise \(\varepsilon\sim\mathcal{N}(0,\sigma^2)\). Let \(n\) i.i.d. observations \((y_i, X_i^\top)^\top,\,i=1,...,n\) be available but only partly observed. We aim to estimate \(\mathbb{E}[y\mid X] = \tilde{X}^\top\boldsymbol{\beta}\).
In view of independence and normality of \(y\) and \(X\), we have \[ (y,X)\sim\mathcal{N}(\mu_{y,X},\Sigma_{y,X}) \quad \text{with}\quad \mu_{y,X} = \begin{pmatrix}\mu_y \\ \mu_X\end{pmatrix} \quad \text{and} \quad \Sigma_{y,X} = \begin{pmatrix}\Sigma_{y} & \Sigma_{y,X} \\ \Sigma_{X,y} & \Sigma_{X}\end{pmatrix}\,. \] Let \(\theta = (\mu_{y,X}, \Sigma_{y,X})\) denote a set of parameters. \(\theta\) can be estimated using the EM algorithm (Dempster, Laird, and Rubin 1977) and this presentation allows us to write \[ \mathbb{E}[y\mid X] = (\mu_y - \Sigma_{y,X}\Sigma_{X}^{-1}\mu_X) + \Sigma_{y,X}\Sigma_{X}^{-1}X\,, \] and thus \(\boldsymbol{\beta}\) can be estimated by plug-in from \[ \boldsymbol{\beta} = (\mu_y - \Sigma_{y,X}\Sigma_{X}^{-1}\mu_X, \Sigma_{y,X}\Sigma_{X}^{-1})^\top\,. \] In the same way, the standard deviations can be estimated (via the Gram matrix) as \[ \mathbb{V}[\boldsymbol{\beta}] = \text{diag}(C) \quad \text{with} \\ C = (\Sigma_{y} - \boldsymbol{\beta}^\top\Sigma_{X}\boldsymbol{\beta})\bigl((\boldsymbol{0}_{p+1},(\boldsymbol{0}_p,\Sigma_{X})^\top)^\top + (1,\mu_X^\top)^\top(1,\mu_X^\top)\bigr)^{-1} / n \,. \]