1. Let $X, Y$ and $Z$ be random variables and let $\operatorname{cov}(A, B)$ denote the covariance of any two random variables $A$ and $B$. Show that
    (a) $\operatorname{cov}(a X, Y)=a \operatorname{cov}(X, Y)$
    (b) $\operatorname{cov}(X, Y+Z)=\operatorname{cov}(X, Y)+\operatorname{cov}(X, Z)$
    (c) If $X_1, \ldots, X_p$ is a set of random variables then show that$$\operatorname{cov}\left(\sum_{i=1}^{p} α_{i} X_{i}, \sum_{j=1}^{p} β_{j} X_{j}\right)=\sum_{i=1}^{p} \sum_{j=1}^{p} α_{i} β_{j} \operatorname{cov}\left(X_{i}, X_{j}\right)$$Proof.
    (a) $\operatorname{cov}\left(aX, Y\right)=E\left[(aX-a\bar X)(Y-\bar Y)\right]=a\,E\left[(X-\bar X)(Y-\bar Y)\right]=a\operatorname{cov}(X, Y)$
    (b) $\operatorname{cov}\left(X, Y+Z\right)=E\left[(X-\bar X)(Y+Z-\bar Y-\bar Z)\right]=E\left[(X-\bar X)(Y-\bar Y)\right]+E\left[(X-\bar X)(Z-\bar Z)\right]=\operatorname{cov}(X, Y)+\operatorname{cov}(X, Z)$
    (c)\begin{align*}\operatorname{cov}\left(\sum_{i=1}^{p} α_{i} X_{i}, \sum_{j=1}^{p} β_{j} X_{j}\right)&\xlongequal{(b)}\sum_{i=1}^{p}\sum_{j=1}^{p}\operatorname{cov}\left(α_{i} X_{i}, β_{j} X_{j}\right)\\
    &\xlongequal{(a)}\sum_{i=1}^{p} \sum_{j=1}^{p} α_{i} β_{j} \operatorname{cov}\left(X_{i}, X_{j}\right)
    \end{align*}
  2. Suppose $X=\left(X_{1}, \ldots, X_{p}\right)^⊺$ is a $p$-vector of random variables with covariance matrix $\Sigma$ where\begin{aligned}\operatorname{var}\left(X_{i}\right) &=\boldsymbol{\Sigma}_{i i} & & \text { for } i \in 1, \ldots, p \\\operatorname{cov}\left(X_{i}, X_{j}\right) &=\boldsymbol{\Sigma}_{i j} & & \text { for } i \neq j \in 1, \ldots, p\end{aligned}Define new random variables $Z$ and $W$ to be linear combinations of $X=\left(X_{1}, \ldots, X_{p}\right)$ such that\begin{aligned}&Z=α^⊺ X=α_{1} X_{1}+α_{2} X_{2}+\ldots+α_{p} X_{p} \\&W=β^⊺ X=β_{1} X_{1}+β_{2} X_{2}+\ldots+β_{p} X_{p}\end{aligned}where\begin{aligned}α &=\left(α_{1}, α_{2}, \ldots, α_{p}\right)^⊺ \\β &=\left(β_{1}, β_{2}, \ldots, β_{p}\right)^⊺\end{aligned}then show that
    (a) $\operatorname{var}(Z)=α^⊺ \boldsymbol{\Sigma} α$
    (b) $\operatorname{cov}(Z, W)=α^⊺ \boldsymbol\Sigma β$
    Solution.
    (a) $\operatorname{var}Z=\operatorname{var}(α_{1} X_{1}+α_{2} X_{2}+\ldots+α_{p} X_{p})=α^2_1\operatorname{var}X_1+α^2_2\operatorname{var}X_2+⋯+α^2_n\operatorname{var}X_n+\sum\limits_{1≤i≠j≤n}α_iα_j\operatorname{cov}(X_i,X_j)=α^⊺ \boldsymbol{\Sigma} α$
    (b) $\operatorname{cov}(Z, W)=\operatorname{cov}(α_{1} X_{1}+α_{2} X_{2}+\ldots+α_{p} X_{p},β_{1} X_{1}+β_{2} X_{2}+\ldots+β_{p} X_{p})=\sum_{i=1}^{p} \sum_{j=1}^{p}α_iβ_j\operatorname{cov}(X_i,X_j)=α^⊺ \boldsymbol\Sigma β$
  3. If $X=\left(X_{1}, \ldots, X_{p}\right)^⊺$ is a $p$-dimensional random column vector such that $X \sim N_{p}(\mu, \boldsymbol{\Sigma})$ and $\boldsymbol{B}$ is a $m \times p$ matrix then
    (i) Show that the $m$-dimensional random column vector $Y=\boldsymbol{B} X$ has covariance matrix $\operatorname{cov}(Y)=\boldsymbol{B} \boldsymbol{\Sigma} \boldsymbol{B}^⊺$.
    (ii) If $m=p$ how can we choose the matrix $\boldsymbol{B}$ so that the transformed variable $\boldsymbol{B} X$ has a covariance matrix that is the identity matrix.
    Solution.
    (i) $Y=\boldsymbol{B} X⇒Y_i=∑_{k=1}^p\boldsymbol B_{ik}X_k$, by Q2, $\operatorname{cov}(Y_i,Y_j)=\boldsymbol B_i\boldsymbol\Sigma\boldsymbol B_j^T=∑_{k=1}^p∑_{l=1}^p\boldsymbol B_{ik}\boldsymbol\Sigma_{kl}\boldsymbol B_{jl}^T⇒\operatorname{cov}(Y)=\boldsymbol{B} \boldsymbol{\Sigma} \boldsymbol{B}^T$
    (ii) By spectral theorem for real symmetric matrix, we can find an orthogonal matrix $\boldsymbol M$ such that $\boldsymbol{M}\boldsymbol{\Sigma} \boldsymbol{M}^T=\operatorname{diag}(λ_1,⋯,λ_m)$, if all of $λ_1,⋯,λ_m$ are non-negative, then $\boldsymbol A\operatorname{diag}(λ_1,⋯,λ_m)\boldsymbol A^T=I$, where $\boldsymbol A=\operatorname{diag}(λ_1^{-\frac12},⋯,λ_m^{-\frac12})$. Therefore $\boldsymbol B=\boldsymbol{AM}$ satisfies $\boldsymbol{B} \boldsymbol{\Sigma} \boldsymbol{B}^T=I$
  4. Let $X=\left(X_{1}, X_{2}\right)^⊺$ be a 2-dimensional random column vector such that $X \sim N_{p}(\mu,\boldsymbol\Sigma)$ with $\mu=(0,0)^⊺$, $\boldsymbol{\Sigma}_{11}=\boldsymbol{\Sigma}_{22}=1$ and $\boldsymbol{\Sigma}_{12}=\boldsymbol{\Sigma}_{21}=\rho$.
    (i) Show that pdf of $X$ is given by$$f(\mathbf{x})=\frac{1}{2 \pi \sqrt{1-\rho^{2}}} \exp \left(-\frac{x_{1}^{2}-2 \rho x_{1} x_{2}+x_{2}^{2}}{2\left(1-\rho^{2}\right)}\right) \quad x \in \mathbb{R}^{2}$$(ii) Describe the shape of the pdf of $X$ at any fixed value of the first random variable $X_{1}=a \in \mathbb{R}$. (profile likelihood – wikipedia)
    Solution.
    (i) $|\boldsymbol\Sigma|=\left|\matrix{1&ρ\\ρ&1}\right|=1-ρ^2,\boldsymbol\Sigma^{-1}=\frac1{1-ρ^2}\left(\begin{smallmatrix}1&-ρ\\-ρ&1\end{smallmatrix}\right),f({\bf x})=\frac{1}{2π|\boldsymbol\Sigma|^{1 / 2}} \exp \left(-\frac{1}{2}{\bf x}^⊺\boldsymbol{\Sigma}^{-1}{\bf x}\right)=\frac{1}{2 \pi \sqrt{1-\rho^{2}}} \exp \left(-\frac{x_{1}^{2}-2 \rho x_{1} x_{2}+x_{2}^{2}}{2\left(1-\rho^{2}\right)}\right)$
    (ii) $f(a,x_2)=\frac{1}{2 \pi \sqrt{1-\rho^{2}}} \exp \left(-\frac{(x_2-ρa)^2+(1-ρ^2)a^2}{2\left(1-\rho^{2}\right)}\right)=e^{-a^2/2}\frac{1}{2 \pi \sqrt{1-\rho^{2}}} \exp \left(-\frac{(x_2-ρa)^2}{2\left(1-\rho^{2}\right)}\right)$
    $⇒f(x_2)={f(a,x_2)\over∫_{-∞}^∞f(a,x_2)\mathrm dx_2}=\frac{1}{2 \pi \sqrt{1-\rho^{2}}} \exp \left(-\frac{(x_2-ρa)^2}{2\left(1-\rho^{2}\right)}\right)⇒X_2∼N(ρa,1-ρ^2)$.
    Plot when $a=1,ρ=0.35$:Error! Click to view log.
  5. Let $X=\left(X_{1}, X_{2}\right)^⊺$ be a 2-dimensional random column vector such that $X \sim N_{p}(\mu, \boldsymbol{\Sigma})$ with $\mu=(120,80)^⊺$, $\boldsymbol{\Sigma}_{11}=25, \boldsymbol{\Sigma}_{22}=16$ and $\boldsymbol{\Sigma}_{12}=\boldsymbol{\Sigma}_{21}=12$. Define the new random variable $Y=2 X_1-3 X_2$.
    (i) Calculate $P(Y>20)$
    (ii) If $\Sigma_{21}=0$ what is $P(Y>20)$.
    Hint: you may assume the result stated in the notes that if $X=\left(X_1, \ldots, X_p\right)^⊺$ is a $p$-dimensional random column vector such that $X∼N_p(\mu, \boldsymbol{\Sigma})$ and $\boldsymbol{B}$ is a $m \times p$ matrix then $Y \sim N_m\left(\boldsymbol{B} \mu, \boldsymbol{B} \boldsymbol{\Sigma} \boldsymbol{B}^⊺\right)$
    Solution.
    (i) $E(Y)=2×120-3×80=0,\operatorname{var}Y=2^2×{\boldsymbol\Sigma}_{11}+(-3)^2×{\boldsymbol\Sigma}_{22}+2×(-3)×({\boldsymbol\Sigma}_{12}+{\boldsymbol\Sigma}_{21})=100⇒Y∼N(0,100)⇒P(Y>20)=1-Φ\left(20\over\sqrt{100}\right)≈0.0227501$
    (ii) $\operatorname{var}Y=2^2×{\boldsymbol\Sigma}_{11}+(-3)^2×{\boldsymbol\Sigma}_{22}=244⇒Y∼N(0,244)⇒P(Y>20)=1-Φ\left(20\over\sqrt{244}\right)≈0.100208$
  6. Let $x_{1}, \ldots, x_{n}$ be iid realizations of a $p$-dimensional random column vector $X=\left(X_{1}, \ldots, X_{p}\right)^⊺$ such that $X \sim N_{p}(\mu, \boldsymbol{\Sigma})$.
    (i) Prove that the maximum likelihood estimator of $\mu$ is$$\hat{\mu}=\frac{1}{n} \sum_{i=1}^{n} x_{i}$$(ii) Show that the log-likelihood can be expressed as$$\ell(\mu, \boldsymbol{\Sigma})=\frac{n}{2} \log \left|\boldsymbol{\Sigma}^{-1}\right|-\frac{1}{2} \operatorname{tr}\left(\boldsymbol{\Sigma}^{-1} \sum_{i=1}^{n}\left(x_{i}-\mu\right)\left(x_{i}-\mu\right)^⊺\right)$$(iii) Using Hints (c) and (d) prove that the maximum likelihood estimator of $\boldsymbol{\Sigma}$ is$$\hat{\boldsymbol{\Sigma}}=\frac{1}{n} \sum_{i=1}^{n}\left(x_{i}-\hat{\mu}\right)\left(x_{i}-\hat{\mu}\right)^⊺$$(iv) Show that the sample covariance $\boldsymbol{S}$ is unbiased for $\boldsymbol{\Sigma}$ where$$\boldsymbol{S}=\frac{1}{n-1} \sum_{i=1}^{n}\left(x_{i}-\hat{\mu}\right)\left(x_{i}-\hat{\mu}\right)^⊺$$Hint You may find it helpful to use the following results
    (a) If $a$ and $x$ are $p$-column vectors, $\boldsymbol{B}$ is a $p \times p$ symmetric matrix and $y$ and $z$ are scalars such that $y=a^⊺ x$ and $z=x^⊺ \boldsymbol{B} x$ then$$\nabla y=a \text { and } \nabla z=2 \boldsymbol{B} x$$(b) If $\boldsymbol{C}$ is a $n \times m$ matrix and $\boldsymbol{D}$ is $m \times n$ matrix then $\operatorname{tr}[\boldsymbol{C} \boldsymbol{D}]=\operatorname{tr}[\boldsymbol{D} \boldsymbol{C}]$
    (c) The matrix of partial derivatives of a scalar function $y$ of an $n \times n$ matrix $\boldsymbol{X}$ of independent variables, with respect to the matrix $\boldsymbol X$, is defined as$$\nabla_{\boldsymbol{X}} y=\left[\begin{array}{cccc}\frac{\partial y}{\partial x_{11}} & \frac{\partial y}{\partial x_{12}} & \cdots & \frac{\partial y}{\partial x_{1 n}} \\\frac{\partial y}{\partial x_{21}} & \frac{\partial y}{\partial x_{22}} & \cdots & \frac{\partial y}{\partial x_{2 n}} \\\vdots & \vdots & \ddots & \vdots \\\frac{\partial y}{\partial x_{2 n}} & \frac{\partial y}{\partial x_{n 2}} & \cdots & \frac{\partial y}{\partial x_{n n}}\end{array}\right]$$If $\boldsymbol{E}$ and $\boldsymbol{F}$ are $p \times p$ matrices then\begin{aligned}\nabla_{\boldsymbol{E}} \log\left|\boldsymbol{E}\right|&=\left(\boldsymbol{E}^{-1}\right)^⊺ \\\nabla_{\boldsymbol{E}} \operatorname{tr}[\boldsymbol{E} \boldsymbol{F}] &=\boldsymbol{F}^⊺\end{aligned}(d) If $L(\theta)$ is a likelihood function with Maximum Likelihood Estimator (MLE) of $\hat{\theta}$ and $g(\theta)$ is a function of $\theta$ then the MLE of $g(\theta)$ is $g(\hat{\theta})$.
    [This is known as the invariance property of MLEs (Wikipedia | stats.stackexchange.com) and it is covered formally in Part A Statistics course, but it is straightforward to prove, and could be covered in tutorials if you ask your tutor nicely. This property may help you when choosing what you should differentiate the log-likelihood in (ii) with respect to in derivation of the maximum likelihood estimator of $\boldsymbol{\Sigma}$.]

    Solution.
    (i) $f(x)=\frac{1}{(2π)^{p/2}|\boldsymbol\Sigma|^{1 / 2}} \exp \left(-\frac{1}{2}(x-μ)^⊺\boldsymbol{\Sigma}^{-1}(x-μ)\right)$
    $\ell(μ)=∑_{i=1}^n-\frac12(x_i-μ)^⊺\boldsymbol{\Sigma}^{-1}(x_i-μ)+$constant
    From (a) we get $\ell'(μ)=∑_{i=1}^n\boldsymbol{\Sigma}^{-1}(x-μ)=\boldsymbol{\Sigma}^{-1}∑_{i=1}^n(x-μ)$
    $\ell'(\widehatμ)=0⇒\hat{\mu}=\frac{1}{n} \sum_{i=1}^{n} x_{i}$
    (ii) Notice $\ell$ is a scalar, hence it equals its trace.\begin{align*}\ell(\mu, \boldsymbol{\Sigma})&=\frac{n}{2} \log \left|\boldsymbol{\Sigma}^{-1}\right|-\frac12∑_{i=1}^n(x_i-μ)^⊺\boldsymbol{\Sigma}^{-1}(x_i-μ)
    \\&=\frac{n}{2} \log \left|\boldsymbol{\Sigma}^{-1}\right|-\frac{1}{2} \operatorname{tr}\left(\sum_{i=1}^{n}\left(x_{i}-\mu\right)^⊺\boldsymbol{\Sigma}^{-1}\left(x_{i}-\mu\right)\right)
    \\&=\frac{n}{2} \log \left|\boldsymbol{\Sigma}^{-1}\right|-\frac{1}{2} \sum_{i=1}^{n}\operatorname{tr}\left(\left(x_{i}-\mu\right)^⊺\boldsymbol{\Sigma}^{-1}\left(x_{i}-\mu\right)\right)&&\text{interchange $∑$ and $\rm tr$}
    \\&=\frac{n}{2} \log \left|\boldsymbol{\Sigma}^{-1}\right|-\frac{1}{2} \operatorname{tr}\left(\boldsymbol{\Sigma}^{-1}\sum_{i=1}^{n}\left(x_{i}-\mu\right)\left(x_{i}-\mu\right)^⊺\right)&&\text{apply (b) to }\boldsymbol{\Sigma}^{-1}\left(x_{i}-\mu\right)\text{ and }\left(x_{i}-\mu\right)^⊺\end{align*}(iii) Using (c) and $\left(\left(x_{i}-\mu\right)\left(x_{i}-\mu\right)^⊺\right)^⊺=\left(x_{i}-\mu\right)\left(x_{i}-\mu\right)^⊺$, we have$${∂\ell\over∂\boldsymbol{\Sigma}^{-1}}(μ,\boldsymbol{\Sigma})=\frac{n}{2}\boldsymbol{\Sigma}^⊺-\frac{1}{2}\sum_{i=1}^{n}\left(x_{i}-\mu\right)\left(x_{i}-\mu\right)^⊺$$By (d), ${∂\ell\over∂\boldsymbol{\Sigma}^{-1}}(μ,\hat{\boldsymbol{\Sigma}})=0$, then$$\hat{\boldsymbol{\Sigma}}^⊺=\frac{1}{n}\sum_{i=1}^{n}\left(x_{i}-\mu\right)\left(x_{i}-\mu\right)^⊺⇒\hat{\boldsymbol{\Sigma}}=\frac{1}{n}\sum_{i=1}^{n}\left(x_{i}-\mu\right)\left(x_{i}-\mu\right)^⊺$$
    (iv) To examine the bias of $\hat{\boldsymbol\Sigma}$ consider\begin{aligned}E[\hat{\boldsymbol\Sigma}]&=E\left[\frac1n\sum_{i=1}^n(x_i-\hatμ)(x_i-\hatμ)^⊺\right]\\ &=\frac1nE\left[\sum_{i=1}^nx_ix_i^⊺-2\sum_{i=1}^n\hatμx_i^⊺+n\hatμ\hatμ^⊺\right]\quad\text{as }\hatμx_i^⊺=x_i\hatμ^⊺\quad(\mmlToken{mtext}[href="https://en.wikipedia.org/wiki/Outer_product"]{outer product})\\&=\frac1n\left(\sum_{i=1}^nE\left[x_ix_i^⊺\right]-2nE\left[\hatμ\hatμ^⊺\right]+nE\left[\hatμ\hatμ^⊺\right]\right)\quad\text{as }\sum_{i=1}^nx_i^⊺=n\hatμ^⊺\\&=\frac1n\left(\sum_{i=1}^nE\left[x_ix_i^⊺\right]-nE\left[\hatμ\hatμ^⊺\right]\right)\end{aligned}Now $x_i∼N_p(μ,\boldsymbol\Sigma)$ and $\hatμ∼N_p\left(μ,\frac{\boldsymbol\Sigma}n\right)$ so
    $E[x^2]=\operatorname{var} X+[E(x)]^2⇒E[x_ix_i^⊺]=\operatorname{var}(x_i)+E[x_i]E[x_i]^⊺=\boldsymbol\Sigma+μμ^⊺$ and
    $E[μμ^⊺]=\frac{\boldsymbol\Sigma}n+μμ^⊺$
    so $E[\hat{\boldsymbol\Sigma}]=\frac1n\left(n(\boldsymbol\Sigma+μμ^⊺)-n\left(\frac{\boldsymbol\Sigma}n+μμ^⊺\right)\right)=\frac{n-1}n\boldsymbol\Sigma$