2018 Paper Ⅲ
PDF
- [10 marks] Solve the following initial value problems, in each case finding $y(x)$ explicitly in terms of $x$.
(i)$$x \frac{\mathrm{d} y}{\mathrm{d} x}+3 y=2, \quad y(1)=2,$$(ii)\begin{aligned}& \frac{\mathrm{d} y}{\mathrm{d} x}=\frac{x+y+2}{1-x-y}, \quad y(1)=1 .\end{aligned} - [10 marks] By using the substitution $y(x)=x^{n} v(x)$ for a suitable $n$, which you should find, solve the initial value problem$$4 x^{2} \frac{\mathrm{d}^{2} y}{\mathrm{d} x^{2}}+4 x \frac{\mathrm{d} y}{\mathrm{d} x}+\left(x^{2}-1\right) y=0, \quad y(π)=1, \quad y'(π)=1$$
Solution.
- (i)Find an integrating factor $I=e^{\int \frac3x{\rm d}x}=x^3$, so $\frac{\rm d}{\mathrm d x}\left(x^3 y\right)=2 x^2$, which integrates to $x^{3} y=\frac{2}{3} x^{3}+c$
$y(1)=2⇒c=\frac13$, so $y=\frac23+\frac1{3x^3}$.
(ii)Let $x+y=u$, then $\frac{\mathrm du}{\mathrm dx}=1+\frac{u+2}{1-u}=\frac{3}{1-u}$. So $\int(1-u)\mathrm du=\int3\mathrm dx$⇒$\frac{u-u^2}2=3x+c$.
$y(1)=1⇒2x+2y-x^2-2xy-y^2=6x-6$⇒$y^{2}+2(x-1) y+x^{2}+4 x-6=0$⇒$y=1-x±\sqrt{7-6 x}$. To satisfy the initial condition, $y=1-x+\sqrt{7-6 x}$. - Let $y=vx^n$. Then $4 x^{2}\left(v''x^{n}+2v'nx^{n-1}+vn(n-1)x^{n-2}\right)+4 x\left(v n x^{n-1}+v'x^n\right)+\left(x^{2}-1\right) v x^n=0$
$(4 v''+v) x^{n+2}+\left(8 v' n+4 v'\right) x^{n+1}+\left(4 v n^{2}-v\right) x^{n}=0$⇒$n=-\frac12$ is suitable. Let $y=vx^{-1/2}$, the differential equation becomes $4 x^{3/2} v''+x^{3 / 2} v=0$⇒$v''+\frac{1}{4}v=0$⇒$v=A \cos \frac{x}{2}+B \sin \frac{x}{2}$. Hence the general solution is $y=\frac{1}{\sqrt{x}}\left(A \cos \frac{x}{2}+B \sin \frac{x}{2}\right)$.
$y(π)=1 \Rightarrow B=\sqrt{π}$
$y'(π)=1 \Rightarrow A=-\frac1{\sqrt{π}}(2 π+1)$
So $y(x)=\frac{1}{\sqrt{x}}\left(\sqrt{π} \sin \frac{x}{2}-\frac{1}{\sqrt{π}}(2 π+1) \cos \frac{x}{2}\right)$
- (a) [8 marks] If $x=e^{s} \cos t, y=e^{s} \sin t$, show that$$
\frac{\partial^{2} u}{\partial x^{2}}+\frac{\partial^{2} u}{\partial y^{2}}=e^{-2 s}\left(\frac{\partial^{2} u}{\partial s^{2}}+\frac{\partial^{2} u}{\partial t^{2}}\right)
$$where $u$ is a suitably differentiable function of two variables.
(b) [3 marks] Determine the second-order Taylor polynomial about (0,0) for the function$$f(x, y)=e^{x y}+(x+y)^{2}$$(c) [9 marks] Let $G(x, y)=g(y-x+h(y+x))$, where $g$ and $h$ are suitably differentiable functions of one variable. Given that the equation $G(x, y)=0$ implicitly defines a function $y(x)$, show that$$\frac{\mathrm{d} y}{\mathrm{d} x}=\frac{1-h'(y+x)}{1+h'(y+x)} \quad \text { and } \quad \frac{\mathrm{d}^{2} y}{\mathrm{d} x^{2}}=-4 \frac{h''(y+x)}{\left(1+h'(y+x)\right)^{3}}$$State any restrictions that are needed on $g$ and $h$. - [10 marks] Find and classify the critical points of the function $f: \mathbb{R}^2→\mathbb{R}$ defined by$$f(x, y)=e^{x+y}\left(x^{2}-x y+y^{2}\right)$$
- [10 marks] Use the method of Lagrange multipliers to find the maximum distance from the origin $(0,0)$ to the curve $g(x, y)=4 x^2+4y^2+5xy-3=0$.
Illustrate your solution by means of a sketch of $g(x, y)=0$ and reference to $\nabla g(x, y)$.
Solution.
- $\begin{array}{ll}f=e^{x+y}\left(x^{2}-x y+y^{2}\right)&f_{x x}=2 x-y+2\\
f_{x}=e^{x+y}\left(x^{2}-x y+y^{2}+2 x-y\right)&f_{y y}=-x+2 y+2\\
f_{y}=e^{x+y}\left(x^{2}-x y+y^{2}-x+2 y\right)&f_{x y}=-x+2 y-1\end{array}$
At critical points $\left.\begin{array}{l}x^{2}-x y+y^{2}+2 x-y=0 \\ x^{2}-x y+y^{2}-x+2 y=0\end{array}\right\} x=y$
So $x^{2}-x^{2}+x^{2}+2 x-x=0$ i.e. $x^{2}+x=0 \Rightarrow x=0,-1$.
At $(0,0)$, $f_{x x}=2>0, f_{y y}=2, f_{x y}=-1$, so $f_{x x} f_{y y}-f_{x y}^{2}=4-3>0$ and $f_{xx}>0$ so local minimum at $(0,0)$
At $(-1,-1)$, $f_{yx}f_{yy}-f_{x y}^{2}=e^{-4}-4 e^{-4}<0$ so saddle point at $(-1,-1)$ - Max/Min of $x^{2}+y^{2}$ subject to $4 x^{2}+4 y^{2}+5 x y-3=0$
So $G(x, y, λ)=x^{2}+y^{2}-λ\left(4 x^{2}+4 y^{2}+5 x y-3\right)$
$\array{G_x=2 x-λ(8 x+5 y)=0&①\\
G_y=2 y-λ(8 y+5 x)=0&②\\
\frac{G_λ}λ=4 x^{2}+4 y^{2}+5 x y-3=0&③}$
$\left.\array{①×y:&\quad 2 x y-λ\left(8 x y+5 y^{2}\right)=0\\②×x:&2 x y-λ\left(8 x y+5 x^{2}\right)=0}\right\}y=±x$
$y=x: \quad 8 x^{2}+5 x^{2}-3=0 \quad x=±\sqrt{3 / 13}$
$y=-x: \quad 8 x^{2}-5 x^{2}-3=0 \quad x=±1$
So critical points at $\left(±\sqrt{\frac{3}{13}}, ±\sqrt{\frac{3}{13}}\right)$ where $x^2+y^2=\frac{6}{13}$ and at $(±1,∓1)$ where $x^2+y^2=2$, so max distance is $\sqrt2$.
- (a) [6 marks] (i) A probability space is a triple of mathematical objects $(\Omega, \mathcal{F}, \mathbb{P})$. State what $\mathbb{P}$ is and list the axioms that it satisfies.
(ii) Show that the axioms imply that for $A, B ∈\mathcal{F}$ we have $\mathbb{P}(A) ⩽ \mathbb{P}(B)$ whenever $A \subset B$.
(b) [5 marks] Recall that a standard normal distribution has density $ϕ(z)=(2 π)^{-1 / 2} \mathrm{e}^{-z^{2} / 2}$. Let $Z$ be a random variable with standard normal distribution. Compute $\mathbb{E}[|Z|]$.
(c) [9 marks] Determine the solution to each of the following difference equations:
(i) $u_{n+1}=2 u_{n}+n$; $u_{0}=1$.
(ii) $u_{n+1}=3 u_{n}-2 u_{n-1}$; $u_{1}=0, u_{0}=1 .$
(iii) $u_{n+1}=4 u_{n}-4 u_{n-1}+1$; $u_{1}=1, u_{0}=0$.
Solution.
- (i) A probability space is $(\Omega, \mathcal{F}, \mathbb{P})$ where $\Omega$ is a set, $\mathcal{F}$ is a collection of subsets of $\Omega$, and $\mathbb{P}$ is a function $\mathbb{P}: \mathcal{F} →[0,1]$ satisfying
◇ For all $A ∈\mathcal{F}, \mathbb{P}(A) \geq 0$. (Note: This is unnecessary if the definition says that the range of $\mathbb{P}$ is $[0,1]$.)
◇ $\mathbb{P}(\Omega)=1$.
◇ If $\left\{A_{i}, i ∈I\right\} \subset \mathcal{F}$ is a countable subcollection with $A_{i} \cap A_{j}=\emptyset$ for $i ≠ j$, then$$\mathbb{P}\left(\bigcup_{i ∈I} A_{i}\right)=\sum_{i ∈I} \mathbb{P}\left(A_{i}\right) .$$(ii) We have $B=A \cup(B \backslash A)$, and $A \cap(B \backslash A)=\emptyset$. By (3),$$\mathbb{P}(B)=\mathbb{P}(A)+\mathbb{P}(B \backslash A) \geq \mathbb{P}(A),$$since axiom (1) implies that $\mathbb{P}(B \backslash A) \geq 0$. - \begin{aligned} \mathbb{E}[|Z|] &=\int_{-∞}^{∞}|z| ·(2 π)^{-1 / 2} \mathrm{e}^{-z^{2} / 2} d z \\ &=2 \int_{0}^{∞}(2 π)^{-1 / 2} \mathrm{e}^{-z^{2} / 2} z d z \\ &=2(2 π)^{-1 / 2} \int_{0}^{∞} \mathrm{e}^{-y} d y, \text { substituting } y=z^{2} / 2 \\ &=2(2 π)^{-1 / 2} \end{aligned}
- (i) The homogeneous equation has solution $w_{n}=2^{n}$. We seek a particular solution of the form $v_{n}=C n+D$. We get$$C(n+1)+D=2 C n+2 D+n \Longrightarrow C=2 C+1 \text { and } C+D=2 D$$and $w_{n}=-n-1$. So $u_{n}=A 2^{n}-n-1$ for some constant $A$. For $n=0$ we get $1=A-1$, so finally$$u_{n}=2^{n+1}-n-1$$(ii) The auxiliary equation is$$λ^{2}-3 λ+2=0,$$which has solutions $λ=1,2$, so $u_{n}=A+B(2)^{n}$. We get\begin{aligned}1 &=A+B, \quad 0=A+2 B, \\B &=-1, \quad A=2,\end{aligned}so$$u_{n}=2-2^{n}$$(iii) The auxiliary equation is$$λ^{2}-4 λ+4=0,$$which has solutions $λ=2,2$, so the homogeneous solutions are $u_{n}=A 2^{n}+B n 2^{n}$. We know that $w_{n}=1$ is a particular solution, so our general solution is$$u_{n}=2^{n}(A+B n)+1$$Substituting the initial conditions yields\begin{aligned}0 &=A+1, \quad 1=2 A+2 B+1 \\A &=-1, \quad B=1\end{aligned}so$$u_{n}=2^{n}(n-1)+1$$
- [10 marks] A fair 6-sided die is painted with three sides red, two sides blue, one side green.
(i) The die is rolled three times. Calculate the probability that three different colours come up.
(ii) The die is rolled three times. Let $A$ be the event {the same colour comes up every time}, and let $B$ be the event {red comes up every time}. Compute the conditional probability $\mathbb{P}(B \mid A)$.
(iii) The die is rolled $n$ times. Let $X$ be the number of different colours that come up. Show that$$\mathbb{E}[X]=3-\left(\frac{5}{6}\right)^{n}-\left(\frac{2}{3}\right)^{n}-\left(\frac{1}{2}\right)^{n} .$$ - [10 marks] For real numbers $c$ and $α$, let $f: \mathbb{R} → \mathbb{R}$ be defined by$$f_{c, α}(x)= \begin{cases}c x^{-α}-x^{-α-1} & \text { for } x>1, \\ 0 & \text { for } x ⩽ 1 .\end{cases}$$(i) For which values of $α$ does there exist a value of $c$ such that $f_{c, α}$ is a probability density? Give the appropriate value of $c$ for each such $α$.
(ii) For $α=3$ let $X$ be a random variable with density $f_{c, 3}$, for the appropriate value of $c$. Compute $\mathbb{P}\{X ⩽ 2\}$ and $\mathbb{E}[X]$.
Solution.
- (i) There are 6 possible orders for the colours, and the probability for each order is $1 / 2 · 1 / 3 · 1 / 6=1 / 36$. So $\mathbb{P}\{3$ different $\}=1 / 6$.
(ii)\begin{array}{r}\mathbb{P}(A)=\frac{1}{2^{3}}+\frac{1}{3^{3}}+\frac{1}{6^{3}}=\frac{1}{6} \\\mathbb{P}(B)=\mathbb{P}(A \cap B)=\frac{1}{2^{3}} \\\mathbb{P}(B \mid A)=\frac{\mathbb{P}(A \cap B)}{\mathbb{P}(A)}=\frac{3}{4}\end{array}(iii) Let $X_{R}, X_{B}, X_{G}$ be the indicators of the events that red, blue, and green come up, respectively. Then $X=X_{R}+X_{B}+X_{G}$, and\begin{aligned}\mathbb{E}[X] &=\mathbb{E}\left[X_{R}\right]+\mathbb{E}\left[X_{B}\right]+\mathbb{E}\left[X_{G}\right] \\&=1-\mathbb{P}\{\text { no red }\}+1-\mathbb{P}\{\text { no blue }\}+1-\mathbb{P}\{\text { no green }\}\end{aligned} - (i) To be a density it must be nonnegative and have integral equal to 1 . To be integrable at all it must have $α>1$. The integral is then$$1=\frac{c}{α-1}-\frac{1}{α},$$implying $c=α-1 / α$. Since $f_{c, α} x^{α+1}=c x-1$, it will be positive for all $x$ as long as $c \geq 1$. By the quadratic formula we can calculate that it must be that $α \geq(1+\sqrt{5}) / 2$.(ii)$$\mathbb{P}\{X \leq 2\}=\int_{1}^{2}\left(\frac{8}{3} x^{-3}-x^{-4}\right) \mathrm{d} x=-\frac{4}{3}\left(\frac{1}{4}-1\right)+\frac{1}{3}\left(\frac{1}{8}-1\right)=\frac{17}{24} .$$The expectation is$$\mathbb{E}[X]=\int_{1}^{∞} x\left(\frac{8}{3} x^{-3}-x^{-4}\right) \mathrm{d} x=\frac{8}{3}-\frac{1}{2}=\frac{13}{6} .$$
- (a) [12 marks] Let $X$ and $Y$ be independent Poisson-distributed random variables, with parameters $\mu$ and $λ$ respectively.
[Recall that the Poisson distribution with parameter $λ$ has probability mass function $p_{k}=\mathrm{e}^{-λ} λ^{k} / k !$ for $k=0,1,2, \ldots$]
(i) Define and determine the probability generating function $G_{X}(s)$ of $X$.
(ii) Show that $X+Y$ is also Poisson distributed. [You may use general results about probability distributions or probability generating functions, as long as they are clearly stated.]
(iii) Let $X$ be defined as above. Suppose we have a random variable $Z$ such that, conditioned on $\{X=k\}, Z$ has Poisson distribution with parameter $k$. Determine the probability generating function of $Z$.
(iv) Evaluate $\mathbb{E}[Z]$.
(b) [8 marks] A box contains 1 red ball and 3 white balls. A ball is chosen at random (with each ball having the same chance of being selected) and replaced by one of the other colour. This process is repeated until all the balls have the same colour.
(i) Compute the probability that the balls are white at the end.
(ii) Compute the expected total number of draws.
Solution.
- (i) Definition$$G_{X}(s)=\mathbb{E}\left[s^{X}\right]$$or$$G_{X}(s)=\sum_{i=0}^{∞} \mathbb{P}\{X=i\} s^{i}$$Then for Poisson $X$,\begin{aligned}G_{X}(s) &=\sum_{i=0}^{∞} \mathrm{e}^{-\mu} \frac{\mu^{i}}{i !} s^{i} \\&=\sum_{i=0}^{∞} \mathrm{e}^{-\mu} \frac{(\mu s)^{i}}{i !} \\&=\mathrm{e}^{\mu(s-1)}\end{aligned}(ii) The probability generating function of the sum of two independent random variables is the product of the probability generating functions. So$$G_{X+Y}(s)=\mathrm{e}^{\mu(s-1)} · \mathrm{e}^{λ(s-1)}=\mathrm{e}^{(\mu+λ)(s-1)}$$This is the probability generating function of a Poisson random variable with parameter $\mu+λ$. Since a distribution is uniquely determined by its probability generating function, $X+Y$ has Poisson distribution with parameter $\mu+λ$.
(iii) $Z$ has the same distribution as$$\sum_{i=1}^{X} Z_{i}$$where $Z_{i}$ are i.i.d. Poisson distributed with parameter 1. We know that pgf is$$G_{Z}(s)=G_{X}\left(G_{Z_{i}}(s)\right)=G_{X}\left(\mathrm{e}^{s-1}\right)=\mathrm{e}^{\mu\left(\mathrm{e}^{s-1}-1\right)} .$$(iv) They may observe that $Z$ may be represented as the sum of $X$ independent Poisson random variables with parameter 1 , hence with expected value $1 .$ They may then use the formula they have learned for expected value of random sums to compute $\mathbb{E}[Z]=1 · \mathbb{E}[X]=\mu$. (Only one point if they state this result, plus one point for explicitly referring to the random sum formula, and one point for citing the relevant property of the Poisson distribution.)
Otherwise, they may compute$$G_{Z}'(s)=\mu \mathrm{e}^{s} \mathrm{e}^{\mu\left(\mathrm{e}^{s-1}-1\right)}$$Thus $\mathbb{E}[Z]=G_{Z}'(1)=\mu$. - A box contains 1 red ball and 3 white balls. A ball is chosen uniformly at random and replaced by one of the other colour. This process is repeated until all the balls have the same colour.
(i) Let $p_{k}$ be the probability of all white at the end, given that we start with $k$ white and $4-k$ red. Then $p_{0}=0, p_{4}=1$, and for $k=1,2,3$,$$p_{k}=\frac{k}{4} p_{k-1}+\frac{4-k}{4} p_{k+1} .$$Thus\begin{aligned}p_{1} &=\frac{3}{4} p_{2}, \\p_{2} &=\frac{1}{2} p_{3}+\frac{1}{2} p_{1} \Longrightarrow p_{2}=\frac{4}{5} p_{3}, \\p_{3} &=\frac{3}{4} p_{2}+\frac{1}{4} \Longrightarrow p_{3}=\frac{5}{8}\end{aligned}(It would also be legitimate to claim that $p_{2}=1 / 2$ by symmetry.)
(ii) Let $e_{k}$ be the expected number of draws, given that we start with $k$ white and $4-k$ red. Then $e_{0}=e_{4}=0$, and for $k=1,2,3$,$$e_{k}=\frac{k}{4} e_{k-1}+\frac{4-k}{4} e_{k+1}+1$$Thus\begin{aligned}e_{1} &=\frac{3}{4} e_{2}+1 \\e_{2} &=\frac{1}{2} e_{3}+\frac{1}{2} e_{1}+1 \Longrightarrow e_{2}=\frac{4}{5} e_{3}+\frac{12}{5} \\e_{3} &=\frac{3}{4} e_{2}+1 \Longrightarrow e_{3}=\frac{3}{5} e_{3}+\frac{14}{5} \Longrightarrow e_{3}=7\end{aligned}
- [10 marks] A factory produces $n$ silicon wafers. The number of flaws in one wafer is believed to be Poisson distributed with unknown parameter $λ$. Thus, the number of flaws $X_{i}$ counted in wafer $i(i \in\{1, \ldots, n\})$ has probability mass function $p_{k}=\mathrm{e}^{-λ} λ^{k} / k !$; its expected value and variance are both equal to $λ$. Assume the $X_{i}$ are independent.
(i) Compute the maximum likelihood estimator $\hat{λ}$ for $λ$.
(ii) Compute the standard error for $\hat{λ}$.
(iii) Suppose the total number of flaws in 100 wafers is found to be 200 . Compute an approximate $95 \%$ confidence interval for $λ$. You may use the quantities $\Phi(1.64) \approx 0.95$ and $\Phi(1.96) \approx 0.975$, where $\Phi$ is the cumulative distribution function of the standard normal distribution. State clearly any mathematical results you make use of. - [10 marks] Let $x_{1}, \ldots, x_{10}$ be real numbers. We have ten random variables $Y_{i}=α+β x_{i}+ϵ_{i}$, where $ϵ_{i}$ are independent normal with expectation 0 and the same unknown variance $\sigma^{2}$, and $α$ and $β$ are unknown real parameters. We observe values $y_{1}, \ldots, y_{10}$ for $Y_{1}, \ldots, Y_{10}$. The points $\left(x_{i}, y_{i}\right)$ are labelled $\rm A$ through $\mathrm{J}$ in the plot below. The least-squares regression line, estimated as $y=\hat{α}+\hat{β} x$ with $\hat{α}=-0.58$ and $\hat{β}=0.74$, has been drawn in, and a $*$ at the point $(\bar{x}, \bar{y})$, the pair given by the mean of $x_{i}$ and the mean of $y_{i}$.We calculate the studentised residuals\begin{array}c\hline \mathrm{A} & \mathrm{B} & \mathrm{C} & \mathrm{D} & \mathrm{E} & \mathrm{F} & \mathrm{G} & \mathrm{H} & \mathrm{I} & \mathrm{J} \\-3.04 & 0.05 & 0.85 & 0.36 & -0.50 & 1.48 & 0.36 & 0.71 & -0.36 & -0.09 \\\hline\end{array}(i) The studentised residuals are calculated according to the formula $r_{i}=e_{i} /\left(s \sqrt{1-h_{i}}\right)$. Explain what each of the elements of this formula is, and state the formula for computing $s$.
(ii) Which points have high leverage? Explain your answer.
(iii) Which point should be treated a possible outlier? State two possible explanations for an outlier that should be investigated.
(iv) If we removed one point and recalculated the regression model, which point would produce the largest change in the estimate $\hat{β}$? Explain your answer.
Solution.
- (i) The log likelihood is$$\ell(λ)=\sum_{i=1}^{n}\left(-λ+X_{i} \log λ-\log X_{i} !\right)=-n λ+\left(\sum X_{i}\right) \log λ-\sum \log X_{i} !$$Thus$$\ell'(λ)=-n+λ^{-1} \sum X_{i},$$and solving $\ell'(\hat{λ})=0$ yields$$\hat{λ}=n^{-1} \sum X_{i} .$$Since $\ell''(λ)$ is always negative, this is a maximum.
(ii) The variance of $\hat{λ}$ is $n^{-2} · n \operatorname{Var}\left(X_{i}\right)=λ / n$. Thus the standard error is $\sqrt{λ / n}$.
(iii) We have $\hat{λ}=2$, and approximate the standard error by $\sqrt{\hat{λ} / n}$. By the Central Limit Theorem,$$\frac{\sum X_{i}-λ n}{\sqrt{λ n}} \approx \frac{\sum X_{i}-λ n}{\sqrt{2 n}}$$is approximately standard normal, so $\hat{λ}$ is approximately normal with mean $λ$ and standard deviation $\sqrt{2 / n}$. A $95 \%$ confidence interval is $\hat{λ} ±1.96 \sqrt{2 / n}$. - (i) $e_{i}$ is the $i$-th residuals; $h_{i}$ is the $i$-th leverage, $s$ is the residual standard error, defined by$$\left(\frac{1}{n-2} \sum e_{i}^{2}\right)^{1 / 2}$$(ii) The high-leverage points are the points whose $x$ value is farthest from the mean. These are $\mathrm{C}$ and $\mathrm{F}$.
(iii) Point A has the largest studentised residual. Studentised residuals larger (in absolute value) than 3 are taken to be possible outliers. Outliers should be investigated to decide whether they may be due to errors in data collection or model failure (such as a missing regressor).
(iv) Large changes in the slope estimate are produced by points that have high leverage and high residual. The only such point is F. (C has about the same leverage and significantly lower residual.)
- [9 marks] Consider the linear regression model$$Y_i=α+β x_i+ϵ_i$$where $x_{1}, \ldots, x_{n}$ are known constants, $α$ and $β$ are unknown parameters, and $ϵ_{1}, \ldots, ϵ_{n}$ are random variables with mean 0 and variance $\sigma^2$. Suppose that we observe values $y_{1}, \ldots, y_{n}$ of $Y_{1}, \ldots, Y_{n}$.
(i) What is meant by the least squares estimators $\hat{α}$ and $\hat{β}$ of $α$ and $β$?
(ii) State a specific model for the joint distribution of $ϵ_{1}, \ldots, ϵ_{n}$ under which $\hat{α}$ and $\hat{β}$ are the maximum likelihood estimators of $α$ and $β$. (A proof is not required.)
(iii) Assume that $\sum x_{i}=0 .$ Show that in this case $\hat{α}=\sum y_{i} / n$, and find $\hat{β}$. Show that $\hat{α}$ and $\hat{β}$ are unbiased estimators of $α$ and $β$. - [11 marks] We have a data set of $n$ observations of $p$ variables, written as $x_{i j}$ for $i \in\{1, \ldots, n\}, j \in\{1, \ldots, p\} .$
(i) What is the difference between $k$-means clustering and hierarchical clustering?
(ii) Describe the $k$-means algorithm for a fixed choice of $k$. You should include the objective function that the algorithm aims to minimise, and a description of the steps of the algorithm.
(iii) Should you choose $k$ to minimise the objective function? Why or why not?
(iv) If the $k$-means algorithm is run twice on the same data will the clusters obtained both times necessarily be identical? Explain your answer.
Solution.
- (i) This is the pair of values $\hat{α}$ and $\hat{β}$ that minimise$$\sum_{i=1}^{n}\left(y_{i}-\hat{α}-\hat{β} x_{i}\right)^{2} .$$(ii) The errors $ϵ_{i}$ are i.i.d. normal (i.e., same variance).
(iii)\begin{aligned}
\frac{\mathrm{d}}{\mathrm{d} \hat{α}} \sum_{i=1}^{n}\left(y_{i}-\hat{α}-\hat{β} x_{i}\right)^{2} &=\sum_{i=1}^{n} 2\left(y_{i}-\hat{α}-\hat{β} x_{i}\right) \\
&=2 \sum y_{i}-2 n \hat{α} .
\end{aligned} - The derivative is 0 when $\hat{\alpha}=\bar{y}$. Substituting in, we get\begin{aligned}\frac{\mathrm{d}}{\mathrm{d} \hat{\beta}} \sum_{i=1}^{n}\left(y_{i}-\bar{y}-\hat{\beta} x_{i}\right)^{2} &=\sum_{i=1}^{n} 2 x_{i}\left(y_{i}-\bar{y}-\hat{\beta} x_{i}\right) \\&=2 \sum x_{i} y_{i}-2 \hat{\beta} \sum x_{i}^{2}\end{aligned}which is 0 when$$\hat{\beta}=\frac{\sum x_{i} y_{i}}{\sum x_{i}^{2}} .$$(The second partial derivatives are both negative, and the mixed partial is 0 , so this point is indeed a minimum, but it is not necessary to say so.) We have\begin{aligned}\mathbb{E}[\hat{\alpha}] &=\mathbb{E}[\bar{y}] \\&=\frac{1}{n} \sum \mathbb{E}\left[\alpha+\beta x_{i}+\epsilon_{i}\right] \\&=\alpha+\frac{\beta}{n} \sum x_{i}+\frac{1}{n} \sum \mathbb{E}\left[\epsilon_{i}\right] \\&=\alpha \\\mathbb{E}[\hat{\beta}] &=\mathbb{E}\left[\left(\sum x_{i}^{2}\right)^{-1} \sum x_{i} Y_{i}\right] \\&=\left(\sum x_{i}^{2}\right)^{-1} \sum \mathbb{E}\left[x_{i} \alpha+x_{i} \beta x_{i}+x_{i} \epsilon_{i}\right] \\&=\beta\end{aligned}
- (i) In $k$-means clustering we group the data into a predetermined number of clusters. In hierarchical clustering we don't pre-specify the number of clusters.
(ii)- Choose $k$.
- Randomly assign the observations to one of the clusters $C_{1}, \ldots, C_{k}$.
- Iterate until the cluster assignments stop changing:
- Compute the cluster means$$\mu_{j}=\frac{1}{\left|C_{j}\right|} \sum_{i \in C_{j}}\left(x_{i 1}, \ldots, x_{i p}\right) .$$
- Reassign observations to the cluster whose mean is closest, where closest is defined by Euclidean distance.The objective function to be minimised is$$\sum_{j=1}^{k} \sum_{i \in C_{j}} \sum_{\ell=1}^{p}\left(x_{i \ell}-\mu_{j \ell}\right)^{2}$$
(iii) No. You can always reduce the objective function by increasing the number of clusters, until it reaches 0 at $k=n$.
(iv) False. The random assignment at the beginning can lead to different local minima of the objective function.- [5 marks] $X_{1}, \ldots, X_{n}$ are independent samples from a distribution with unknown mean $\mu$ and unknown variance $\sigma^{2} .$ We estimate the variance by$$\tilde{\sigma}^{2}=\frac{1}{n} \sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)^{2}$$where $\bar{X}=\left(X_{1}+⋯+X_{n}\right) / n$ is the sample mean. Compute the bias.
[You may wish to use the expansion $\left(X_{i}-\bar{X}\right)^{2}=\left(\left(X_{i}-\mu\right)+(\mu-\bar{X})\right)^{2}$.] - 100 unfit older men are evaluated for four fitness characteristics: systolic blood pressure (in mmHg), diastolic blood pressure (in mmHg), aerobic capacity (in ml/min/kg), and body fat fraction. The plan is to summarise each individual's condition with principal components that explain at least $75 \%$ of the total variance, and to use them as statistical predictors for the subjects' improvement in running speed over a four-week training course. The sample variance-covariance matrix $S$ is calculated to be\begin{array}{ccccc} & \text { Systolic } & \text { Diastolic } & \text { aerobic } & \text { body fat fraction } \\ \text { Systolic } & 121 & 81 & 22 & 0.11 \\ \text { Diastolic } & 81 & 81 & 18 & 0.09 \\ \text { aerobic } & 22 & 18 & 25 & 0.10 \\ \text { body fat fraction } & 0.11 & 0.09 & 0.10 & 0.0016\end{array}(i) [3 marks] The measurement of characteristic $j(j=1,2,3,4)$ for individual $i\ (i=1, \ldots, 100)$ is $x_{i j}$. Explain how $S$ was computed.
(ii) [3 marks] Compute the sample correlation matrix.
(iii) [2 marks] Two different researchers, who disagree about whether or not to standardise the variables, each compute the principal components. One of them computes the loadings on the first principal component to be $(0.56,0.56,0.48,0.39)$; the other computes loadings $(0.78,0.61,0.17,0.0008)$. Which one came from the standardised variables? Explain your answer.
(iv) [2 marks] State two reasons why the loadings based on the standardised variables should be preferred in this case.
(v) [3 marks] It is decided to work with principal components based on the standardised variables. The correlation matrix has eigenvalues $2.32,1.04,0.48,0.16$. Make a scree plot.
(vi) [2 marks] Based on the researchers' goals, as stated above, how many principal components should they use to summarise individuals' condition? Explain your answer.
Solution.
- We know that $X_i$ has mean $µ$ and variance $σ^2$; $\overline X$ has mean $µ$ and variance $\sigma^{2} / n ;$ and $\mu-\bar{X}=\frac{1}{n} \sum_{j=1}^{n}\left(\mu-X_{j}\right)$, so that$$\mathbb{E}\left[\left(X_{i}-\mu\right)(\mu-\bar{X})\right]=-\frac{1}{n} \sum_{j=1}^{n} \mathbb{E}\left[\left(X_{i}-\mu\right)\left(X_{j}-\mu\right)\right]=-\frac{1}{n} \mathbb{E}\left[\left(X_{i}-\mu\right)^2\right]=-\frac{\sigma^{2}}{n}$$Thus\begin{aligned}\mathbb{E}\left[\tilde{\sigma}^{2}\right] &=\frac{1}{n} \sum \left(\mathbb{E}\left[\left(X_{i}-\mu\right)^{2}\right]+\mathbb{E}\left[(\mu-\bar{X})^{2}\right]+2 \mathbb{E}\left[(\mu-\bar{X})\left(X_{i}-\mu\right)\right]\right)\\&=\frac{1}{n} \sum\left(\sigma^{2}+\frac{\sigma^{2}}{n}+2 \cdot-\frac{\sigma^{2}}{n}\right) \\&=\frac{n-1}{n} \sigma^{2}\end{aligned}The bias is $-\sigma^{2} / n$.
- (i)$$S_{j k}=\frac{1}{n-1} \sum_{i=1}^{n}\left(x_{i j}-\mu_{j}\right)\left(x_{i k}-\mu_{k}\right)$$where $\mu_{j}=\sum_{i=1}^{n} x_{i j}$.
(ii) $R_{i j}=S_{ij}/ \sqrt{S_{i i} S_{j j}}$. So the sample correlation matrix $R$ is\begin{array}{ccccc} & \text { Systolic } & \text { Diastolic } & \text { aerobic } & \text { body fat fraction } \\ \text { Systolic } & 1 & 0.82 & 0.4 & 0.25 \\ \text { Diastolic } & 0.82 & 1 & 0.4 & 0.25 \\ \text { aerobic } & 0.4 & 0.4 & 1 & 0.5 \\ \text { body fat fraction } & 0.25 & 0.25 & 0.5 & 1\end{array}(iii)The first principal component emphasises those variables with the highest variance. In particular, the second set has very small loading on body fat, the one with small variance.
(iv) Since the variables are incommensurable — they measure different things — there is no way to compare their scales, except relative to their own SDs. Also, changing the units of measurement will produce completely different principal components, which is undesirable, given that the units are arbitrary
(v)
(vi) The first two components together account for more than 75% of the variance in the data, so two components suffice.