### Summary

One can imagine least squares regression as solving $$Ax=b$$ where $$b$$ is (generally) not in the columnspace of $$A$$. To handle this, one projects $$b$$ into the columnspace of $$A$$ and solves $$Ax=p$$. Note the error of the projection ($$e=b-p$$) is orthogonal to $$A^T$$ and is thus in the nullspace of $$A^T$$.

• Starts with a 2D example of projecting a vector onto another vector.
• The projection $$p$$ of vector $$b$$ onto vector $$a$$
• The projection is $$a$$ scaled by $$x$$: $$p=xa$$
• Error is difference of projection $$p$$ and projected vector $$b$$: $$b-xa$$
• Error should be orthogonal to $$a$$: $$a^T(b-xa) = 0$$
• Solve for $$x$$: $$x = \frac{a^Tb}{a^Ta}$$
• Substitute back into $$p$$: $$p=a\frac{a^Tb}{a^Ta}$$
• The projection matrix is $$\text{proj}(p)=Pb=\frac{aa^T}{a^Ta}$$
• The column space of the projection matrix is a line through $$a$$
• The rank of the projection matrix is 1
• The projection matrix is symmetric: $$P^T=P$$
• Projecting more than once will give you same result: $$P^2=P$$
• Why project?
• Because $$Ax=b$$ may have no solution so we want to solve the closest problem with a solution.
• So we solve the closest vector in the columnspace to $$b$$ ($$Ax=p$$) where $$p$$ is in the columnspace of $$A$$.
• You are given a plane defined by a basis $$a_1$$ and $$a_2$$:
• $$A$$ is the matrix where $$a_1$$ is column 1 and $$a_2$$ is column 2
• We will solve $$Ax=b$$ by projecting $$b$$ into the columnspace of $$A$$
• Error is the difference between $$b$$ and projection $$p$$: $$e=b-p$$
• The projection $$p$$ is some multiple of the basis:

$p = \hat{x}_1a_1 + \hat{x}_2a_2 = A\hat{x}$
• The above projection defines two equations:
$a^T_1(b-A\hat{x}) = 0$ $a^T_2(b-A\hat{x}) = 0$
• Combines into: $$A^T(b-A\hat{x}) = 0$$
• Note the error is in the nullspace of $$A^T$$ by the above equation
• error is perpendicular to the columnspace of $$A$$
• Rewrite equation: $$A^TA\hat{x} = A^Tb$$
• Solve for $$\hat{x}$$: $$\hat{x} = (A^TA)^{-1}A^Tb$$

$p = A\hat{x} = A(A^TA)^{-1}A^Tb$
• You generally can’t distribute the inverse above because $$A$$ isn’t square
• If $$A$$ is invertible, then $$b$$ is in the columnspace and the projection matrix is the identity matrix.

• You can conceptualize least squares as a projection problem.
• You are given a bunch of data points that lie close to a line
• e.g. you are given (1,1), (2,2), and (3,2). What is the line that minimizes error?
• Find the best line $$b=C+Dt$$, so we’re solving the equations:

$C+D=1$ $C+2D=2$ $C+3D=2$
• In matrix form, this is:
$Ax = b$ $\begin{bmatrix} \begin{array}{rr} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ \end{array} \end{bmatrix} * \begin{bmatrix} \begin{array}{r} C \\ D \\ \end{array} \end{bmatrix} = \begin{bmatrix} \begin{array}{r} 1 \\ 2 \\ 2 \\ \end{array} \end{bmatrix}$
• Now we project $$b$$ into the columnspace of $$A$$ and solve.