Our basic linear regression model is:
We wish to minimise the residual sum of squares: (I drop the subscript for clarity throughout)
Some calculus gives us the following solutions:
The (squared) standard error for our parameters is given by:
(assuming our errors are uncorrelated).
Although we don't know the exact value of , we can estimate it using the residual standard error:
Roughly, this measures the average amount that responses deviate from the true regression line - i.e. the lack of fit.
We can use these standard error estimates to compute confidence intervals. If we assume a Gaussian distribution for our residuals, we can say, for instance, that the 95% confidence interval for both is given by:
Given null hypothesis and alternative hypothesis (X and Y are correlated), we can compute a t-statistic:
This measures the number of standard deviations is away from 0. We can compare this value to a t-distribution with degrees of freedom to calculate a p-value. This enables us to reject the null hypothesis or not.
The RSE is a useful measure of fit, but is in the units of Y. The statistic addresses this. It measures the proportion of variance of y explained by the model:
The closer is to 1 the better.
In the simple linear regression case, this is equivalent to the squared correlation (covariance over product of individual variances), but unlike correlation it also extends to the multivariable case (see below).
Our multiple linear regression model now becomes:
The standard approach here omits the bias term and simply adds a column of 1s to the input data . This makes for a simple solution: . However, I feel like this glosses over what's actually going on here with both and —as the following explanation will demonstrate.
Given training data , the maximum likelihood estimator can be shown to minimise the RSS:
We can solve this by differentiating with respect to our weights/bias, setting the gradient/derivative to 0, and solving.
A quick note on notation here: we will use and to represent the mean of the and . We also have versions of this broadcasted to matrices and vectors respectively: , .
We begin with :
Solving for :
For we have:
Solving for :
We now have expressions in terms of and . In terms of each other. We can eliminate to give expressions for each in terms of just the training data:
This gives us a closed form solution for the weights and bias of our model. The bias appear to have an intuitive interpretation, but not so for the bias. Fortunately we can make more sense of this with help from something called the centring matrix.
The centring matrix simply subtracts the mean of a vector's components from each of its values:
Where is the matrix of ones.
When multiplied by a matrix such as our standard matrix, it performs the same operation on each vector independently. Typically, we will want to left-multiply using , which results in each column (i.e. dimension) having its mean subtracted. Right-multiplying does the same for our rows, although this is of less use for our purposes.
The centring matrix demonstrates one very curious and useful property in linear algebra: for matrix multiplication, centring one matrix is the same as centring both.
This also gives us a succinct way for representing the singly/doubly-centred (they are the same) matrix-multiplication:
Using the rules demonstrated above for the centring matrix, we can reformulate our solution for :
This is just the centred version of the solution we typically use when there is no bias term!
This is an important realisation, as it allows us to interpret what represents. We can think of as fitting a non-affine line (i.e. it passes through the origin) to an altered version of our dataset which is centred such that —i.e. the means our our dataset on each axis lie at the origin. This centring means that a bias term is not required.
The role of the bias is then to un-centre our predictions. Examining our overall model we have:
We see how the bias term changes our prediction to first centre the input , and then un-centres the result by adding back the term.
Using the same setup as above, with the null hypothesis assuming all , we now compute the test using the F-statistic:
This value is close to 1 when there is no relationship between X and Y, and >1 otherwise.
Again we can use the F-distribution to calculate p-values.
If we wish to test a subset of the parameters, we can repeat this using a model that just uses those features.