- This is part of probstat.
In this section, we shall discuss linear regression. We shall focus on one-variable linear regression.
Model
We consider two variables
and
where
is a function of
. We refer to
as independent or input variable, and
as a dependent variable. We consider linear relationship between independent variable and dependent variable. We assume that there exist hidden variables
and
such that
where
is a random error. We further assume that the error is unbiased, i.e.,
and is independent of
.
Input: As an input to the regression process, we are given a set of
data points:
generated from the previous equation.
Goal: We want to estimate
and
.
The least squares estimators
Denote our estimate for
as
and for
as
. Using both variables as estimator, the error at data point
, the error is
.
We focus more on the sum of squared errors, i.e.,
.
The method of least squares use the parameters that minimize the squared errors as an estimator. Therefore, we want to find
and
that minimize
. To do so, we partially differentiate
with respect to
and
:
(Eq1)
(Eq2)
We set these two equations to zero to find the maximum and obtain these two equations we have to solve.
Before solving these two equations, let's define
We start by rewriting the first equation (Eq1) as
(Eq3)
and put it in (Eq2) to get
.
With some calculation, we get
.
To find
, we can just use equation (Eq3).
Estimated regression parameters. Using the least squares method, we obtain the following estimates
,
and
.
Distribution of regression parameters
Although the estimators
and
are least-squares estimators, we are not sure if they are good estimators. In this section, we shall discuss various properties of these parameters.
We make an assumption on the error, that is is normally distributed with variance
. Therefore,
.
We shall start with
. First, note that
's are inputs and are not random. Therefore, if we look at the formula for
, we see that
is actually a sum of independent normal random variables. This implies that
is a normal random variable. If we can find its mean and its variance, we have a complete information about the distribution of
.
We can calculate the expectation and variance of
as follows.
.
.
We can also calculate the expectation and variance of
.
.
.
Statistical tests on regression parameters
We focus on how to test the null hypothesis:
Since
is normal with mean
and variance
, we know that the statistic
is unit normal and it is possible to perform various statistical tests on
based on the estimated value
if we know parameter
. However, usually, we do not.
We end up with a situation similar to when we perform sampling on populations with unknown variances. Another key quantity in this case is the sum of squares of the residuals:
Note that if we substitute
with
and
with
the term
is exactly the errors (which is normally distributed with mean 0 and variance
). This motivate the fact that
can be used to estimate
:
.
More over, it can be shown that
,
and
is independent of
. These two facts implies that
where
.
Therefore, if we want to perform a hypothesis testing if
, we can check if
deviate far enough from the t-distribution with
degrees of freedom.
Notes: This is fairly similar to the use of the t-distribution for the sample mean when the variance of the population is unknown, where the quantity
acts as
in that case.