Simple Linear Regression

Simple Linear Regression

Assumes there is approximately a linear relationship between X and Y:

Y≈β0+β1X

β0 and β1 are known as coefficients or parameters
Use training data to produce estimates β0^ and β1^, then predict future response values by:

y^=β0^+β1^x

Estimating Coefficients

Let

(x1,y1),(x2,y2),⋯,(xn,yn)

represent n observation pairs, each which consists of a measurement of X and a measurement of Y
Measure closeness using least squares (this is one of multiple ways to measure closeness)
Let y^i=β0^+β1^xi be the prediction for Y based on the ith value of X. Then ei=yi−y^i represents the ith residual - the difference between the ith observed response value and the ith response value that is predicted. We define the residual sum of squares (RSS) as

RSS=e12+e22+⋯+en2

or

RSS=(y1−β0^−β1^x1)2+(y2−β0^−β1^x2)2+⋯+(yn−β0^−β1^xn)2

Least squares chooses β0^ and β1^ to minimize RSS. Minimizers are:

β1^=∑i=1n(xi−x¯)(yi−y¯)∑i=1n(xi−x¯)2β0^=y¯−β1^x¯

where y¯≡1n∑i=1nyi and x¯≡1n∑i=1nxi are the sample means.

Assessing Coefficient Accuracy

Standard error of β0^ and β1^:

SE(β0^)2=σ2(1n+x¯2∑i=1n(xi−x¯)2)SE(β1^)2=σ2∑i=1n(xi−x¯)2

where σ2=Var(ϵ), assuming the errors ϵi have a common variance σ2 and are uncorrelated.
Estimate σ using residual standard error:

RSE=RSS/(n−2)

95% confidence interval can be calculated by:

β1^±2∗SE(β1^)

and

β0^±2∗SE(β0^)

Hypothesis testing:

H0:β1=0,Ha:β1≠0t=β1^−0SE(β1^)

where t distribution has n−2 degrees of freedom and n>30.

Assessing Model Accuracy

Residual Standard Error (RSE)
R2 Statistic
Measures the proportion of variance:

R2=TSS−RSSTSS=1−RSSTSS

where

TSS=∑(yi−y¯)2RSS=e12+e22+⋯+en2RSS=(y1−β0^−β1^x1)2+(y2−β0^−β1^x2)2+⋯+(yn−β0^−β1^xn)2

R2 measures the proportion of variability in Y that can be explained using X. An R2 statistic that is close to 1 indicates that a large proportion of the variability in the response is explained by the regression. A number near 0 indicates that the regression does not explain much of the variability in the response; this might occur because the linear model is wrong, or the error variance σ2 is high, or both.
For a simple linear regression (*this does not hold for a Multiple Linear Regression), R2=r2 where r=Cor(X,Y):

Cor(X,Y)=∑i=1n(xi−x¯)(yi−y¯)∑i=1n(xi−x¯)2∑i=1n(yi−y¯)2

Sources: 1

Connect With Me!