Simple Linear Regression

Assumes there is approximately a linear relationship between $X$ and $Y$ :

Y \approx β_{0} + β_{1} X

$β_{0}$ and $β_{1}$ are known as coefficients or parameters
Use training data to produce estimates $\hat{β_{0}}$ and $\hat{β_{1}}$ , then predict future response values by:

\hat{y} = \hat{β_{0}} + \hat{β_{1}} x

Estimating Coefficients

Let

(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})

represent $n$ observation pairs, each which consists of a measurement of $X$ and a measurement of $Y$
Measure closeness using least squares (this is one of multiple ways to measure closeness)
Let ${\hat{y}}_{i} = \hat{β_{0}} + \hat{β_{1}} x_{i}$ be the prediction for $Y$ based on the $i^{t h}$ value of $X$ . Then $e_{i} = y_{i} - {\hat{y}}_{i}$ represents the $i^{t h}$ residual - the difference between the $i^{t h}$ observed response value and the $i^{t h}$ response value that is predicted. We define the residual sum of squares (RSS) as

R S S = e_{1}^{2} + e_{2}^{2} + \dots + e_{n}^{2}

R S S = (y_{1} - \hat{β_{0}} - \hat{β_{1}} x_{1})^{2} + (y_{2} - \hat{β_{0}} - \hat{β_{1}} x_{2})^{2} + \dots + (y_{n} - \hat{β_{0}} - \hat{β_{1}} x_{n})^{2}

Least squares chooses $\hat{β_{0}}$ and $\hat{β_{1}}$ to minimize RSS. Minimizers are:

\hat{β_{1}} = \frac{\sum_{i = 1}^{n} (x_{i} - \bar{x}) (y_{i} - \bar{y})}{\sum_{i = 1}^{n} (x_{i} - \bar{x})^{2}}

\hat{β_{0}} = \bar{y} - \hat{β_{1}} \bar{x}

where $\bar{y} \equiv \frac{1}{n} \sum_{i = 1}^{n} y_{i}$ and $\bar{x} \equiv \frac{1}{n} \sum_{i = 1}^{n} x_{i}$ are the sample means.

Assessing Coefficient Accuracy

Standard error of $\hat{β_{0}}$ and $\hat{β_{1}}$ :

S E (\hat{β_{0}})^{2} = σ^{2} (\frac{1}{n} + \frac{{\bar{x}}^{2}}{\sum_{i = 1}^{n} (x_{i} - \bar{x})^{2}})

S E (\hat{β_{1}})^{2} = \frac{σ^{2}}{\sum_{i = 1}^{n} (x_{i} - \bar{x})^{2}}

where $σ^{2} = V a r (ϵ)$ , assuming the errors $ϵ_{i}$ have a common variance $σ^{2}$ and are uncorrelated.
Estimate $σ$ using residual standard error:

R S E = \sqrt{R S S / (n - 2)}

95% confidence interval can be calculated by:

\hat{β_{1}} \pm 2 * S E (\hat{β_{1}})

and

\hat{β_{0}} \pm 2 * S E (\hat{β_{0}})

Hypothesis testing:

H_{0} : β_{1} = 0, H_{a} : β_{1} \neq 0

t = \frac{\hat{β_{1}} - 0}{S E (\hat{β_{1}})}

where $t$ distribution has $n - 2$ degrees of freedom and $n > 30$ .

Assessing Model Accuracy

Residual Standard Error (RSE)
$R^{2}$ Statistic
Measures the proportion of variance:

R^{2} = \frac{T S S - R S S}{T S S} = 1 - \frac{R S S}{T S S}

where

T S S = \sum (y_{i} - \bar{y})^{2}

R S S = e_{1}^{2} + e_{2}^{2} + \dots + e_{n}^{2}

R S S = (y_{1} - \hat{β_{0}} - \hat{β_{1}} x_{1})^{2} + (y_{2} - \hat{β_{0}} - \hat{β_{1}} x_{2})^{2} + \dots + (y_{n} - \hat{β_{0}} - \hat{β_{1}} x_{n})^{2}

$R^{2}$ measures the proportion of variability in $Y$ that can be explained using $X$ . An $R^{2}$ statistic that is close to 1 indicates that a large proportion of the variability in the response is explained by the regression. A number near 0 indicates that the regression does not explain much of the variability in the response; this might occur because the linear model is wrong, or the error variance $σ^{2}$ is high, or both.
For a simple linear regression (*this does not hold for a Multiple Linear Regression), $R^{2} = r^{2}$ where $r = C o r (X, Y)$ :

C o r (X, Y) = \frac{\sum_{i = 1}^{n} (x_{i} - \bar{x}) (y_{i} - \bar{y})}{\sqrt{\sum_{i = 1}^{n} (x_{i} - \bar{x})^{2}} \sqrt{\sum_{i = 1}^{n} (y_{i} - \bar{y})^{2}}}

Sources: 1

Simple Linear Regression

Estimating Coefficients

Assessing Coefficient Accuracy

Assessing Model Accuracy

Connect With Me!