Linear Regression Considerations

Some Important Questions

Is at least one of the predictors $X_{1}, X_{2}, \dots, X_{p}$ useful in predicting the response?
Hypothesis test:

H_{0} : β_{1} = β_{2} = \dots = β_{p} = 0, H_{a} : a t l e a s t o n e β_{j} \neq 0

$F$ -statistic:

F = \frac{(T S S - R S S) / p}{R S S / (n - p - 1)}

where $T S S = \sum (y_{i} - \bar{y})^{2}$ and $R S S = \sum (y_{i} - \hat{y})^{2}$
If the linear model assumptions are correct, one can show that:

E (\frac{R S S}{(n - p - 1)}) = σ^{2}

and that, provided $H_{0}$ is true:

E (\frac{T S S - R S S}{p}) = σ^{2}

When there is no relationship between the response and predictors, one would expect the $F$ -statistic to take on a value close to 1. On the other hand, if $H_{a}$ is true, then $E [(T S S - R S S) / p] > σ^{2}$ , so we expect $F$ to be greater than 1.

When $n$ is large, an $F$ -statistic that is just a little larger than 1 might still provide evidence against $H_{0}$ . In contrast, a larger $F$ -statistic is needed to reject $H_{0}$ if $n$ is small. When $H_{0}$ is true and the errors $ϵ_{i}$ have a normal distribution, the $F$ -statistic follows an $F$ -distribution. For any given value of $n$ and $p$ , any statistical software package can be used to compute the $p$ -value associated with the $F$ -statistic using this distribution. Based on this $p$ -value, we can determine whether or not to reject $H_{0}$ .

Sometimes we want to test that a particular subset of $q$ of the coefficients are zero to find the partial effect of adding that variable to the model. This corresponds to a null hypothesis:

H_{0} : β_{p - q + 1} = β_{p - q + 2} = \dots = β_{p} = 0

In this case we fit a second model that uses all the variables except those last $q$ . Suppose that the residual sum of squares for that model is $R S S_{0}$ . Then the appropriate $F$ -statistic is:

F = \frac{(R S S_{0} - R S S) / q}{R S S / (n - p - 1)}

If $p > n$ then there are more coefficients $β_{j}$ to estimate than observations from which to estimate them. In this case we cannot even fit the multiple linear regression model using least squares, so the $F$ statistic cannot be used. When $p$ is large, use a high-dimensional setting.
Do all the predictors help to explain $Y$ , or is only a subset of the predictors useful?
Variable Selection is the task of determining which predictors are associated with the response in order to fit a single model involving only those predictors. Covered more in Ch 6

Various statistics can be used to judge the quality of a model. These include Mallow’s $C_{p}$ , Akaike information criterion (AIC), Bayesian information criterion (BIC), and adjusted $R^{2}$ .
We use an automated approach to find subsets of predictors to consider. Three classical approaches:
forward selection. Start with a null model (intercept with no predictors). Fit $p$ simple linear regressions and add to the null model the variable that results in the lowest RSS. Then add another variable that results in the lowest RSS for the new two-variable model. Keep going until some stopping rule is satisfied. A greedy approach that might include variables early that later become redundant.
backward selection. Start with all variables in the model. Remove the variable with the largest $p$ -value. Continue removing variables until a stopping rule is reached. Can't be used if $p > n$ .
mixed selection. Combination of forward and backward selection. Start with forward selection. If the $p$ -value for one of the variables in the model rises above a certain threshold, we use backward selection on that variable. Continue to perform these forward and backward steps until all variables in the model have a sufficiently low $p$ -value, and all variables outside the model would have a large $p$ -value if added to the model.
How well does the model fit the data?
Most common measures of fit are RSE and $R^{2}$ .

For a multiple linear regression:

R^{2} = C o r (Y, \hat{Y})^{2}

In training data, $R^{2}$ will always increase when more variables are added to the model, even if those variables are only weakly associated with the response. It's a sign that a particular predictor will lead to overfitting if it's included.

RSE is defined as:

R S E = \sqrt{\frac{1}{n - p - 1} R S S}

Thus, models with more variables can have higher RSE if the decrease in RSS is small relative to the increase in $p$ .

Given a set of predictor values, what response value should we predict, and how accurate is our prediction?
We have to take uncertainty into account when using the model to predict future values

$\hat{Y}$ is only an estimate for $f (X)$ , so we need to compute the confidence interval to determine how close $\hat{Y}$ will be to $f (X)$
model bias: potentially reducible error by assuming a specific model. We need to ignore this discrepancy, and operate as if the linear model were correct.
Even if we know $f (X)$ the response value can't be perfectly because of $ϵ$ (irreducible error). We take this into account with prediction intervals , which are always wider than confidence intervals because they take into account reducible and irreducible error.

Other Regression Model Considerations

Qualitative Predictors

Predictors with Two Levels
factor: a qualitative predictor
levels: possible values for a factor
dummy variable: create a variable that takes the value of 0/1 or -1/1 to indicate the level of a factor that a response value is associated with

It is important to note that the final predictions for different levels will be identical regardless of the coding scheme used. The only difference is in the way that the coefficients are interpreted.

Predictors with More than Two Levels
Make more dummy variables!
There will always be one fewer dummy variable than the number of levels. The level with no dummy variable is known as the baseline. The level selected as the baseline category is arbitrary, and the final predictions for each group will be the same regardless of this choice.

However, the coefficients and their $p$ -values do depend on the choice of dummy variable coding. Rather than rely on the individual coefficients, we can use an $F$ -test to test $H_{0} : β_{1} = β_{2} = \dots = β_{p} = 0$ ; this does not depend on the coding.

Extensions of the Linear Model

The standard linear regression model makes several highly restrictive assumptions that are often violated in practice: the most important assumptions state that the relationship between the predictors and response are additive and linear.

additive: the association between a predictor $X_{j}$ and the response $Y$ does not depend on the values of the other predictors
linear: the change in the response $Y$ associated with a one-unit change in $X_{j}$ is constant, regardless of the value of $X_{j}$

Removing the Additive Assumption
Add an interaction term to measure how predictors affect each other:

Y = β_{0} + β_{1} X_{1} + β_{2} X_{2} + β_{3} X_{1} X_{2} + ϵ

hierarchical principle: if we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant.

Non-linear Relationships
Let's use polynomials!
polynomial regression: include polynomial functions of the predictors in the regression model
More in Ch 7

Potential Problems

Most common problems:

Non-linearity of the response-predictor relationships
Correlation of error terms
Non-constant variance of error terms
Outliers
High-leverage points
Collinearity

Non-linearity of the Data

If the true relationship is far from linear, then virtually all of the conclusions that we draw from the fit are suspect.

Check using residual plot, $e_{i} = y_{i} - {\hat{y}}_{i}$ versus $x_{i}$ . In the case of a multiple regression model we plot the residuals versus the predicted (or fitted) values ${\hat{y}}_{i}$ . Ideally, the residual plot will show no discernible pattern. The presence of a pattern may indicate a problem with some aspect of the linear model.

Correlation of Error Terms

If there is correlation among the error terms, then the estimated standard errors will tend to underestimate the true standard errors. As a result, confidence and prediction intervals will be narrower than they should be.

Correlations among error terms usually occur in the context of time series data. In order to determine if this is the case for a given data set, we can plot the residuals from our model as a function of time. If the errors are uncorrelated, then there should be no discernible pattern. On the other hand, if the error terms are positively correlated, then we may see adjacent residuals having similar values.

Correlation among the error terms can also occur outside of time series data.

Non-constant Variance of Error Terms

One can identify non-constant variances in the errors, or heteroscedasticity, from the presence of a funnel shape in the residual plot.

Reduce heteroscedasticity by transforming $Y$ with a concave function like $l o g (Y)$ or $\sqrt{Y}$

Sometimes we have a good idea of the variance of each response. For example, the $i^{t h}$ response could be an average of $n_{i}$ raw observations. If each of these raw observations is uncorrelated with variance $σ^{2}$ , then their average has variance $σ_{i}^{2} = σ^{2} / n_{i}$ . In this case a simple remedy is to fit our model by weighted least squares, with weights proportional to the inverse variances—i.e. $w_{i} = n_{i}$ in this case.

Outliers

While outliers may not have much of an effect on a least squares line, it can greatly affect RSE or $R^{2}$ . Since the RSE is used to compute all confidence intervals and $p$ -values, a dramatic increase caused by a single data point can have implications for the interpretation of the fit. Similarly, inclusion of the outlier causes the $R^{2}$ to decline.

Residual plots can be used to identify outliers. Specifically, the studentized residuals, computed by dividing each residual $e_{i}$ by its estimated standard error. Observations whose studentized residuals are greater than 3 in absolute value are possible outliers.

If we believe that an outlier has occurred due to an error in data collection or recording, then one solution is to simply remove the observation. However, care should be taken, since an outlier may instead indicate a deficiency with the model, such as a missing predictor.

High Leverage Points

Observations with high leverage have an unusual value for $x_{i}$

High leverage observations tend to have a sizable impact on the estimated regression line. It is cause for concern if the least squares line is heavily affected by just a couple of observations, because any problems with these points may invalidate the entire ft.

In order to quantify an observation’s leverage, we compute the leverage statistic. A large value of this statistic indicates an observation with high leverage. For a simple linear regression:

h_{i} = \frac{1}{n} + \frac{(x_{i} - \bar{x})^{2}}{\sum_{i^{'} = 1}^{n} (x_{i^{'}} - \bar{x})^{2}}

There is a simple extension of $h_{i}$ to the case of multiple predictors. The leverage statistic $h_{i}$ is always between $1 / n$ and $1$ , and the average leverage for all the observations is always equal to $(p + 1) / n$ . So if a given observation has a leverage statistic that greatly exceeds $(p + 1) / n$ , then we may suspect that the corresponding point has high leverage.

Collinearity

Collinearity refers to the situation in which two or more predictor variables are closely related to one another. The presence of collinearity can pose problems in the regression context, since it can be difficult to separate out the individual effects of collinear variables on the response. The power of $H_{0} : β_{j} = 0$ is reduced by collinearity.

To avoid such a situation, it is desirable to identify and address potential collinearity problems while fitting the model. compute the variance inflation factor (VIF). The smallest possible value for VIF is 1, which indicates the complete absence of collinearity. Typically in practice there is a small amount of collinearity among the predictors. As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity. The VIF for each variable can be computed using the formula:

V I F ({\hat{β}}_{j}) = \frac{1}{1 - R_{X_{j} | X_{- j}}^{2}}

where $R_{X_{j} | X_{- j}}^{2}$ is the $R^{2}$ from a regression of $X_{j}$ onto all of the other predictors.

When faced with the problem of collinearity, there are two simple solutions. The first is to drop one of the problematic variables from the regression. The second solution is to combine the collinear variables together into a single predictor.

Sources: 1