Linear Regression Considerations
Some Important Questions
Is at least one of the predictors
Hypothesis test:
where
If the linear model assumptions are correct, one can show that:
and that, provided
When there is no relationship between the response and predictors, one would expect the
When
Sometimes we want to test that a particular subset of
In this case we fit a second model that uses all the variables except those last
If
Do all the predictors help to explain
Variable Selection is the task of determining which predictors are associated with the response in order to fit a single model involving only those predictors. Covered more in Ch 6
- Various statistics can be used to judge the quality of a model. These include Mallow’s
, Akaike information criterion (AIC), Bayesian information criterion (BIC), and adjusted .
We use an automated approach to find subsets of predictors to consider. Three classical approaches: - forward selection. Start with a null model (intercept with no predictors). Fit
simple linear regressions and add to the null model the variable that results in the lowest RSS. Then add another variable that results in the lowest RSS for the new two-variable model. Keep going until some stopping rule is satisfied. A greedy approach that might include variables early that later become redundant. - backward selection. Start with all variables in the model. Remove the variable with the largest
-value. Continue removing variables until a stopping rule is reached. Can't be used if . - mixed selection. Combination of forward and backward selection. Start with forward selection. If the
-value for one of the variables in the model rises above a certain threshold, we use backward selection on that variable. Continue to perform these forward and backward steps until all variables in the model have a sufficiently low -value, and all variables outside the model would have a large -value if added to the model.
How well does the model fit the data?
Most common measures of fit are RSE and.
For a multiple linear regression:
In training data,
RSE is defined as:
Thus, models with more variables can have higher RSE if the decrease in RSS is small relative to the increase in
Given a set of predictor values, what response value should we predict, and how accurate is our prediction?
We have to take uncertainty into account when using the model to predict future values
is only an estimate for , so we need to compute the confidence interval to determine how close will be to - model bias: potentially reducible error by assuming a specific model. We need to ignore this discrepancy, and operate as if the linear model were correct.
- Even if we know
the response value can't be perfectly because of (irreducible error). We take this into account with prediction intervals , which are always wider than confidence intervals because they take into account reducible and irreducible error.
Other Regression Model Considerations
Qualitative Predictors
Predictors with Two Levels
factor: a qualitative predictor
levels: possible values for a factor
dummy variable: create a variable that takes the value of 0/1 or -1/1 to indicate the level of a factor that a response value is associated with
It is important to note that the final predictions for different levels will be identical regardless of the coding scheme used. The only difference is in the way that the coefficients are interpreted.
Predictors with More than Two Levels
Make more dummy variables!
There will always be one fewer dummy variable than the number of levels. The level with no dummy variable is known as the baseline. The level selected as the baseline category is arbitrary, and the final predictions for each group will be the same regardless of this choice.
However, the coefficients and their
Extensions of the Linear Model
The standard linear regression model makes several highly restrictive assumptions that are often violated in practice: the most important assumptions state that the relationship between the predictors and response are additive and linear.
- additive: the association between a predictor
and the response does not depend on the values of the other predictors - linear: the change in the response
associated with a one-unit change in is constant, regardless of the value of
Removing the Additive Assumption
Add an interaction term to measure how predictors affect each other:
hierarchical principle: if we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant.
Non-linear Relationships
Let's use polynomials!
polynomial regression: include polynomial functions of the predictors in the regression model
More in Ch 7
Potential Problems
Most common problems:
- Non-linearity of the response-predictor relationships
- Correlation of error terms
- Non-constant variance of error terms
- Outliers
- High-leverage points
- Collinearity
Non-linearity of the Data
If the true relationship is far from linear, then virtually all of the conclusions that we draw from the fit are suspect.
Check using residual plot,
Correlation of Error Terms
If there is correlation among the error terms, then the estimated standard errors will tend to underestimate the true standard errors. As a result, confidence and prediction intervals will be narrower than they should be.
Correlations among error terms usually occur in the context of time series data. In order to determine if this is the case for a given data set, we can plot the residuals from our model as a function of time. If the errors are uncorrelated, then there should be no discernible pattern. On the other hand, if the error terms are positively correlated, then we may see adjacent residuals having similar values.
Correlation among the error terms can also occur outside of time series data.
Non-constant Variance of Error Terms
One can identify non-constant variances in the errors, or heteroscedasticity, from the presence of a funnel shape in the residual plot.
Reduce heteroscedasticity by transforming
Sometimes we have a good idea of the variance of each response. For example, the
Outliers
While outliers may not have much of an effect on a least squares line, it can greatly affect RSE or
Residual plots can be used to identify outliers. Specifically, the studentized residuals, computed by dividing each residual
If we believe that an outlier has occurred due to an error in data collection or recording, then one solution is to simply remove the observation. However, care should be taken, since an outlier may instead indicate a deficiency with the model, such as a missing predictor.
High Leverage Points
Observations with high leverage have an unusual value for
High leverage observations tend to have a sizable impact on the estimated regression line. It is cause for concern if the least squares line is heavily affected by just a couple of observations, because any problems with these points may invalidate the entire ft.
In order to quantify an observation’s leverage, we compute the leverage statistic. A large value of this statistic indicates an observation with high leverage. For a simple linear regression:
There is a simple extension of
Collinearity
Collinearity refers to the situation in which two or more predictor variables are closely related to one another. The presence of collinearity can pose problems in the regression context, since it can be difficult to separate out the individual effects of collinear variables on the response. The power of
To avoid such a situation, it is desirable to identify and address potential collinearity problems while fitting the model. compute the variance inflation factor (VIF). The smallest possible value for VIF is 1, which indicates the complete absence of collinearity. Typically in practice there is a small amount of collinearity among the predictors. As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity. The VIF for each variable can be computed using the formula:
where
When faced with the problem of collinearity, there are two simple solutions. The first is to drop one of the problematic variables from the regression. The second solution is to combine the collinear variables together into a single predictor.
Sources: 1