Linear Regression Considerations

Some Important Questions

Is at least one of the predictors X1,X2,⋯,Xp useful in predicting the response?
Hypothesis test:

H0:β1=β2=⋯=βp=0,Ha:atleastoneβj≠0

F-statistic:

F=(TSS−RSS)/pRSS/(n−p−1)

where TSS=∑(yi−y¯)2 and RSS=∑(yi−y^)2
If the linear model assumptions are correct, one can show that:

E(RSS(n−p−1))=σ2

and that, provided H0 is true:

E(TSS−RSSp)=σ2

When there is no relationship between the response and predictors, one would expect the F-statistic to take on a value close to 1. On the other hand, if Ha is true, then E[(TSS−RSS)/p]>σ2, so we expect F to be greater than 1.

When n is large, an F-statistic that is just a little larger than 1 might still provide evidence against H0. In contrast, a larger F-statistic is needed to reject H0 if n is small. When H0 is true and the errors ϵi have a normal distribution, the F-statistic follows an F-distribution. For any given value of n and p, any statistical software package can be used to compute the p-value associated with the F-statistic using this distribution. Based on this p-value, we can determine whether or not to reject H0.

Sometimes we want to test that a particular subset of q of the coefficients are zero to find the partial effect of adding that variable to the model. This corresponds to a null hypothesis:

H0:βp−q+1=βp−q+2=⋯=βp=0

In this case we fit a second model that uses all the variables except those last q. Suppose that the residual sum of squares for that model is RSS0. Then the appropriate F-statistic is:

F=(RSS0−RSS)/qRSS/(n−p−1)

If p>n then there are more coefficients βj to estimate than observations from which to estimate them. In this case we cannot even fit the multiple linear regression model using least squares, so the F statistic cannot be used. When p is large, use a high-dimensional setting.
Do all the predictors help to explain Y , or is only a subset of the predictors useful?
Variable Selection is the task of determining which predictors are associated with the response in order to fit a single model involving only those predictors. Covered more in Ch 6

For a multiple linear regression:

R2=Cor(Y,Y^)2

In training data, R2 will always increase when more variables are added to the model, even if those variables are only weakly associated with the response. It's a sign that a particular predictor will lead to overfitting if it's included.

RSE is defined as:

RSE=1n−p−1RSS

Thus, models with more variables can have higher RSE if the decrease in RSS is small relative to the increase in p.

Given a set of predictor values, what response value should we predict, and how accurate is our prediction?
We have to take uncertainty into account when using the model to predict future values

Other Regression Model Considerations

Qualitative Predictors

Predictors with Two Levels
factor: a qualitative predictor
levels: possible values for a factor
dummy variable: create a variable that takes the value of 0/1 or -1/1 to indicate the level of a factor that a response value is associated with

It is important to note that the final predictions for different levels will be identical regardless of the coding scheme used. The only difference is in the way that the coefficients are interpreted.

Predictors with More than Two Levels
Make more dummy variables!
There will always be one fewer dummy variable than the number of levels. The level with no dummy variable is known as the baseline. The level selected as the baseline category is arbitrary, and the final predictions for each group will be the same regardless of this choice.

However, the coefficients and their p-values do depend on the choice of dummy variable coding. Rather than rely on the individual coefficients, we can use an F-test to test H0:β1=β2=⋯=βp=0; this does not depend on the coding.

Extensions of the Linear Model

The standard linear regression model makes several highly restrictive assumptions that are often violated in practice: the most important assumptions state that the relationship between the predictors and response are additive and linear.

Removing the Additive Assumption
Add an interaction term to measure how predictors affect each other:

Y=β0+β1X1+β2X2+β3X1X2+ϵ

hierarchical principle: if we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant.

Non-linear Relationships
Let's use polynomials!
polynomial regression: include polynomial functions of the predictors in the regression model
More in Ch 7

Potential Problems

Most common problems:

Non-linearity of the Data

If the true relationship is far from linear, then virtually all of the conclusions that we draw from the fit are suspect.

Check using residual plot, ei=yi−y^i versus xi. In the case of a multiple regression model we plot the residuals versus the predicted (or fitted) values y^i . Ideally, the residual plot will show no discernible pattern. The presence of a pattern may indicate a problem with some aspect of the linear model.

Correlation of Error Terms

If there is correlation among the error terms, then the estimated standard errors will tend to underestimate the true standard errors. As a result, confidence and prediction intervals will be narrower than they should be.

Correlations among error terms usually occur in the context of time series data. In order to determine if this is the case for a given data set, we can plot the residuals from our model as a function of time. If the errors are uncorrelated, then there should be no discernible pattern. On the other hand, if the error terms are positively correlated, then we may see adjacent residuals having similar values.

Correlation among the error terms can also occur outside of time series data.

Non-constant Variance of Error Terms

One can identify non-constant variances in the errors, or heteroscedasticity, from the presence of a funnel shape in the residual plot.

Reduce heteroscedasticity by transforming Y with a concave function like log(Y) or Y

Sometimes we have a good idea of the variance of each response. For example, the ith response could be an average of ni raw observations. If each of these raw observations is uncorrelated with variance σ2 , then their average has variance σi2=σ2/ni . In this case a simple remedy is to fit our model by weighted least squares, with weights proportional to the inverse variances—i.e. wi=ni in this case.

Outliers

While outliers may not have much of an effect on a least squares line, it can greatly affect RSE or R2. Since the RSE is used to compute all confidence intervals and p-values, a dramatic increase caused by a single data point can have implications for the interpretation of the fit. Similarly, inclusion of the outlier causes the R2 to decline.

Residual plots can be used to identify outliers. Specifically, the studentized residuals, computed by dividing each residual ei by its estimated standard error. Observations whose studentized residuals are greater than 3 in absolute value are possible outliers.

If we believe that an outlier has occurred due to an error in data collection or recording, then one solution is to simply remove the observation. However, care should be taken, since an outlier may instead indicate a deficiency with the model, such as a missing predictor.

High Leverage Points

Observations with high leverage have an unusual value for xi

High leverage observations tend to have a sizable impact on the estimated regression line. It is cause for concern if the least squares line is heavily affected by just a couple of observations, because any problems with these points may invalidate the entire ft.

In order to quantify an observation’s leverage, we compute the leverage statistic. A large value of this statistic indicates an observation with high leverage. For a simple linear regression:

hi=1n+(xi−x¯)2∑i′=1n(xi′−x¯)2

There is a simple extension of hi to the case of multiple predictors. The leverage statistic hi is always between 1/n and 1, and the average leverage for all the observations is always equal to (p+1)/n. So if a given observation has a leverage statistic that greatly exceeds (p+1)/n, then we may suspect that the corresponding point has high leverage.

Collinearity

Collinearity refers to the situation in which two or more predictor variables are closely related to one another. The presence of collinearity can pose problems in the regression context, since it can be difficult to separate out the individual effects of collinear variables on the response. The power of H0:βj=0 is reduced by collinearity.

To avoid such a situation, it is desirable to identify and address potential collinearity problems while fitting the model. compute the variance inflation factor (VIF). The smallest possible value for VIF is 1, which indicates the complete absence of collinearity. Typically in practice there is a small amount of collinearity among the predictors. As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity. The VIF for each variable can be computed using the formula:

VIF(β^j)=11−RXj|X−j2

where RXj|X−j2 is the R2 from a regression of Xj onto all of the other predictors.

When faced with the problem of collinearity, there are two simple solutions. The first is to drop one of the problematic variables from the regression. The second solution is to combine the collinear variables together into a single predictor.

Sources: 1

Connect With Me!