StatsModels

import statsmodels.api as sm
from statsmodels.stats.outliers_influence \
     import variance_inflation_factor as VIF
from statsmodels.stats.anova import anova_lm

Simple Linear Regression

sm.OLS() function to fit a simple linear regression model (ordinary least squares)
- separate the X and y values from the data frame into their own matrices
- X matrix needs a second column for number of observations. Consider np.ones()
  - this can be automatically handled by MS()

model = sm.OLS(y, X)
results = model.fit()
summarize(results)

Multiple Linear Regression

use MS() again, but with multiple columns:

X = MS(['col1', 'col2']).fit_transform(dataframe)
model = sm.OLS(y, X)
results = model.fit()

Use dataframe.columns.drop('col') to drop columns (THIS IS JUST A LIST OF THE COLUMN NAMES)

Interaction Terms

Add a tuple to MS() to add an interaction term:

X = MS(['col1',
        'col2',
        ('col1', 'col2')]).fit_transform(dataframe)

Non-linear Transformations of the Predictors

Add poly() from ISLP to MS():

X = MS([poly('col1', degree=2), 'col2']).fit_transform(dataframe)

by default, the columns created by poly() do not include an intercept column (those are added by MS())
adding raw=TRUE to poly() the matrix would have col1 and col1**2. Fitted values wouldn't change, but polynomial coefficients would.
anova_lm() does hypothesis tests comparing two successive models
- see documentation to use anova with modelspec

Qualitative Predictors

MS() automatically created dummy variables. Remember that one column (first level of a factor) is dropped. When all other levels are equal to 0, then the dropped level is being measured.

Connect With Me!