StatsModels
import statsmodels.api as sm
from statsmodels.stats.outliers_influence \
import variance_inflation_factor as VIF
from statsmodels.stats.anova import anova_lm
Simple Linear Regression
sm.OLS()
 function to fit a simple linear regression model (ordinary least squares)- separate the X and y values from the data frame into their own matrices
- X matrix needs a second column for number of observations. Consider
np.ones()
- this can be automatically handled by
MS()
- this can be automatically handled by
model = sm.OLS(y, X)
results = model.fit()
summarize(results)
Multiple Linear Regression
use MS()
again, but with multiple columns:
X = MS(['col1', 'col2']).fit_transform(dataframe)
model = sm.OLS(y, X)
results = model.fit()
- Use
dataframe.columns.drop('col')
to drop columns (THIS IS JUST A LIST OF THE COLUMN NAMES)
Interaction Terms
Add a tuple to MS()
to add an interaction term:
X = MS(['col1',
'col2',
('col1', 'col2')]).fit_transform(dataframe)
Non-linear Transformations of the Predictors
Add poly()
from ISLP
to MS()
:
X = MS([poly('col1', degree=2), 'col2']).fit_transform(dataframe)
- by default, the columns created byÂ
poly()
 do not include an intercept column (those are added byMS()
) - adding
raw=TRUE
topoly()
the matrix would havecol1
andcol1**2
. Fitted values wouldn't change, but polynomial coefficients would. anova_lm()
does hypothesis tests comparing two successive models- see documentation to use anova with modelspec
Qualitative Predictors
MS()
automatically created dummy variables. Remember that one column (first level of a factor) is dropped. When all other levels are equal to0
, then the dropped level is being measured.