Correlation and Regression

Definitions

Bivariate data

If one of the variables has been controlled in some way or is used to explain the other, it is called the independent or explanatory variable. The other variable is called the dependent or response variable.

Plot with a scatter diagram, one variable on the x axis and the other on the y axis

Correlation

Correlations are mathematical relationships between variables. It does not mean that one variable causes the other.

Linear Correlations

A linear correlation is one that follows a straight line.

Line of Best Fit

The line that best fits the data points is called the line of best fit.
Linear regression is a mathematical way of finding the line of best fit,

y=a+bx

The sum of squared errors, or SSE, is given by

∑(y−y^)2

The slope of the line y=a+bx is

b=∑(x−x¯)(y−y¯)∑(x−x¯)2

The value of a is given by

a=y−bx

The correlation coefficient, r, is a number between -1 and 1 that describes the scatter of data away from the line of best fit. If r=−1, there is perfect negative linear correlation. If r=1, there is perfect positive linear correlation. If r=0, there is no correlation. Correlation Coefficient.png|center
You find r by calculating

r=bsxsy

where

sx=∑(x−x¯)2n−1

and

sy=∑(y−y¯)2n−1

Least square regression alternate notation

covariance is sxy covariance of x and y is a measure of how x and y vary together

sx2=∑(x−x¯)2n−1sy2=∑(y−y¯)2n−1sxy=∑(x−x¯)(y−y¯)n−1b=sxysx2r=sxysxsy

the coefficient of determination

The coefficient of determination is given by r2 or R2. It’s the percentage of variation in the y variable that’s explainable by the x variable.

r2=(sxysxsy)2orr2=∑(y−y^)2∑(y−y¯)2

non-linear relationships

If your line of best fit isn’t linear, you can sometimes transform it to a linear form.
You can then perform linear regression on the transformation to find the values of a and b. The big trick is to try and transform your non-linear equation of the line so that it takes the form

y′=a+bx′

where y′ and x′ are functions of x
Once you’ve transformed your y values, you can use least squares regression to find the values of a and b, then substitute these back into your original equation.

The confidence interval for the slope of a regression line

The confidence interval for b takes the form

b^±(marginoferror)

margin of error

marginoferror=t(ν)∗(standard deviationofb)

standard deviation of b

sb=∑(y−y^)2n−2∑(x−x¯)2

confidence interval

(b^−t(ν)sb,b^+t(ν)sb)whereν=n−2

Sources: 1

Connect With Me!