Before start our lesson please download the datasets.

Regression

Contents

Correlation
Simple Regression
R-Squared
Multiple Regression
Adj R-Squared
P-value
Multicollinearity
Interaction terms

Correlation

What is need of correlation?

Is there any association between hours of study and grades?
Is there any association between number of temples in a city & murder rate?
What happens to sweater sales with increase in temperature? What is the strength of association between them?
What happens to ice-cream sales v.s temperature? What is the strength of association between them?
How to quantify the association?
Which of the above examples has very strong association?
Correlation

Correlation coefficient

It is a measure of linear association
r is the ratio of variance together vs product of individual variances.

$Correlation coefficient (r) = frac{Covariance of XY}{Sqrt(VarianceX* VarianceY)}$

Correlation 0 No linear association
Correlation 0 to 0.25 Negligible positive association
Correlation 0.25-0.5 Weak positive association
Correlation 0.5-0.75 Moderate positive association
Correlation >0.75 Very Strong positive association

LAB –Correlation Calculation

Dataset: AirPassengersAirPassengers.csv
Find the correlation between number of passengers and promotional budget.
Draw a scatter plot between number of passengers and promotional budget.
Find the correlation between number of passengers and Service_Quality_Score.

In [1]:

import pandas as pd
air = pd.read_csv("DatasetsAirPassengersAirPassengers.csv")
air.shape

Out[1]:

(80, 9)

In [2]:

#Name of the columns in the dataset:
air.columns.values

Out[2]:

array([ 'Week_num', 'Passengers', 'Promotion_Budget',
	'Service_Quality_Score', 'Holiday_week',
	'Delayed_Cancelled_flight_ind', 'Inter_metro_flight_ratio',
	'Bad_Weather_Ind', 'Technical_issues_ind'], dtype=object)

In [3]:

#Find the correlation between number of passengers and promotional budget.
import numpy as np
np.corrcoef(air.Passengers,air.Promotion_Budget)

Out[3]:

array([[ 1.,0.96585103],
	[ 0.96585103,1.]])

In [4]:

#Draw a scatter plot between number of passengers and promotional budget
	import matplotlib.pyplot as plt
	%matplotlib inline  
	plt.scatter(air.Passengers, air.Promotion_Budget)

Out[4]:

<matplotlib.collections.PathCollection at 0x28c0eb906a0>

In [5]:

#Find the correlation between number of passengers and Service_Quality_Score
np.corrcoef(air.Passengers,air.Service_Quality_Score)

Out[5]:

array([[ 1., -0.88653002],
[-0.88653002,  1.]])

Beyond Pearson Correlation

Correlation coefficient measures for different types of data

Variable YX	Quantitative /Continuous X	Ordinal/Ranked/Discrete X	Nominal/Categorical X
Quantitative Y	Pearson r	Biserial $r_b$	Point Biserial $r_{pb}$
Ordinal/Ranked/Discrete Y	Biserial $r_b$	Spearman rho/Kendall’s	Rank Biserial $r_{pb}$
Nominal/Categorical Y	Point Biserial $r_{pb}$	Rank Biserial $r_{rb}$	Phi, Contingency Coeff, V

From Correlation to Regression

Correlation is just a measure of association
It can’t be used for prediction.
Given the predictor variable, we can’t estimate the dependent variable.
In the air passengers example, given the promotion budget, we can’t get the estimated value of passengers
We need a model, an equation, a fit for the data.
That is known as regression line

What is Regression

A regression line is a mathematical formula that quantifies the general relation between a predictor/independent variable (or known variable x) and the target/dependent variable (or the unknown variable y).
Below is the regression line. If we have the data of x and y, then we can build a model to generalize their relation

$y = beta_0 + beta_1 x$

What is the best fit for our data?
- The one which goes through the core of the data
- The one which minimizes the error

Regression

Regression Line fitting

Error

Minimizing the error

The best line will have the minimum error
Some errors are positive and some errors are negative. Taking their sum is not a good idea
We can either minimize the squared sum of errors Or we can minimize the absolute sum of errors
Squared sum of errors is mathematically convenient to minimize
The method of minimizing squared sum of errors is called least squared method of regression

Least Squares Estimation

$X: x_1, x_2, x_3,... x_n$
$Y: y_1, y_2, y_3,... y_n$
Imagine a line through all the points
Deviation from each point (residual or error)
Square of the deviation
Minimizing sum of squares of deviation

$sum e^2 = sum (y - hat{y})^2$ $sum e^2= sum (y - (beta_0 + beta_1 x))^2$

$beta_0$ and $beta_1$ are obtained by minimizing the sum of the squared residuals

LAB: Regression Line Fitting

Dataset: AirPassengersAirPassengers.csv
Find the correlation between Promotion_Budget and Passengers
Draw a scatter plot between Promotion_Budget and Passengers. Is there any any pattern between Promotion_Budget and Passengers?
Build a linear regression model on Promotion_Budget and Passengers.
Build a regression line to predict the passengers using Inter_metro_flight_ratio

In [6]:

import pandas as pd
air = pd.read_csv("DatasetsAirPassengersAirPassengers.csv")
air.shape

Out[6]:

(80, 9)

In [7]:

air.columns.values

Out[7]:

array(['Week_num', 'Passengers', 'Promotion_Budget',
       'Service_Quality_Score', 'Holiday_week',
       'Delayed_Cancelled_flight_ind', 'Inter_metro_flight_ratio',
'Bad_Weather_Ind', 'Technical_issues_ind'], dtype=object)

In [8]:

air.head(5)

Out[8]:

	Week_num	Passengers	Promotion_Budget	Service_Quality_Score	Holiday_week	Delayed_Cancelled_flight_ind	Inter_metro_flight_ratio	Bad_Weather_Ind	Technical_issues_ind
0	1	37824	517356	4.00000	NO	NO	0.70	YES	YES
1	2	43936	646086	2.67466	NO	YES	0.80	YES	YES
2	3	42896	638330	3.29473	NO	NO	0.90	NO	NO
3	4	35792	506492	3.85684	NO	NO	0.40	NO	NO
4	5	38624	609658	3.90757	NO	NO	0.87	NO	YES

In [9]:

# Find the correlation between Promotion_Budget and Passengers
import numpy as np
np.corrcoef(air.Passengers,air.Promotion_Budget)

Out[9]:

array([[ 1.,  0.96585103],
[ 0.96585103, 1.]])

In [10]:

# Draw a scatter plot between Promotion_Budget and Passengers. Is there any any pattern between Promotion_Budget and Passengers?
import matplotlib.pyplot as plt
%matplotlib inline 
plt.scatter(air.Passengers, air.Promotion_Budget)

Out[10]:

<matplotlib.collections.PathCollection at 0x28c0f0f0e48>

In [11]:

#Build a linear regression model and estimate the expected passengers for a Promotion_Budget is 650,000
##Regression Model  promotion and passengers count
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(air[["Promotion_Budget"]], air[["Passengers"]])
predictions = lr.predict(65000)
predictions

Out[11]:

array([[ 5779.03537577]])

In [12]:

import statsmodels.formula.api as sm
model = sm.ols(formula='Passengers ~ Promotion_Budget', data=air)
fitted1 = model.fit()

In [13]:

fitted1.summary()

Out[13]:

OLS Regression Results
Dep. Variable:	Passengers	R-squared:	0.933
Model:	OLS	Adj. R-squared:	0.932
Method:	Least Squares	F-statistic:	1084.
Date:	Tue, 14 Feb 2017	Prob (F-statistic):	1.66e-47
Time:	17:26:26	Log-Likelihood:	-751.34
No. Observations:	80	AIC:	1507.
Df Residuals:	78	BIC:	1511.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	1259.6058	1361.071	0.925	0.358	-1450.078 3969.290
Promotion_Budget	0.0695	0.002	32.923	0.000	0.065 0.074

Omnibus:	26.624	Durbin-Watson:	1.831
Prob(Omnibus):	0.000	Jarque-Bera (JB):	5.188
Skew:	-0.128	Prob(JB):	0.0747
Kurtosis:	1.779	Cond. No.	2.67e+06

In [14]:

# Build a regression line to predict the passengers using Inter_metro_flight_ratio
plt.scatter(air.Inter_metro_flight_ratio,air.Passengers)

Out[14]:

<matplotlib.collections.PathCollection at 0x28c10f92630>

In [15]:

import sklearn as sk
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(air[["Inter_metro_flight_ratio"]], air[["Passengers"]])

Out[15]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [16]:

predictions = lr.predict(air[["Inter_metro_flight_ratio"]])

In [17]:

import statsmodels.formula.api as sm
model = sm.ols(formula='Passengers ~ Inter_metro_flight_ratio', data=air)
fitted2 = model.fit()

In [18]:

fitted2.summary()

Out[18]:

OLS Regression Results
Dep. Variable:	Passengers	R-squared:	0.242
Model:	OLS	Adj. R-squared:	0.232
Method:	Least Squares	F-statistic:	24.90
Date:	Tue, 14 Feb 2017	Prob (F-statistic):	3.58e-06
Time:	17:26:34	Log-Likelihood:	-848.30
No. Observations:	80	AIC:	1701.
Df Residuals:	78	BIC:	1705.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	2.044e+04	4993.747	4.093	0.000	1.05e+04 3.04e+04
Inter_metro_flight_ratio	3.507e+04	7027.768	4.990	0.000	2.11e+04 4.91e+04

Omnibus:	10.172	Durbin-Watson:	1.385
Prob(Omnibus):	0.006	Jarque-Bera (JB):	10.098
Skew:	0.822	Prob(JB):	0.00641
Kurtosis:	3.573	Cond. No.	9.48

How good is my regression line?

Take an (x,y) point from data.
Imagine that we submitted x in the regression line, we got a prediction as $y_{pred}$
If the regression line is a good fit then we expect $y_{pred}$ =y or (y- $y_{pred}$ ) =0
At every point of x, if we repeat the same, then we will get multiple error values ( $y-$ $y_{pred}$ ).
Some of them might be positive, some of them might be negative, so we can take the square of all such errors

$SSE = sum(y - hat{y})^2$

For a good model, we need SSE to be zero or near to zero
Standalone SSE will not make any sense. For example SSE= 100, is very less when y is varying in terms of 1000’s. Same value is is very high when y is varying in terms of decimals.
We have to consider variance of y, while calculating the regression line accuracy
Error Sum of squares (SSE- Sum of Squares of error) $SSE = sum(y - hat{y})^2$
Total Variance in Y (SST- Sum of Squares of Total)
$SST = sum(y - bar{y})^2$ $SST = sum(y - hat{y} + - hat{y} - bar{y})^2$
$SST = sum(y - hat{y} + - hat{y} - bar{y})^2$ $SST = sum(y - hat{y} + - hat{y} - bar{y})^2$ $SST = sum(y - hat{y})^2 + sum(hat{y} - bar{y})^2$ $SST = SSE + sum(hat{y} - bar{y})^2$ $SST = SSE + SSR$
So, total variance in Y is divided into two parts,
- Variance that cannot be explained by x (error)
- Variance that can be explained by x, using regression

Explained and Unexplained Variation

So, total variance in Y is divided into two parts,
- Variance that can be explained by x, using regression
- Variance that cannot be explained by x SST = SSE + SSR

$Total sum of Squares = Sum of Squares Error + Sum of Squares Regression$ $SST = sum(y - bar{y})^2 SSE = sum(y - hat{y})^2 SSR = sum(hat{y} - bar{y})^2$

R-Squared

A good fit will have
- SSE (Minimum or Maximum?)
- SSR (Minimum or Maximum?)
- And we know SST= SSE + SSR
- SSE/SST(Minimum or Maximum?)
- SSR/SST(Minimum or Maximum?)
The coefficient of determination is the portion of the total variation in the dependent variable that is explained by variation in the independent variable
The coefficient of determination is also called R-squared and is denoted as $R^2$

$R^2 = frac{SSR}{SST}$ where 0<= $R^2$ <=1

Lab: R- Square

What is the R-square value of Passengers vs Promotion_Budget model?
What is the R-square value of Passengers vs Inter_metro_flight_ratio

In [19]:

#What is the R-square value of Passengers vs Promotion_Budget model?
fitted1.summary()

Out[19]:

OLS Regression Results
Dep. Variable:	Passengers	R-squared:	0.933
Model:	OLS	Adj. R-squared:	0.932
Method:	Least Squares	F-statistic:	1084.
Date:	Tue, 14 Feb 2017	Prob (F-statistic):	1.66e-47
Time:	17:26:45	Log-Likelihood:	-751.34
No. Observations:	80	AIC:	1507.
Df Residuals:	78	BIC:	1511.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	1259.6058	1361.071	0.925	0.358	-1450.078 3969.290
Promotion_Budget	0.0695	0.002	32.923	0.000	0.065 0.074

Omnibus:	26.624	Durbin-Watson:	1.831
Prob(Omnibus):	0.000	Jarque-Bera (JB):	5.188
Skew:	-0.128	Prob(JB):	0.0747
Kurtosis:	1.779	Cond. No.	2.67e+06

In [20]:

#What is the R-square value of Passengers vs Inter_metro_flight_ratio
fitted2.summary()

Out[20]:

OLS Regression Results
Dep. Variable:	Passengers	R-squared:	0.242
Model:	OLS	Adj. R-squared:	0.232
Method:	Least Squares	F-statistic:	24.90
Date:	Tue, 14 Feb 2017	Prob (F-statistic):	3.58e-06
Time:	17:26:47	Log-Likelihood:	-848.30
No. Observations:	80	AIC:	1701.
Df Residuals:	78	BIC:	1705.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	2.044e+04	4993.747	4.093	0.000	1.05e+04 3.04e+04
Inter_metro_flight_ratio	3.507e+04	7027.768	4.990	0.000	2.11e+04 4.91e+04

Omnibus:	10.172	Durbin-Watson:	1.385
Prob(Omnibus):	0.006	Jarque-Bera (JB):	10.098
Skew:	0.822	Prob(JB):	0.00641
Kurtosis:	3.573	Cond. No.	9.48

Multiple Regression

Using multiple predictor variables instead of single variable
We need to find a perfect plane here

Code – Multiple Regression

In [21]:

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(air[["Promotion_Budget"]+["Inter_metro_flight_ratio"]], air[["Passengers"]])

Out[21]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [22]:

predictions = lr.predict(air[["Promotion_Budget"]+["Inter_metro_flight_ratio"]])

In [23]:

import statsmodels.formula.api as sm
model = sm.ols(formula='Passengers ~ Promotion_Budget+Inter_metro_flight_ratio', data=air)
fitted = model.fit()
fitted.summary()

Out[23]:

OLS Regression Results
Dep. Variable:	Passengers	R-squared:	0.934
Model:	OLS	Adj. R-squared:	0.932
Method:	Least Squares	F-statistic:	540.5
Date:	Tue, 14 Feb 2017	Prob (F-statistic):	4.76e-46
Time:	17:26:53	Log-Likelihood:	-750.96
No. Observations:	80	AIC:	1508.
Df Residuals:	77	BIC:	1515.
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	2017.7724	1624.803	1.242	0.218	-1217.624 5253.169
Promotion_Budget	0.0707	0.002	28.297	0.000	0.066 0.076
Inter_metro_flight_ratio	-2121.5208	2473.189	-0.858	0.394	-7046.268 2803.227

Omnibus:	26.259	Durbin-Watson:	1.800
Prob(Omnibus):	0.000	Jarque-Bera (JB):	5.075
Skew:	-0.096	Prob(JB):	0.0791
Kurtosis:	1.781	Cond. No.	5.25e+06

Individual Impact of variables

Look at the P-value
Probability of the hypothesis being right.
Individual variable coefficient is tested for significance
Beta coefficients follow t distribution.
Individual P values tell us about the significance of each variable
A variable is significant if P value is less than 5%. Lesser the P-value, better the variable
Note: It is possible for all the variables in a regression to produce a great fit, and yet very few of the variables be individually significant.

To test $H_0 : beta_i = 0$ $H_a : beta_i not= 0$ Test Statistic $t=frac{b_i}{s(b_i)}$ Reject $t > t(frac{alpha}{2};n-k-1)$ or $t > -t(frac{alpha}{2};n-k-1)$

LAB: Multiple Regression

Build a multiple regression model to predict the number of passengers
What is R-square value
Are there any predictor variables that are not impacting the dependent variable

In [24]:

#Build a multiple regression model to predict the number of passengers
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(air[["Promotion_Budget"]+["Inter_metro_flight_ratio"]+["Service_Quality_Score"]], air[["Passengers"]])
predictions = lr.predict(air[["Promotion_Budget"]+["Inter_metro_flight_ratio"]+["Service_Quality_Score"]])

In [25]:

import statsmodels.formula.api as sm
model = sm.ols(formula='Passengers ~ Promotion_Budget+Inter_metro_flight_ratio+Service_Quality_Score', data=air)
fitted = model.fit()
fitted.summary()

Out[25]:

OLS Regression Results
Dep. Variable:	Passengers	R-squared:	0.951
Model:	OLS	Adj. R-squared:	0.949
Method:	Least Squares	F-statistic:	495.6
Date:	Tue, 14 Feb 2017	Prob (F-statistic):	8.71e-50
Time:	17:26:58	Log-Likelihood:	-738.45
No. Observations:	80	AIC:	1485.
Df Residuals:	76	BIC:	1494.
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	1.921e+04	3542.694	5.424	0.000	1.22e+04 2.63e+04
Promotion_Budget	0.0555	0.004	15.476	0.000	0.048 0.063
Inter_metro_flight_ratio	-2003.4508	2129.095	-0.941	0.350	-6243.912 2237.010
Service_Quality_Score	-2802.0708	530.382	-5.283	0.000	-3858.419 -1745.723

Omnibus:	6.902	Durbin-Watson:	2.312
Prob(Omnibus):	0.032	Jarque-Bera (JB):	2.759
Skew:	-0.051	Prob(JB):	0.252
Kurtosis:	2.096	Cond. No.	8.22e+06

What is R-square value

0.951

Are there any predictor variables that are not impacting the dependent variable

Inter_metro_flight_ratio

Adjusted R-Squared

Is it good to have as many independent variables as possible? Nope
R-square is deceptive. R-squared value never decreases when a new X variable is added to the model – True?
We need a better measure or an adjustment to the original R-squared formula.
Adjusted R squared
- Its value depends on the number of explanatory variables
- Imposes a penalty for adding additional explanatory variables
- It is usually written as ( $bar{R}^2$ )
- Very different from $R^2$ , when there are too many predictors and n is less
$bar{R}^2 = R^2 - frac{k-1}{n-k}(1-R^2)$
where n – number of observations
```
 k - number of parameters
```

LAB: Adjusted R-Square

Dataset: “Adjusted Rsquare/ Adj_Sample.csv”
Build a model to predict y using x1,x2 and x3. Note down R-Square and Adj R-Square values
Build a model to predict y using x1,x2,x3,x4,x5 and x6. Note down R-Square and Adj R-Square values
Build a model to predict y using x1,x2,x3,x4,x5,x6,x7 and x8. Note down R-Square and Adj R-Square values

In [26]:

adj_sample=pd.read_csv("DatasetsAdjusted RSquareAdj_Sample.csv")
adj_sample.shape

Out[26]:

(12, 9)

In [27]:

adj_sample.columns.values

Out[27]:

array(['Y', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8'], dtype=object)

In [28]:

#Build a model to predict y using x1,x2 and x3. Note down R-Square and Adj R-Square values 
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(adj_sample[["x1"]+["x2"]+["x3"]], adj_sample[["Y"]])
predictions = lr.predict(adj_sample[["x1"]+["x2"]+["x3"]])

In [29]:

import statsmodels.formula.api as sm
model = sm.ols(formula='Y ~ x1+x2+x3', data=adj_sample)
fitted1 = model.fit()
fitted1.summary()

C:UsersADMINAnaconda3libsite-packagesscipystatsstats.py:1327: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=12
  "anyway, n=%i" % int(n))

Out[29]:

OLS Regression Results
Dep. Variable:	Y	R-squared:	0.684
Model:	OLS	Adj. R-squared:	0.566
Method:	Least Squares	F-statistic:	5.785
Date:	Tue, 14 Feb 2017	Prob (F-statistic):	0.0211
Time:	17:27:07	Log-Likelihood:	-10.430
No. Observations:	12	AIC:	28.86
Df Residuals:	8	BIC:	30.80
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	-2.8798	1.163	-2.477	0.038	-5.561 -0.199
x1	-0.4894	0.370	-1.324	0.222	-1.342 0.363
x2	0.0029	0.001	2.586	0.032	0.000 0.005
x3	0.4572	0.176	2.595	0.032	0.051 0.864

Omnibus:	1.113	Durbin-Watson:	1.978
Prob(Omnibus):	0.573	Jarque-Bera (JB):	0.763
Skew:	-0.562	Prob(JB):	0.683
Kurtosis:	2.489	Cond. No.	6.00e+03

In [30]:

#Build a model to predict y using x1,x2,x3,x4,x5 and x6. Note down R-Square and Adj R-Square values 
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(adj_sample[["x1"]+["x2"]+["x3"]+["x4"]+["x5"]+["x6"]], adj_sample[["Y"]])
predictions = lr.predict(adj_sample[["x1"]+["x2"]+["x3"]+["x4"]+["x5"]+["x6"]])

In [31]:

import statsmodels.formula.api as sm
model = sm.ols(formula='Y ~ x1+x2+x3+x4+x5+x6', data=adj_sample)
fitted2 = model.fit()
fitted2.summary()

C:UsersADMINAnaconda3libsite-packagesscipystatsstats.py:1327: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=12
  "anyway, n=%i" % int(n))

Out[31]:

OLS Regression Results
Dep. Variable:	Y	R-squared:	0.717
Model:	OLS	Adj. R-squared:	0.377
Method:	Least Squares	F-statistic:	2.111
Date:	Tue, 14 Feb 2017	Prob (F-statistic):	0.215
Time:	17:27:10	Log-Likelihood:	-9.7790
No. Observations:	12	AIC:	33.56
Df Residuals:	5	BIC:	36.95
Df Model:	6
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	-5.3751	4.687	-1.147	0.303	-17.423 6.673
x1	-0.6697	0.537	-1.247	0.268	-2.050 0.711
x2	0.0030	0.002	1.956	0.108	-0.001 0.007
x3	0.5063	0.249	2.036	0.097	-0.133 1.146
x4	0.0376	0.084	0.449	0.672	-0.178 0.253
x5	0.0436	0.169	0.258	0.806	-0.390 0.478
x6	0.0516	0.088	0.588	0.582	-0.174 0.277

Omnibus:	0.426	Durbin-Watson:	2.065
Prob(Omnibus):	0.808	Jarque-Bera (JB):	0.434
Skew:	-0.347	Prob(JB):	0.805
Kurtosis:	2.378	Cond. No.	1.98e+04

In [32]:

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(adj_sample[["x1"]+["x2"]+["x3"]+["x4"]+["x5"]+["x6"]+["x7"]+["x8"]], adj_sample[["Y"]])
predictions = lr.predict(adj_sample[["x1"]+["x2"]+["x3"]+["x4"]+["x5"]+["x6"]+["x7"]+["x8"]])

In [33]:

import statsmodels.formula.api as sm
model = sm.ols(formula='Y ~ x1+x2+x3+x4+x5+x6+x7+x8', data=adj_sample)
fitted3 = model.fit()
fitted3.summary()

C:UsersADMINAnaconda3libsite-packagesscipystatsstats.py:1327: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=12
  "anyway, n=%i" % int(n))

Out[33]:

OLS Regression Results
Dep. Variable:	Y	R-squared:	0.805
Model:	OLS	Adj. R-squared:	0.285
Method:	Least Squares	F-statistic:	1.549
Date:	Tue, 14 Feb 2017	Prob (F-statistic):	0.393
Time:	17:27:13	Log-Likelihood:	-7.5390
No. Observations:	12	AIC:	33.08
Df Residuals:	3	BIC:	37.44
Df Model:	8
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	17.0440	19.903	0.856	0.455	-46.297 80.385
x1	-0.0956	0.761	-0.126	0.908	-2.519 2.328
x2	0.0007	0.003	0.291	0.790	-0.007 0.009
x3	0.5157	0.306	1.684	0.191	-0.459 1.490
x4	0.0579	0.103	0.560	0.615	-0.271 0.387
x5	0.0858	0.191	0.448	0.684	-0.524 0.695
x6	-0.1747	0.220	-0.795	0.485	-0.874 0.525
x7	-0.0324	0.153	-0.212	0.846	-0.519 0.455
x8	-0.2321	0.207	-1.124	0.343	-0.890 0.425

Omnibus:	1.329	Durbin-Watson:	1.594
Prob(Omnibus):	0.514	Jarque-Bera (JB):	0.875
Skew:	-0.339	Prob(JB):	0.646
Kurtosis:	1.863	Cond. No.	7.85e+04

Model	$R^2$	$Adj R^2$
Model1	0.684	0.566
Model2	0.717	0.377
Model3	0.805	0.285

R-Squared vs Adjusted R-Squared

We have built three models on Adj_sample data; model1, model2 and model3 with different number of variables

LAB: Multiple Regression- issues

Import Final Exam Score data
Build a model to predict final score using the rest of the variables.
How are Sem2_Math & Final score related? As Sem2_Math score increases, what happens to Final score?
Remove “Sem1_Math” variable from the model and rebuild the model
Is there any change in R square or Adj R square
How are Sem2_Math & Final score related now? As Sem2_Math score increases, what happens to Final score?
Draw a scatter plot between Sem1_Math & Sem2_Math
Find the correlation between Sem1_Math & Sem2_Math

In [34]:

#Import Final Exam Score data
final_exam=pd.read_csv("DatasetsFinal ExamFinal Exam Score.csv")

In [35]:

#Size of the data
final_exam.shape

Out[35]:

(24, 5)

In [36]:

#Variable names
final_exam.columns

Out[36]:

Index(['Sem1_Science', 'Sem2_Science', 'Sem1_Math', 'Sem2_Math',
       'Final_exam_marks'],
dtype='object')

In [37]:

#Build a model to predict final score using the rest of the variables.
from sklearn.linear_model import LinearRegression
lr1 = LinearRegression()
lr1.fit(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem1_Math"]+["Sem2_Math"]], final_exam[["Final_exam_marks"]])
predictions1 = lr1.predict(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem1_Math"]+["Sem2_Math"]])
import statsmodels.formula.api as sm
model1 = sm.ols(formula='Final_exam_marks ~ Sem1_Science+Sem2_Science+Sem1_Math+Sem2_Math', data=final_exam)
fitted1 = model1.fit()
fitted1.summary()

Out[37]:

OLS Regression Results
Dep. Variable:	Final_exam_marks	R-squared:	0.990
Model:	OLS	Adj. R-squared:	0.987
Method:	Least Squares	F-statistic:	452.3
Date:	Tue, 14 Feb 2017	Prob (F-statistic):	1.50e-18
Time:	17:27:21	Log-Likelihood:	-38.099
No. Observations:	24	AIC:	86.20
Df Residuals:	19	BIC:	92.09
Df Model:	4
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	-1.6226	1.999	-0.812	0.427	-5.806 2.561
Sem1_Science	0.1738	0.063	2.767	0.012	0.042 0.305
Sem2_Science	0.2785	0.052	5.379	0.000	0.170 0.387
Sem1_Math	0.7890	0.197	4.002	0.001	0.376 1.202
Sem2_Math	-0.2063	0.191	-1.078	0.294	-0.607 0.194

Omnibus:	6.343	Durbin-Watson:	1.863
Prob(Omnibus):	0.042	Jarque-Bera (JB):	4.332
Skew:	0.973	Prob(JB):	0.115
Kurtosis:	3.737	Cond. No.	1.20e+03

In [38]:

fitted1.rsquared

Out[38]:

0.98960765475687229

How are Sem2_Math & Final score related? As Sem2_Math score increases, what happens to Final score?

As Sem2_Math score increases Final score decreases

In [39]:

#Remove "Sem1_Math" variable from the model and rebuild the model
from sklearn.linear_model import LinearRegression
lr2 = LinearRegression()
lr2.fit(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem2_Math"]], final_exam[["Final_exam_marks"]])
predictions2 = lr2.predict(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem2_Math"]])
import statsmodels.formula.api as sm
model2 = sm.ols(formula='Final_exam_marks ~ Sem1_Science+Sem2_Science+Sem2_Math', data=final_exam)
fitted2 = model2.fit()
fitted2.summary()

Out[39]:

OLS Regression Results
Dep. Variable:	Final_exam_marks	R-squared:	0.981
Model:	OLS	Adj. R-squared:	0.978
Method:	Least Squares	F-statistic:	341.4
Date:	Tue, 14 Feb 2017	Prob (F-statistic):	2.44e-17
Time:	17:27:25	Log-Likelihood:	-45.436
No. Observations:	24	AIC:	98.87
Df Residuals:	20	BIC:	103.6
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	-2.3986	2.632	-0.911	0.373	-7.889 3.092
Sem1_Science	0.2130	0.082	2.595	0.017	0.042 0.384
Sem2_Science	0.2686	0.068	3.925	0.001	0.126 0.411
Sem2_Math	0.5320	0.067	7.897	0.000	0.391 0.673

Omnibus:	5.869	Durbin-Watson:	2.424
Prob(Omnibus):	0.053	Jarque-Bera (JB):	3.793
Skew:	0.864	Prob(JB):	0.150
Kurtosis:	3.898	Cond. No.	1.03e+03

Is there any change in R square or Adj R square

Model	$R^2$	Adj $R^2$
model1	0.990	0.987
model2	0.981	0.978

How are Sem2_Math & Final score related now? As Sem2_Math score increases, what happens to Final score?

As Sem2_Math score increases Final score also increases.

In [40]:

#Draw a scatter plot between Sem1_Math & Sem2_Mat
import matplotlib.pyplot as plt
%matplotlib inline 
plt.scatter(final_exam.Sem1_Math,final_exam.Sem2_Math)

Out[40]:

<matplotlib.collections.PathCollection at 0x28c11508ac8>

In [41]:

#Find the correlation between Sem1_Math & Sem2_Math 
np.corrcoef(final_exam.Sem1_Math,final_exam.Sem2_Math)

Out[41]:

array([[ 1.       ,  0.9924948],
[ 0.9924948,  1.       ]])

Multicollinearity

Multiple regression is wonderful – It allows you to consider the effect of multiple variables simultaneously.
Multiple regression is extremely unpleasant – Because it allows you to consider the effect of multiple variables simultaneously.
The relationships between the explanatory variables are the key to understanding multiple regression.
Multicollinearity (or inter correlation) exists when at least some of the predictor variables are correlated among themselves.
The parameter estimates will have inflated variance in presence of multicollineraity.
Sometimes the signs of the parameter estimates tend to change
If the relation between the independent variables grows really strong, then the variance of parameter estimates tends to be infinity – Can you prove it?

Multicollinearity Detection

$Y = beta_0 + beta_1 X_1 + beta_2 X_2 + beta_3 X_3 + beta_4 X_4$

Build a model X1 vs X2 X_3 X_4 find $R^2$ , say R1
Build a model X2 vs X1 X3 X4 find $R^2$ , say R2
Build a model X3 vs X1 X_2 X4 find $R^2$ , say R3
Build a model X4 vs X1 X2 X3 find $R^2$ , say R4
For example if R3 is 95% then we don’t really need X3 in the model
Since it can be explained as liner combination of other three
For each variable we find individual $R^2$ .
$frac{1}{(1-R^2)}$ is called VIF.
VIF option in SAS, automatically calculates VIF values for each of the predictor variables

$R^2$	40%	50%	60%	70%	75%	80%	90%
VIF	1.67	2.00	2.50	3.33	4.00	5.00	10.00

LAB: Multicollinearity

Identify the Multicollinearity in the Final Exam Score model.
Drop the variables one by one to reduce the multicollinearity.
Identify and eliminate the Multicollinearity in the Air passengers model.

In [42]:

from sklearn.linear_model import LinearRegression
lr1 = LinearRegression()
lr1.fit(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem1_Math"]+["Sem2_Math"]], final_exam[["Final_exam_marks"]])
predictions1 = lr1.predict(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem1_Math"]+["Sem2_Math"]])

In [43]:

import statsmodels.formula.api as sm
model1 = sm.ols(formula='Final_exam_marks ~ Sem1_Science+Sem2_Science+Sem1_Math+Sem2_Math', data=final_exam)
fitted1 = model1.fit()
fitted1.summary()

Out[43]:

OLS Regression Results
Dep. Variable:	Final_exam_marks	R-squared:	0.990
Model:	OLS	Adj. R-squared:	0.987
Method:	Least Squares	F-statistic:	452.3
Date:	Tue, 14 Feb 2017	Prob (F-statistic):	1.50e-18
Time:	17:27:34	Log-Likelihood:	-38.099
No. Observations:	24	AIC:	86.20
Df Residuals:	19	BIC:	92.09
Df Model:	4
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	-1.6226	1.999	-0.812	0.427	-5.806 2.561
Sem1_Science	0.1738	0.063	2.767	0.012	0.042 0.305
Sem2_Science	0.2785	0.052	5.379	0.000	0.170 0.387
Sem1_Math	0.7890	0.197	4.002	0.001	0.376 1.202
Sem2_Math	-0.2063	0.191	-1.078	0.294	-0.607 0.194

Omnibus:	6.343	Durbin-Watson:	1.863
Prob(Omnibus):	0.042	Jarque-Bera (JB):	4.332
Skew:	0.973	Prob(JB):	0.115
Kurtosis:	3.737	Cond. No.	1.20e+03

In [44]:

fitted1.summary2()

Out[44]:

Model:	OLS	Adj. R-squared:	0.987
Dependent Variable:	Final_exam_marks	AIC:	86.1980
Date:	2017-02-14 17:27	BIC:	92.0883
No. Observations:	24	Log-Likelihood:	-38.099
Df Model:	4	F-statistic:	452.3
Df Residuals:	19	Prob (F-statistic):	1.50e-18
R-squared:	0.990	Scale:	1.7694

	Coef.	Std.Err.	t	P>\|t\|	[0.025	0.975]
Intercept	-1.6226	1.9987	-0.8118	0.4269	-5.8060	2.5607
Sem1_Science	0.1738	0.0628	2.7668	0.0123	0.0423	0.3052
Sem2_Science	0.2785	0.0518	5.3795	0.0000	0.1702	0.3869
Sem1_Math	0.7890	0.1971	4.0023	0.0008	0.3764	1.2016
Sem2_Math	-0.2063	0.1914	-1.0782	0.2944	-0.6069	0.1942

Omnibus:	6.343	Durbin-Watson:	1.863
Prob(Omnibus):	0.042	Jarque-Bera (JB):	4.332
Skew:	0.973	Prob(JB):	0.115
Kurtosis:	3.737	Condition No.:	1200

In [45]:

#Code for VIF Calculation
#Writing a function to calculate the VIF values
def vif_cal(input_data, dependent_col):
    x_vars=input_data.drop([dependent_col], axis=1)
    xvar_names=x_vars.columns
    for i in range(0,xvar_names.shape[0]):
        y=x_vars[xvar_names[i]] 
        x=x_vars[xvar_names.drop(xvar_names[i])]
        rsq=sm.ols(formula="y~x", data=x_vars).fit().rsquared  
        vif=round(1/(1-rsq),2)
        print (xvar_names[i], " VIF = " , vif)

In [46]:

#Calculating VIF values using that function
vif_cal(input_data=final_exam, dependent_col="Final_exam_marks")

Sem1_Science  VIF =  7.4
Sem2_Science  VIF =  5.4
Sem1_Math  VIF =  68.79
Sem2_Math  VIF =  68.01

In [47]:

import statsmodels.formula.api as sm
model2 = sm.ols(formula='Final_exam_marks ~ Sem1_Science+Sem2_Science+Sem2_Math', data=final_exam)
fitted2 = model2.fit()
fitted2.summary()

Out[47]:

OLS Regression Results
Dep. Variable:	Final_exam_marks	R-squared:	0.981
Model:	OLS	Adj. R-squared:	0.978
Method:	Least Squares	F-statistic:	341.4
Date:	Tue, 14 Feb 2017	Prob (F-statistic):	2.44e-17
Time:	17:27:39	Log-Likelihood:	-45.436
No. Observations:	24	AIC:	98.87
Df Residuals:	20	BIC:	103.6
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	-2.3986	2.632	-0.911	0.373	-7.889 3.092
Sem1_Science	0.2130	0.082	2.595	0.017	0.042 0.384
Sem2_Science	0.2686	0.068	3.925	0.001	0.126 0.411
Sem2_Math	0.5320	0.067	7.897	0.000	0.391 0.673

Omnibus:	5.869	Durbin-Watson:	2.424
Prob(Omnibus):	0.053	Jarque-Bera (JB):	3.793
Skew:	0.864	Prob(JB):	0.150
Kurtosis:	3.898	Cond. No.	1.03e+03

In [48]:

vif_cal(input_data=final_exam.drop(["Sem1_Math"], axis=1), dependent_col="Final_exam_marks")

Sem1_Science  VIF =  7.22
Sem2_Science  VIF =  5.38
Sem2_Math  VIF =  4.81

In [49]:

vif_cal(input_data=final_exam.drop(["Sem1_Math","Sem1_Science"], axis=1), dependent_col="Final_exam_marks")

Sem2_Science  VIF =  3.4
Sem2_Math  VIF =  3.4

In [50]:

#Identify and eliminate the Multicollinearity  in the Air passengers model
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(air[["Promotion_Budget"]+["Inter_metro_flight_ratio"]+["Service_Quality_Score"]], air[["Passengers"]])
predictions = lr.predict(air[["Promotion_Budget"]+["Inter_metro_flight_ratio"]+["Service_Quality_Score"]])
import statsmodels.formula.api as sm
model = sm.ols(formula='Passengers ~ Promotion_Budget+Inter_metro_flight_ratio+Service_Quality_Score', data=air)
fitted = model.fit()
fitted.summary()

Out[50]:

OLS Regression Results
Dep. Variable:	Passengers	R-squared:	0.951
Model:	OLS	Adj. R-squared:	0.949
Method:	Least Squares	F-statistic:	495.6
Date:	Tue, 14 Feb 2017	Prob (F-statistic):	8.71e-50
Time:	17:27:42	Log-Likelihood:	-738.45
No. Observations:	80	AIC:	1485.
Df Residuals:	76	BIC:	1494.
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	1.921e+04	3542.694	5.424	0.000	1.22e+04 2.63e+04
Promotion_Budget	0.0555	0.004	15.476	0.000	0.048 0.063
Inter_metro_flight_ratio	-2003.4508	2129.095	-0.941	0.350	-6243.912 2237.010
Service_Quality_Score	-2802.0708	530.382	-5.283	0.000	-3858.419 -1745.723

Omnibus:	6.902	Durbin-Watson:	2.312
Prob(Omnibus):	0.032	Jarque-Bera (JB):	2.759
Skew:	-0.051	Prob(JB):	0.252
Kurtosis:	2.096	Cond. No.	8.22e+06

In [51]:

air.columns.values

Out[51]:

array(['Week_num', 'Passengers', 'Promotion_Budget',
       'Service_Quality_Score', 'Holiday_week',
       'Delayed_Cancelled_flight_ind', 'Inter_metro_flight_ratio',
'Bad_Weather_Ind', 'Technical_issues_ind'], dtype=object)

In [52]:

#Calculating VIF values using that function
vif_cal(input_data=air.drop(["Holiday_week","Delayed_Cancelled_flight_ind", "Bad_Weather_Ind", "Technical_issues_ind"], axis=1), dependent_col="Passengers")

Week_num  VIF =  1.2
Promotion_Budget  VIF =  3.94
Service_Quality_Score  VIF =  3.52
Inter_metro_flight_ratio  VIF =  1.39

Note: For calculating vif, all the variables have to be numerical i.e., no categorical variables. All the categorical variables should either be dropped or converted into numerical.

In [53]:

air

Out[53]:

	Week_num	Passengers	Promotion_Budget	Service_Quality_Score	Holiday_week	Delayed_Cancelled_flight_ind	Inter_metro_flight_ratio	Bad_Weather_Ind	Technical_issues_ind
0	1	37824	517356	4.00000	NO	NO	0.70	YES	YES
1	2	43936	646086	2.67466	NO	YES	0.80	YES	YES
2	3	42896	638330	3.29473	NO	NO	0.90	NO	NO
3	4	35792	506492	3.85684	NO	NO	0.40	NO	NO
4	5	38624	609658	3.90757	NO	NO	0.87	NO	YES
5	6	35744	476084	3.83710	NO	YES	0.66	YES	NO
6	7	40752	635978	3.60259	NO	YES	0.74	YES	NO
7	8	34592	495152	3.60086	NO	YES	0.39	NO	NO
8	9	35136	429800	3.62776	NO	NO	0.61	NO	YES
9	10	43328	613326	2.98305	NO	NO	0.66	NO	NO
10	11	34960	492758	3.60089	NO	NO	0.77	NO	NO
11	12	44464	600726	2.56064	NO	YES	0.74	YES	NO
12	13	36464	456960	3.89655	NO	YES	0.39	YES	NO
13	14	44464	586096	2.47713	NO	YES	0.79	YES	NO
14	15	51888	704802	1.77422	YES	YES	0.72	YES	YES
15	16	36800	536970	3.92254	NO	NO	0.43	NO	YES
16	17	48688	742308	1.93589	NO	NO	0.90	NO	YES
17	18	37456	500234	3.99060	NO	NO	0.46	NO	NO
18	19	44800	570682	2.43241	NO	YES	0.79	YES	YES
19	20	56032	826420	1.41139	YES	YES	0.80	YES	NO
20	21	58800	761040	1.24488	YES	NO	0.69	NO	NO
21	22	57440	753466	1.36091	YES	NO	0.60	NO	NO
22	23	32752	502712	3.37428	NO	YES	0.45	YES	YES
23	24	43424	653856	2.88878	NO	YES	0.89	YES	YES
24	25	45968	706748	2.31898	NO	YES	0.62	YES	NO
25	26	38816	532602	3.85307	NO	NO	0.75	NO	YES
26	27	35168	518070	3.70671	NO	YES	0.47	YES	YES
27	28	34496	539378	3.48455	NO	YES	0.78	YES	YES
28	29	34208	414120	3.48166	NO	YES	0.38	YES	NO
29	30	44320	653338	2.58325	NO	NO	0.71	NO	YES
…	…	…	…	…	…	…	…	…	…
50	51	43728	590492	2.77882	NO	YES	0.47	YES	NO
51	52	47040	694568	2.06989	NO	YES	0.55	YES	NO
52	53	34512	493444	3.57125	NO	NO	0.74	NO	YES
53	54	57600	781718	1.35511	YES	NO	0.67	NO	YES
54	55	36064	526162	3.87218	NO	YES	0.73	NO	YES
55	56	49392	707070	1.91865	NO	NO	0.75	NO	NO
56	57	42378	545510	3.46630	NO	NO	0.62	NO	YES
57	58	38584	555170	3.99116	NO	NO	0.77	NO	NO
58	59	28700	405916	3.07021	NO	NO	0.72	NO	NO
59	60	55160	738794	1.48667	YES	YES	0.71	YES	NO
60	61	52472	666778	1.58686	YES	YES	0.90	YES	NO
61	62	54474	715498	1.52341	YES	YES	0.55	YES	NO
62	63	54222	754418	1.58647	YES	NO	0.78	YES	NO
63	64	73444	1012130	0.91298	YES	YES	0.90	YES	NO
64	65	67130	1003002	0.98050	YES	NO	0.79	NO	YES
65	66	39984	589526	3.77575	NO	NO	0.81	NO	NO
66	67	41972	550872	3.49699	NO	YES	0.68	YES	YES
67	68	43722	652680	2.84565	NO	YES	0.69	YES	NO
68	69	76972	1041796	0.87470	YES	YES	0.90	YES	NO
69	70	58156	881818	1.33013	YES	NO	0.82	NO	NO
70	71	52304	679938	1.68678	YES	NO	0.63	NO	YES
71	72	76524	1024450	0.87933	YES	YES	0.90	YES	NO
72	73	60620	844578	1.15504	YES	NO	0.90	NO	YES
73	74	32018	445424	3.23666	NO	YES	0.64	YES	YES
74	75	51814	669144	1.87321	YES	NO	0.88	NO	YES
75	76	66934	927696	1.07138	YES	YES	0.84	NO	NO
76	77	81228	1108254	0.85536	YES	YES	0.90	YES	NO
77	78	43288	638162	3.08191	NO	NO	0.62	NO	NO
78	79	43834	636636	2.75382	NO	YES	0.79	YES	YES
79	80	40852	575008	3.52768	NO	YES	0.54	YES	YES

80 rows × 9 columns

Lab: Multiple Regression

Dataset: Webpage_Product_Sales/Webpage_Product_Sales.csv
Build a model to predict sales using rest of the variables
Drop the less impacting variables based on p-values.
Is there any multicollinearity?
How many variables are there in the final model?
What is the R-squared of the final model?
Can you improve the model using same data and variables?

In [54]:

import pandas as pd 
Webpage_Product_Sales=pd.read_csv("DatasetsWebpage_Product_SalesWebpage_Product_Sales.csv")
Webpage_Product_Sales.shape

Out[54]:

(675, 12)

In [55]:

Webpage_Product_Sales.columns

Out[55]:

Index(['ID', 'DayofMonth', 'Weekday', 'Month', 'Social_Network_Ref_links',
       'Online_Ad_Paid_ref_links', 'Clicks_From_Serach_Engine',
       'Special_Discount', 'Holiday', 'Server_Down_time_Sec', 'Web_UI_Score',
       'Sales'],
dtype='object')

In [56]:

import statsmodels.formula.api as sm
model1 = sm.ols(formula='Sales ~ Web_UI_Score+Server_Down_time_Sec+Holiday+Special_Discount+Clicks_From_Serach_Engine+Online_Ad_Paid_ref_links+Social_Network_Ref_links+Month+Weekday+DayofMonth', data=Webpage_Product_Sales)
fitted1 = model1.fit()
fitted1.summary()

Out[56]:

OLS Regression Results
Dep. Variable:	Sales	R-squared:	0.818
Model:	OLS	Adj. R-squared:	0.815
Method:	Least Squares	F-statistic:	298.4
Date:	Tue, 14 Feb 2017	Prob (F-statistic):	5.54e-238
Time:	17:27:52	Log-Likelihood:	-6456.7
No. Observations:	675	AIC:	1.294e+04
Df Residuals:	664	BIC:	1.299e+04
Df Model:	10
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	6545.8922	1286.240	5.089	0.000	4020.304 9071.481
Web_UI_Score	-6.2582	11.545	-0.542	0.588	-28.928 16.412
Server_Down_time_Sec	-134.0441	14.009	-9.569	0.000	-161.551 -106.537
Holiday	1.877e+04	683.077	27.477	0.000	1.74e+04 2.01e+04
Special_Discount	4718.3978	402.019	11.737	0.000	3929.016 5507.780
Clicks_From_Serach_Engine	-0.1258	0.944	-0.133	0.894	-1.980 1.728
Online_Ad_Paid_ref_links	6.1557	1.002	6.142	0.000	4.188 8.124
Social_Network_Ref_links	6.6841	0.411	16.261	0.000	5.877 7.491
Month	481.0294	41.508	11.589	0.000	399.527 562.532
Weekday	1355.2153	67.224	20.160	0.000	1223.218 1487.213
DayofMonth	47.0579	15.198	3.096	0.002	17.216 76.900

Omnibus:	40.759	Durbin-Watson:	1.356
Prob(Omnibus):	0.000	Jarque-Bera (JB):	102.136
Skew:	0.297	Prob(JB):	6.63e-23
Kurtosis:	4.811	Cond. No.	2.57e+04

In [57]:

#VIF
vif_cal(Webpage_Product_Sales,"Sales")

ID  VIF =  1.18
DayofMonth  VIF =  1.01
Weekday  VIF =  1.0
Month  VIF =  1.19
Social_Network_Ref_links  VIF =  1.02
Online_Ad_Paid_ref_links  VIF =  12.13
Clicks_From_Serach_Engine  VIF =  12.08
Special_Discount  VIF =  1.37
Holiday  VIF =  1.38
Server_Down_time_Sec  VIF =  1.02
Web_UI_Score  VIF =  1.02

In [58]:

##Dropped Clicks_From_Serach_Engine based on VIF
import statsmodels.formula.api as sm
model2 = sm.ols(formula='Sales ~ Web_UI_Score+Server_Down_time_Sec+Holiday+Special_Discount+Online_Ad_Paid_ref_links+Social_Network_Ref_links+Month+Weekday+DayofMonth', data=Webpage_Product_Sales)
fitted2 = model2.fit()
fitted2.summary()

Out[58]:

OLS Regression Results
Dep. Variable:	Sales	R-squared:	0.818
Model:	OLS	Adj. R-squared:	0.815
Method:	Least Squares	F-statistic:	332.0
Date:	Tue, 14 Feb 2017	Prob (F-statistic):	2.98e-239
Time:	17:27:55	Log-Likelihood:	-6456.7
No. Observations:	675	AIC:	1.293e+04
Df Residuals:	665	BIC:	1.298e+04
Df Model:	9
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	6598.7469	1222.658	5.397	0.000	4198.012 8999.482
Web_UI_Score	-6.3332	11.523	-0.550	0.583	-28.959 16.293
Server_Down_time_Sec	-133.9518	13.981	-9.581	0.000	-161.405 -106.499
Holiday	1.877e+04	681.292	27.557	0.000	1.74e+04 2.01e+04
Special_Discount	4713.9295	400.323	11.775	0.000	3927.881 5499.978
Online_Ad_Paid_ref_links	6.0279	0.291	20.740	0.000	5.457 6.599
Social_Network_Ref_links	6.6872	0.410	16.307	0.000	5.882 7.492
Month	480.6876	41.398	11.611	0.000	399.401 561.974
Weekday	1355.2536	67.174	20.175	0.000	1223.355 1487.152
DayofMonth	47.0168	15.184	3.097	0.002	17.203 76.831

Omnibus:	40.826	Durbin-Watson:	1.356
Prob(Omnibus):	0.000	Jarque-Bera (JB):	102.313
Skew:	0.298	Prob(JB):	6.07e-23
Kurtosis:	4.812	Cond. No.	1.94e+04

In [59]:

#VIF for the updated model
vif_cal(Webpage_Product_Sales.drop(["Clicks_From_Serach_Engine"],axis=1),"Sales")

ID  VIF =  1.18
DayofMonth  VIF =  1.01
Weekday  VIF =  1.0
Month  VIF =  1.19
Social_Network_Ref_links  VIF =  1.01
Online_Ad_Paid_ref_links  VIF =  1.02
Special_Discount  VIF =  1.36
Holiday  VIF =  1.38
Server_Down_time_Sec  VIF =  1.02
Web_UI_Score  VIF =  1.02

In [60]:

##Drop the less impacting variables based on p-values.
##Dropped Web_UI_Score based on P-value
import statsmodels.formula.api as sm
model3 = sm.ols(formula='Sales ~ Server_Down_time_Sec+Holiday+Special_Discount+Online_Ad_Paid_ref_links+Social_Network_Ref_links+Month+Weekday+DayofMonth', data=Webpage_Product_Sales)
fitted3 = model3.fit()
fitted3.summary()

Out[60]:

OLS Regression Results
Dep. Variable:	Sales	R-squared:	0.818
Model:	OLS	Adj. R-squared:	0.816
Method:	Least Squares	F-statistic:	373.9
Date:	Tue, 14 Feb 2017	Prob (F-statistic):	1.74e-240
Time:	17:27:57	Log-Likelihood:	-6456.9
No. Observations:	675	AIC:	1.293e+04
Df Residuals:	666	BIC:	1.297e+04
Df Model:	8
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	6101.1539	821.286	7.429	0.000	4488.532 7713.776
Server_Down_time_Sec	-134.0717	13.972	-9.596	0.000	-161.507 -106.637
Holiday	1.874e+04	678.528	27.623	0.000	1.74e+04 2.01e+04
Special_Discount	4726.1858	399.491	11.831	0.000	3941.771 5510.600
Online_Ad_Paid_ref_links	6.0357	0.290	20.802	0.000	5.466 6.605
Social_Network_Ref_links	6.6738	0.409	16.312	0.000	5.870 7.477
Month	479.5231	41.322	11.605	0.000	398.386 560.660
Weekday	1354.4252	67.122	20.179	0.000	1222.629 1486.221
DayofMonth	46.9564	15.175	3.094	0.002	17.159 76.754

Omnibus:	41.049	Durbin-Watson:	1.352
Prob(Omnibus):	0.000	Jarque-Bera (JB):	103.243
Skew:	0.298	Prob(JB):	3.81e-23
Kurtosis:	4.821	Cond. No.	1.31e+04

In [61]:

#How many variables are there in the final model?
8

Out[61]:

In [62]:

#What is the R-squared of the final model?
fitted3.rsquared

Out[62]:

0.8178742020411971

Interaction Terms

Adding interaction terms might help in improving the prediction accuracy of the model.
The addition of interaction terms needs prior knowledge of the dataset and variables

LAB: Interaction Terms

Add few interaction terms to above web product sales model and see the increase in the accuracy

In [63]:

import statsmodels.formula.api as sm
model4 = sm.ols(formula='Sales ~ Server_Down_time_Sec+Holiday+Special_Discount+Online_Ad_Paid_ref_links+Social_Network_Ref_links+Month+Weekday+DayofMonth+Holiday*Weekday', data=Webpage_Product_Sales)
fitted4 = model4.fit()
fitted4.summary()

Out[63]:

OLS Regression Results
Dep. Variable:	Sales	R-squared:	0.865
Model:	OLS	Adj. R-squared:	0.863
Method:	Least Squares	F-statistic:	473.6
Date:	Tue, 14 Feb 2017	Prob (F-statistic):	2.17e-282
Time:	17:28:04	Log-Likelihood:	-6355.7
No. Observations:	675	AIC:	1.273e+04
Df Residuals:	665	BIC:	1.278e+04
Df Model:	9
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	6753.6923	708.791	9.528	0.000	5361.955 8145.430
Server_Down_time_Sec	-140.4922	12.044	-11.665	0.000	-164.141 -116.844
Holiday	2201.8694	1232.336	1.787	0.074	-217.870 4621.608
Special_Discount	4749.0044	344.145	13.799	0.000	4073.262 5424.747
Online_Ad_Paid_ref_links	5.9515	0.250	23.805	0.000	5.461 6.442
Social_Network_Ref_links	7.0657	0.353	19.994	0.000	6.372 7.760
Month	480.3156	35.597	13.493	0.000	410.420 550.212
Weekday	1164.8864	59.143	19.696	0.000	1048.756 1281.017
DayofMonth	47.0967	13.073	3.603	0.000	21.428 72.766
Holiday:Weekday	4294.6865	281.683	15.247	0.000	3741.592 4847.782

Omnibus:	7.552	Durbin-Watson:	0.867
Prob(Omnibus):	0.023	Jarque-Bera (JB):	7.305
Skew:	0.219	Prob(JB):	0.0259
Kurtosis:	2.740	Cond. No.	2.32e+04

Conclusion – Regression

In this chapter, we have discussed what is simple regression, what is multiple regression, how to build simple linear regression, multiple linear regression, what are the most important metric that one should consider in output of a regression line, what is Multi-collinearity, how to detect it, how to eliminate Multi-collinearity, what is R square, what is adjusted R square , difference between R square and Adjusted R-squared, how do we see the individual impact of the variables and etc.
This is basic regression class, once you get a good idea on regression you can explore more in regression by going through the advance topics like adding the polynomial and interaction terms to your regression line.
Adjusted R-squared is a good measure of training/in time sample error. We can’t be sure about the final model performance based on this. We may have to perform cross-validation to get an idea on testing error.
About cross-validation we will talk in future lectures in more detail.
Outliers can influence the regression line; we need to take care of data sanitization before building the regression line, because at the end of the day these are all mathematical formula. If wrong adjustment is done, then we will get the wrong results, so data cleaning is very important before getting into regression.

Handout – Linear Regression

Before start our lesson please download the datasets.

Regression

Correlation

What is need of correlation?

Correlation coefficient

LAB –Correlation Calculation

Beyond Pearson Correlation

From Correlation to Regression

What is Regression

Regression

Regression Line fitting

Error

Minimizing the error

Least Squares Estimation

LAB: Regression Line Fitting

How good is my regression line?

Explained and Unexplained Variation

R-Squared

Lab: R- Square

Multiple Regression

Code – Multiple Regression

Individual Impact of variables

LAB: Multiple Regression

Adjusted R-Squared

LAB: Adjusted R-Square

R-Squared vs Adjusted R-Squared

LAB: Multiple Regression- issues

Multicollinearity

Multicollinearity Detection

LAB: Multicollinearity

Lab: Multiple Regression

Interaction Terms

LAB: Interaction Terms

Conclusion – Regression