• No products in the cart.

Handout – Linear Regression

 

Before start our lesson please download the datasets.

Regression

Contents

  • Correlation
  • Simple Regression
  • R-Squared
  • Multiple Regression
  • Adj R-Squared
  • P-value
  • Multicollinearity
  • Interaction terms

Correlation

What is need of correlation?

  • Is there any association between hours of study and grades?
  • Is there any association between number of temples in a city & murder rate?
  • What happens to sweater sales with increase in temperature? What is the strength of association between them?
  • What happens to ice-cream sales v.s temperature? What is the strength of association between them?
  • How to quantify the association?
  • Which of the above examples has very strong association?
  • Correlation

Correlation coefficient

  • It is a measure of linear association
  • r is the ratio of variance together vs product of individual variances.

Correlation coefficient (r) = frac{Covariance of XY}{Sqrt(VarianceX* VarianceY)}

  • Correlation 0 No linear association
  • Correlation 0 to 0.25 Negligible positive association
  • Correlation 0.25-0.5 Weak positive association
  • Correlation 0.5-0.75 Moderate positive association
  • Correlation >0.75 Very Strong positive association

LAB –Correlation Calculation

  • Dataset: AirPassengersAirPassengers.csv
  • Find the correlation between number of passengers and promotional budget.
  • Draw a scatter plot between number of passengers and promotional budget.
  • Find the correlation between number of passengers and Service_Quality_Score.
In [1]:
import pandas as pd
air = pd.read_csv("DatasetsAirPassengersAirPassengers.csv")
air.shape
Out[1]:
(80, 9)
In [2]:
#Name of the columns in the dataset:
air.columns.values
Out[2]:
array([ 'Week_num', 'Passengers', 'Promotion_Budget',
	'Service_Quality_Score', 'Holiday_week',
	'Delayed_Cancelled_flight_ind', 'Inter_metro_flight_ratio',
	'Bad_Weather_Ind', 'Technical_issues_ind'], dtype=object)
In [3]:
#Find the correlation between number of passengers and promotional budget.
import numpy as np
np.corrcoef(air.Passengers,air.Promotion_Budget)
Out[3]:
array([[ 1.,0.96585103],
	[ 0.96585103,1.]])
In [4]:
#Draw a scatter plot between number of passengers and promotional budget
	import matplotlib.pyplot as plt
	%matplotlib inline  
	plt.scatter(air.Passengers, air.Promotion_Budget)
Out[4]:
<matplotlib.collections.PathCollection at 0x28c0eb906a0>
In [5]:
#Find the correlation between number of passengers and Service_Quality_Score
np.corrcoef(air.Passengers,air.Service_Quality_Score)
Out[5]:
array([[ 1., -0.88653002],
[-0.88653002,  1.]])

Beyond Pearson Correlation

  • Correlation coefficient measures for different types of data
Variable YX Quantitative /Continuous X Ordinal/Ranked/Discrete X Nominal/Categorical X
Quantitative Y Pearson r Biserial r_b Point Biserial r_{pb}
Ordinal/Ranked/Discrete Y Biserial r_b Spearman rho/Kendall’s Rank Biserial r_{pb}
Nominal/Categorical Y Point Biserial r_{pb} Rank Biserial r_{rb} Phi, Contingency Coeff, V

From Correlation to Regression

  • Correlation is just a measure of association
  • It can’t be used for prediction.
  • Given the predictor variable, we can’t estimate the dependent variable.
  • In the air passengers example, given the promotion budget, we can’t get the estimated value of passengers
  • We need a model, an equation, a fit for the data.
  • That is known as regression line

What is Regression

  • A regression line is a mathematical formula that quantifies the general relation between a predictor/independent variable (or known variable x) and the target/dependent variable (or the unknown variable y).
  • Below is the regression line. If we have the data of x and y, then we can build a model to generalize their relation

y = beta_0 + beta_1 x

  • What is the best fit for our data?
    • The one which goes through the core of the data
    • The one which minimizes the error

Regression

Regression Line fitting

Error

Minimizing the error

  • The best line will have the minimum error
  • Some errors are positive and some errors are negative. Taking their sum is not a good idea
  • We can either minimize the squared sum of errors Or we can minimize the absolute sum of errors
  • Squared sum of errors is mathematically convenient to minimize
  • The method of minimizing squared sum of errors is called least squared method of regression

Least Squares Estimation

  • X: x_1, x_2, x_3,... x_n
  • Y: y_1, y_2, y_3,... y_n
  • Imagine a line through all the points
  • Deviation from each point (residual or error)
  • Square of the deviation
  • Minimizing sum of squares of deviation

sum e^2 = sum (y - hat{y})^2 sum e^2= sum (y - (beta_0 + beta_1 x))^2

  • beta_0 and beta_1are obtained by minimizing the sum of the squared residuals

LAB: Regression Line Fitting

  • Dataset: AirPassengersAirPassengers.csv
  • Find the correlation between Promotion_Budget and Passengers
  • Draw a scatter plot between Promotion_Budget and Passengers. Is there any any pattern between Promotion_Budget and Passengers?
  • Build a linear regression model on Promotion_Budget and Passengers.
  • Build a regression line to predict the passengers using Inter_metro_flight_ratio
In [6]:
import pandas as pd
air = pd.read_csv("DatasetsAirPassengersAirPassengers.csv")
air.shape
Out[6]:
(80, 9)
In [7]:
air.columns.values
Out[7]:
array(['Week_num', 'Passengers', 'Promotion_Budget',
       'Service_Quality_Score', 'Holiday_week',
       'Delayed_Cancelled_flight_ind', 'Inter_metro_flight_ratio',
'Bad_Weather_Ind', 'Technical_issues_ind'], dtype=object)
In [8]:
air.head(5)
Out[8]:
Week_num Passengers Promotion_Budget Service_Quality_Score Holiday_week Delayed_Cancelled_flight_ind Inter_metro_flight_ratio Bad_Weather_Ind Technical_issues_ind
0 1 37824 517356 4.00000 NO NO 0.70 YES YES
1 2 43936 646086 2.67466 NO YES 0.80 YES YES
2 3 42896 638330 3.29473 NO NO 0.90 NO NO
3 4 35792 506492 3.85684 NO NO 0.40 NO NO
4 5 38624 609658 3.90757 NO NO 0.87 NO YES
In [9]:
# Find the correlation between Promotion_Budget and Passengers
import numpy as np
np.corrcoef(air.Passengers,air.Promotion_Budget)
Out[9]:
array([[ 1.,  0.96585103],
[ 0.96585103, 1.]])
In [10]:
# Draw a scatter plot between Promotion_Budget and Passengers. Is there any any pattern between Promotion_Budget and Passengers?
import matplotlib.pyplot as plt
%matplotlib inline 
plt.scatter(air.Passengers, air.Promotion_Budget)
Out[10]:
<matplotlib.collections.PathCollection at 0x28c0f0f0e48>
In [11]:
#Build a linear regression model and estimate the expected passengers for a Promotion_Budget is 650,000
##Regression Model  promotion and passengers count
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(air[["Promotion_Budget"]], air[["Passengers"]])
predictions = lr.predict(65000)
predictions
Out[11]:
array([[ 5779.03537577]])
In [12]:
import statsmodels.formula.api as sm
model = sm.ols(formula='Passengers ~ Promotion_Budget', data=air)
fitted1 = model.fit()
In [13]:
fitted1.summary()
Out[13]:
OLS Regression Results
Dep. Variable: Passengers R-squared: 0.933
Model: OLS Adj. R-squared: 0.932
Method: Least Squares F-statistic: 1084.
Date: Tue, 14 Feb 2017 Prob (F-statistic): 1.66e-47
Time: 17:26:26 Log-Likelihood: -751.34
No. Observations: 80 AIC: 1507.
Df Residuals: 78 BIC: 1511.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 1259.6058 1361.071 0.925 0.358 -1450.078 3969.290
Promotion_Budget 0.0695 0.002 32.923 0.000 0.065 0.074
Omnibus: 26.624 Durbin-Watson: 1.831
Prob(Omnibus): 0.000 Jarque-Bera (JB): 5.188
Skew: -0.128 Prob(JB): 0.0747
Kurtosis: 1.779 Cond. No. 2.67e+06
In [14]:
# Build a regression line to predict the passengers using Inter_metro_flight_ratio
plt.scatter(air.Inter_metro_flight_ratio,air.Passengers)
Out[14]:
<matplotlib.collections.PathCollection at 0x28c10f92630>
In [15]:
import sklearn as sk
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(air[["Inter_metro_flight_ratio"]], air[["Passengers"]])
Out[15]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [16]:
predictions = lr.predict(air[["Inter_metro_flight_ratio"]])
In [17]:
import statsmodels.formula.api as sm
model = sm.ols(formula='Passengers ~ Inter_metro_flight_ratio', data=air)
fitted2 = model.fit()
In [18]:
fitted2.summary()
Out[18]:
OLS Regression Results
Dep. Variable: Passengers R-squared: 0.242
Model: OLS Adj. R-squared: 0.232
Method: Least Squares F-statistic: 24.90
Date: Tue, 14 Feb 2017 Prob (F-statistic): 3.58e-06
Time: 17:26:34 Log-Likelihood: -848.30
No. Observations: 80 AIC: 1701.
Df Residuals: 78 BIC: 1705.
Df Model: 1
Covariance Type: nonrobust

 

coef std err t P>|t| [95.0% Conf. Int.]
Intercept 2.044e+04 4993.747 4.093 0.000 1.05e+04 3.04e+04
Inter_metro_flight_ratio 3.507e+04 7027.768 4.990 0.000 2.11e+04 4.91e+04

 

Omnibus: 10.172 Durbin-Watson: 1.385
Prob(Omnibus): 0.006 Jarque-Bera (JB): 10.098
Skew: 0.822 Prob(JB): 0.00641
Kurtosis: 3.573 Cond. No. 9.48

How good is my regression line?

  • Take an (x,y) point from data.
  • Imagine that we submitted x in the regression line, we got a prediction as y_{pred}
  • If the regression line is a good fit then we expecty_{pred}=y or (y-y_{pred}) =0
  • At every point of x, if we repeat the same, then we will get multiple error values (y-y_{pred}).
  • Some of them might be positive, some of them might be negative, so we can take the square of all such errors

SSE = sum(y - hat{y})^2

  • For a good model, we need SSE to be zero or near to zero
  • Standalone SSE will not make any sense. For example SSE= 100, is very less when y is varying in terms of 1000’s. Same value is is very high when y is varying in terms of decimals.
  • We have to consider variance of y, while calculating the regression line accuracy
  • Error Sum of squares (SSE- Sum of Squares of error) SSE = sum(y - hat{y})^2
  • Total Variance in Y (SST- Sum of Squares of Total)
  • SST = sum(y - bar{y})^2SST = sum(y - hat{y} + - hat{y} - bar{y})^2
  • SST = sum(y - hat{y} + - hat{y} - bar{y})^2SST = sum(y - hat{y} + - hat{y} - bar{y})^2SST = sum(y - hat{y})^2 + sum(hat{y} - bar{y})^2SST = SSE + sum(hat{y} - bar{y})^2SST = SSE + SSR
  • So, total variance in Y is divided into two parts,
    • Variance that cannot be explained by x (error)
    • Variance that can be explained by x, using regression

Explained and Unexplained Variation

  • So, total variance in Y is divided into two parts,
    • Variance that can be explained by x, using regression
    • Variance that cannot be explained by x SST = SSE + SSR
Total sum of Squares = Sum of Squares Error + Sum of Squares RegressionSST = sum(y - bar{y})^2 SSE = sum(y - hat{y})^2 SSR = sum(hat{y} - bar{y})^2

R-Squared

  • A good fit will have
    • SSE (Minimum or Maximum?)
    • SSR (Minimum or Maximum?)
    • And we know SST= SSE + SSR
    • SSE/SST(Minimum or Maximum?)
    • SSR/SST(Minimum or Maximum?)
  • The coefficient of determination is the portion of the total variation in the dependent variable that is explained by variation in the independent variable
  • The coefficient of determination is also called R-squared and is denoted as R^2

R^2 = frac{SSR}{SST} where 0<= R^2<=1

Lab: R- Square

  • What is the R-square value of Passengers vs Promotion_Budget model?
  • What is the R-square value of Passengers vs Inter_metro_flight_ratio
In [19]:
#What is the R-square value of Passengers vs Promotion_Budget model?
fitted1.summary()
Out[19]:
OLS Regression Results
Dep. Variable: Passengers R-squared: 0.933
Model: OLS Adj. R-squared: 0.932
Method: Least Squares F-statistic: 1084.
Date: Tue, 14 Feb 2017 Prob (F-statistic): 1.66e-47
Time: 17:26:45 Log-Likelihood: -751.34
No. Observations: 80 AIC: 1507.
Df Residuals: 78 BIC: 1511.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 1259.6058 1361.071 0.925 0.358 -1450.078 3969.290
Promotion_Budget 0.0695 0.002 32.923 0.000 0.065 0.074
Omnibus: 26.624 Durbin-Watson: 1.831
Prob(Omnibus): 0.000 Jarque-Bera (JB): 5.188
Skew: -0.128 Prob(JB): 0.0747
Kurtosis: 1.779 Cond. No. 2.67e+06
In [20]:
#What is the R-square value of Passengers vs Inter_metro_flight_ratio
fitted2.summary()
Out[20]:
OLS Regression Results
Dep. Variable: Passengers R-squared: 0.242
Model: OLS Adj. R-squared: 0.232
Method: Least Squares F-statistic: 24.90
Date: Tue, 14 Feb 2017 Prob (F-statistic): 3.58e-06
Time: 17:26:47 Log-Likelihood: -848.30
No. Observations: 80 AIC: 1701.
Df Residuals: 78 BIC: 1705.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 2.044e+04 4993.747 4.093 0.000 1.05e+04 3.04e+04
Inter_metro_flight_ratio 3.507e+04 7027.768 4.990 0.000 2.11e+04 4.91e+04
Omnibus: 10.172 Durbin-Watson: 1.385
Prob(Omnibus): 0.006 Jarque-Bera (JB): 10.098
Skew: 0.822 Prob(JB): 0.00641
Kurtosis: 3.573 Cond. No. 9.48

Multiple Regression

  • Using multiple predictor variables instead of single variable
  • We need to find a perfect plane here

Code – Multiple Regression

In [21]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(air[["Promotion_Budget"]+["Inter_metro_flight_ratio"]], air[["Passengers"]])
Out[21]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [22]:
predictions = lr.predict(air[["Promotion_Budget"]+["Inter_metro_flight_ratio"]])
In [23]:
import statsmodels.formula.api as sm
model = sm.ols(formula='Passengers ~ Promotion_Budget+Inter_metro_flight_ratio', data=air)
fitted = model.fit()
fitted.summary()
Out[23]:
OLS Regression Results
Dep. Variable: Passengers R-squared: 0.934
Model: OLS Adj. R-squared: 0.932
Method: Least Squares F-statistic: 540.5
Date: Tue, 14 Feb 2017 Prob (F-statistic): 4.76e-46
Time: 17:26:53 Log-Likelihood: -750.96
No. Observations: 80 AIC: 1508.
Df Residuals: 77 BIC: 1515.
Df Model: 2
Covariance Type: nonrobust

 

coef std err t P>|t| [95.0% Conf. Int.]
Intercept 2017.7724 1624.803 1.242 0.218 -1217.624 5253.169
Promotion_Budget 0.0707 0.002 28.297 0.000 0.066 0.076
Inter_metro_flight_ratio -2121.5208 2473.189 -0.858 0.394 -7046.268 2803.227

 

Omnibus: 26.259 Durbin-Watson: 1.800
Prob(Omnibus): 0.000 Jarque-Bera (JB): 5.075
Skew: -0.096 Prob(JB): 0.0791
Kurtosis: 1.781 Cond. No. 5.25e+06

Individual Impact of variables

  • Look at the P-value
  • Probability of the hypothesis being right.
  • Individual variable coefficient is tested for significance
  • Beta coefficients follow t distribution.
  • Individual P values tell us about the significance of each variable
  • A variable is significant if P value is less than 5%. Lesser the P-value, better the variable
  • Note: It is possible for all the variables in a regression to produce a great fit, and yet very few of the variables be individually significant.

To test H_0 : beta_i = 0 H_a : beta_i not= 0 Test Statistic t=frac{b_i}{s(b_i)} Reject t > t(frac{alpha}{2};n-k-1) or t > -t(frac{alpha}{2};n-k-1)

LAB: Multiple Regression

  • Build a multiple regression model to predict the number of passengers
  • What is R-square value
  • Are there any predictor variables that are not impacting the dependent variable
In [24]:
#Build a multiple regression model to predict the number of passengers
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(air[["Promotion_Budget"]+["Inter_metro_flight_ratio"]+["Service_Quality_Score"]], air[["Passengers"]])
predictions = lr.predict(air[["Promotion_Budget"]+["Inter_metro_flight_ratio"]+["Service_Quality_Score"]])
In [25]:
import statsmodels.formula.api as sm
model = sm.ols(formula='Passengers ~ Promotion_Budget+Inter_metro_flight_ratio+Service_Quality_Score', data=air)
fitted = model.fit()
fitted.summary()
Out[25]:
OLS Regression Results
Dep. Variable: Passengers R-squared: 0.951
Model: OLS Adj. R-squared: 0.949
Method: Least Squares F-statistic: 495.6
Date: Tue, 14 Feb 2017 Prob (F-statistic): 8.71e-50
Time: 17:26:58 Log-Likelihood: -738.45
No. Observations: 80 AIC: 1485.
Df Residuals: 76 BIC: 1494.
Df Model: 3
Covariance Type: nonrobust

 

coef std err t P>|t| [95.0% Conf. Int.]
Intercept 1.921e+04 3542.694 5.424 0.000 1.22e+04 2.63e+04
Promotion_Budget 0.0555 0.004 15.476 0.000 0.048 0.063
Inter_metro_flight_ratio -2003.4508 2129.095 -0.941 0.350 -6243.912 2237.010
Service_Quality_Score -2802.0708 530.382 -5.283 0.000 -3858.419 -1745.723

 

Omnibus: 6.902 Durbin-Watson: 2.312
Prob(Omnibus): 0.032 Jarque-Bera (JB): 2.759
Skew: -0.051 Prob(JB): 0.252
Kurtosis: 2.096 Cond. No. 8.22e+06
  • What is R-square value

0.951

  • Are there any predictor variables that are not impacting the dependent variable

Inter_metro_flight_ratio

Adjusted R-Squared

  • Is it good to have as many independent variables as possible? Nope
  • R-square is deceptive. R-squared value never decreases when a new X variable is added to the model – True?
  • We need a better measure or an adjustment to the original R-squared formula.
  • Adjusted R squared
    • Its value depends on the number of explanatory variables
    • Imposes a penalty for adding additional explanatory variables
    • It is usually written as (bar{R}^2)
    • Very different from R^2, when there are too many predictors and n is less

    bar{R}^2 = R^2 - frac{k-1}{n-k}(1-R^2)

  • where n – number of observations
     k - number of parameters

LAB: Adjusted R-Square

  • Dataset: “Adjusted Rsquare/ Adj_Sample.csv”
  • Build a model to predict y using x1,x2 and x3. Note down R-Square and Adj R-Square values
  • Build a model to predict y using x1,x2,x3,x4,x5 and x6. Note down R-Square and Adj R-Square values
  • Build a model to predict y using x1,x2,x3,x4,x5,x6,x7 and x8. Note down R-Square and Adj R-Square values
In [26]:
adj_sample=pd.read_csv("DatasetsAdjusted RSquareAdj_Sample.csv")
adj_sample.shape
Out[26]:
(12, 9)
In [27]:
adj_sample.columns.values
Out[27]:
array(['Y', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8'], dtype=object)
In [28]:
#Build a model to predict y using x1,x2 and x3. Note down R-Square and Adj R-Square values 
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(adj_sample[["x1"]+["x2"]+["x3"]], adj_sample[["Y"]])
predictions = lr.predict(adj_sample[["x1"]+["x2"]+["x3"]])
In [29]:
import statsmodels.formula.api as sm
model = sm.ols(formula='Y ~ x1+x2+x3', data=adj_sample)
fitted1 = model.fit()
fitted1.summary()
C:UsersADMINAnaconda3libsite-packagesscipystatsstats.py:1327: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=12
  "anyway, n=%i" % int(n))
Out[29]:
OLS Regression Results
Dep. Variable: Y R-squared: 0.684
Model: OLS Adj. R-squared: 0.566
Method: Least Squares F-statistic: 5.785
Date: Tue, 14 Feb 2017 Prob (F-statistic): 0.0211
Time: 17:27:07 Log-Likelihood: -10.430
No. Observations: 12 AIC: 28.86
Df Residuals: 8 BIC: 30.80
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -2.8798 1.163 -2.477 0.038 -5.561 -0.199
x1 -0.4894 0.370 -1.324 0.222 -1.342 0.363
x2 0.0029 0.001 2.586 0.032 0.000 0.005
x3 0.4572 0.176 2.595 0.032 0.051 0.864
Omnibus: 1.113 Durbin-Watson: 1.978
Prob(Omnibus): 0.573 Jarque-Bera (JB): 0.763
Skew: -0.562 Prob(JB): 0.683
Kurtosis: 2.489 Cond. No. 6.00e+03
In [30]:
#Build a model to predict y using x1,x2,x3,x4,x5 and x6. Note down R-Square and Adj R-Square values 
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(adj_sample[["x1"]+["x2"]+["x3"]+["x4"]+["x5"]+["x6"]], adj_sample[["Y"]])
predictions = lr.predict(adj_sample[["x1"]+["x2"]+["x3"]+["x4"]+["x5"]+["x6"]])
In [31]:
import statsmodels.formula.api as sm
model = sm.ols(formula='Y ~ x1+x2+x3+x4+x5+x6', data=adj_sample)
fitted2 = model.fit()
fitted2.summary()
C:UsersADMINAnaconda3libsite-packagesscipystatsstats.py:1327: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=12
  "anyway, n=%i" % int(n))
Out[31]:
OLS Regression Results
Dep. Variable: Y R-squared: 0.717
Model: OLS Adj. R-squared: 0.377
Method: Least Squares F-statistic: 2.111
Date: Tue, 14 Feb 2017 Prob (F-statistic): 0.215
Time: 17:27:10 Log-Likelihood: -9.7790
No. Observations: 12 AIC: 33.56
Df Residuals: 5 BIC: 36.95
Df Model: 6
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -5.3751 4.687 -1.147 0.303 -17.423 6.673
x1 -0.6697 0.537 -1.247 0.268 -2.050 0.711
x2 0.0030 0.002 1.956 0.108 -0.001 0.007
x3 0.5063 0.249 2.036 0.097 -0.133 1.146
x4 0.0376 0.084 0.449 0.672 -0.178 0.253
x5 0.0436 0.169 0.258 0.806 -0.390 0.478
x6 0.0516 0.088 0.588 0.582 -0.174 0.277
Omnibus: 0.426 Durbin-Watson: 2.065
Prob(Omnibus): 0.808 Jarque-Bera (JB): 0.434
Skew: -0.347 Prob(JB): 0.805
Kurtosis: 2.378 Cond. No. 1.98e+04
In [32]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(adj_sample[["x1"]+["x2"]+["x3"]+["x4"]+["x5"]+["x6"]+["x7"]+["x8"]], adj_sample[["Y"]])
predictions = lr.predict(adj_sample[["x1"]+["x2"]+["x3"]+["x4"]+["x5"]+["x6"]+["x7"]+["x8"]])
In [33]:
import statsmodels.formula.api as sm
model = sm.ols(formula='Y ~ x1+x2+x3+x4+x5+x6+x7+x8', data=adj_sample)
fitted3 = model.fit()
fitted3.summary()
C:UsersADMINAnaconda3libsite-packagesscipystatsstats.py:1327: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=12
  "anyway, n=%i" % int(n))
Out[33]:
OLS Regression Results
Dep. Variable: Y R-squared: 0.805
Model: OLS Adj. R-squared: 0.285
Method: Least Squares F-statistic: 1.549
Date: Tue, 14 Feb 2017 Prob (F-statistic): 0.393
Time: 17:27:13 Log-Likelihood: -7.5390
No. Observations: 12 AIC: 33.08
Df Residuals: 3 BIC: 37.44
Df Model: 8
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 17.0440 19.903 0.856 0.455 -46.297 80.385
x1 -0.0956 0.761 -0.126 0.908 -2.519 2.328
x2 0.0007 0.003 0.291 0.790 -0.007 0.009
x3 0.5157 0.306 1.684 0.191 -0.459 1.490
x4 0.0579 0.103 0.560 0.615 -0.271 0.387
x5 0.0858 0.191 0.448 0.684 -0.524 0.695
x6 -0.1747 0.220 -0.795 0.485 -0.874 0.525
x7 -0.0324 0.153 -0.212 0.846 -0.519 0.455
x8 -0.2321 0.207 -1.124 0.343 -0.890 0.425
Omnibus: 1.329 Durbin-Watson: 1.594
Prob(Omnibus): 0.514 Jarque-Bera (JB): 0.875
Skew: -0.339 Prob(JB): 0.646
Kurtosis: 1.863 Cond. No. 7.85e+04
Model R^2 Adj R^2
Model1 0.684 0.566
Model2 0.717 0.377
Model3 0.805 0.285

R-Squared vs Adjusted R-Squared

We have built three models on Adj_sample data; model1, model2 and model3 with different number of variables

LAB: Multiple Regression- issues

  • Import Final Exam Score data
  • Build a model to predict final score using the rest of the variables.
  • How are Sem2_Math & Final score related? As Sem2_Math score increases, what happens to Final score?
  • Remove “Sem1_Math” variable from the model and rebuild the model
  • Is there any change in R square or Adj R square
  • How are Sem2_Math & Final score related now? As Sem2_Math score increases, what happens to Final score?
  • Draw a scatter plot between Sem1_Math & Sem2_Math
  • Find the correlation between Sem1_Math & Sem2_Math
In [34]:
#Import Final Exam Score data
final_exam=pd.read_csv("DatasetsFinal ExamFinal Exam Score.csv")
In [35]:
#Size of the data
final_exam.shape
Out[35]:
(24, 5)
In [36]:
#Variable names
final_exam.columns
Out[36]:
Index(['Sem1_Science', 'Sem2_Science', 'Sem1_Math', 'Sem2_Math',
       'Final_exam_marks'],
dtype='object')
In [37]:
#Build a model to predict final score using the rest of the variables.
from sklearn.linear_model import LinearRegression
lr1 = LinearRegression()
lr1.fit(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem1_Math"]+["Sem2_Math"]], final_exam[["Final_exam_marks"]])
predictions1 = lr1.predict(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem1_Math"]+["Sem2_Math"]])
import statsmodels.formula.api as sm
model1 = sm.ols(formula='Final_exam_marks ~ Sem1_Science+Sem2_Science+Sem1_Math+Sem2_Math', data=final_exam)
fitted1 = model1.fit()
fitted1.summary()
Out[37]:
OLS Regression Results
Dep. Variable: Final_exam_marks R-squared: 0.990
Model: OLS Adj. R-squared: 0.987
Method: Least Squares F-statistic: 452.3
Date: Tue, 14 Feb 2017 Prob (F-statistic): 1.50e-18
Time: 17:27:21 Log-Likelihood: -38.099
No. Observations: 24 AIC: 86.20
Df Residuals: 19 BIC: 92.09
Df Model: 4
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -1.6226 1.999 -0.812 0.427 -5.806 2.561
Sem1_Science 0.1738 0.063 2.767 0.012 0.042 0.305
Sem2_Science 0.2785 0.052 5.379 0.000 0.170 0.387
Sem1_Math 0.7890 0.197 4.002 0.001 0.376 1.202
Sem2_Math -0.2063 0.191 -1.078 0.294 -0.607 0.194
Omnibus: 6.343 Durbin-Watson: 1.863
Prob(Omnibus): 0.042 Jarque-Bera (JB): 4.332
Skew: 0.973 Prob(JB): 0.115
Kurtosis: 3.737 Cond. No. 1.20e+03
In [38]:
fitted1.rsquared
Out[38]:
0.98960765475687229
  • How are Sem2_Math & Final score related? As Sem2_Math score increases, what happens to Final score?

As Sem2_Math score increases Final score decreases

In [39]:
#Remove "Sem1_Math" variable from the model and rebuild the model
from sklearn.linear_model import LinearRegression
lr2 = LinearRegression()
lr2.fit(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem2_Math"]], final_exam[["Final_exam_marks"]])
predictions2 = lr2.predict(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem2_Math"]])
import statsmodels.formula.api as sm
model2 = sm.ols(formula='Final_exam_marks ~ Sem1_Science+Sem2_Science+Sem2_Math', data=final_exam)
fitted2 = model2.fit()
fitted2.summary()
Out[39]:
OLS Regression Results
Dep. Variable: Final_exam_marks R-squared: 0.981
Model: OLS Adj. R-squared: 0.978
Method: Least Squares F-statistic: 341.4
Date: Tue, 14 Feb 2017 Prob (F-statistic): 2.44e-17
Time: 17:27:25 Log-Likelihood: -45.436
No. Observations: 24 AIC: 98.87
Df Residuals: 20 BIC: 103.6
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -2.3986 2.632 -0.911 0.373 -7.889 3.092
Sem1_Science 0.2130 0.082 2.595 0.017 0.042 0.384
Sem2_Science 0.2686 0.068 3.925 0.001 0.126 0.411
Sem2_Math 0.5320 0.067 7.897 0.000 0.391 0.673
Omnibus: 5.869 Durbin-Watson: 2.424
Prob(Omnibus): 0.053 Jarque-Bera (JB): 3.793
Skew: 0.864 Prob(JB): 0.150
Kurtosis: 3.898 Cond. No. 1.03e+03
  • Is there any change in R square or Adj R square
Model R^2 Adj R^2
model1 0.990 0.987
model2 0.981 0.978
  • How are Sem2_Math & Final score related now? As Sem2_Math score increases, what happens to Final score?

As Sem2_Math score increases Final score also increases.

In [40]:
#Draw a scatter plot between Sem1_Math & Sem2_Mat
import matplotlib.pyplot as plt
%matplotlib inline 
plt.scatter(final_exam.Sem1_Math,final_exam.Sem2_Math)
Out[40]:
<matplotlib.collections.PathCollection at 0x28c11508ac8>
In [41]:
#Find the correlation between Sem1_Math & Sem2_Math 
np.corrcoef(final_exam.Sem1_Math,final_exam.Sem2_Math)
Out[41]:
array([[ 1.       ,  0.9924948],
[ 0.9924948,  1.       ]])

Multicollinearity

  • Multiple regression is wonderful – It allows you to consider the effect of multiple variables simultaneously.
  • Multiple regression is extremely unpleasant – Because it allows you to consider the effect of multiple variables simultaneously.
  • The relationships between the explanatory variables are the key to understanding multiple regression.
  • Multicollinearity (or inter correlation) exists when at least some of the predictor variables are correlated among themselves.
  • The parameter estimates will have inflated variance in presence of multicollineraity.
  • Sometimes the signs of the parameter estimates tend to change
  • If the relation between the independent variables grows really strong, then the variance of parameter estimates tends to be infinity – Can you prove it?

Multicollinearity Detection

Y = beta_0 + beta_1 X_1 + beta_2 X_2 + beta_3 X_3 + beta_4 X_4

  • Build a model X1 vs X2 X_3 X_4 find R^2, say R1
  • Build a model X2 vs X1 X3 X4 find R^2, say R2
  • Build a model X3 vs X1 X_2 X4 find R^2, say R3
  • Build a model X4 vs X1 X2 X3 find R^2, say R4
  • For example if R3 is 95% then we don’t really need X3 in the model
  • Since it can be explained as liner combination of other three
  • For each variable we find individual R^2.
  • frac{1}{(1-R^2)} is called VIF.
  • VIF option in SAS, automatically calculates VIF values for each of the predictor variables
R^2 40% 50% 60% 70% 75% 80% 90%
VIF 1.67 2.00 2.50 3.33 4.00 5.00 10.00

LAB: Multicollinearity

  • Identify the Multicollinearity in the Final Exam Score model.
  • Drop the variables one by one to reduce the multicollinearity.
  • Identify and eliminate the Multicollinearity in the Air passengers model.
In [42]:
from sklearn.linear_model import LinearRegression
lr1 = LinearRegression()
lr1.fit(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem1_Math"]+["Sem2_Math"]], final_exam[["Final_exam_marks"]])
predictions1 = lr1.predict(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem1_Math"]+["Sem2_Math"]])
In [43]:
import statsmodels.formula.api as sm
model1 = sm.ols(formula='Final_exam_marks ~ Sem1_Science+Sem2_Science+Sem1_Math+Sem2_Math', data=final_exam)
fitted1 = model1.fit()
fitted1.summary()
Out[43]:
OLS Regression Results
Dep. Variable: Final_exam_marks R-squared: 0.990
Model: OLS Adj. R-squared: 0.987
Method: Least Squares F-statistic: 452.3
Date: Tue, 14 Feb 2017 Prob (F-statistic): 1.50e-18
Time: 17:27:34 Log-Likelihood: -38.099
No. Observations: 24 AIC: 86.20
Df Residuals: 19 BIC: 92.09
Df Model: 4
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -1.6226 1.999 -0.812 0.427 -5.806 2.561
Sem1_Science 0.1738 0.063 2.767 0.012 0.042 0.305
Sem2_Science 0.2785 0.052 5.379 0.000 0.170 0.387
Sem1_Math 0.7890 0.197 4.002 0.001 0.376 1.202
Sem2_Math -0.2063 0.191 -1.078 0.294 -0.607 0.194
Omnibus: 6.343 Durbin-Watson: 1.863
Prob(Omnibus): 0.042 Jarque-Bera (JB): 4.332
Skew: 0.973 Prob(JB): 0.115
Kurtosis: 3.737 Cond. No. 1.20e+03
In [44]:
fitted1.summary2()
Out[44]:
Model: OLS Adj. R-squared: 0.987
Dependent Variable: Final_exam_marks AIC: 86.1980
Date: 2017-02-14 17:27 BIC: 92.0883
No. Observations: 24 Log-Likelihood: -38.099
Df Model: 4 F-statistic: 452.3
Df Residuals: 19 Prob (F-statistic): 1.50e-18
R-squared: 0.990 Scale: 1.7694
Coef. Std.Err. t P>|t| [0.025 0.975]
Intercept -1.6226 1.9987 -0.8118 0.4269 -5.8060 2.5607
Sem1_Science 0.1738 0.0628 2.7668 0.0123 0.0423 0.3052
Sem2_Science 0.2785 0.0518 5.3795 0.0000 0.1702 0.3869
Sem1_Math 0.7890 0.1971 4.0023 0.0008 0.3764 1.2016
Sem2_Math -0.2063 0.1914 -1.0782 0.2944 -0.6069 0.1942
Omnibus: 6.343 Durbin-Watson: 1.863
Prob(Omnibus): 0.042 Jarque-Bera (JB): 4.332
Skew: 0.973 Prob(JB): 0.115
Kurtosis: 3.737 Condition No.: 1200
In [45]:
#Code for VIF Calculation
#Writing a function to calculate the VIF values
def vif_cal(input_data, dependent_col):
    x_vars=input_data.drop([dependent_col], axis=1)
    xvar_names=x_vars.columns
    for i in range(0,xvar_names.shape[0]):
        y=x_vars[xvar_names[i]] 
        x=x_vars[xvar_names.drop(xvar_names[i])]
        rsq=sm.ols(formula="y~x", data=x_vars).fit().rsquared  
        vif=round(1/(1-rsq),2)
        print (xvar_names[i], " VIF = " , vif)
In [46]:
#Calculating VIF values using that function
vif_cal(input_data=final_exam, dependent_col="Final_exam_marks")
Sem1_Science  VIF =  7.4
Sem2_Science  VIF =  5.4
Sem1_Math  VIF =  68.79
Sem2_Math  VIF =  68.01
In [47]:
import statsmodels.formula.api as sm
model2 = sm.ols(formula='Final_exam_marks ~ Sem1_Science+Sem2_Science+Sem2_Math', data=final_exam)
fitted2 = model2.fit()
fitted2.summary()
Out[47]:
OLS Regression Results
Dep. Variable: Final_exam_marks R-squared: 0.981
Model: OLS Adj. R-squared: 0.978
Method: Least Squares F-statistic: 341.4
Date: Tue, 14 Feb 2017 Prob (F-statistic): 2.44e-17
Time: 17:27:39 Log-Likelihood: -45.436
No. Observations: 24 AIC: 98.87
Df Residuals: 20 BIC: 103.6
Df Model: 3
Covariance Type: nonrobust

 

coef std err t P>|t| [95.0% Conf. Int.]
Intercept -2.3986 2.632 -0.911 0.373 -7.889 3.092
Sem1_Science 0.2130 0.082 2.595 0.017 0.042 0.384
Sem2_Science 0.2686 0.068 3.925 0.001 0.126 0.411
Sem2_Math 0.5320 0.067 7.897 0.000 0.391 0.673

 

Omnibus: 5.869 Durbin-Watson: 2.424
Prob(Omnibus): 0.053 Jarque-Bera (JB): 3.793
Skew: 0.864 Prob(JB): 0.150
Kurtosis: 3.898 Cond. No. 1.03e+03

 

In [48]:
vif_cal(input_data=final_exam.drop(["Sem1_Math"], axis=1), dependent_col="Final_exam_marks")
Sem1_Science  VIF =  7.22
Sem2_Science  VIF =  5.38
Sem2_Math  VIF =  4.81
In [49]:
vif_cal(input_data=final_exam.drop(["Sem1_Math","Sem1_Science"], axis=1), dependent_col="Final_exam_marks")
Sem2_Science  VIF =  3.4
Sem2_Math  VIF =  3.4
In [50]:
#Identify and eliminate the Multicollinearity  in the Air passengers model
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(air[["Promotion_Budget"]+["Inter_metro_flight_ratio"]+["Service_Quality_Score"]], air[["Passengers"]])
predictions = lr.predict(air[["Promotion_Budget"]+["Inter_metro_flight_ratio"]+["Service_Quality_Score"]])
import statsmodels.formula.api as sm
model = sm.ols(formula='Passengers ~ Promotion_Budget+Inter_metro_flight_ratio+Service_Quality_Score', data=air)
fitted = model.fit()
fitted.summary()
Out[50]:
OLS Regression Results
Dep. Variable: Passengers R-squared: 0.951
Model: OLS Adj. R-squared: 0.949
Method: Least Squares F-statistic: 495.6
Date: Tue, 14 Feb 2017 Prob (F-statistic): 8.71e-50
Time: 17:27:42 Log-Likelihood: -738.45
No. Observations: 80 AIC: 1485.
Df Residuals: 76 BIC: 1494.
Df Model: 3
Covariance Type: nonrobust

 

coef std err t P>|t| [95.0% Conf. Int.]
Intercept 1.921e+04 3542.694 5.424 0.000 1.22e+04 2.63e+04
Promotion_Budget 0.0555 0.004 15.476 0.000 0.048 0.063
Inter_metro_flight_ratio -2003.4508 2129.095 -0.941 0.350 -6243.912 2237.010
Service_Quality_Score -2802.0708 530.382 -5.283 0.000 -3858.419 -1745.723

 

Omnibus: 6.902 Durbin-Watson: 2.312
Prob(Omnibus): 0.032 Jarque-Bera (JB): 2.759
Skew: -0.051 Prob(JB): 0.252
Kurtosis: 2.096 Cond. No. 8.22e+06

In [51]:

air.columns.values
Out[51]:
array(['Week_num', 'Passengers', 'Promotion_Budget',
       'Service_Quality_Score', 'Holiday_week',
       'Delayed_Cancelled_flight_ind', 'Inter_metro_flight_ratio',
'Bad_Weather_Ind', 'Technical_issues_ind'], dtype=object)
In [52]:
#Calculating VIF values using that function
vif_cal(input_data=air.drop(["Holiday_week","Delayed_Cancelled_flight_ind", "Bad_Weather_Ind", "Technical_issues_ind"], axis=1), dependent_col="Passengers")
Week_num  VIF =  1.2
Promotion_Budget  VIF =  3.94
Service_Quality_Score  VIF =  3.52
Inter_metro_flight_ratio  VIF =  1.39
Note: For calculating vif, all the variables have to be numerical i.e., no categorical variables. All the categorical variables should either be dropped or converted into numerical.
In [53]:
air
Out[53]:
Week_num Passengers Promotion_Budget Service_Quality_Score Holiday_week Delayed_Cancelled_flight_ind Inter_metro_flight_ratio Bad_Weather_Ind Technical_issues_ind
0 1 37824 517356 4.00000 NO NO 0.70 YES YES
1 2 43936 646086 2.67466 NO YES 0.80 YES YES
2 3 42896 638330 3.29473 NO NO 0.90 NO NO
3 4 35792 506492 3.85684 NO NO 0.40 NO NO
4 5 38624 609658 3.90757 NO NO 0.87 NO YES
5 6 35744 476084 3.83710 NO YES 0.66 YES NO
6 7 40752 635978 3.60259 NO YES 0.74 YES NO
7 8 34592 495152 3.60086 NO YES 0.39 NO NO
8 9 35136 429800 3.62776 NO NO 0.61 NO YES
9 10 43328 613326 2.98305 NO NO 0.66 NO NO
10 11 34960 492758 3.60089 NO NO 0.77 NO NO
11 12 44464 600726 2.56064 NO YES 0.74 YES NO
12 13 36464 456960 3.89655 NO YES 0.39 YES NO
13 14 44464 586096 2.47713 NO YES 0.79 YES NO
14 15 51888 704802 1.77422 YES YES 0.72 YES YES
15 16 36800 536970 3.92254 NO NO 0.43 NO YES
16 17 48688 742308 1.93589 NO NO 0.90 NO YES
17 18 37456 500234 3.99060 NO NO 0.46 NO NO
18 19 44800 570682 2.43241 NO YES 0.79 YES YES
19 20 56032 826420 1.41139 YES YES 0.80 YES NO
20 21 58800 761040 1.24488 YES NO 0.69 NO NO
21 22 57440 753466 1.36091 YES NO 0.60 NO NO
22 23 32752 502712 3.37428 NO YES 0.45 YES YES
23 24 43424 653856 2.88878 NO YES 0.89 YES YES
24 25 45968 706748 2.31898 NO YES 0.62 YES NO
25 26 38816 532602 3.85307 NO NO 0.75 NO YES
26 27 35168 518070 3.70671 NO YES 0.47 YES YES
27 28 34496 539378 3.48455 NO YES 0.78 YES YES
28 29 34208 414120 3.48166 NO YES 0.38 YES NO
29 30 44320 653338 2.58325 NO NO 0.71 NO YES
50 51 43728 590492 2.77882 NO YES 0.47 YES NO
51 52 47040 694568 2.06989 NO YES 0.55 YES NO
52 53 34512 493444 3.57125 NO NO 0.74 NO YES
53 54 57600 781718 1.35511 YES NO 0.67 NO YES
54 55 36064 526162 3.87218 NO YES 0.73 NO YES
55 56 49392 707070 1.91865 NO NO 0.75 NO NO
56 57 42378 545510 3.46630 NO NO 0.62 NO YES
57 58 38584 555170 3.99116 NO NO 0.77 NO NO
58 59 28700 405916 3.07021 NO NO 0.72 NO NO
59 60 55160 738794 1.48667 YES YES 0.71 YES NO
60 61 52472 666778 1.58686 YES YES 0.90 YES NO
61 62 54474 715498 1.52341 YES YES 0.55 YES NO
62 63 54222 754418 1.58647 YES NO 0.78 YES NO
63 64 73444 1012130 0.91298 YES YES 0.90 YES NO
64 65 67130 1003002 0.98050 YES NO 0.79 NO YES
65 66 39984 589526 3.77575 NO NO 0.81 NO NO
66 67 41972 550872 3.49699 NO YES 0.68 YES YES
67 68 43722 652680 2.84565 NO YES 0.69 YES NO
68 69 76972 1041796 0.87470 YES YES 0.90 YES NO
69 70 58156 881818 1.33013 YES NO 0.82 NO NO
70 71 52304 679938 1.68678 YES NO 0.63 NO YES
71 72 76524 1024450 0.87933 YES YES 0.90 YES NO
72 73 60620 844578 1.15504 YES NO 0.90 NO YES
73 74 32018 445424 3.23666 NO YES 0.64 YES YES
74 75 51814 669144 1.87321 YES NO 0.88 NO YES
75 76 66934 927696 1.07138 YES YES 0.84 NO NO
76 77 81228 1108254 0.85536 YES YES 0.90 YES NO
77 78 43288 638162 3.08191 NO NO 0.62 NO NO
78 79 43834 636636 2.75382 NO YES 0.79 YES YES
79 80 40852 575008 3.52768 NO YES 0.54 YES YES

80 rows × 9 columns

Lab: Multiple Regression

  • Dataset: Webpage_Product_Sales/Webpage_Product_Sales.csv
  • Build a model to predict sales using rest of the variables
  • Drop the less impacting variables based on p-values.
  • Is there any multicollinearity?
  • How many variables are there in the final model?
  • What is the R-squared of the final model?
  • Can you improve the model using same data and variables?
In [54]:
import pandas as pd 
Webpage_Product_Sales=pd.read_csv("DatasetsWebpage_Product_SalesWebpage_Product_Sales.csv")
Webpage_Product_Sales.shape
Out[54]:
(675, 12)
In [55]:
Webpage_Product_Sales.columns
Out[55]:
Index(['ID', 'DayofMonth', 'Weekday', 'Month', 'Social_Network_Ref_links',
       'Online_Ad_Paid_ref_links', 'Clicks_From_Serach_Engine',
       'Special_Discount', 'Holiday', 'Server_Down_time_Sec', 'Web_UI_Score',
       'Sales'],
dtype='object')
In [56]:
import statsmodels.formula.api as sm
model1 = sm.ols(formula='Sales ~ Web_UI_Score+Server_Down_time_Sec+Holiday+Special_Discount+Clicks_From_Serach_Engine+Online_Ad_Paid_ref_links+Social_Network_Ref_links+Month+Weekday+DayofMonth', data=Webpage_Product_Sales)
fitted1 = model1.fit()
fitted1.summary()
Out[56]:
OLS Regression Results
Dep. Variable: Sales R-squared: 0.818
Model: OLS Adj. R-squared: 0.815
Method: Least Squares F-statistic: 298.4
Date: Tue, 14 Feb 2017 Prob (F-statistic): 5.54e-238
Time: 17:27:52 Log-Likelihood: -6456.7
No. Observations: 675 AIC: 1.294e+04
Df Residuals: 664 BIC: 1.299e+04
Df Model: 10
Covariance Type: nonrobust

 

coef std err t P>|t| [95.0% Conf. Int.]
Intercept 6545.8922 1286.240 5.089 0.000 4020.304 9071.481
Web_UI_Score -6.2582 11.545 -0.542 0.588 -28.928 16.412
Server_Down_time_Sec -134.0441 14.009 -9.569 0.000 -161.551 -106.537
Holiday 1.877e+04 683.077 27.477 0.000 1.74e+04 2.01e+04
Special_Discount 4718.3978 402.019 11.737 0.000 3929.016 5507.780
Clicks_From_Serach_Engine -0.1258 0.944 -0.133 0.894 -1.980 1.728
Online_Ad_Paid_ref_links 6.1557 1.002 6.142 0.000 4.188 8.124
Social_Network_Ref_links 6.6841 0.411 16.261 0.000 5.877 7.491
Month 481.0294 41.508 11.589 0.000 399.527 562.532
Weekday 1355.2153 67.224 20.160 0.000 1223.218 1487.213
DayofMonth 47.0579 15.198 3.096 0.002 17.216 76.900

 

Omnibus: 40.759 Durbin-Watson: 1.356
Prob(Omnibus): 0.000 Jarque-Bera (JB): 102.136
Skew: 0.297 Prob(JB): 6.63e-23
Kurtosis: 4.811 Cond. No. 2.57e+04
In [57]:
#VIF
vif_cal(Webpage_Product_Sales,"Sales")
ID  VIF =  1.18
DayofMonth  VIF =  1.01
Weekday  VIF =  1.0
Month  VIF =  1.19
Social_Network_Ref_links  VIF =  1.02
Online_Ad_Paid_ref_links  VIF =  12.13
Clicks_From_Serach_Engine  VIF =  12.08
Special_Discount  VIF =  1.37
Holiday  VIF =  1.38
Server_Down_time_Sec  VIF =  1.02
Web_UI_Score  VIF =  1.02
In [58]:
##Dropped Clicks_From_Serach_Engine based on VIF
import statsmodels.formula.api as sm
model2 = sm.ols(formula='Sales ~ Web_UI_Score+Server_Down_time_Sec+Holiday+Special_Discount+Online_Ad_Paid_ref_links+Social_Network_Ref_links+Month+Weekday+DayofMonth', data=Webpage_Product_Sales)
fitted2 = model2.fit()
fitted2.summary()
Out[58]:
OLS Regression Results
Dep. Variable: Sales R-squared: 0.818
Model: OLS Adj. R-squared: 0.815
Method: Least Squares F-statistic: 332.0
Date: Tue, 14 Feb 2017 Prob (F-statistic): 2.98e-239
Time: 17:27:55 Log-Likelihood: -6456.7
No. Observations: 675 AIC: 1.293e+04
Df Residuals: 665 BIC: 1.298e+04
Df Model: 9
Covariance Type: nonrobust

 

coef std err t P>|t| [95.0% Conf. Int.]
Intercept 6598.7469 1222.658 5.397 0.000 4198.012 8999.482
Web_UI_Score -6.3332 11.523 -0.550 0.583 -28.959 16.293
Server_Down_time_Sec -133.9518 13.981 -9.581 0.000 -161.405 -106.499
Holiday 1.877e+04 681.292 27.557 0.000 1.74e+04 2.01e+04
Special_Discount 4713.9295 400.323 11.775 0.000 3927.881 5499.978
Online_Ad_Paid_ref_links 6.0279 0.291 20.740 0.000 5.457 6.599
Social_Network_Ref_links 6.6872 0.410 16.307 0.000 5.882 7.492
Month 480.6876 41.398 11.611 0.000 399.401 561.974
Weekday 1355.2536 67.174 20.175 0.000 1223.355 1487.152
DayofMonth 47.0168 15.184 3.097 0.002 17.203 76.831

 

Omnibus: 40.826 Durbin-Watson: 1.356
Prob(Omnibus): 0.000 Jarque-Bera (JB): 102.313
Skew: 0.298 Prob(JB): 6.07e-23
Kurtosis: 4.812 Cond. No. 1.94e+04
In [59]:
#VIF for the updated model
vif_cal(Webpage_Product_Sales.drop(["Clicks_From_Serach_Engine"],axis=1),"Sales")
ID  VIF =  1.18
DayofMonth  VIF =  1.01
Weekday  VIF =  1.0
Month  VIF =  1.19
Social_Network_Ref_links  VIF =  1.01
Online_Ad_Paid_ref_links  VIF =  1.02
Special_Discount  VIF =  1.36
Holiday  VIF =  1.38
Server_Down_time_Sec  VIF =  1.02
Web_UI_Score  VIF =  1.02
In [60]:
##Drop the less impacting variables based on p-values.
##Dropped Web_UI_Score based on P-value
import statsmodels.formula.api as sm
model3 = sm.ols(formula='Sales ~ Server_Down_time_Sec+Holiday+Special_Discount+Online_Ad_Paid_ref_links+Social_Network_Ref_links+Month+Weekday+DayofMonth', data=Webpage_Product_Sales)
fitted3 = model3.fit()
fitted3.summary()
Out[60]:
OLS Regression Results
Dep. Variable: Sales R-squared: 0.818
Model: OLS Adj. R-squared: 0.816
Method: Least Squares F-statistic: 373.9
Date: Tue, 14 Feb 2017 Prob (F-statistic): 1.74e-240
Time: 17:27:57 Log-Likelihood: -6456.9
No. Observations: 675 AIC: 1.293e+04
Df Residuals: 666 BIC: 1.297e+04
Df Model: 8
Covariance Type: nonrobust

 

coef std err t P>|t| [95.0% Conf. Int.]
Intercept 6101.1539 821.286 7.429 0.000 4488.532 7713.776
Server_Down_time_Sec -134.0717 13.972 -9.596 0.000 -161.507 -106.637
Holiday 1.874e+04 678.528 27.623 0.000 1.74e+04 2.01e+04
Special_Discount 4726.1858 399.491 11.831 0.000 3941.771 5510.600
Online_Ad_Paid_ref_links 6.0357 0.290 20.802 0.000 5.466 6.605
Social_Network_Ref_links 6.6738 0.409 16.312 0.000 5.870 7.477
Month 479.5231 41.322 11.605 0.000 398.386 560.660
Weekday 1354.4252 67.122 20.179 0.000 1222.629 1486.221
DayofMonth 46.9564 15.175 3.094 0.002 17.159 76.754

 

Omnibus: 41.049 Durbin-Watson: 1.352
Prob(Omnibus): 0.000 Jarque-Bera (JB): 103.243
Skew: 0.298 Prob(JB): 3.81e-23
Kurtosis: 4.821 Cond. No. 1.31e+04

 

In [61]:
#How many variables are there in the final model?
8
Out[61]:
8
In [62]:
#What is the R-squared of the final model?
fitted3.rsquared
Out[62]:
0.8178742020411971

Interaction Terms

  • Adding interaction terms might help in improving the prediction accuracy of the model.
  • The addition of interaction terms needs prior knowledge of the dataset and variables

LAB: Interaction Terms

  • Add few interaction terms to above web product sales model and see the increase in the accuracy
In [63]:
import statsmodels.formula.api as sm
model4 = sm.ols(formula='Sales ~ Server_Down_time_Sec+Holiday+Special_Discount+Online_Ad_Paid_ref_links+Social_Network_Ref_links+Month+Weekday+DayofMonth+Holiday*Weekday', data=Webpage_Product_Sales)
fitted4 = model4.fit()
fitted4.summary()
Out[63]:
OLS Regression Results
Dep. Variable: Sales R-squared: 0.865
Model: OLS Adj. R-squared: 0.863
Method: Least Squares F-statistic: 473.6
Date: Tue, 14 Feb 2017 Prob (F-statistic): 2.17e-282
Time: 17:28:04 Log-Likelihood: -6355.7
No. Observations: 675 AIC: 1.273e+04
Df Residuals: 665 BIC: 1.278e+04
Df Model: 9
Covariance Type: nonrobust

 

coef std err t P>|t| [95.0% Conf. Int.]
Intercept 6753.6923 708.791 9.528 0.000 5361.955 8145.430
Server_Down_time_Sec -140.4922 12.044 -11.665 0.000 -164.141 -116.844
Holiday 2201.8694 1232.336 1.787 0.074 -217.870 4621.608
Special_Discount 4749.0044 344.145 13.799 0.000 4073.262 5424.747
Online_Ad_Paid_ref_links 5.9515 0.250 23.805 0.000 5.461 6.442
Social_Network_Ref_links 7.0657 0.353 19.994 0.000 6.372 7.760
Month 480.3156 35.597 13.493 0.000 410.420 550.212
Weekday 1164.8864 59.143 19.696 0.000 1048.756 1281.017
DayofMonth 47.0967 13.073 3.603 0.000 21.428 72.766
Holiday:Weekday 4294.6865 281.683 15.247 0.000 3741.592 4847.782

 

Omnibus: 7.552 Durbin-Watson: 0.867
Prob(Omnibus): 0.023 Jarque-Bera (JB): 7.305
Skew: 0.219 Prob(JB): 0.0259
Kurtosis: 2.740 Cond. No. 2.32e+04

Conclusion – Regression

  • In this chapter, we have discussed what is simple regression, what is multiple regression, how to build simple linear regression, multiple linear regression, what are the most important metric that one should consider in output of a regression line, what is Multi-collinearity, how to detect it, how to eliminate Multi-collinearity, what is R square, what is adjusted R square , difference between R square and Adjusted R-squared, how do we see the individual impact of the variables and etc.
  • This is basic regression class, once you get a good idea on regression you can explore more in regression by going through the advance topics like adding the polynomial and interaction terms to your regression line.
  • Adjusted R-squared is a good measure of training/in time sample error. We can’t be sure about the final model performance based on this. We may have to perform cross-validation to get an idea on testing error.
  • About cross-validation we will talk in future lectures in more detail.
  • Outliers can influence the regression line; we need to take care of data sanitization before building the regression line, because at the end of the day these are all mathematical formula. If wrong adjustment is done, then we will get the wrong results, so data cleaning is very important before getting into regression.

 

 

DV Analytics

DV Data & Analytics is a leading data science,  Cyber Security training and consulting firm, led by industry experts. We are aiming to train and prepare resources to acquire the most in-demand data science job opportunities in India and abroad.

Bangalore Center

DV Data & Analytics Bangalore Private Limited
#52, 2nd Floor:
Malleshpalya Maruthinagar Bengaluru.
Bangalore 560075
India
(+91) 9019 030 033 (+91) 8095 881 188
Email: info@dvanalyticsmds.com

Bhubneshwar Center

DV Data & Analytics Private Limited Bhubaneswar
Plot No A/7 :
Adjacent to Maharaja Cine Complex, Bhoinagar, Acharya Vihar
Bhubaneswar 751022
(+91) 8095 881 188 (+91) 8249 430 414
Email: info@dvanalyticsmds.com

top
© 2020. All Rights Reserved.