• No products in the cart.

Regression

Regression

Contents

  • Correlation
  • Simple Regression
  • R-Squared
  • Multiple Regression
  • Adj R-Squared
  • P-value
  • Multicollinearity
  • Interaction terms

Correlation

What is need of correlation?

  • Is there any association between hours of study and grades?
  • Is there any association between number of temples in a city & murder rate?
  • What happens to sweater sales with increase in temperature? What is the strength of association between them?
  • What happens to ice-cream sales v.s temperature? What is the strength of association between them?
  • How to quantify the association?
  • Which of the above examples has very strong association?
  • Correlation

Correlation coefficient

  • It is a measure of linear association
  • r is the ratio of variance together vs product of individual variances.

$$Correlation coefficient (r) = frac{Covariance of XY}{Sqrt(VarianceX* VarianceY)}$$

  • Correlation 0 No linear association
  • Correlation 0 to 0.25 Negligible positive association
  • Correlation 0.25-0.5 Weak positive association
  • Correlation 0.5-0.75 Moderate positive association
  • Correlation >0.75 Very Strong positive association

LAB –Correlation Calculation

  • Dataset: AirPassengersAirPassengers.csv
  • Find the correlation between number of passengers and promotional budget.
  • Draw a scatter plot between number of passengers and promotional budget
  • Find the correlation between number of passengers and Service_Quality_Score
In [1]:
import pandas as pd
air = pd.read_csv("C:\Amrita\Datavedi\AirPassengers\AirPassengers.csv")
air.shape
Out[1]:
(80, 9)
In [2]:
air.columns.values
Out[2]:
array(['Week_num', 'Passengers', 'Promotion_Budget',
       'Service_Quality_Score', 'Holiday_week',
       'Delayed_Cancelled_flight_ind', 'Inter_metro_flight_ratio',
       'Bad_Weather_Ind', 'Technical_issues_ind'], dtype=object)
In [3]:
#Find the correlation between number of passengers and promotional budget.
import numpy as np
np.corrcoef(air.Passengers,air.Promotion_Budget)
Out[3]:
array([[ 1.        ,  0.96585103],
       [ 0.96585103,  1.        ]])
In [4]:
#Draw a scatter plot between number of passengers and promotional budget
import matplotlib.pyplot as plt
%matplotlib inline  
plt.scatter(air.Passengers, air.Promotion_Budget)
Out[4]:
<matplotlib.collections.PathCollection at 0x8feb8d0>
In [5]:
#Find the correlation between number of passengers and Service_Quality_Score
np.corrcoef(air.Passengers,air.Service_Quality_Score)
Out[5]:
array([[ 1.        , -0.88653002],
       [-0.88653002,  1.        ]])

Beyond Pearson Correlation

  • Correlation coefficient measures for different types of data
Variable YX Quantitative /Continuous X Ordinal/Ranked/Discrete X Nominal/Categorical X
Quantitative Y Pearson r Biserial $r_b$ Point Biserial $r_{pb}$
Ordinal/Ranked/Discrete Y Biserial $r_b$ Spearman rho/Kendall’s Rank Biserial $r_{rb}$
Nominal/Categorical Y Point Biserial $r_{pb}$ Rank Biserial $r_{rb}$ Phi, Contingency Coeff, V

From Correlation to Regression

  • Correlation is just a measure of association
  • It can’t be used for prediction.
  • Given the predictor variable, we can’t estimate the dependent variable.
  • In the air passengers example, given the promotion budget, we can’t get an estimated value of passengers
  • We need a model, an equation, a fit for the data.
  • That is known as regression line

What is Regression

  • A regression line is a mathematical formula that quantifies the general relation between a predictor/independent (or known variable x) and the target/dependent (or the unknown variable y)
  • Below is the regression line. If we have the data of x and y then we can build a model to generalize their relation

$$ y = beta_0 + beta_1 x$$

- What is the best fit for our data?
- The one which goes through the core of the data
- The one which minimizes the error

Regression

Regression Line fitting

Error

Minimizing the error

  • The best line will have the minimum error
  • Some errors are positive and some errors are negative. Taking their sum is not a good idea
  • We can either minimize the squared sum of errors Or we can minimize the absolute sum of errors
  • Squared sum of errors is mathematically convenient to minimize
  • The method of minimizing squared sum of errors is called least squared method of regression

Least Squares Estimation

  • X: $x_1$, $x_2$, $x_3$,… $x_n$
  • Y: $y_1$, $y_2$, $y_3$,… $y_n
  • Imagine a line through all the points
  • Deviation from each point (residual or error)
  • Square of the deviation
  • Minimizing sum of squares of deviation

$$ sum e^2 = sum (y – hat{y})^2$$$$sum e^2= sum (y – (beta_0 + beta_1 x))^2$$

  • $beta_0$ and $beta_1$ are obtained by minimizing the sum of the squared residuals

LAB: Regression Line Fitting

  • Dataset: AirPassengersAirPassengers.csv
  • Find the correlation between Promotion_Budget and Passengers
  • Draw a scatter plot between Promotion_Budget and Passengers. Is there any any pattern between Promotion_Budget and Passengers?
  • Build a linear regression model on Promotion_Budget and Passengers.
  • Build a regression line to predict the passengers using Inter_metro_flight_ratio
In [6]:
import pandas as pd
air = pd.read_csv("C:\Amrita\Datavedi\AirPassengers\AirPassengers.csv")
air.shape
Out[6]:
(80, 9)
In [7]:
air.columns.values
Out[7]:
array(['Week_num', 'Passengers', 'Promotion_Budget',
       'Service_Quality_Score', 'Holiday_week',
       'Delayed_Cancelled_flight_ind', 'Inter_metro_flight_ratio',
       'Bad_Weather_Ind', 'Technical_issues_ind'], dtype=object)
In [8]:
air.head(5)
Out[8]:
Week_num Passengers Promotion_Budget Service_Quality_Score Holiday_week Delayed_Cancelled_flight_ind Inter_metro_flight_ratio Bad_Weather_Ind Technical_issues_ind
0 1 37824 517356 4.00000 NO NO 0.70 YES YES
1 2 43936 646086 2.67466 NO YES 0.80 YES YES
2 3 42896 638330 3.29473 NO NO 0.90 NO NO
3 4 35792 506492 3.85684 NO NO 0.40 NO NO
4 5 38624 609658 3.90757 NO NO 0.87 NO YES
In [9]:
# Find the correlation between Promotion_Budget and Passengers
import numpy as np
np.corrcoef(air.Passengers,air.Promotion_Budget)
Out[9]:
array([[ 1.        ,  0.96585103],
       [ 0.96585103,  1.        ]])
In [10]:
# Draw a scatter plot between   Promotion_Budget and Passengers. Is there any any pattern between Promotion_Budget and Passengers?

import matplotlib.pyplot as plt
%matplotlib inline 

plt.scatter(air.Passengers, air.Promotion_Budget)
Out[10]:
<matplotlib.collections.PathCollection at 0x90bda20>
In [11]:
#Build a linear regression model and estimate the expected passengers for a Promotion_Budget is 650,000
##Regression Model  promotion and passengers count

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(air[["Promotion_Budget"]], air[["Passengers"]])
predictions = lr.predict(air[["Promotion_Budget"]])
In [12]:
import statsmodels.formula.api as sm
model = sm.ols(formula='Passengers ~ Promotion_Budget', data=air)
fitted1 = model.fit()
In [13]:
fitted1.summary()
Out[13]:
OLS Regression Results
Dep. Variable: Passengers R-squared: 0.933
Model: OLS Adj. R-squared: 0.932
Method: Least Squares F-statistic: 1084.
Date: Wed, 27 Jul 2016 Prob (F-statistic): 1.66e-47
Time: 11:48:26 Log-Likelihood: -751.34
No. Observations: 80 AIC: 1507.
Df Residuals: 78 BIC: 1511.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 1259.6058 1361.071 0.925 0.358 -1450.078 3969.290
Promotion_Budget 0.0695 0.002 32.923 0.000 0.065 0.074
Omnibus: 26.624 Durbin-Watson: 1.831
Prob(Omnibus): 0.000 Jarque-Bera (JB): 5.188
Skew: -0.128 Prob(JB): 0.0747
Kurtosis: 1.779 Cond. No. 2.67e+06
In [14]:
# Build a regression line to predict the passengers using Inter_metro_flight_ratio

plt.scatter(air.Inter_metro_flight_ratio,air.Passengers)
Out[14]:
<matplotlib.collections.PathCollection at 0xb13f2b0>
In [15]:
import sklearn as sk

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(air[["Inter_metro_flight_ratio"]], air[["Passengers"]])
Out[15]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [16]:
predictions = lr.predict(air[["Inter_metro_flight_ratio"]])
In [17]:
import statsmodels.formula.api as sm
model = sm.ols(formula='Passengers ~ Inter_metro_flight_ratio', data=air)
fitted2 = model.fit()
In [18]:
fitted2.summary()
Out[18]:
OLS Regression Results
Dep. Variable: Passengers R-squared: 0.242
Model: OLS Adj. R-squared: 0.232
Method: Least Squares F-statistic: 24.90
Date: Wed, 27 Jul 2016 Prob (F-statistic): 3.58e-06
Time: 11:48:27 Log-Likelihood: -848.30
No. Observations: 80 AIC: 1701.
Df Residuals: 78 BIC: 1705.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 2.044e+04 4993.747 4.093 0.000 1.05e+04 3.04e+04
Inter_metro_flight_ratio 3.507e+04 7027.768 4.990 0.000 2.11e+04 4.91e+04
Omnibus: 10.172 Durbin-Watson: 1.385
Prob(Omnibus): 0.006 Jarque-Bera (JB): 10.098
Skew: 0.822 Prob(JB): 0.00641
Kurtosis: 3.573 Cond. No. 9.48

How good is my regression line?

  • Take an (x,y) point from data.
  • Imagine that we submitted x in the regression line, we got a prediction as $y_{pred}$
  • If the regression line is a good fit then the we expect $y_{pred}$=y or (y-$y_{pred}$) =0
  • At every point of x, if we repeat the same, then we will get multiple error values (y-$y_{pred}$) values
  • Some of them might be positive, some of them may be negative, so we can take the square of all such errors

$$SSE = sum(y – hat{y})^2$$

  • For a good model we need SSE to be zero or near to zero
  • Standalone SSE will not make any sense, For example SSE= 100, is very less when y is varying in terms of 1000’s. Same value is is very high when y is varying in terms of decimals.
  • We have to consider variance of y while calculating the regression line accuracy
  • Error Sum of squares (SSE- Sum of Squares of error)
    $$SSE = sum(y – hat{y})^2$$
  • Total Variance in Y (SST- Sum of Squares of Total)
    $$SST = sum(y – bar{y})^2$$
    $$SST = sum(y – hat{y} + – hat{y} – bar{y})^2$$
    $$SST = sum(y – hat{y} + – hat{y} – bar{y})^2$$
    $$SST = sum(y – hat{y})^2 + sum(hat{y} – bar{y})^2$$
    $$SST = SSE + sum(hat{y} – bar{y})^2$$
    $$SST = SSE + SSR$$
  • So, total variance in Y is divided into two parts,
    • Variance that can’t be explained by x (error)
    • Variance that can be explained by x, using regression

Explained and Unexplained Variation

  • So, total variance in Y is divided into two parts,
    • Variance that can be explained by x, using regression
    • Variance that can’t be explained by x
      $$SST = SSE + SSR$$
      $$Total sum of Squares = Sum of Squares Error + Sum of Squares Regression$$
      $$SST = sum(y – bar{y})^2 SSE = sum(y – hat{y})^2 SSR = sum(hat{y} – bar{y})^2$$

R-Squared

  • A good fit will have
    • SSE (Minimum or Maximum?)
    • SSR (Minimum or Maximum?)
    • And we know SST= SSE + SSR
    • SSE/SST(Minimum or Maximum?)
    • SSR/SST(Minimum or Maximum?)
  • The coefficient of determination is the portion of the total variation in the dependent variable that is explained by variation in the independent variable
  • The coefficient of determination is also called R-squared and is denoted as $R^2$

$$ R^2 = frac{SSR}{SST}$$

where 0<= $R^2$<=1

Lab: R- Square

  • What is the R-square value of Passengers vs Promotion_Budget model?
  • What is the R-square value of Passengers vs Inter_metro_flight_ratio
In [19]:
#What is the R-square value of Passengers vs Promotion_Budget model?
fitted1.summary()
Out[19]:
OLS Regression Results
Dep. Variable: Passengers R-squared: 0.933
Model: OLS Adj. R-squared: 0.932
Method: Least Squares F-statistic: 1084.
Date: Wed, 27 Jul 2016 Prob (F-statistic): 1.66e-47
Time: 11:48:27 Log-Likelihood: -751.34
No. Observations: 80 AIC: 1507.
Df Residuals: 78 BIC: 1511.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 1259.6058 1361.071 0.925 0.358 -1450.078 3969.290
Promotion_Budget 0.0695 0.002 32.923 0.000 0.065 0.074
Omnibus: 26.624 Durbin-Watson: 1.831
Prob(Omnibus): 0.000 Jarque-Bera (JB): 5.188
Skew: -0.128 Prob(JB): 0.0747
Kurtosis: 1.779 Cond. No. 2.67e+06
In [20]:
#What is the R-square value of Passengers vs Inter_metro_flight_ratio

fitted2.summary()
Out[20]:
OLS Regression Results
Dep. Variable: Passengers R-squared: 0.242
Model: OLS Adj. R-squared: 0.232
Method: Least Squares F-statistic: 24.90
Date: Wed, 27 Jul 2016 Prob (F-statistic): 3.58e-06
Time: 11:48:27 Log-Likelihood: -848.30
No. Observations: 80 AIC: 1701.
Df Residuals: 78 BIC: 1705.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 2.044e+04 4993.747 4.093 0.000 1.05e+04 3.04e+04
Inter_metro_flight_ratio 3.507e+04 7027.768 4.990 0.000 2.11e+04 4.91e+04
Omnibus: 10.172 Durbin-Watson: 1.385
Prob(Omnibus): 0.006 Jarque-Bera (JB): 10.098
Skew: 0.822 Prob(JB): 0.00641
Kurtosis: 3.573 Cond. No. 9.48

Multiple Regression

  • Using multiple predictor variables instead of single variable
  • We need to find a perfect plane here

Code – Multiple Regression

In [21]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(air[["Promotion_Budget"]+["Inter_metro_flight_ratio"]], air[["Passengers"]])
Out[21]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [22]:
predictions = lr.predict(air[["Promotion_Budget"]+["Inter_metro_flight_ratio"]])
In [23]:
import statsmodels.formula.api as sm
model = sm.ols(formula='Passengers ~ Promotion_Budget+Inter_metro_flight_ratio', data=air)
fitted = model.fit()
fitted.summary()
Out[23]:
OLS Regression Results
Dep. Variable: Passengers R-squared: 0.934
Model: OLS Adj. R-squared: 0.932
Method: Least Squares F-statistic: 540.5
Date: Wed, 27 Jul 2016 Prob (F-statistic): 4.76e-46
Time: 11:48:27 Log-Likelihood: -750.96
No. Observations: 80 AIC: 1508.
Df Residuals: 77 BIC: 1515.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 2017.7724 1624.803 1.242 0.218 -1217.624 5253.169
Promotion_Budget 0.0707 0.002 28.297 0.000 0.066 0.076
Inter_metro_flight_ratio -2121.5208 2473.189 -0.858 0.394 -7046.268 2803.227
Omnibus: 26.259 Durbin-Watson: 1.800
Prob(Omnibus): 0.000 Jarque-Bera (JB): 5.075
Skew: -0.096 Prob(JB): 0.0791
Kurtosis: 1.781 Cond. No. 5.25e+06

Individual Impact of variables

  • Look at the P-value
  • Probability of the hypothesis being right.
  • Individual variable coefficient is tested for significance
  • Beta coefficients follow t distribution.
  • Individual P values tell us about the significance of each variable
  • A variable is significant if P value is less than 5%. Lesser the P-value, better the variable
  • Note it is possible all the variables in a regression to produce a great fit, and yet very few of the variables be individually significant.

To test
$$H_0 : beta_i = 0$$
$$H_a : beta_i not= 0$$

Test Statistic

$$t=frac{b_i}{s(b_i)}$$

Reject $H_0$ if

$$t > t(frac{alpha}{2};n-k-1)$$

or
$$t > -t(frac{alpha}{2};n-k-1)$$

LAB: Multiple Regression

  • Build a multiple regression model to predict the number of passengers
  • What is R-square value
  • Are there any predictor variables that are not impacting the dependent variable
In [24]:
#Build a multiple regression model to predict the number of passengers

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(air[["Promotion_Budget"]+["Inter_metro_flight_ratio"]+["Service_Quality_Score"]], air[["Passengers"]])
predictions = lr.predict(air[["Promotion_Budget"]+["Inter_metro_flight_ratio"]+["Service_Quality_Score"]])
In [25]:
import statsmodels.formula.api as sm
model = sm.ols(formula='Passengers ~ Promotion_Budget+Inter_metro_flight_ratio+Service_Quality_Score', data=air)
fitted = model.fit()
fitted.summary()
Out[25]:
OLS Regression Results
Dep. Variable: Passengers R-squared: 0.951
Model: OLS Adj. R-squared: 0.949
Method: Least Squares F-statistic: 495.6
Date: Wed, 27 Jul 2016 Prob (F-statistic): 8.71e-50
Time: 11:48:28 Log-Likelihood: -738.45
No. Observations: 80 AIC: 1485.
Df Residuals: 76 BIC: 1494.
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 1.921e+04 3542.694 5.424 0.000 1.22e+04 2.63e+04
Promotion_Budget 0.0555 0.004 15.476 0.000 0.048 0.063
Inter_metro_flight_ratio -2003.4508 2129.095 -0.941 0.350 -6243.912 2237.010
Service_Quality_Score -2802.0708 530.382 -5.283 0.000 -3858.419 -1745.723
Omnibus: 6.902 Durbin-Watson: 2.312
Prob(Omnibus): 0.032 Jarque-Bera (JB): 2.759
Skew: -0.051 Prob(JB): 0.252
Kurtosis: 2.096 Cond. No. 8.22e+06
  • What is R-square value

0.951

  • Are there any predictor variables that are not impacting the dependent variable

Inter_metro_flight_ratio

Adjusted R-Squared

  • Is it good to have as many independent variables as possible? Nope
  • R-square is deceptive. R-squared never decreases when a new X variable is added to the model – True?
  • We need a better measure or an adjustment to the original R-squared formula.
  • Adjusted R squared
    • Its value depends on the number of explanatory variables
    • Imposes a penalty for adding additional explanatory variables
    • It is usually written as ($bar{R}^2$)
    • Very different from $R^2$ when there are too many predictors and n is less

    $$ bar{R}^2 = R^2 – frac{k-1}{n-k}(1-R^2)$$
    where n – number of observations

     k - number of parameters

LAB: Adjusted R-Square

  • Dataset: “Adjusted Rsquare/ Adj_Sample.csv”
  • Build a model to predict y using x1,x2 and x3. Note down R-Square and Adj R-Square values
  • Build a model to predict y using x1,x2,x3,x4,x5 and x6. Note down R-Square and Adj R-Square values
  • Build a model to predict y using x1,x2,x3,x4,x5,x6,x7 and x8. Note down R-Square and Adj R-Square values
In [26]:
adj_sample=pd.read_csv("C:\Amrita\Datavedi\Adjusted RSquare\Adj_Sample.csv")
adj_sample.shape
Out[26]:
(12, 9)
In [27]:
adj_sample.columns.values
Out[27]:
array(['Y', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8'], dtype=object)
In [28]:
#Build a model to predict y using x1,x2 and x3. Note down R-Square and Adj R-Square values 
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(adj_sample[["x1"]+["x2"]+["x3"]], adj_sample[["Y"]])
predictions = lr.predict(adj_sample[["x1"]+["x2"]+["x3"]])
In [29]:
import statsmodels.formula.api as sm
model = sm.ols(formula='Y ~ x1+x2+x3', data=adj_sample)
fitted1 = model.fit()
fitted1.summary()
C:Anaconda3libsite-packagesscipystatsstats.py:1557: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=12
  "anyway, n=%i" % int(n))
Out[29]:
OLS Regression Results
Dep. Variable: Y R-squared: 0.684
Model: OLS Adj. R-squared: 0.566
Method: Least Squares F-statistic: 5.785
Date: Wed, 27 Jul 2016 Prob (F-statistic): 0.0211
Time: 11:48:28 Log-Likelihood: -10.430
No. Observations: 12 AIC: 28.86
Df Residuals: 8 BIC: 30.80
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -2.8798 1.163 -2.477 0.038 -5.561 -0.199
x1 -0.4894 0.370 -1.324 0.222 -1.342 0.363
x2 0.0029 0.001 2.586 0.032 0.000 0.005
x3 0.4572 0.176 2.595 0.032 0.051 0.864
Omnibus: 1.113 Durbin-Watson: 1.978
Prob(Omnibus): 0.573 Jarque-Bera (JB): 0.763
Skew: -0.562 Prob(JB): 0.683
Kurtosis: 2.489 Cond. No. 6.00e+03
In [30]:
#Build a model to predict y using x1,x2,x3,x4,x5 and x6. Note down R-Square and Adj R-Square values 

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(adj_sample[["x1"]+["x2"]+["x3"]+["x4"]+["x5"]+["x6"]], adj_sample[["Y"]])
predictions = lr.predict(adj_sample[["x1"]+["x2"]+["x3"]+["x4"]+["x5"]+["x6"]])
In [31]:
import statsmodels.formula.api as sm
model = sm.ols(formula='Y ~ x1+x2+x3+x4+x5+x6', data=adj_sample)
fitted2 = model.fit()
fitted2.summary()
C:Anaconda3libsite-packagesscipystatsstats.py:1557: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=12
  "anyway, n=%i" % int(n))
Out[31]:
OLS Regression Results
Dep. Variable: Y R-squared: 0.717
Model: OLS Adj. R-squared: 0.377
Method: Least Squares F-statistic: 2.111
Date: Wed, 27 Jul 2016 Prob (F-statistic): 0.215
Time: 11:48:28 Log-Likelihood: -9.7790
No. Observations: 12 AIC: 33.56
Df Residuals: 5 BIC: 36.95
Df Model: 6
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -5.3751 4.687 -1.147 0.303 -17.423 6.673
x1 -0.6697 0.537 -1.247 0.268 -2.050 0.711
x2 0.0030 0.002 1.956 0.108 -0.001 0.007
x3 0.5063 0.249 2.036 0.097 -0.133 1.146
x4 0.0376 0.084 0.449 0.672 -0.178 0.253
x5 0.0436 0.169 0.258 0.806 -0.390 0.478
x6 0.0516 0.088 0.588 0.582 -0.174 0.277
Omnibus: 0.426 Durbin-Watson: 2.065
Prob(Omnibus): 0.808 Jarque-Bera (JB): 0.434
Skew: -0.347 Prob(JB): 0.805
Kurtosis: 2.378 Cond. No. 1.98e+04
In [32]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(adj_sample[["x1"]+["x2"]+["x3"]+["x4"]+["x5"]+["x6"]+["x7"]+["x8"]], adj_sample[["Y"]])
predictions = lr.predict(adj_sample[["x1"]+["x2"]+["x3"]+["x4"]+["x5"]+["x6"]+["x7"]+["x8"]])
In [33]:
import statsmodels.formula.api as sm
model = sm.ols(formula='Y ~ x1+x2+x3+x4+x5+x6+x7+x8', data=adj_sample)
fitted3 = model.fit()
fitted3.summary()
C:Anaconda3libsite-packagesscipystatsstats.py:1557: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=12
  "anyway, n=%i" % int(n))
Out[33]:
OLS Regression Results
Dep. Variable: Y R-squared: 0.805
Model: OLS Adj. R-squared: 0.285
Method: Least Squares F-statistic: 1.549
Date: Wed, 27 Jul 2016 Prob (F-statistic): 0.393
Time: 11:48:28 Log-Likelihood: -7.5390
No. Observations: 12 AIC: 33.08
Df Residuals: 3 BIC: 37.44
Df Model: 8
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 17.0440 19.903 0.856 0.455 -46.297 80.385
x1 -0.0956 0.761 -0.126 0.908 -2.519 2.328
x2 0.0007 0.003 0.291 0.790 -0.007 0.009
x3 0.5157 0.306 1.684 0.191 -0.459 1.490
x4 0.0579 0.103 0.560 0.615 -0.271 0.387
x5 0.0858 0.191 0.448 0.684 -0.524 0.695
x6 -0.1747 0.220 -0.795 0.485 -0.874 0.525
x7 -0.0324 0.153 -0.212 0.846 -0.519 0.455
x8 -0.2321 0.207 -1.124 0.343 -0.890 0.425
Omnibus: 1.329 Durbin-Watson: 1.594
Prob(Omnibus): 0.514 Jarque-Bera (JB): 0.875
Skew: -0.339 Prob(JB): 0.646
Kurtosis: 1.863 Cond. No. 7.85e+04
Model $R^2$ $Adj R^2$
Model1 0.684 0.566
Model2 0.717 0.377
Model3 0.805 0.285

R-Squared vs Adjusted R-Squared

We have built three models on Adj_sample data; model1, model2 and model3 with different number of variabes

LAB: Multiple Regression- issues

  • Import Final Exam Score data
  • Build a model to predict final score using the rest of the variables.
  • How are Sem2_Math & Final score related? As Sem2_Math score increases, what happens to Final score?
  • Remove “Sem1_Math” variable from the model and rebuild the model
  • Is there any change in R square or Adj R square
  • How are Sem2_Math & Final score related now? As Sem2_Math score increases, what happens to Final score?
  • Draw a scatter plot between Sem1_Math & Sem2_Math
  • Find the correlation between Sem1_Math & Sem2_Math
In [34]:
#Import Final Exam Score data
final_exam=pd.read_csv("C:\Amrita\Datavedi\Final Exam\Final Exam Score.csv")
In [35]:
#Size of the data
final_exam.shape
Out[35]:
(24, 5)
In [36]:
#Variable names
final_exam.columns
Out[36]:
Index(['Sem1_Science', 'Sem2_Science', 'Sem1_Math', 'Sem2_Math',
       'Final_exam_marks'],
      dtype='object')
In [37]:
#Build a model to predict final score using the rest of the variables.
from sklearn.linear_model import LinearRegression
lr1 = LinearRegression()
lr1.fit(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem1_Math"]+["Sem2_Math"]], final_exam[["Final_exam_marks"]])
predictions1 = lr1.predict(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem1_Math"]+["Sem2_Math"]])

import statsmodels.formula.api as sm
model1 = sm.ols(formula='Final_exam_marks ~ Sem1_Science+Sem2_Science+Sem1_Math+Sem2_Math', data=final_exam)
fitted1 = model1.fit()
fitted1.summary()
Out[37]:
OLS Regression Results
Dep. Variable: Final_exam_marks R-squared: 0.990
Model: OLS Adj. R-squared: 0.987
Method: Least Squares F-statistic: 452.3
Date: Wed, 27 Jul 2016 Prob (F-statistic): 1.50e-18
Time: 11:48:28 Log-Likelihood: -38.099
No. Observations: 24 AIC: 86.20
Df Residuals: 19 BIC: 92.09
Df Model: 4
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -1.6226 1.999 -0.812 0.427 -5.806 2.561
Sem1_Science 0.1738 0.063 2.767 0.012 0.042 0.305
Sem2_Science 0.2785 0.052 5.379 0.000 0.170 0.387
Sem1_Math 0.7890 0.197 4.002 0.001 0.376 1.202
Sem2_Math -0.2063 0.191 -1.078 0.294 -0.607 0.194
Omnibus: 6.343 Durbin-Watson: 1.863
Prob(Omnibus): 0.042 Jarque-Bera (JB): 4.332
Skew: 0.973 Prob(JB): 0.115
Kurtosis: 3.737 Cond. No. 1.20e+03
In [38]:
fitted1.rsquared
Out[38]:
0.98960765475687229
  • How are Sem2_Math & Final score related? As Sem2_Math score increases, what happens to Final score?

As Sem2_Math score increases Final score decreases

In [39]:
#Remove "Sem1_Math" variable from the model and rebuild the model
from sklearn.linear_model import LinearRegression
lr2 = LinearRegression()
lr2.fit(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem2_Math"]], final_exam[["Final_exam_marks"]])
predictions2 = lr2.predict(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem2_Math"]])

import statsmodels.formula.api as sm
model2 = sm.ols(formula='Final_exam_marks ~ Sem1_Science+Sem2_Science+Sem2_Math', data=final_exam)
fitted2 = model2.fit()
fitted2.summary()
Out[39]:
OLS Regression Results
Dep. Variable: Final_exam_marks R-squared: 0.981
Model: OLS Adj. R-squared: 0.978
Method: Least Squares F-statistic: 341.4
Date: Wed, 27 Jul 2016 Prob (F-statistic): 2.44e-17
Time: 11:48:29 Log-Likelihood: -45.436
No. Observations: 24 AIC: 98.87
Df Residuals: 20 BIC: 103.6
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -2.3986 2.632 -0.911 0.373 -7.889 3.092
Sem1_Science 0.2130 0.082 2.595 0.017 0.042 0.384
Sem2_Science 0.2686 0.068 3.925 0.001 0.126 0.411
Sem2_Math 0.5320 0.067 7.897 0.000 0.391 0.673
Omnibus: 5.869 Durbin-Watson: 2.424
Prob(Omnibus): 0.053 Jarque-Bera (JB): 3.793
Skew: 0.864 Prob(JB): 0.150
Kurtosis: 3.898 Cond. No. 1.03e+03
  • Is there any change in R square or Adj R square
Model $R^2$ $Adj R^2$
model1 0.990 0.987
model2 0.981 0.978
  • How are Sem2_Math & Final score related now? As Sem2_Math score increases, what happens to Final score?

As Sem2_Math score increases Final score also increases.

In [40]:
#Draw a scatter plot between Sem1_Math & Sem2_Mat

import matplotlib.pyplot as plt
%matplotlib inline 
plt.scatter(final_exam.Sem1_Math,final_exam.Sem2_Math)
Out[40]:
<matplotlib.collections.PathCollection at 0xb2cf0f0>
In [41]:
#Find the correlation between Sem1_Math & Sem2_Math 
np.corrcoef(final_exam.Sem1_Math,final_exam.Sem2_Math)
Out[41]:
array([[ 1.       ,  0.9924948],
       [ 0.9924948,  1.       ]])

Multicollinearity

  • Multiple regression is wonderful – In that it allows you to consider the effect of multiple variables simultaneously.
  • Multiple regression is extremely unpleasant -Because it allows you to consider the effect of multiple variables simultaneously.
  • The relationships between the explanatory variables are the key to understanding multiple regression.
  • Multicollinearity (or inter correlation) exists when at least some of the predictor variables are correlated among themselves.
  • The parameter estimates will have inflated variance in presence of multicollineraity
  • Sometimes the signs of the parameter estimates tend to change
  • If the relation between the independent variables grows really strong then the variance of parameter estimates tends to be infinity – Can you prove it?

Multicollinearity Detection

$$Y = beta_0 + beta_1 X_1 + beta_2 X_2 + beta_3 X_3 + beta_4 X_4 $$

  • Build a model $X_1$ vs $X_2$ $X_3$ $X_4$ find $R^2$, say R1
  • Build a model $X_2$ vs $X_1$ $X_3$ $X_4$ find $R^2$, say R2
  • Build a model $X_3$ vs $X_1$ $X_2$ $X_4$ find $R^2$, say R3
  • Build a model $X_4$ vs $X_1$ $X_2$ $X_3$ find $R^2$, say R4
  • For example if R3 is 95% then we don’t really need X3 in the model
  • Since it can be explained as liner combination of other three
  • For each variable we find individual $R^2$.
  • $frac{1}{(1-R^2)}$ is called VIF.
  • VIF option in SAS automatically calculates VIF values for each of the predictor variables
$R^2$ 40% 50% 60% 70% 75% 80% 90%
VIF 1.67 2.00 2.50 3.33 4.00 5.00 10.00

LAB: Multicollinearity

  • Identify the Multicollinearity in the Final Exam Score model
  • Drop the variable one by one to reduce the multicollinearity
  • Identify and eliminate the Multicollinearity in the Air passengers model
In [42]:
from sklearn.linear_model import LinearRegression
lr1 = LinearRegression()
lr1.fit(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem1_Math"]+["Sem2_Math"]], final_exam[["Final_exam_marks"]])
predictions1 = lr1.predict(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem1_Math"]+["Sem2_Math"]])
In [44]:
import statsmodels.formula.api as sm
model1 = sm.ols(formula='Final_exam_marks ~ Sem1_Science+Sem2_Science+Sem1_Math+Sem2_Math', data=final_exam)
fitted1 = model1.fit()
fitted1.summary()
Out[44]:
OLS Regression Results
Dep. Variable: Final_exam_marks R-squared: 0.990
Model: OLS Adj. R-squared: 0.987
Method: Least Squares F-statistic: 452.3
Date: Wed, 27 Jul 2016 Prob (F-statistic): 1.50e-18
Time: 11:52:41 Log-Likelihood: -38.099
No. Observations: 24 AIC: 86.20
Df Residuals: 19 BIC: 92.09
Df Model: 4
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -1.6226 1.999 -0.812 0.427 -5.806 2.561
Sem1_Science 0.1738 0.063 2.767 0.012 0.042 0.305
Sem2_Science 0.2785 0.052 5.379 0.000 0.170 0.387
Sem1_Math 0.7890 0.197 4.002 0.001 0.376 1.202
Sem2_Math -0.2063 0.191 -1.078 0.294 -0.607 0.194
Omnibus: 6.343 Durbin-Watson: 1.863
Prob(Omnibus): 0.042 Jarque-Bera (JB): 4.332
Skew: 0.973 Prob(JB): 0.115
Kurtosis: 3.737 Cond. No. 1.20e+03
In [45]:
fitted1.summary2()
Out[45]:
Model: OLS Adj. R-squared: 0.987
Dependent Variable: Final_exam_marks AIC: 86.1980
Date: 2016-07-27 11:53 BIC: 92.0883
No. Observations: 24 Log-Likelihood: -38.099
Df Model: 4 F-statistic: 452.3
Df Residuals: 19 Prob (F-statistic): 1.50e-18
R-squared: 0.990 Scale: 1.7694
Coef. Std.Err. t P>|t| [0.025 0.975]
Intercept -1.6226 1.9987 -0.8118 0.4269 -5.8060 2.5607
Sem1_Science 0.1738 0.0628 2.7668 0.0123 0.0423 0.3052
Sem2_Science 0.2785 0.0518 5.3795 0.0000 0.1702 0.3869
Sem1_Math 0.7890 0.1971 4.0023 0.0008 0.3764 1.2016
Sem2_Math -0.2063 0.1914 -1.0782 0.2944 -0.6069 0.1942
Omnibus: 6.343 Durbin-Watson: 1.863
Prob(Omnibus): 0.042 Jarque-Bera (JB): 4.332
Skew: 0.973 Prob(JB): 0.115
Kurtosis: 3.737 Condition No.: 1200
In [48]:
#Code for VIF Calculation

#Writing a function to calculate the VIF values

def vif_cal(input_data, dependent_col):
    x_vars=input_data.drop([dependent_col], axis=1)
    xvar_names=x_vars.columns
    for i in range(0,xvar_names.shape[0]):
        y=x_vars[xvar_names[i]] 
        x=x_vars[xvar_names.drop(xvar_names[i])]
        rsq=sm.ols(formula="y~x", data=x_vars).fit().rsquared  
        vif=round(1/(1-rsq),2)
        print (xvar_names[i], " VIF = " , vif)
In [49]:
#Calculating VIF values using that function
vif_cal(input_data=final_exam, dependent_col="Final_exam_marks")
Sem1_Science  VIF =  7.4
Sem2_Science  VIF =  5.4
Sem1_Math  VIF =  68.79
Sem2_Math  VIF =  68.01
In [51]:
import statsmodels.formula.api as sm
model2 = sm.ols(formula='Final_exam_marks ~ Sem1_Science+Sem2_Science+Sem2_Math', data=final_exam)
fitted2 = model2.fit()
fitted2.summary()
Out[51]:
OLS Regression Results
Dep. Variable: Final_exam_marks R-squared: 0.981
Model: OLS Adj. R-squared: 0.978
Method: Least Squares F-statistic: 341.4
Date: Wed, 27 Jul 2016 Prob (F-statistic): 2.44e-17
Time: 12:03:55 Log-Likelihood: -45.436
No. Observations: 24 AIC: 98.87
Df Residuals: 20 BIC: 103.6
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -2.3986 2.632 -0.911 0.373 -7.889 3.092
Sem1_Science 0.2130 0.082 2.595 0.017 0.042 0.384
Sem2_Science 0.2686 0.068 3.925 0.001 0.126 0.411
Sem2_Math 0.5320 0.067 7.897 0.000 0.391 0.673
Omnibus: 5.869 Durbin-Watson: 2.424
Prob(Omnibus): 0.053 Jarque-Bera (JB): 3.793
Skew: 0.864 Prob(JB): 0.150
Kurtosis: 3.898 Cond. No. 1.03e+03
In [52]:
vif_cal(input_data=final_exam.drop(["Sem1_Math"], axis=1), dependent_col="Final_exam_marks")
Sem1_Science  VIF =  7.22
Sem2_Science  VIF =  5.38
Sem2_Math  VIF =  4.81
In [53]:
vif_cal(input_data=final_exam.drop(["Sem1_Math","Sem1_Science"], axis=1), dependent_col="Final_exam_marks")
Sem2_Science  VIF =  3.4
Sem2_Math  VIF =  3.4
In [54]:
#Identify and eliminate the Multicollinearity  in the Air passengers model
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(air[["Promotion_Budget"]+["Inter_metro_flight_ratio"]+["Service_Quality_Score"]], air[["Passengers"]])
predictions = lr.predict(air[["Promotion_Budget"]+["Inter_metro_flight_ratio"]+["Service_Quality_Score"]])

import statsmodels.formula.api as sm
model = sm.ols(formula='Passengers ~ Promotion_Budget+Inter_metro_flight_ratio+Service_Quality_Score', data=air)
fitted = model.fit()
fitted.summary()
Out[54]:
OLS Regression Results
Dep. Variable: Passengers R-squared: 0.951
Model: OLS Adj. R-squared: 0.949
Method: Least Squares F-statistic: 495.6
Date: Wed, 27 Jul 2016 Prob (F-statistic): 8.71e-50
Time: 12:11:42 Log-Likelihood: -738.45
No. Observations: 80 AIC: 1485.
Df Residuals: 76 BIC: 1494.
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 1.921e+04 3542.694 5.424 0.000 1.22e+04 2.63e+04
Promotion_Budget 0.0555 0.004 15.476 0.000 0.048 0.063
Inter_metro_flight_ratio -2003.4508 2129.095 -0.941 0.350 -6243.912 2237.010
Service_Quality_Score -2802.0708 530.382 -5.283 0.000 -3858.419 -1745.723
Omnibus: 6.902 Durbin-Watson: 2.312
Prob(Omnibus): 0.032 Jarque-Bera (JB): 2.759
Skew: -0.051 Prob(JB): 0.252
Kurtosis: 2.096 Cond. No. 8.22e+06
In [55]:
#Calculating VIF values using that function

vif_cal(input_data=air, dependent_col="Passengers")
---------------------------------------------------------------------------
PatsyError                                Traceback (most recent call last)
<ipython-input-55-b281c5a9ab02> in <module>()
      1 #Calculating VIF values using that function
----> 2 vif_cal(input_data=air, dependent_col="Passengers")

<ipython-input-48-149609aac97d> in vif_cal(input_data, dependent_col)
      9         y=x_vars[xvar_names[i]]
     10         x=x_vars[xvar_names.drop(xvar_names[i])]
---> 11         rsq=sm.ols(formula="y~x", data=x_vars).fit().rsquared
     12         vif=round(1/(1-rsq),2)
     13         print (xvar_names[i], " VIF = " , vif)

C:Anaconda3libsite-packagesstatsmodelsbasemodel.py in from_formula(cls, formula, data, subset, *args, **kwargs)
    145         (endog, exog), missing_idx = handle_formula_data(data, None, formula,
    146                                                          depth=eval_env,
--> 147                                                          missing=missing)
    148         kwargs.update({'missing_idx': missing_idx,
    149                        'missing': missing})

C:Anaconda3libsite-packagesstatsmodelsformulaformulatools.py in handle_formula_data(Y, X, formula, depth, missing)
     63         if data_util._is_using_pandas(Y, None):
     64             result = dmatrices(formula, Y, depth, return_type='dataframe',
---> 65                                NA_action=na_action)
     66         else:
     67             result = dmatrices(formula, Y, depth, return_type='dataframe',

C:Anaconda3libsite-packagespatsyhighlevel.py in dmatrices(formula_like, data, eval_env, NA_action, return_type)
    295     eval_env = EvalEnvironment.capture(eval_env, reference=1)
    296     (lhs, rhs) = _do_highlevel_design(formula_like, data, eval_env,
--> 297                                       NA_action, return_type)
    298     if lhs.shape[1] == 0:
    299         raise PatsyError("model is missing required outcome variables")

C:Anaconda3libsite-packagespatsyhighlevel.py in _do_highlevel_design(formula_like, data, eval_env, NA_action, return_type)
    150         return iter([data])
    151     design_infos = _try_incr_builders(formula_like, data_iter_maker, eval_env,
--> 152                                       NA_action)
    153     if design_infos is not None:
    154         return build_design_matrices(design_infos, data,

C:Anaconda3libsite-packagespatsyhighlevel.py in _try_incr_builders(formula_like, data_iter_maker, eval_env, NA_action)
     55                                       data_iter_maker,
     56                                       eval_env,
---> 57                                       NA_action)
     58     else:
     59         return None

C:Anaconda3libsite-packagespatsybuild.py in design_matrix_builders(termlists, data_iter_maker, eval_env, NA_action)
    694                                                    factor_states,
    695                                                    data_iter_maker,
--> 696                                                    NA_action)
    697     # Now we need the factor infos, which encapsulate the knowledge of
    698     # how to turn any given factor into a chunk of data:

C:Anaconda3libsite-packagespatsybuild.py in _examine_factor_types(factors, factor_states, data_iter_maker, NA_action)
    446                     cat_sniffers[factor] = CategoricalSniffer(NA_action,
    447                                                               factor.origin)
--> 448                 done = cat_sniffers[factor].sniff(value)
    449                 if done:
    450                     examine_needed.remove(factor)

C:Anaconda3libsite-packagespatsycategorical.py in sniff(self, data)
    200             return True
    201 
--> 202         data = _categorical_shape_fix(data)
    203 
    204         for value in data:

C:Anaconda3libsite-packagespatsycategorical.py in _categorical_shape_fix(data)
    155     # wrong shape.
    156     if hasattr(data, "ndim") and data.ndim > 1:
--> 157         raise PatsyError("categorical data cannot be >1-dimensional")
    158     # coerce scalars into 1d, which is consistent with what we do for numeric
    159     # factors. (See statsmodels/statsmodels#1881)

PatsyError: categorical data cannot be >1-dimensional

Lab: Multiple Regression

  • Dataset: Webpage_Product_Sales/Webpage_Product_Sales.csv
  • Build a model to predict sales using rest of the variables
  • Drop the less impacting variables based on p-values.
  • Is there any multicollinearity?
  • How many variables are there in the final model?
  • What is the R-squared of the final model?
  • Can you improve the model using same data and variables?
In [57]:
import pandas as pd 
Webpage_Product_Sales=pd.read_csv("C:AmritaDatavedi\Webpage_Product_Sales\Webpage_Product_Sales.csv")
Webpage_Product_Sales.shape
Out[57]:
(675, 12)
In [58]:
Webpage_Product_Sales.columns
Out[58]:
Index(['ID', 'DayofMonth', 'Weekday', 'Month', 'Social_Network_Ref_links',
       'Online_Ad_Paid_ref_links', 'Clicks_From_Serach_Engine',
       'Special_Discount', 'Holiday', 'Server_Down_time_Sec', 'Web_UI_Score',
       'Sales'],
      dtype='object')
In [59]:
import statsmodels.formula.api as sm
model1 = sm.ols(formula='Sales ~ Web_UI_Score+Server_Down_time_Sec+Holiday+Special_Discount+Clicks_From_Serach_Engine+Online_Ad_Paid_ref_links+Social_Network_Ref_links+Month+Weekday+DayofMonth', data=Webpage_Product_Sales)
fitted1 = model1.fit()
fitted1.summary()
Out[59]:
OLS Regression Results
Dep. Variable: Sales R-squared: 0.818
Model: OLS Adj. R-squared: 0.815
Method: Least Squares F-statistic: 298.4
Date: Wed, 27 Jul 2016 Prob (F-statistic): 5.54e-238
Time: 12:45:36 Log-Likelihood: -6456.7
No. Observations: 675 AIC: 1.294e+04
Df Residuals: 664 BIC: 1.299e+04
Df Model: 10
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 6545.8922 1286.240 5.089 0.000 4020.304 9071.481
Web_UI_Score -6.2582 11.545 -0.542 0.588 -28.928 16.412
Server_Down_time_Sec -134.0441 14.009 -9.569 0.000 -161.551 -106.537
Holiday 1.877e+04 683.077 27.477 0.000 1.74e+04 2.01e+04
Special_Discount 4718.3978 402.019 11.737 0.000 3929.016 5507.780
Clicks_From_Serach_Engine -0.1258 0.944 -0.133 0.894 -1.980 1.728
Online_Ad_Paid_ref_links 6.1557 1.002 6.142 0.000 4.188 8.124
Social_Network_Ref_links 6.6841 0.411 16.261 0.000 5.877 7.491
Month 481.0294 41.508 11.589 0.000 399.527 562.532
Weekday 1355.2153 67.224 20.160 0.000 1223.218 1487.213
DayofMonth 47.0579 15.198 3.096 0.002 17.216 76.900
Omnibus: 40.759 Durbin-Watson: 1.356
Prob(Omnibus): 0.000 Jarque-Bera (JB): 102.136
Skew: 0.297 Prob(JB): 6.63e-23
Kurtosis: 4.811 Cond. No. 2.57e+04
In [60]:
#VIF
vif_cal(Webpage_Product_Sales,"Sales")
ID  VIF =  1.18
DayofMonth  VIF =  1.01
Weekday  VIF =  1.0
Month  VIF =  1.19
Social_Network_Ref_links  VIF =  1.02
Online_Ad_Paid_ref_links  VIF =  12.13
Clicks_From_Serach_Engine  VIF =  12.08
Special_Discount  VIF =  1.37
Holiday  VIF =  1.38
Server_Down_time_Sec  VIF =  1.02
Web_UI_Score  VIF =  1.02
In [61]:
##Dropped Clicks_From_Serach_Engine based on VIF

import statsmodels.formula.api as sm
model2 = sm.ols(formula='Sales ~ Web_UI_Score+Server_Down_time_Sec+Holiday+Special_Discount+Online_Ad_Paid_ref_links+Social_Network_Ref_links+Month+Weekday+DayofMonth', data=Webpage_Product_Sales)
fitted2 = model2.fit()
fitted2.summary()
Out[61]:
OLS Regression Results
Dep. Variable: Sales R-squared: 0.818
Model: OLS Adj. R-squared: 0.815
Method: Least Squares F-statistic: 332.0
Date: Wed, 27 Jul 2016 Prob (F-statistic): 2.98e-239
Time: 12:48:18 Log-Likelihood: -6456.7
No. Observations: 675 AIC: 1.293e+04
Df Residuals: 665 BIC: 1.298e+04
Df Model: 9
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 6598.7469 1222.658 5.397 0.000 4198.012 8999.482
Web_UI_Score -6.3332 11.523 -0.550 0.583 -28.959 16.293
Server_Down_time_Sec -133.9518 13.981 -9.581 0.000 -161.405 -106.499
Holiday 1.877e+04 681.292 27.557 0.000 1.74e+04 2.01e+04
Special_Discount 4713.9295 400.323 11.775 0.000 3927.881 5499.978
Online_Ad_Paid_ref_links 6.0279 0.291 20.740 0.000 5.457 6.599
Social_Network_Ref_links 6.6872 0.410 16.307 0.000 5.882 7.492
Month 480.6876 41.398 11.611 0.000 399.401 561.974
Weekday 1355.2536 67.174 20.175 0.000 1223.355 1487.152
DayofMonth 47.0168 15.184 3.097 0.002 17.203 76.831
Omnibus: 40.826 Durbin-Watson: 1.356
Prob(Omnibus): 0.000 Jarque-Bera (JB): 102.313
Skew: 0.298 Prob(JB): 6.07e-23
Kurtosis: 4.812 Cond. No. 1.94e+04
In [62]:
#VIF for the updated model
vif_cal(Webpage_Product_Sales.drop(["Clicks_From_Serach_Engine"],axis=1),"Sales")
ID  VIF =  1.18
DayofMonth  VIF =  1.01
Weekday  VIF =  1.0
Month  VIF =  1.19
Social_Network_Ref_links  VIF =  1.01
Online_Ad_Paid_ref_links  VIF =  1.02
Special_Discount  VIF =  1.36
Holiday  VIF =  1.38
Server_Down_time_Sec  VIF =  1.02
Web_UI_Score  VIF =  1.02
In [63]:
##Drop the less impacting variables based on p-values.
##Dropped Web_UI_Score based on P-value

import statsmodels.formula.api as sm
model3 = sm.ols(formula='Sales ~ Server_Down_time_Sec+Holiday+Special_Discount+Online_Ad_Paid_ref_links+Social_Network_Ref_links+Month+Weekday+DayofMonth', data=Webpage_Product_Sales)
fitted3 = model3.fit()
fitted3.summary()
Out[63]:
OLS Regression Results
Dep. Variable: Sales R-squared: 0.818
Model: OLS Adj. R-squared: 0.816
Method: Least Squares F-statistic: 373.9
Date: Wed, 27 Jul 2016 Prob (F-statistic): 1.74e-240
Time: 12:49:15 Log-Likelihood: -6456.9
No. Observations: 675 AIC: 1.293e+04
Df Residuals: 666 BIC: 1.297e+04
Df Model: 8
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 6101.1539 821.286 7.429 0.000 4488.532 7713.776
Server_Down_time_Sec -134.0717 13.972 -9.596 0.000 -161.507 -106.637
Holiday 1.874e+04 678.528 27.623 0.000 1.74e+04 2.01e+04
Special_Discount 4726.1858 399.491 11.831 0.000 3941.771 5510.600
Online_Ad_Paid_ref_links 6.0357 0.290 20.802 0.000 5.466 6.605
Social_Network_Ref_links 6.6738 0.409 16.312 0.000 5.870 7.477
Month 479.5231 41.322 11.605 0.000 398.386 560.660
Weekday 1354.4252 67.122 20.179 0.000 1222.629 1486.221
DayofMonth 46.9564 15.175 3.094 0.002 17.159 76.754
Omnibus: 41.049 Durbin-Watson: 1.352
Prob(Omnibus): 0.000 Jarque-Bera (JB): 103.243
Skew: 0.298 Prob(JB): 3.81e-23
Kurtosis: 4.821 Cond. No. 1.31e+04
In [65]:
#How many variables are there in the final model?
8
Out[65]:
8
In [69]:
#What is the R-squared of the final model?
fitted3.rsquared
Out[69]:
0.8178742020411971

Interaction Terms

  • Adding interaction terms might help in improving the prediction accuracy of the model.
  • The addition of interaction terms needs prior knowledge of the dataset and variables

LAB: Interaction Terms

  • Add few interaction terms to above web product sales model and see the increase in the accuracy
In [70]:
import statsmodels.formula.api as sm
model4 = sm.ols(formula='Sales ~ Server_Down_time_Sec+Holiday+Special_Discount+Online_Ad_Paid_ref_links+Social_Network_Ref_links+Month+Weekday+DayofMonth+Holiday*Weekday', data=Webpage_Product_Sales)
fitted4 = model4.fit()
fitted4.summary()
Out[70]:
OLS Regression Results
Dep. Variable: Sales R-squared: 0.865
Model: OLS Adj. R-squared: 0.863
Method: Least Squares F-statistic: 473.6
Date: Wed, 27 Jul 2016 Prob (F-statistic): 2.17e-282
Time: 12:59:08 Log-Likelihood: -6355.7
No. Observations: 675 AIC: 1.273e+04
Df Residuals: 665 BIC: 1.278e+04
Df Model: 9
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 6753.6923 708.791 9.528 0.000 5361.955 8145.430
Server_Down_time_Sec -140.4922 12.044 -11.665 0.000 -164.141 -116.844
Holiday 2201.8694 1232.336 1.787 0.074 -217.870 4621.608
Special_Discount 4749.0044 344.145 13.799 0.000 4073.262 5424.747
Online_Ad_Paid_ref_links 5.9515 0.250 23.805 0.000 5.461 6.442
Social_Network_Ref_links 7.0657 0.353 19.994 0.000 6.372 7.760
Month 480.3156 35.597 13.493 0.000 410.420 550.212
Weekday 1164.8864 59.143 19.696 0.000 1048.756 1281.017
DayofMonth 47.0967 13.073 3.603 0.000 21.428 72.766
Holiday:Weekday 4294.6865 281.683 15.247 0.000 3741.592 4847.782
Omnibus: 7.552 Durbin-Watson: 0.867
Prob(Omnibus): 0.023 Jarque-Bera (JB): 7.305
Skew: 0.219 Prob(JB): 0.0259
Kurtosis: 2.740 Cond. No. 2.32e+04

Conclusion – Regression

  • Try adding the polynomial & interaction terms to your regression line. Sometimes they work like a charm.
  • Adjusted R-squared is a good measure of training/in time sample error. We can’t be sure about the final model performance based on this. We may have to perform cross-validation to get an idea on testing error.
  • Outlies can influence the regression line, we need to take care of data sanitization before building the regression line.
In [ ]:

DV Analytics

DV Data & Analytics is a leading data science training and consulting firm, led by industry experts. We are aiming to train and prepare resources to acquire the most in-demand data science job opportunities in India and abroad.

Bangalore Center

DV Data & Analytics Bangalore Private Limited
#52, 2nd Floor:
Malleshpalya Maruthinagar Bengaluru.
Bangalore 560075
India
(+91) 9019 030 033 (+91) 8095 881 188
Email: info@dvanalyticsmds.com

Bhubneshwar Center

DV Data & Analytics Private Limited Bhubaneswar
Plot No A/7 :
Adjacent to Maharaja Cine Complex, Bhoinagar, Acharya Vihar
Bhubaneswar 751022
(+91) 8095 881 188 (+91) 8249 430 414
Email: info@dvanalyticsmds.com

top
© 2020. All Rights Reserved.