• No products in the cart.

203.1.12 Linear Regression with Multicollinearity in R and Conclusion

Finishing what we started

Practice: Multiple Regression

In previous section, we studied about Issue of Multicollinearity in R

  1. Import Dataset: Webpage_Product_Sales/Webpage_Product_Sales.csv
  2. Build a model to predict sales using rest of the variables
  3. Drop the less impacting variables based on p-values.
  4. Is there any multicollinearity?
  5. How many variables are there in the final model?
  6. What is the R-squared of the final model?
  7. Can you improve the model using same data and variables?

Solution

  1. Import Dataset: Webpage_Product_Sales/Webpage_Product_Sales.csv
Webpage_Product_Sales = read.csv("R dataset\\Webpage_Product_Sales\\Webpage_Product_Sales.csv")
  1. Build a model to predict sales using rest of the variables
web_sales_model1<-lm(Sales~Web_UI_Score+Server_Down_time_Sec+Holiday+Special_Discount+Clicks_From_Serach_Engine+Online_Ad_Paid_ref_links+Social_Network_Ref_links+Month+Weekday+DayofMonth, data=Webpage_Product_Sales)
summary(web_sales_model1)
## 
## Call:
## lm(formula = Sales ~ Web_UI_Score + Server_Down_time_Sec + Holiday + 
##     Special_Discount + Clicks_From_Serach_Engine + Online_Ad_Paid_ref_links + 
##     Social_Network_Ref_links + Month + Weekday + DayofMonth, 
##     data = Webpage_Product_Sales)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14391.9  -2186.2   -191.6   2243.1  15462.1 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                6545.8922  1286.2404   5.089 4.69e-07 ***
## Web_UI_Score                 -6.2582    11.5453  -0.542  0.58796    
## Server_Down_time_Sec       -134.0441    14.0087  -9.569  < 2e-16 ***
## Holiday                   18768.5954   683.0769  27.477  < 2e-16 ***
## Special_Discount           4718.3978   402.0193  11.737  < 2e-16 ***
## Clicks_From_Serach_Engine    -0.1258     0.9443  -0.133  0.89403    
## Online_Ad_Paid_ref_links      6.1557     1.0022   6.142 1.40e-09 ***
## Social_Network_Ref_links      6.6841     0.4111  16.261  < 2e-16 ***
## Month                       481.0294    41.5079  11.589  < 2e-16 ***
## Weekday                    1355.2153    67.2243  20.160  < 2e-16 ***
## DayofMonth                   47.0579    15.1982   3.096  0.00204 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3480 on 664 degrees of freedom
## Multiple R-squared:  0.818,  Adjusted R-squared:  0.8152 
## F-statistic: 298.4 on 10 and 664 DF,  p-value: < 2.2e-16
  1. Drop the less impacting variables based on p-values.

From the p-value of the output we can see that Clicks_From_Serach_Engine and Web_UI_Score are insignificant hence dropping these two variables

web_sales_model2<-lm(Sales~Server_Down_time_Sec+Holiday+Special_Discount+Online_Ad_Paid_ref_links+Social_Network_Ref_links+Month+Weekday+DayofMonth,data=Webpage_Product_Sales)
summary(web_sales_model2)
## 
## Call:
## lm(formula = Sales ~ Server_Down_time_Sec + Holiday + Special_Discount + 
##     Online_Ad_Paid_ref_links + Social_Network_Ref_links + Month + 
##     Weekday + DayofMonth, data = Webpage_Product_Sales)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14305.2  -2154.8   -185.7   2252.3  15383.2 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               6101.1539   821.2864   7.429 3.37e-13 ***
## Server_Down_time_Sec      -134.0717    13.9722  -9.596  < 2e-16 ***
## Holiday                  18742.7123   678.5281  27.623  < 2e-16 ***
## Special_Discount          4726.1858   399.4915  11.831  < 2e-16 ***
## Online_Ad_Paid_ref_links     6.0357     0.2901  20.802  < 2e-16 ***
## Social_Network_Ref_links     6.6738     0.4091  16.312  < 2e-16 ***
## Month                      479.5231    41.3221  11.605  < 2e-16 ***
## Weekday                   1354.4252    67.1219  20.179  < 2e-16 ***
## DayofMonth                  46.9564    15.1755   3.094  0.00206 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3476 on 666 degrees of freedom
## Multiple R-squared:  0.8179, Adjusted R-squared:  0.8157 
## F-statistic: 373.9 on 8 and 666 DF,  p-value: < 2.2e-16
  1. Is there any multicollinearity?
library(car)
vif(web_sales_model2)
##     Server_Down_time_Sec                  Holiday         Special_Discount 
##                 1.018345                 1.366781                 1.353936 
## Online_Ad_Paid_ref_links Social_Network_Ref_links                    Month 
##                 1.018222                 1.004572                 1.011388 
##                  Weekday               DayofMonth 
##                 1.004399                 1.003881

No. From the above results it can be seen that there is no multicollinearity.

  1. How many variables are there in the final model?

Eight

  1. What is the R-squared of the final model?

0.8179

  1. Can you improve the model using same data and variables?

No

Interaction Terms

Adding interaction terms might help in improving the prediction accuracy of the model.The addition of interaction terms needs prior knowledge of the dataset and variables.

web_sales_model3<-lm(Sales~Server_Down_time_Sec+Holiday+Special_Discount+Online_Ad_Paid_ref_links+Social_Network_Ref_links+Month+Weekday+DayofMonth+Holiday*Weekday,data=Webpage_Product_Sales)

summary(web_sales_model3)
## 
## Call:
## lm(formula = Sales ~ Server_Down_time_Sec + Holiday + Special_Discount + 
##     Online_Ad_Paid_ref_links + Social_Network_Ref_links + Month + 
##     Weekday + DayofMonth + Holiday * Weekday, data = Webpage_Product_Sales)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7486.3 -2073.0  -270.4  2104.2  9146.2 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              6753.6923   708.7910   9.528  < 2e-16 ***
## Server_Down_time_Sec     -140.4922    12.0438 -11.665  < 2e-16 ***
## Holiday                  2201.8694  1232.3364   1.787 0.074434 .  
## Special_Discount         4749.0044   344.1454  13.799  < 2e-16 ***
## Online_Ad_Paid_ref_links    5.9515     0.2500  23.805  < 2e-16 ***
## Social_Network_Ref_links    7.0657     0.3534  19.994  < 2e-16 ***
## Month                     480.3156    35.5970  13.493  < 2e-16 ***
## Weekday                  1164.8864    59.1435  19.696  < 2e-16 ***
## DayofMonth                 47.0967    13.0729   3.603 0.000339 ***
## Holiday:Weekday          4294.6865   281.6829  15.247  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2994 on 665 degrees of freedom
## Multiple R-squared:  0.865,  Adjusted R-squared:  0.8632 
## F-statistic: 473.6 on 9 and 665 DF,  p-value: < 2.2e-16

Conclusion

In this chapter, we have discussed what is simple regression , what is multiple regression ,how to build simple linear regression ,multiple linear regression what are the most important metric that one should consider in output of a regression line, what is Multicollinearity how to detect, how to eliminate Multicollinearity, what is R square what is adjusted R square , difference between R square and Adjusted R-squared, how do we see the individual impact of the variables and etc.

This is a basic regression class once you get a good idea on regression you can explore more in regression by going through the advance topics like adding the polynomial and interaction terms to your regression line , sometimes they work will charm. Adjusted R-squared is a good measure of training/in time sample error. We can’t be sure about the final model performance based on this. We may have to perform cross-validation to get an idea on testing error.

About cross-validation we will talk in future lectures in more detail. Outliers can influence the regression line; we need to take care of data sanitization before building the regression line, because at the end of the day these are all mathematical formula if wrong adjustment is done,then the wrong result we will get , so data cleaning is very important before getting into regression.

In next section, we will be studying about  Logistic Regression, why do we need it?

DV Analytics

DV Data & Analytics is a leading data science,  Cyber Security training and consulting firm, led by industry experts. We are aiming to train and prepare resources to acquire the most in-demand data science job opportunities in India and abroad.

Bangalore Center

DV Data & Analytics Bangalore Private Limited
#52, 2nd Floor:
Malleshpalya Maruthinagar Bengaluru.
Bangalore 560075
India
(+91) 9019 030 033 (+91) 8095 881 188
Email: info@dvanalyticsmds.com

Bhubneshwar Center

DV Data & Analytics Private Limited Bhubaneswar
Plot No A/7 :
Adjacent to Maharaja Cine Complex, Bhoinagar, Acharya Vihar
Bhubaneswar 751022
(+91) 8095 881 188 (+91) 8249 430 414
Email: info@dvanalyticsmds.com

top
© 2020. All Rights Reserved.