Before start our lesson please download the datasets.
Regression
Contents
- Correlation
- Simple Regression
- R-Squared
- Multiple Regression
- Adj R-Squared
- P-value
- Multicollinearity
- Interaction terms
Correlation
What is need of correlation?
- Is there any association between hours of study and grades?
- Is there any association between number of temples in a city & murder rate?
- What happens to sweater sales with increase in temperature? What is the strength of association between them?
- What happens to ice-cream sales v.s temperature? What is the strength of association between them?
- How to quantify the association?
- Which of the above examples has very strong association?
- Correlation
Correlation coefficient
- It is a measure of linear association
- r is the ratio of variance together vs product of individual variances.
- Correlation 0 No linear association
- Correlation 0 to 0.25 Negligible positive association
- Correlation 0.25-0.5 Weak positive association
- Correlation 0.5-0.75 Moderate positive association
- Correlation >0.75 Very Strong positive association
LAB –Correlation Calculation
- Dataset: AirPassengersAirPassengers.csv
- Find the correlation between number of passengers and promotional budget.
- Draw a scatter plot between number of passengers and promotional budget.
- Find the correlation between number of passengers and Service_Quality_Score.
In [1]:
import pandas as pd
air = pd.read_csv("DatasetsAirPassengersAirPassengers.csv")
air.shape
Out[1]:
In [2]:
#Name of the columns in the dataset:
air.columns.values
Out[2]:
In [3]:
#Find the correlation between number of passengers and promotional budget.
import numpy as np
np.corrcoef(air.Passengers,air.Promotion_Budget)
Out[3]:
In [4]:
#Draw a scatter plot between number of passengers and promotional budget
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(air.Passengers, air.Promotion_Budget)
Out[4]:
In [5]:
#Find the correlation between number of passengers and Service_Quality_Score
np.corrcoef(air.Passengers,air.Service_Quality_Score)
Out[5]:
Beyond Pearson Correlation
- Correlation coefficient measures for different types of data
Variable YX | Quantitative /Continuous X | Ordinal/Ranked/Discrete X | Nominal/Categorical X | |
---|---|---|---|---|
Quantitative Y | Pearson r | Biserial |
Point Biserial |
|
Ordinal/Ranked/Discrete Y | Biserial |
Spearman rho/Kendall’s | Rank Biserial |
|
Nominal/Categorical Y | Point Biserial |
Rank Biserial |
Phi, Contingency Coeff, V |
From Correlation to Regression
- Correlation is just a measure of association
- It can’t be used for prediction.
- Given the predictor variable, we can’t estimate the dependent variable.
- In the air passengers example, given the promotion budget, we can’t get the estimated value of passengers
- We need a model, an equation, a fit for the data.
- That is known as regression line
What is Regression
- A regression line is a mathematical formula that quantifies the general relation between a predictor/independent variable (or known variable x) and the target/dependent variable (or the unknown variable y).
- Below is the regression line. If we have the data of x and y, then we can build a model to generalize their relation
- What is the best fit for our data?
- The one which goes through the core of the data
- The one which minimizes the error
Regression
Regression Line fitting
Error
Minimizing the error
- The best line will have the minimum error
- Some errors are positive and some errors are negative. Taking their sum is not a good idea
- We can either minimize the squared sum of errors Or we can minimize the absolute sum of errors
- Squared sum of errors is mathematically convenient to minimize
- The method of minimizing squared sum of errors is called least squared method of regression
Least Squares Estimation
- Imagine a line through all the points
- Deviation from each point (residual or error)
- Square of the deviation
- Minimizing sum of squares of deviation
and
are obtained by minimizing the sum of the squared residuals
LAB: Regression Line Fitting
- Dataset: AirPassengersAirPassengers.csv
- Find the correlation between Promotion_Budget and Passengers
- Draw a scatter plot between Promotion_Budget and Passengers. Is there any any pattern between Promotion_Budget and Passengers?
- Build a linear regression model on Promotion_Budget and Passengers.
- Build a regression line to predict the passengers using Inter_metro_flight_ratio
In [6]:
import pandas as pd
air = pd.read_csv("DatasetsAirPassengersAirPassengers.csv")
air.shape
Out[6]:
In [7]:
air.columns.values
Out[7]:
In [8]:
air.head(5)
Out[8]:
In [9]:
# Find the correlation between Promotion_Budget and Passengers
import numpy as np
np.corrcoef(air.Passengers,air.Promotion_Budget)
Out[9]:
In [10]:
# Draw a scatter plot between Promotion_Budget and Passengers. Is there any any pattern between Promotion_Budget and Passengers?
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(air.Passengers, air.Promotion_Budget)
Out[10]:
In [11]:
#Build a linear regression model and estimate the expected passengers for a Promotion_Budget is 650,000
##Regression Model promotion and passengers count
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(air[["Promotion_Budget"]], air[["Passengers"]])
predictions = lr.predict(65000)
predictions
Out[11]:
In [12]:
import statsmodels.formula.api as sm
model = sm.ols(formula='Passengers ~ Promotion_Budget', data=air)
fitted1 = model.fit()
In [13]:
fitted1.summary()
Out[13]:
In [14]:
# Build a regression line to predict the passengers using Inter_metro_flight_ratio
plt.scatter(air.Inter_metro_flight_ratio,air.Passengers)
Out[14]:
In [15]:
import sklearn as sk
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(air[["Inter_metro_flight_ratio"]], air[["Passengers"]])
Out[15]:
In [16]:
predictions = lr.predict(air[["Inter_metro_flight_ratio"]])
In [17]:
import statsmodels.formula.api as sm
model = sm.ols(formula='Passengers ~ Inter_metro_flight_ratio', data=air)
fitted2 = model.fit()
In [18]:
fitted2.summary()
Out[18]:
How good is my regression line?
- Take an (x,y) point from data.
- Imagine that we submitted x in the regression line, we got a prediction as
- If the regression line is a good fit then we expect
=y or (y-
) =0
- At every point of x, if we repeat the same, then we will get multiple error values (
).
- Some of them might be positive, some of them might be negative, so we can take the square of all such errors
- For a good model, we need SSE to be zero or near to zero
- Standalone SSE will not make any sense. For example SSE= 100, is very less when y is varying in terms of 1000’s. Same value is is very high when y is varying in terms of decimals.
- We have to consider variance of y, while calculating the regression line accuracy
- Error Sum of squares (SSE- Sum of Squares of error)
- Total Variance in Y (SST- Sum of Squares of Total)
- So, total variance in Y is divided into two parts,
- Variance that cannot be explained by x (error)
- Variance that can be explained by x, using regression
Explained and Unexplained Variation
- So, total variance in Y is divided into two parts,
- Variance that can be explained by x, using regression
- Variance that cannot be explained by x SST = SSE + SSR
R-Squared
- A good fit will have
- SSE (Minimum or Maximum?)
- SSR (Minimum or Maximum?)
- And we know SST= SSE + SSR
- SSE/SST(Minimum or Maximum?)
- SSR/SST(Minimum or Maximum?)
- The coefficient of determination is the portion of the total variation in the dependent variable that is explained by variation in the independent variable
- The coefficient of determination is also called R-squared and is denoted as
where 0<=
<=1
Lab: R- Square
- What is the R-square value of Passengers vs Promotion_Budget model?
- What is the R-square value of Passengers vs Inter_metro_flight_ratio
In [19]:
#What is the R-square value of Passengers vs Promotion_Budget model?
fitted1.summary()
Out[19]:
In [20]:
#What is the R-square value of Passengers vs Inter_metro_flight_ratio
fitted2.summary()
Out[20]:
Multiple Regression
- Using multiple predictor variables instead of single variable
- We need to find a perfect plane here
Code – Multiple Regression
In [21]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(air[["Promotion_Budget"]+["Inter_metro_flight_ratio"]], air[["Passengers"]])
Out[21]:
In [22]:
predictions = lr.predict(air[["Promotion_Budget"]+["Inter_metro_flight_ratio"]])
In [23]:
import statsmodels.formula.api as sm
model = sm.ols(formula='Passengers ~ Promotion_Budget+Inter_metro_flight_ratio', data=air)
fitted = model.fit()
fitted.summary()
Out[23]:
Individual Impact of variables
- Look at the P-value
- Probability of the hypothesis being right.
- Individual variable coefficient is tested for significance
- Beta coefficients follow t distribution.
- Individual P values tell us about the significance of each variable
- A variable is significant if P value is less than 5%. Lesser the P-value, better the variable
- Note: It is possible for all the variables in a regression to produce a great fit, and yet very few of the variables be individually significant.
To test
Test Statistic
Reject
or
LAB: Multiple Regression
- Build a multiple regression model to predict the number of passengers
- What is R-square value
- Are there any predictor variables that are not impacting the dependent variable
In [24]:
#Build a multiple regression model to predict the number of passengers
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(air[["Promotion_Budget"]+["Inter_metro_flight_ratio"]+["Service_Quality_Score"]], air[["Passengers"]])
predictions = lr.predict(air[["Promotion_Budget"]+["Inter_metro_flight_ratio"]+["Service_Quality_Score"]])
In [25]:
import statsmodels.formula.api as sm
model = sm.ols(formula='Passengers ~ Promotion_Budget+Inter_metro_flight_ratio+Service_Quality_Score', data=air)
fitted = model.fit()
fitted.summary()
Out[25]:
- What is R-square value
0.951
- Are there any predictor variables that are not impacting the dependent variable
Inter_metro_flight_ratio
Adjusted R-Squared
- Is it good to have as many independent variables as possible? Nope
- R-square is deceptive. R-squared value never decreases when a new X variable is added to the model – True?
- We need a better measure or an adjustment to the original R-squared formula.
- Adjusted R squared
- Its value depends on the number of explanatory variables
- Imposes a penalty for adding additional explanatory variables
- It is usually written as (
)
- Very different from
, when there are too many predictors and n is less
- where n – number of observations
k - number of parameters
LAB: Adjusted R-Square
- Dataset: “Adjusted Rsquare/ Adj_Sample.csv”
- Build a model to predict y using x1,x2 and x3. Note down R-Square and Adj R-Square values
- Build a model to predict y using x1,x2,x3,x4,x5 and x6. Note down R-Square and Adj R-Square values
- Build a model to predict y using x1,x2,x3,x4,x5,x6,x7 and x8. Note down R-Square and Adj R-Square values
In [26]:
adj_sample=pd.read_csv("DatasetsAdjusted RSquareAdj_Sample.csv")
adj_sample.shape
Out[26]:
In [27]:
adj_sample.columns.values
Out[27]:
In [28]:
#Build a model to predict y using x1,x2 and x3. Note down R-Square and Adj R-Square values
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(adj_sample[["x1"]+["x2"]+["x3"]], adj_sample[["Y"]])
predictions = lr.predict(adj_sample[["x1"]+["x2"]+["x3"]])
In [29]:
import statsmodels.formula.api as sm
model = sm.ols(formula='Y ~ x1+x2+x3', data=adj_sample)
fitted1 = model.fit()
fitted1.summary()
Out[29]:
In [30]:
#Build a model to predict y using x1,x2,x3,x4,x5 and x6. Note down R-Square and Adj R-Square values
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(adj_sample[["x1"]+["x2"]+["x3"]+["x4"]+["x5"]+["x6"]], adj_sample[["Y"]])
predictions = lr.predict(adj_sample[["x1"]+["x2"]+["x3"]+["x4"]+["x5"]+["x6"]])
In [31]:
import statsmodels.formula.api as sm
model = sm.ols(formula='Y ~ x1+x2+x3+x4+x5+x6', data=adj_sample)
fitted2 = model.fit()
fitted2.summary()
Out[31]:
In [32]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(adj_sample[["x1"]+["x2"]+["x3"]+["x4"]+["x5"]+["x6"]+["x7"]+["x8"]], adj_sample[["Y"]])
predictions = lr.predict(adj_sample[["x1"]+["x2"]+["x3"]+["x4"]+["x5"]+["x6"]+["x7"]+["x8"]])
In [33]:
import statsmodels.formula.api as sm
model = sm.ols(formula='Y ~ x1+x2+x3+x4+x5+x6+x7+x8', data=adj_sample)
fitted3 = model.fit()
fitted3.summary()
Out[33]:
Model | |||
---|---|---|---|
Model1 | 0.684 | 0.566 | |
Model2 | 0.717 | 0.377 | |
Model3 | 0.805 | 0.285 |
R-Squared vs Adjusted R-Squared
We have built three models on Adj_sample data; model1, model2 and model3 with different number of variables
LAB: Multiple Regression- issues
- Import Final Exam Score data
- Build a model to predict final score using the rest of the variables.
- How are Sem2_Math & Final score related? As Sem2_Math score increases, what happens to Final score?
- Remove “Sem1_Math” variable from the model and rebuild the model
- Is there any change in R square or Adj R square
- How are Sem2_Math & Final score related now? As Sem2_Math score increases, what happens to Final score?
- Draw a scatter plot between Sem1_Math & Sem2_Math
- Find the correlation between Sem1_Math & Sem2_Math
In [34]:
#Import Final Exam Score data
final_exam=pd.read_csv("DatasetsFinal ExamFinal Exam Score.csv")
In [35]:
#Size of the data
final_exam.shape
Out[35]:
In [36]:
#Variable names
final_exam.columns
Out[36]:
In [37]:
#Build a model to predict final score using the rest of the variables.
from sklearn.linear_model import LinearRegression
lr1 = LinearRegression()
lr1.fit(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem1_Math"]+["Sem2_Math"]], final_exam[["Final_exam_marks"]])
predictions1 = lr1.predict(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem1_Math"]+["Sem2_Math"]])
import statsmodels.formula.api as sm
model1 = sm.ols(formula='Final_exam_marks ~ Sem1_Science+Sem2_Science+Sem1_Math+Sem2_Math', data=final_exam)
fitted1 = model1.fit()
fitted1.summary()
Out[37]:
In [38]:
fitted1.rsquared
Out[38]:
- How are Sem2_Math & Final score related? As Sem2_Math score increases, what happens to Final score?
As Sem2_Math score increases Final score decreases
In [39]:
#Remove "Sem1_Math" variable from the model and rebuild the model
from sklearn.linear_model import LinearRegression
lr2 = LinearRegression()
lr2.fit(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem2_Math"]], final_exam[["Final_exam_marks"]])
predictions2 = lr2.predict(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem2_Math"]])
import statsmodels.formula.api as sm
model2 = sm.ols(formula='Final_exam_marks ~ Sem1_Science+Sem2_Science+Sem2_Math', data=final_exam)
fitted2 = model2.fit()
fitted2.summary()
Out[39]:
- Is there any change in R square or Adj R square
Model | Adj |
||
---|---|---|---|
model1 | 0.990 | 0.987 | |
model2 | 0.981 | 0.978 |
- How are Sem2_Math & Final score related now? As Sem2_Math score increases, what happens to Final score?
As Sem2_Math score increases Final score also increases.
In [40]:
#Draw a scatter plot between Sem1_Math & Sem2_Mat
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(final_exam.Sem1_Math,final_exam.Sem2_Math)
Out[40]:
In [41]:
#Find the correlation between Sem1_Math & Sem2_Math
np.corrcoef(final_exam.Sem1_Math,final_exam.Sem2_Math)
Out[41]:
Multicollinearity
- Multiple regression is wonderful – It allows you to consider the effect of multiple variables simultaneously.
- Multiple regression is extremely unpleasant – Because it allows you to consider the effect of multiple variables simultaneously.
- The relationships between the explanatory variables are the key to understanding multiple regression.
- Multicollinearity (or inter correlation) exists when at least some of the predictor variables are correlated among themselves.
- The parameter estimates will have inflated variance in presence of multicollineraity.
- Sometimes the signs of the parameter estimates tend to change
- If the relation between the independent variables grows really strong, then the variance of parameter estimates tends to be infinity – Can you prove it?
Multicollinearity Detection
- Build a model X1 vs X2 X_3 X_4 find
, say R1
- Build a model X2 vs X1 X3 X4 find
, say R2
- Build a model X3 vs X1 X_2 X4 find
, say R3
- Build a model X4 vs X1 X2 X3 find
, say R4
- For example if R3 is 95% then we don’t really need X3 in the model
- Since it can be explained as liner combination of other three
- For each variable we find individual
.
is called VIF.
- VIF option in SAS, automatically calculates VIF values for each of the predictor variables
40% | 50% | 60% | 70% | 75% | 80% | 90% | |
---|---|---|---|---|---|---|---|
VIF | 1.67 | 2.00 | 2.50 | 3.33 | 4.00 | 5.00 | 10.00 |
LAB: Multicollinearity
- Identify the Multicollinearity in the Final Exam Score model.
- Drop the variables one by one to reduce the multicollinearity.
- Identify and eliminate the Multicollinearity in the Air passengers model.
In [42]:
from sklearn.linear_model import LinearRegression
lr1 = LinearRegression()
lr1.fit(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem1_Math"]+["Sem2_Math"]], final_exam[["Final_exam_marks"]])
predictions1 = lr1.predict(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem1_Math"]+["Sem2_Math"]])
In [43]:
import statsmodels.formula.api as sm
model1 = sm.ols(formula='Final_exam_marks ~ Sem1_Science+Sem2_Science+Sem1_Math+Sem2_Math', data=final_exam)
fitted1 = model1.fit()
fitted1.summary()
Out[43]:
In [44]:
fitted1.summary2()
Out[44]:
In [45]:
#Code for VIF Calculation
#Writing a function to calculate the VIF values
def vif_cal(input_data, dependent_col):
x_vars=input_data.drop([dependent_col], axis=1)
xvar_names=x_vars.columns
for i in range(0,xvar_names.shape[0]):
y=x_vars[xvar_names[i]]
x=x_vars[xvar_names.drop(xvar_names[i])]
rsq=sm.ols(formula="y~x", data=x_vars).fit().rsquared
vif=round(1/(1-rsq),2)
print (xvar_names[i], " VIF = " , vif)
In [46]:
#Calculating VIF values using that function
vif_cal(input_data=final_exam, dependent_col="Final_exam_marks")
In [47]:
import statsmodels.formula.api as sm
model2 = sm.ols(formula='Final_exam_marks ~ Sem1_Science+Sem2_Science+Sem2_Math', data=final_exam)
fitted2 = model2.fit()
fitted2.summary()
Out[47]:
In [48]:
vif_cal(input_data=final_exam.drop(["Sem1_Math"], axis=1), dependent_col="Final_exam_marks")
In [49]:
vif_cal(input_data=final_exam.drop(["Sem1_Math","Sem1_Science"], axis=1), dependent_col="Final_exam_marks")
In [50]:
#Identify and eliminate the Multicollinearity in the Air passengers model
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(air[["Promotion_Budget"]+["Inter_metro_flight_ratio"]+["Service_Quality_Score"]], air[["Passengers"]])
predictions = lr.predict(air[["Promotion_Budget"]+["Inter_metro_flight_ratio"]+["Service_Quality_Score"]])
import statsmodels.formula.api as sm
model = sm.ols(formula='Passengers ~ Promotion_Budget+Inter_metro_flight_ratio+Service_Quality_Score', data=air)
fitted = model.fit()
fitted.summary()
Out[50]:
air.columns.values
Out[51]:
In [52]:
#Calculating VIF values using that function
vif_cal(input_data=air.drop(["Holiday_week","Delayed_Cancelled_flight_ind", "Bad_Weather_Ind", "Technical_issues_ind"], axis=1), dependent_col="Passengers")
Note: For calculating vif, all the variables have to be numerical i.e., no categorical variables. All the categorical variables should either be dropped or converted into numerical.
In [53]:
air
Out[53]:
Lab: Multiple Regression
- Dataset: Webpage_Product_Sales/Webpage_Product_Sales.csv
- Build a model to predict sales using rest of the variables
- Drop the less impacting variables based on p-values.
- Is there any multicollinearity?
- How many variables are there in the final model?
- What is the R-squared of the final model?
- Can you improve the model using same data and variables?
In [54]:
import pandas as pd
Webpage_Product_Sales=pd.read_csv("DatasetsWebpage_Product_SalesWebpage_Product_Sales.csv")
Webpage_Product_Sales.shape
Out[54]:
In [55]:
Webpage_Product_Sales.columns
Out[55]:
In [56]:
import statsmodels.formula.api as sm
model1 = sm.ols(formula='Sales ~ Web_UI_Score+Server_Down_time_Sec+Holiday+Special_Discount+Clicks_From_Serach_Engine+Online_Ad_Paid_ref_links+Social_Network_Ref_links+Month+Weekday+DayofMonth', data=Webpage_Product_Sales)
fitted1 = model1.fit()
fitted1.summary()
Out[56]:
In [57]:
#VIF
vif_cal(Webpage_Product_Sales,"Sales")
In [58]:
##Dropped Clicks_From_Serach_Engine based on VIF
import statsmodels.formula.api as sm
model2 = sm.ols(formula='Sales ~ Web_UI_Score+Server_Down_time_Sec+Holiday+Special_Discount+Online_Ad_Paid_ref_links+Social_Network_Ref_links+Month+Weekday+DayofMonth', data=Webpage_Product_Sales)
fitted2 = model2.fit()
fitted2.summary()
Out[58]:
In [59]:
#VIF for the updated model
vif_cal(Webpage_Product_Sales.drop(["Clicks_From_Serach_Engine"],axis=1),"Sales")
In [60]:
##Drop the less impacting variables based on p-values.
##Dropped Web_UI_Score based on P-value
import statsmodels.formula.api as sm
model3 = sm.ols(formula='Sales ~ Server_Down_time_Sec+Holiday+Special_Discount+Online_Ad_Paid_ref_links+Social_Network_Ref_links+Month+Weekday+DayofMonth', data=Webpage_Product_Sales)
fitted3 = model3.fit()
fitted3.summary()
Out[60]:
In [61]:
#How many variables are there in the final model?
8
Out[61]:
In [62]:
#What is the R-squared of the final model?
fitted3.rsquared
Out[62]:
Interaction Terms
- Adding interaction terms might help in improving the prediction accuracy of the model.
- The addition of interaction terms needs prior knowledge of the dataset and variables
LAB: Interaction Terms
- Add few interaction terms to above web product sales model and see the increase in the accuracy
In [63]:
import statsmodels.formula.api as sm
model4 = sm.ols(formula='Sales ~ Server_Down_time_Sec+Holiday+Special_Discount+Online_Ad_Paid_ref_links+Social_Network_Ref_links+Month+Weekday+DayofMonth+Holiday*Weekday', data=Webpage_Product_Sales)
fitted4 = model4.fit()
fitted4.summary()
Out[63]:
Conclusion – Regression
- In this chapter, we have discussed what is simple regression, what is multiple regression, how to build simple linear regression, multiple linear regression, what are the most important metric that one should consider in output of a regression line, what is Multi-collinearity, how to detect it, how to eliminate Multi-collinearity, what is R square, what is adjusted R square , difference between R square and Adjusted R-squared, how do we see the individual impact of the variables and etc.
- This is basic regression class, once you get a good idea on regression you can explore more in regression by going through the advance topics like adding the polynomial and interaction terms to your regression line.
- Adjusted R-squared is a good measure of training/in time sample error. We can’t be sure about the final model performance based on this. We may have to perform cross-validation to get an idea on testing error.
- About cross-validation we will talk in future lectures in more detail.
- Outliers can influence the regression line; we need to take care of data sanitization before building the regression line, because at the end of the day these are all mathematical formula. If wrong adjustment is done, then we will get the wrong results, so data cleaning is very important before getting into regression.