Regression
Contents
- Correlation
- Simple Regression
- R-Squared
- Multiple Regression
- Adj R-Squared
- P-value
- Multicollinearity
- Interaction terms
Correlation
What is need of correlation?
- Is there any association between hours of study and grades?
- Is there any association between number of temples in a city & murder rate?
- What happens to sweater sales with increase in temperature? What is the strength of association between them?
- What happens to ice-cream sales v.s temperature? What is the strength of association between them?
- How to quantify the association?
- Which of the above examples has very strong association?
- Correlation
Correlation coefficient
- It is a measure of linear association
- r is the ratio of variance together vs product of individual variances.
$$Correlation coefficient (r) = frac{Covariance of XY}{Sqrt(VarianceX* VarianceY)}$$
- Correlation 0 No linear association
- Correlation 0 to 0.25 Negligible positive association
- Correlation 0.25-0.5 Weak positive association
- Correlation 0.5-0.75 Moderate positive association
- Correlation >0.75 Very Strong positive association
LAB –Correlation Calculation
- Dataset: AirPassengersAirPassengers.csv
- Find the correlation between number of passengers and promotional budget.
- Draw a scatter plot between number of passengers and promotional budget
- Find the correlation between number of passengers and Service_Quality_Score
import pandas as pd
air = pd.read_csv("C:\Amrita\Datavedi\AirPassengers\AirPassengers.csv")
air.shape
air.columns.values
#Find the correlation between number of passengers and promotional budget.
import numpy as np
np.corrcoef(air.Passengers,air.Promotion_Budget)
#Draw a scatter plot between number of passengers and promotional budget
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(air.Passengers, air.Promotion_Budget)
#Find the correlation between number of passengers and Service_Quality_Score
np.corrcoef(air.Passengers,air.Service_Quality_Score)
Beyond Pearson Correlation
- Correlation coefficient measures for different types of data
| Variable YX | Quantitative /Continuous X | Ordinal/Ranked/Discrete X | Nominal/Categorical X | |
|---|---|---|---|---|
| Quantitative Y | Pearson r | Biserial $r_b$ | Point Biserial $r_{pb}$ | |
| Ordinal/Ranked/Discrete Y | Biserial $r_b$ | Spearman rho/Kendall’s | Rank Biserial $r_{rb}$ | |
| Nominal/Categorical Y | Point Biserial $r_{pb}$ | Rank Biserial $r_{rb}$ | Phi, Contingency Coeff, V |
From Correlation to Regression
- Correlation is just a measure of association
- It can’t be used for prediction.
- Given the predictor variable, we can’t estimate the dependent variable.
- In the air passengers example, given the promotion budget, we can’t get an estimated value of passengers
- We need a model, an equation, a fit for the data.
- That is known as regression line
What is Regression
- A regression line is a mathematical formula that quantifies the general relation between a predictor/independent (or known variable x) and the target/dependent (or the unknown variable y)
- Below is the regression line. If we have the data of x and y then we can build a model to generalize their relation
$$ y = beta_0 + beta_1 x$$
- What is the best fit for our data?
- The one which goes through the core of the data
- The one which minimizes the error
Regression

Regression Line fitting

Error

Minimizing the error

- The best line will have the minimum error
- Some errors are positive and some errors are negative. Taking their sum is not a good idea
- We can either minimize the squared sum of errors Or we can minimize the absolute sum of errors
- Squared sum of errors is mathematically convenient to minimize
- The method of minimizing squared sum of errors is called least squared method of regression
Least Squares Estimation
- X: $x_1$, $x_2$, $x_3$,… $x_n$
- Y: $y_1$, $y_2$, $y_3$,… $y_n
- Imagine a line through all the points
- Deviation from each point (residual or error)
- Square of the deviation
- Minimizing sum of squares of deviation
$$ sum e^2 = sum (y – hat{y})^2$$$$sum e^2= sum (y – (beta_0 + beta_1 x))^2$$
- $beta_0$ and $beta_1$ are obtained by minimizing the sum of the squared residuals
LAB: Regression Line Fitting
- Dataset: AirPassengersAirPassengers.csv
- Find the correlation between Promotion_Budget and Passengers
- Draw a scatter plot between Promotion_Budget and Passengers. Is there any any pattern between Promotion_Budget and Passengers?
- Build a linear regression model on Promotion_Budget and Passengers.
- Build a regression line to predict the passengers using Inter_metro_flight_ratio
import pandas as pd
air = pd.read_csv("C:\Amrita\Datavedi\AirPassengers\AirPassengers.csv")
air.shape
air.columns.values
air.head(5)
# Find the correlation between Promotion_Budget and Passengers
import numpy as np
np.corrcoef(air.Passengers,air.Promotion_Budget)
# Draw a scatter plot between Promotion_Budget and Passengers. Is there any any pattern between Promotion_Budget and Passengers?
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(air.Passengers, air.Promotion_Budget)
#Build a linear regression model and estimate the expected passengers for a Promotion_Budget is 650,000
##Regression Model promotion and passengers count
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(air[["Promotion_Budget"]], air[["Passengers"]])
predictions = lr.predict(air[["Promotion_Budget"]])
import statsmodels.formula.api as sm
model = sm.ols(formula='Passengers ~ Promotion_Budget', data=air)
fitted1 = model.fit()
fitted1.summary()
# Build a regression line to predict the passengers using Inter_metro_flight_ratio
plt.scatter(air.Inter_metro_flight_ratio,air.Passengers)
import sklearn as sk
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(air[["Inter_metro_flight_ratio"]], air[["Passengers"]])
predictions = lr.predict(air[["Inter_metro_flight_ratio"]])
import statsmodels.formula.api as sm
model = sm.ols(formula='Passengers ~ Inter_metro_flight_ratio', data=air)
fitted2 = model.fit()
fitted2.summary()
How good is my regression line?
- Take an (x,y) point from data.
- Imagine that we submitted x in the regression line, we got a prediction as $y_{pred}$
- If the regression line is a good fit then the we expect $y_{pred}$=y or (y-$y_{pred}$) =0
- At every point of x, if we repeat the same, then we will get multiple error values (y-$y_{pred}$) values
- Some of them might be positive, some of them may be negative, so we can take the square of all such errors
$$SSE = sum(y – hat{y})^2$$
- For a good model we need SSE to be zero or near to zero
- Standalone SSE will not make any sense, For example SSE= 100, is very less when y is varying in terms of 1000’s. Same value is is very high when y is varying in terms of decimals.
- We have to consider variance of y while calculating the regression line accuracy
- Error Sum of squares (SSE- Sum of Squares of error)
$$SSE = sum(y – hat{y})^2$$ - Total Variance in Y (SST- Sum of Squares of Total)
$$SST = sum(y – bar{y})^2$$
$$SST = sum(y – hat{y} + – hat{y} – bar{y})^2$$
$$SST = sum(y – hat{y} + – hat{y} – bar{y})^2$$
$$SST = sum(y – hat{y})^2 + sum(hat{y} – bar{y})^2$$
$$SST = SSE + sum(hat{y} – bar{y})^2$$
$$SST = SSE + SSR$$ - So, total variance in Y is divided into two parts,
- Variance that can’t be explained by x (error)
- Variance that can be explained by x, using regression
Explained and Unexplained Variation

- So, total variance in Y is divided into two parts,
- Variance that can be explained by x, using regression
- Variance that can’t be explained by x
$$SST = SSE + SSR$$
$$Total sum of Squares = Sum of Squares Error + Sum of Squares Regression$$
$$SST = sum(y – bar{y})^2 SSE = sum(y – hat{y})^2 SSR = sum(hat{y} – bar{y})^2$$
R-Squared
- A good fit will have
- SSE (Minimum or Maximum?)
- SSR (Minimum or Maximum?)
- And we know SST= SSE + SSR
- SSE/SST(Minimum or Maximum?)
- SSR/SST(Minimum or Maximum?)
- The coefficient of determination is the portion of the total variation in the dependent variable that is explained by variation in the independent variable
- The coefficient of determination is also called R-squared and is denoted as $R^2$
$$ R^2 = frac{SSR}{SST}$$
where 0<= $R^2$<=1
Lab: R- Square
- What is the R-square value of Passengers vs Promotion_Budget model?
- What is the R-square value of Passengers vs Inter_metro_flight_ratio
#What is the R-square value of Passengers vs Promotion_Budget model?
fitted1.summary()
#What is the R-square value of Passengers vs Inter_metro_flight_ratio
fitted2.summary()
Multiple Regression
- Using multiple predictor variables instead of single variable
- We need to find a perfect plane here

Code – Multiple Regression
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(air[["Promotion_Budget"]+["Inter_metro_flight_ratio"]], air[["Passengers"]])
predictions = lr.predict(air[["Promotion_Budget"]+["Inter_metro_flight_ratio"]])
import statsmodels.formula.api as sm
model = sm.ols(formula='Passengers ~ Promotion_Budget+Inter_metro_flight_ratio', data=air)
fitted = model.fit()
fitted.summary()
Individual Impact of variables
- Look at the P-value
- Probability of the hypothesis being right.
- Individual variable coefficient is tested for significance
- Beta coefficients follow t distribution.
- Individual P values tell us about the significance of each variable
- A variable is significant if P value is less than 5%. Lesser the P-value, better the variable
- Note it is possible all the variables in a regression to produce a great fit, and yet very few of the variables be individually significant.
To test
$$H_0 : beta_i = 0$$
$$H_a : beta_i not= 0$$
Test Statistic
$$t=frac{b_i}{s(b_i)}$$
Reject $H_0$ if
$$t > t(frac{alpha}{2};n-k-1)$$
or
$$t > -t(frac{alpha}{2};n-k-1)$$
LAB: Multiple Regression
- Build a multiple regression model to predict the number of passengers
- What is R-square value
- Are there any predictor variables that are not impacting the dependent variable
#Build a multiple regression model to predict the number of passengers
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(air[["Promotion_Budget"]+["Inter_metro_flight_ratio"]+["Service_Quality_Score"]], air[["Passengers"]])
predictions = lr.predict(air[["Promotion_Budget"]+["Inter_metro_flight_ratio"]+["Service_Quality_Score"]])
import statsmodels.formula.api as sm
model = sm.ols(formula='Passengers ~ Promotion_Budget+Inter_metro_flight_ratio+Service_Quality_Score', data=air)
fitted = model.fit()
fitted.summary()
- What is R-square value
0.951
- Are there any predictor variables that are not impacting the dependent variable
Inter_metro_flight_ratio
Adjusted R-Squared
- Is it good to have as many independent variables as possible? Nope
- R-square is deceptive. R-squared never decreases when a new X variable is added to the model – True?
- We need a better measure or an adjustment to the original R-squared formula.
- Adjusted R squared
- Its value depends on the number of explanatory variables
- Imposes a penalty for adding additional explanatory variables
- It is usually written as ($bar{R}^2$)
- Very different from $R^2$ when there are too many predictors and n is less
$$ bar{R}^2 = R^2 – frac{k-1}{n-k}(1-R^2)$$
where n – number of observationsk - number of parameters
LAB: Adjusted R-Square
- Dataset: “Adjusted Rsquare/ Adj_Sample.csv”
- Build a model to predict y using x1,x2 and x3. Note down R-Square and Adj R-Square values
- Build a model to predict y using x1,x2,x3,x4,x5 and x6. Note down R-Square and Adj R-Square values
- Build a model to predict y using x1,x2,x3,x4,x5,x6,x7 and x8. Note down R-Square and Adj R-Square values
adj_sample=pd.read_csv("C:\Amrita\Datavedi\Adjusted RSquare\Adj_Sample.csv")
adj_sample.shape
adj_sample.columns.values
#Build a model to predict y using x1,x2 and x3. Note down R-Square and Adj R-Square values
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(adj_sample[["x1"]+["x2"]+["x3"]], adj_sample[["Y"]])
predictions = lr.predict(adj_sample[["x1"]+["x2"]+["x3"]])
import statsmodels.formula.api as sm
model = sm.ols(formula='Y ~ x1+x2+x3', data=adj_sample)
fitted1 = model.fit()
fitted1.summary()
#Build a model to predict y using x1,x2,x3,x4,x5 and x6. Note down R-Square and Adj R-Square values
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(adj_sample[["x1"]+["x2"]+["x3"]+["x4"]+["x5"]+["x6"]], adj_sample[["Y"]])
predictions = lr.predict(adj_sample[["x1"]+["x2"]+["x3"]+["x4"]+["x5"]+["x6"]])
import statsmodels.formula.api as sm
model = sm.ols(formula='Y ~ x1+x2+x3+x4+x5+x6', data=adj_sample)
fitted2 = model.fit()
fitted2.summary()
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(adj_sample[["x1"]+["x2"]+["x3"]+["x4"]+["x5"]+["x6"]+["x7"]+["x8"]], adj_sample[["Y"]])
predictions = lr.predict(adj_sample[["x1"]+["x2"]+["x3"]+["x4"]+["x5"]+["x6"]+["x7"]+["x8"]])
import statsmodels.formula.api as sm
model = sm.ols(formula='Y ~ x1+x2+x3+x4+x5+x6+x7+x8', data=adj_sample)
fitted3 = model.fit()
fitted3.summary()
| Model | $R^2$ | $Adj R^2$ | |
|---|---|---|---|
| Model1 | 0.684 | 0.566 | |
| Model2 | 0.717 | 0.377 | |
| Model3 | 0.805 | 0.285 |
R-Squared vs Adjusted R-Squared
We have built three models on Adj_sample data; model1, model2 and model3 with different number of variabes

LAB: Multiple Regression- issues
- Import Final Exam Score data
- Build a model to predict final score using the rest of the variables.
- How are Sem2_Math & Final score related? As Sem2_Math score increases, what happens to Final score?
- Remove “Sem1_Math” variable from the model and rebuild the model
- Is there any change in R square or Adj R square
- How are Sem2_Math & Final score related now? As Sem2_Math score increases, what happens to Final score?
- Draw a scatter plot between Sem1_Math & Sem2_Math
- Find the correlation between Sem1_Math & Sem2_Math
#Import Final Exam Score data
final_exam=pd.read_csv("C:\Amrita\Datavedi\Final Exam\Final Exam Score.csv")
#Size of the data
final_exam.shape
#Variable names
final_exam.columns
#Build a model to predict final score using the rest of the variables.
from sklearn.linear_model import LinearRegression
lr1 = LinearRegression()
lr1.fit(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem1_Math"]+["Sem2_Math"]], final_exam[["Final_exam_marks"]])
predictions1 = lr1.predict(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem1_Math"]+["Sem2_Math"]])
import statsmodels.formula.api as sm
model1 = sm.ols(formula='Final_exam_marks ~ Sem1_Science+Sem2_Science+Sem1_Math+Sem2_Math', data=final_exam)
fitted1 = model1.fit()
fitted1.summary()
fitted1.rsquared
- How are Sem2_Math & Final score related? As Sem2_Math score increases, what happens to Final score?
As Sem2_Math score increases Final score decreases
#Remove "Sem1_Math" variable from the model and rebuild the model
from sklearn.linear_model import LinearRegression
lr2 = LinearRegression()
lr2.fit(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem2_Math"]], final_exam[["Final_exam_marks"]])
predictions2 = lr2.predict(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem2_Math"]])
import statsmodels.formula.api as sm
model2 = sm.ols(formula='Final_exam_marks ~ Sem1_Science+Sem2_Science+Sem2_Math', data=final_exam)
fitted2 = model2.fit()
fitted2.summary()
- Is there any change in R square or Adj R square
| Model | $R^2$ | $Adj R^2$ | |
|---|---|---|---|
| model1 | 0.990 | 0.987 | |
| model2 | 0.981 | 0.978 |
- How are Sem2_Math & Final score related now? As Sem2_Math score increases, what happens to Final score?
As Sem2_Math score increases Final score also increases.
#Draw a scatter plot between Sem1_Math & Sem2_Mat
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(final_exam.Sem1_Math,final_exam.Sem2_Math)
#Find the correlation between Sem1_Math & Sem2_Math
np.corrcoef(final_exam.Sem1_Math,final_exam.Sem2_Math)
Multicollinearity
- Multiple regression is wonderful – In that it allows you to consider the effect of multiple variables simultaneously.
- Multiple regression is extremely unpleasant -Because it allows you to consider the effect of multiple variables simultaneously.
- The relationships between the explanatory variables are the key to understanding multiple regression.
- Multicollinearity (or inter correlation) exists when at least some of the predictor variables are correlated among themselves.
- The parameter estimates will have inflated variance in presence of multicollineraity
- Sometimes the signs of the parameter estimates tend to change
- If the relation between the independent variables grows really strong then the variance of parameter estimates tends to be infinity – Can you prove it?
Multicollinearity Detection
$$Y = beta_0 + beta_1 X_1 + beta_2 X_2 + beta_3 X_3 + beta_4 X_4 $$
- Build a model $X_1$ vs $X_2$ $X_3$ $X_4$ find $R^2$, say R1
- Build a model $X_2$ vs $X_1$ $X_3$ $X_4$ find $R^2$, say R2
- Build a model $X_3$ vs $X_1$ $X_2$ $X_4$ find $R^2$, say R3
- Build a model $X_4$ vs $X_1$ $X_2$ $X_3$ find $R^2$, say R4
- For example if R3 is 95% then we don’t really need X3 in the model
- Since it can be explained as liner combination of other three
- For each variable we find individual $R^2$.
- $frac{1}{(1-R^2)}$ is called VIF.
- VIF option in SAS automatically calculates VIF values for each of the predictor variables
| $R^2$ | 40% | 50% | 60% | 70% | 75% | 80% | 90% |
|---|---|---|---|---|---|---|---|
| VIF | 1.67 | 2.00 | 2.50 | 3.33 | 4.00 | 5.00 | 10.00 |
LAB: Multicollinearity
- Identify the Multicollinearity in the Final Exam Score model
- Drop the variable one by one to reduce the multicollinearity
- Identify and eliminate the Multicollinearity in the Air passengers model
from sklearn.linear_model import LinearRegression
lr1 = LinearRegression()
lr1.fit(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem1_Math"]+["Sem2_Math"]], final_exam[["Final_exam_marks"]])
predictions1 = lr1.predict(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem1_Math"]+["Sem2_Math"]])
import statsmodels.formula.api as sm
model1 = sm.ols(formula='Final_exam_marks ~ Sem1_Science+Sem2_Science+Sem1_Math+Sem2_Math', data=final_exam)
fitted1 = model1.fit()
fitted1.summary()
fitted1.summary2()
#Code for VIF Calculation
#Writing a function to calculate the VIF values
def vif_cal(input_data, dependent_col):
x_vars=input_data.drop([dependent_col], axis=1)
xvar_names=x_vars.columns
for i in range(0,xvar_names.shape[0]):
y=x_vars[xvar_names[i]]
x=x_vars[xvar_names.drop(xvar_names[i])]
rsq=sm.ols(formula="y~x", data=x_vars).fit().rsquared
vif=round(1/(1-rsq),2)
print (xvar_names[i], " VIF = " , vif)
#Calculating VIF values using that function
vif_cal(input_data=final_exam, dependent_col="Final_exam_marks")
import statsmodels.formula.api as sm
model2 = sm.ols(formula='Final_exam_marks ~ Sem1_Science+Sem2_Science+Sem2_Math', data=final_exam)
fitted2 = model2.fit()
fitted2.summary()
vif_cal(input_data=final_exam.drop(["Sem1_Math"], axis=1), dependent_col="Final_exam_marks")
vif_cal(input_data=final_exam.drop(["Sem1_Math","Sem1_Science"], axis=1), dependent_col="Final_exam_marks")
#Identify and eliminate the Multicollinearity in the Air passengers model
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(air[["Promotion_Budget"]+["Inter_metro_flight_ratio"]+["Service_Quality_Score"]], air[["Passengers"]])
predictions = lr.predict(air[["Promotion_Budget"]+["Inter_metro_flight_ratio"]+["Service_Quality_Score"]])
import statsmodels.formula.api as sm
model = sm.ols(formula='Passengers ~ Promotion_Budget+Inter_metro_flight_ratio+Service_Quality_Score', data=air)
fitted = model.fit()
fitted.summary()
#Calculating VIF values using that function
vif_cal(input_data=air, dependent_col="Passengers")
Lab: Multiple Regression¶
- Dataset: Webpage_Product_Sales/Webpage_Product_Sales.csv
- Build a model to predict sales using rest of the variables
- Drop the less impacting variables based on p-values.
- Is there any multicollinearity?
- How many variables are there in the final model?
- What is the R-squared of the final model?
- Can you improve the model using same data and variables?
import pandas as pd
Webpage_Product_Sales=pd.read_csv("C:AmritaDatavedi\Webpage_Product_Sales\Webpage_Product_Sales.csv")
Webpage_Product_Sales.shape
Webpage_Product_Sales.columns
import statsmodels.formula.api as sm
model1 = sm.ols(formula='Sales ~ Web_UI_Score+Server_Down_time_Sec+Holiday+Special_Discount+Clicks_From_Serach_Engine+Online_Ad_Paid_ref_links+Social_Network_Ref_links+Month+Weekday+DayofMonth', data=Webpage_Product_Sales)
fitted1 = model1.fit()
fitted1.summary()
#VIF
vif_cal(Webpage_Product_Sales,"Sales")
##Dropped Clicks_From_Serach_Engine based on VIF
import statsmodels.formula.api as sm
model2 = sm.ols(formula='Sales ~ Web_UI_Score+Server_Down_time_Sec+Holiday+Special_Discount+Online_Ad_Paid_ref_links+Social_Network_Ref_links+Month+Weekday+DayofMonth', data=Webpage_Product_Sales)
fitted2 = model2.fit()
fitted2.summary()
#VIF for the updated model
vif_cal(Webpage_Product_Sales.drop(["Clicks_From_Serach_Engine"],axis=1),"Sales")
##Drop the less impacting variables based on p-values.
##Dropped Web_UI_Score based on P-value
import statsmodels.formula.api as sm
model3 = sm.ols(formula='Sales ~ Server_Down_time_Sec+Holiday+Special_Discount+Online_Ad_Paid_ref_links+Social_Network_Ref_links+Month+Weekday+DayofMonth', data=Webpage_Product_Sales)
fitted3 = model3.fit()
fitted3.summary()
#How many variables are there in the final model?
8
#What is the R-squared of the final model?
fitted3.rsquared
Interaction Terms¶
- Adding interaction terms might help in improving the prediction accuracy of the model.
- The addition of interaction terms needs prior knowledge of the dataset and variables
LAB: Interaction Terms
- Add few interaction terms to above web product sales model and see the increase in the accuracy
import statsmodels.formula.api as sm
model4 = sm.ols(formula='Sales ~ Server_Down_time_Sec+Holiday+Special_Discount+Online_Ad_Paid_ref_links+Social_Network_Ref_links+Month+Weekday+DayofMonth+Holiday*Weekday', data=Webpage_Product_Sales)
fitted4 = model4.fit()
fitted4.summary()
Conclusion – Regression¶
- Try adding the polynomial & interaction terms to your regression line. Sometimes they work like a charm.
- Adjusted R-squared is a good measure of training/in time sample error. We can’t be sure about the final model performance based on this. We may have to perform cross-validation to get an idea on testing error.
- Outlies can influence the regression line, we need to take care of data sanitization before building the regression line.


