Regression

Correlation

What is need of correlation?

Is there any association between hours of study and grades?
Is there any association between number of temples in a city & murder rate?
What happens to sweater sales with increase in temperature? What is the strength of association between them?
What happens to ice-cream sales v.s temperature? What is the strength of association between them?
How to quantify the association?
Which of the above examples has very strong association?
Correlation

Correlation coefficient

It is a measure of linear association
r is the ratio of variance together vs product of individual variances.

$$Correlation coefficient (r) = frac{Covariance of XY}{Sqrt(VarianceX* VarianceY)}$$

Correlation 0 No linear association
Correlation 0 to 0.25 Negligible positive association
Correlation 0.25-0.5 Weak positive association
Correlation 0.5-0.75 Moderate positive association
Correlation >0.75 Very Strong positive association

LAB –Correlation Calculation

Dataset: AirPassengersAirPassengers.csv
Find the correlation between number of passengers and promotional budget.
Draw a scatter plot between number of passengers and promotional budget
Find the correlation between number of passengers and Service_Quality_Score

import pandas as pd
air = pd.read_csv("C:\Amrita\Datavedi\AirPassengers\AirPassengers.csv")
air.shape

(80, 9)

air.columns.values

array(['Week_num', 'Passengers', 'Promotion_Budget',
       'Service_Quality_Score', 'Holiday_week',
       'Delayed_Cancelled_flight_ind', 'Inter_metro_flight_ratio',
       'Bad_Weather_Ind', 'Technical_issues_ind'], dtype=object)

#Find the correlation between number of passengers and promotional budget.
import numpy as np
np.corrcoef(air.Passengers,air.Promotion_Budget)

array([[ 1.        ,  0.96585103],
       [ 0.96585103,  1.        ]])

#Draw a scatter plot between number of passengers and promotional budget
import matplotlib.pyplot as plt
%matplotlib inline  
plt.scatter(air.Passengers, air.Promotion_Budget)

<matplotlib.collections.PathCollection at 0x8feb8d0>

#Find the correlation between number of passengers and Service_Quality_Score
np.corrcoef(air.Passengers,air.Service_Quality_Score)

array([[ 1.        , -0.88653002],
       [-0.88653002,  1.        ]])

Beyond Pearson Correlation

Correlation coefficient measures for different types of data

Variable YX	Quantitative /Continuous X	Ordinal/Ranked/Discrete X	Nominal/Categorical X
Quantitative Y	Pearson r	Biserial $r_b$	Point Biserial $r_{pb}$
Ordinal/Ranked/Discrete Y	Biserial $r_b$	Spearman rho/Kendall’s	Rank Biserial $r_{rb}$
Nominal/Categorical Y	Point Biserial $r_{pb}$	Rank Biserial $r_{rb}$	Phi, Contingency Coeff, V

From Correlation to Regression

Correlation is just a measure of association
It can’t be used for prediction.
Given the predictor variable, we can’t estimate the dependent variable.
In the air passengers example, given the promotion budget, we can’t get an estimated value of passengers
We need a model, an equation, a fit for the data.
That is known as regression line

What is Regression

A regression line is a mathematical formula that quantifies the general relation between a predictor/independent (or known variable x) and the target/dependent (or the unknown variable y)
Below is the regression line. If we have the data of x and y then we can build a model to generalize their relation

$$ y = beta_0 + beta_1 x$$

- What is the best fit for our data?
- The one which goes through the core of the data
- The one which minimizes the error

Regression

Regression Line fitting

Error

Minimizing the error

The best line will have the minimum error
Some errors are positive and some errors are negative. Taking their sum is not a good idea
We can either minimize the squared sum of errors Or we can minimize the absolute sum of errors
Squared sum of errors is mathematically convenient to minimize
The method of minimizing squared sum of errors is called least squared method of regression

Least Squares Estimation

X: $x_1$, $x_2$, $x_3$,… $x_n$
Y: $y_1$, $y_2$, $y_3$,… $y_n
Imagine a line through all the points
Deviation from each point (residual or error)
Square of the deviation
Minimizing sum of squares of deviation

$$ sum e^2 = sum (y – hat{y})^2$$$$sum e^2= sum (y – (beta_0 + beta_1 x))^2$$

$beta_0$ and $beta_1$ are obtained by minimizing the sum of the squared residuals

LAB: Regression Line Fitting

Dataset: AirPassengersAirPassengers.csv
Find the correlation between Promotion_Budget and Passengers
Draw a scatter plot between Promotion_Budget and Passengers. Is there any any pattern between Promotion_Budget and Passengers?
Build a linear regression model on Promotion_Budget and Passengers.
Build a regression line to predict the passengers using Inter_metro_flight_ratio

import pandas as pd
air = pd.read_csv("C:\Amrita\Datavedi\AirPassengers\AirPassengers.csv")
air.shape

(80, 9)

air.columns.values

array(['Week_num', 'Passengers', 'Promotion_Budget',
       'Service_Quality_Score', 'Holiday_week',
       'Delayed_Cancelled_flight_ind', 'Inter_metro_flight_ratio',
       'Bad_Weather_Ind', 'Technical_issues_ind'], dtype=object)

air.head(5)

# Find the correlation between Promotion_Budget and Passengers
import numpy as np
np.corrcoef(air.Passengers,air.Promotion_Budget)

array([[ 1.        ,  0.96585103],
       [ 0.96585103,  1.        ]])

# Draw a scatter plot between   Promotion_Budget and Passengers. Is there any any pattern between Promotion_Budget and Passengers?

import matplotlib.pyplot as plt
%matplotlib inline 

plt.scatter(air.Passengers, air.Promotion_Budget)

<matplotlib.collections.PathCollection at 0x90bda20>

#Build a linear regression model and estimate the expected passengers for a Promotion_Budget is 650,000
##Regression Model  promotion and passengers count

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(air[["Promotion_Budget"]], air[["Passengers"]])
predictions = lr.predict(air[["Promotion_Budget"]])

import statsmodels.formula.api as sm
model = sm.ols(formula='Passengers ~ Promotion_Budget', data=air)
fitted1 = model.fit()

fitted1.summary()

# Build a regression line to predict the passengers using Inter_metro_flight_ratio

plt.scatter(air.Inter_metro_flight_ratio,air.Passengers)

<matplotlib.collections.PathCollection at 0xb13f2b0>

import sklearn as sk

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(air[["Inter_metro_flight_ratio"]], air[["Passengers"]])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

predictions = lr.predict(air[["Inter_metro_flight_ratio"]])

import statsmodels.formula.api as sm
model = sm.ols(formula='Passengers ~ Inter_metro_flight_ratio', data=air)
fitted2 = model.fit()

fitted2.summary()

How good is my regression line?

Take an (x,y) point from data.
Imagine that we submitted x in the regression line, we got a prediction as $y_{pred}$
If the regression line is a good fit then the we expect $y_{pred}$=y or (y-$y_{pred}$) =0
At every point of x, if we repeat the same, then we will get multiple error values (y-$y_{pred}$) values
Some of them might be positive, some of them may be negative, so we can take the square of all such errors

$$SSE = sum(y – hat{y})^2$$

For a good model we need SSE to be zero or near to zero
Standalone SSE will not make any sense, For example SSE= 100, is very less when y is varying in terms of 1000’s. Same value is is very high when y is varying in terms of decimals.
We have to consider variance of y while calculating the regression line accuracy
Error Sum of squares (SSE- Sum of Squares of error)
$$SSE = sum(y – hat{y})^2$$
Total Variance in Y (SST- Sum of Squares of Total)
$$SST = sum(y – bar{y})^2$$
$$SST = sum(y – hat{y} + – hat{y} – bar{y})^2$$
$$SST = sum(y – hat{y} + – hat{y} – bar{y})^2$$
$$SST = sum(y – hat{y})^2 + sum(hat{y} – bar{y})^2$$
$$SST = SSE + sum(hat{y} – bar{y})^2$$
$$SST = SSE + SSR$$
So, total variance in Y is divided into two parts,
- Variance that can’t be explained by x (error)
- Variance that can be explained by x, using regression

Explained and Unexplained Variation

So, total variance in Y is divided into two parts,
- Variance that can be explained by x, using regression
- Variance that can’t be explained by x
  $$SST = SSE + SSR$$
  $$Total sum of Squares = Sum of Squares Error + Sum of Squares Regression$$
  $$SST = sum(y – bar{y})^2 SSE = sum(y – hat{y})^2 SSR = sum(hat{y} – bar{y})^2$$

R-Squared

A good fit will have
- SSE (Minimum or Maximum?)
- SSR (Minimum or Maximum?)
- And we know SST= SSE + SSR
- SSE/SST(Minimum or Maximum?)
- SSR/SST(Minimum or Maximum?)
The coefficient of determination is the portion of the total variation in the dependent variable that is explained by variation in the independent variable
The coefficient of determination is also called R-squared and is denoted as $R^2$

$$ R^2 = frac{SSR}{SST}$$

where 0<= $R^2$<=1

Lab: R- Square

What is the R-square value of Passengers vs Promotion_Budget model?
What is the R-square value of Passengers vs Inter_metro_flight_ratio

#What is the R-square value of Passengers vs Promotion_Budget model?
fitted1.summary()

#What is the R-square value of Passengers vs Inter_metro_flight_ratio

fitted2.summary()

Multiple Regression

Using multiple predictor variables instead of single variable
We need to find a perfect plane here

Code – Multiple Regression

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(air[["Promotion_Budget"]+["Inter_metro_flight_ratio"]], air[["Passengers"]])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

predictions = lr.predict(air[["Promotion_Budget"]+["Inter_metro_flight_ratio"]])

import statsmodels.formula.api as sm
model = sm.ols(formula='Passengers ~ Promotion_Budget+Inter_metro_flight_ratio', data=air)
fitted = model.fit()
fitted.summary()

Individual Impact of variables

Look at the P-value
Probability of the hypothesis being right.
Individual variable coefficient is tested for significance
Beta coefficients follow t distribution.
Individual P values tell us about the significance of each variable
A variable is significant if P value is less than 5%. Lesser the P-value, better the variable
Note it is possible all the variables in a regression to produce a great fit, and yet very few of the variables be individually significant.

To test
$$H_0 : beta_i = 0$$
$$H_a : beta_i not= 0$$

Test Statistic

$$t=frac{b_i}{s(b_i)}$$

Reject $H_0$ if

$$t > t(frac{alpha}{2};n-k-1)$$

or
$$t > -t(frac{alpha}{2};n-k-1)$$

LAB: Multiple Regression

Build a multiple regression model to predict the number of passengers
What is R-square value
Are there any predictor variables that are not impacting the dependent variable

#Build a multiple regression model to predict the number of passengers

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(air[["Promotion_Budget"]+["Inter_metro_flight_ratio"]+["Service_Quality_Score"]], air[["Passengers"]])
predictions = lr.predict(air[["Promotion_Budget"]+["Inter_metro_flight_ratio"]+["Service_Quality_Score"]])

import statsmodels.formula.api as sm
model = sm.ols(formula='Passengers ~ Promotion_Budget+Inter_metro_flight_ratio+Service_Quality_Score', data=air)
fitted = model.fit()
fitted.summary()

What is R-square value

0.951

Are there any predictor variables that are not impacting the dependent variable

Inter_metro_flight_ratio

Adjusted R-Squared

Is it good to have as many independent variables as possible? Nope
R-square is deceptive. R-squared never decreases when a new X variable is added to the model – True?
We need a better measure or an adjustment to the original R-squared formula.
Adjusted R squared
- Its value depends on the number of explanatory variables
- Imposes a penalty for adding additional explanatory variables
- It is usually written as ($bar{R}^2$)
- Very different from $R^2$ when there are too many predictors and n is less
$$ bar{R}^2 = R^2 – frac{k-1}{n-k}(1-R^2)$$
where n – number of observations
```
 k - number of parameters
```

LAB: Adjusted R-Square

Dataset: “Adjusted Rsquare/ Adj_Sample.csv”
Build a model to predict y using x1,x2 and x3. Note down R-Square and Adj R-Square values
Build a model to predict y using x1,x2,x3,x4,x5 and x6. Note down R-Square and Adj R-Square values
Build a model to predict y using x1,x2,x3,x4,x5,x6,x7 and x8. Note down R-Square and Adj R-Square values

adj_sample=pd.read_csv("C:\Amrita\Datavedi\Adjusted RSquare\Adj_Sample.csv")
adj_sample.shape

(12, 9)

adj_sample.columns.values

array(['Y', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8'], dtype=object)

#Build a model to predict y using x1,x2 and x3. Note down R-Square and Adj R-Square values 
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(adj_sample[["x1"]+["x2"]+["x3"]], adj_sample[["Y"]])
predictions = lr.predict(adj_sample[["x1"]+["x2"]+["x3"]])

import statsmodels.formula.api as sm
model = sm.ols(formula='Y ~ x1+x2+x3', data=adj_sample)
fitted1 = model.fit()
fitted1.summary()

C:Anaconda3libsite-packagesscipystatsstats.py:1557: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=12
  "anyway, n=%i" % int(n))

#Build a model to predict y using x1,x2,x3,x4,x5 and x6. Note down R-Square and Adj R-Square values 

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(adj_sample[["x1"]+["x2"]+["x3"]+["x4"]+["x5"]+["x6"]], adj_sample[["Y"]])
predictions = lr.predict(adj_sample[["x1"]+["x2"]+["x3"]+["x4"]+["x5"]+["x6"]])

import statsmodels.formula.api as sm
model = sm.ols(formula='Y ~ x1+x2+x3+x4+x5+x6', data=adj_sample)
fitted2 = model.fit()
fitted2.summary()

C:Anaconda3libsite-packagesscipystatsstats.py:1557: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=12
  "anyway, n=%i" % int(n))

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(adj_sample[["x1"]+["x2"]+["x3"]+["x4"]+["x5"]+["x6"]+["x7"]+["x8"]], adj_sample[["Y"]])
predictions = lr.predict(adj_sample[["x1"]+["x2"]+["x3"]+["x4"]+["x5"]+["x6"]+["x7"]+["x8"]])

import statsmodels.formula.api as sm
model = sm.ols(formula='Y ~ x1+x2+x3+x4+x5+x6+x7+x8', data=adj_sample)
fitted3 = model.fit()
fitted3.summary()

C:Anaconda3libsite-packagesscipystatsstats.py:1557: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=12
  "anyway, n=%i" % int(n))

Model	$R^2$	$Adj R^2$
Model1	0.684	0.566
Model2	0.717	0.377
Model3	0.805	0.285

R-Squared vs Adjusted R-Squared

We have built three models on Adj_sample data; model1, model2 and model3 with different number of variabes

LAB: Multiple Regression- issues

Import Final Exam Score data
Build a model to predict final score using the rest of the variables.
How are Sem2_Math & Final score related? As Sem2_Math score increases, what happens to Final score?
Remove “Sem1_Math” variable from the model and rebuild the model
Is there any change in R square or Adj R square
How are Sem2_Math & Final score related now? As Sem2_Math score increases, what happens to Final score?
Draw a scatter plot between Sem1_Math & Sem2_Math
Find the correlation between Sem1_Math & Sem2_Math

#Import Final Exam Score data
final_exam=pd.read_csv("C:\Amrita\Datavedi\Final Exam\Final Exam Score.csv")

#Size of the data
final_exam.shape

(24, 5)

#Variable names
final_exam.columns

Index(['Sem1_Science', 'Sem2_Science', 'Sem1_Math', 'Sem2_Math',
       'Final_exam_marks'],
      dtype='object')

#Build a model to predict final score using the rest of the variables.
from sklearn.linear_model import LinearRegression
lr1 = LinearRegression()
lr1.fit(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem1_Math"]+["Sem2_Math"]], final_exam[["Final_exam_marks"]])
predictions1 = lr1.predict(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem1_Math"]+["Sem2_Math"]])

import statsmodels.formula.api as sm
model1 = sm.ols(formula='Final_exam_marks ~ Sem1_Science+Sem2_Science+Sem1_Math+Sem2_Math', data=final_exam)
fitted1 = model1.fit()
fitted1.summary()

fitted1.rsquared

0.98960765475687229

How are Sem2_Math & Final score related? As Sem2_Math score increases, what happens to Final score?

As Sem2_Math score increases Final score decreases

#Remove "Sem1_Math" variable from the model and rebuild the model
from sklearn.linear_model import LinearRegression
lr2 = LinearRegression()
lr2.fit(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem2_Math"]], final_exam[["Final_exam_marks"]])
predictions2 = lr2.predict(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem2_Math"]])

import statsmodels.formula.api as sm
model2 = sm.ols(formula='Final_exam_marks ~ Sem1_Science+Sem2_Science+Sem2_Math', data=final_exam)
fitted2 = model2.fit()
fitted2.summary()

Is there any change in R square or Adj R square

Model	$R^2$	$Adj R^2$
model1	0.990	0.987
model2	0.981	0.978

How are Sem2_Math & Final score related now? As Sem2_Math score increases, what happens to Final score?

As Sem2_Math score increases Final score also increases.

#Draw a scatter plot between Sem1_Math & Sem2_Mat

import matplotlib.pyplot as plt
%matplotlib inline 
plt.scatter(final_exam.Sem1_Math,final_exam.Sem2_Math)

<matplotlib.collections.PathCollection at 0xb2cf0f0>

#Find the correlation between Sem1_Math & Sem2_Math 
np.corrcoef(final_exam.Sem1_Math,final_exam.Sem2_Math)

array([[ 1.       ,  0.9924948],
       [ 0.9924948,  1.       ]])

Multicollinearity

Multiple regression is wonderful – In that it allows you to consider the effect of multiple variables simultaneously.
Multiple regression is extremely unpleasant -Because it allows you to consider the effect of multiple variables simultaneously.
The relationships between the explanatory variables are the key to understanding multiple regression.
Multicollinearity (or inter correlation) exists when at least some of the predictor variables are correlated among themselves.
The parameter estimates will have inflated variance in presence of multicollineraity
Sometimes the signs of the parameter estimates tend to change
If the relation between the independent variables grows really strong then the variance of parameter estimates tends to be infinity – Can you prove it?

Multicollinearity Detection

$$Y = beta_0 + beta_1 X_1 + beta_2 X_2 + beta_3 X_3 + beta_4 X_4 $$

Build a model $X_1$ vs $X_2$ $X_3$ $X_4$ find $R^2$, say R1
Build a model $X_2$ vs $X_1$ $X_3$ $X_4$ find $R^2$, say R2
Build a model $X_3$ vs $X_1$ $X_2$ $X_4$ find $R^2$, say R3
Build a model $X_4$ vs $X_1$ $X_2$ $X_3$ find $R^2$, say R4
For example if R3 is 95% then we don’t really need X3 in the model
Since it can be explained as liner combination of other three
For each variable we find individual $R^2$.
$frac{1}{(1-R^2)}$ is called VIF.
VIF option in SAS automatically calculates VIF values for each of the predictor variables

$R^2$	40%	50%	60%	70%	75%	80%	90%
VIF	1.67	2.00	2.50	3.33	4.00	5.00	10.00

LAB: Multicollinearity

Identify the Multicollinearity in the Final Exam Score model
Drop the variable one by one to reduce the multicollinearity
Identify and eliminate the Multicollinearity in the Air passengers model

from sklearn.linear_model import LinearRegression
lr1 = LinearRegression()
lr1.fit(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem1_Math"]+["Sem2_Math"]], final_exam[["Final_exam_marks"]])
predictions1 = lr1.predict(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem1_Math"]+["Sem2_Math"]])

import statsmodels.formula.api as sm
model1 = sm.ols(formula='Final_exam_marks ~ Sem1_Science+Sem2_Science+Sem1_Math+Sem2_Math', data=final_exam)
fitted1 = model1.fit()
fitted1.summary()

fitted1.summary2()

#Code for VIF Calculation

#Writing a function to calculate the VIF values

def vif_cal(input_data, dependent_col):
    x_vars=input_data.drop([dependent_col], axis=1)
    xvar_names=x_vars.columns
    for i in range(0,xvar_names.shape[0]):
        y=x_vars[xvar_names[i]] 
        x=x_vars[xvar_names.drop(xvar_names[i])]
        rsq=sm.ols(formula="y~x", data=x_vars).fit().rsquared  
        vif=round(1/(1-rsq),2)
        print (xvar_names[i], " VIF = " , vif)

#Calculating VIF values using that function
vif_cal(input_data=final_exam, dependent_col="Final_exam_marks")

Sem1_Science  VIF =  7.4
Sem2_Science  VIF =  5.4
Sem1_Math  VIF =  68.79
Sem2_Math  VIF =  68.01

import statsmodels.formula.api as sm
model2 = sm.ols(formula='Final_exam_marks ~ Sem1_Science+Sem2_Science+Sem2_Math', data=final_exam)
fitted2 = model2.fit()
fitted2.summary()

vif_cal(input_data=final_exam.drop(["Sem1_Math"], axis=1), dependent_col="Final_exam_marks")

Sem1_Science  VIF =  7.22
Sem2_Science  VIF =  5.38
Sem2_Math  VIF =  4.81

vif_cal(input_data=final_exam.drop(["Sem1_Math","Sem1_Science"], axis=1), dependent_col="Final_exam_marks")

Sem2_Science  VIF =  3.4
Sem2_Math  VIF =  3.4

#Identify and eliminate the Multicollinearity  in the Air passengers model
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(air[["Promotion_Budget"]+["Inter_metro_flight_ratio"]+["Service_Quality_Score"]], air[["Passengers"]])
predictions = lr.predict(air[["Promotion_Budget"]+["Inter_metro_flight_ratio"]+["Service_Quality_Score"]])

import statsmodels.formula.api as sm
model = sm.ols(formula='Passengers ~ Promotion_Budget+Inter_metro_flight_ratio+Service_Quality_Score', data=air)
fitted = model.fit()
fitted.summary()

#Calculating VIF values using that function

vif_cal(input_data=air, dependent_col="Passengers")

---------------------------------------------------------------------------
PatsyError                                Traceback (most recent call last)
<ipython-input-55-b281c5a9ab02> in <module>()
      1 #Calculating VIF values using that function
----> 2 vif_cal(input_data=air, dependent_col="Passengers")

<ipython-input-48-149609aac97d> in vif_cal(input_data, dependent_col)
      9         y=x_vars[xvar_names[i]]
     10         x=x_vars[xvar_names.drop(xvar_names[i])]
---> 11         rsq=sm.ols(formula="y~x", data=x_vars).fit().rsquared
     12         vif=round(1/(1-rsq),2)
     13         print (xvar_names[i], " VIF = " , vif)

C:Anaconda3libsite-packagesstatsmodelsbasemodel.py in from_formula(cls, formula, data, subset, *args, **kwargs)
    145         (endog, exog), missing_idx = handle_formula_data(data, None, formula,
    146                                                          depth=eval_env,
--> 147                                                          missing=missing)
    148         kwargs.update({'missing_idx': missing_idx,
    149                        'missing': missing})

C:Anaconda3libsite-packagesstatsmodelsformulaformulatools.py in handle_formula_data(Y, X, formula, depth, missing)
     63         if data_util._is_using_pandas(Y, None):
     64             result = dmatrices(formula, Y, depth, return_type='dataframe',
---> 65                                NA_action=na_action)
     66         else:
     67             result = dmatrices(formula, Y, depth, return_type='dataframe',

C:Anaconda3libsite-packagespatsyhighlevel.py in dmatrices(formula_like, data, eval_env, NA_action, return_type)
    295     eval_env = EvalEnvironment.capture(eval_env, reference=1)
    296     (lhs, rhs) = _do_highlevel_design(formula_like, data, eval_env,
--> 297                                       NA_action, return_type)
    298     if lhs.shape[1] == 0:
    299         raise PatsyError("model is missing required outcome variables")

C:Anaconda3libsite-packagespatsyhighlevel.py in _do_highlevel_design(formula_like, data, eval_env, NA_action, return_type)
    150         return iter([data])
    151     design_infos = _try_incr_builders(formula_like, data_iter_maker, eval_env,
--> 152                                       NA_action)
    153     if design_infos is not None:
    154         return build_design_matrices(design_infos, data,

C:Anaconda3libsite-packagespatsyhighlevel.py in _try_incr_builders(formula_like, data_iter_maker, eval_env, NA_action)
     55                                       data_iter_maker,
     56                                       eval_env,
---> 57                                       NA_action)
     58     else:
     59         return None

C:Anaconda3libsite-packagespatsybuild.py in design_matrix_builders(termlists, data_iter_maker, eval_env, NA_action)
    694                                                    factor_states,
    695                                                    data_iter_maker,
--> 696                                                    NA_action)
    697     # Now we need the factor infos, which encapsulate the knowledge of
    698     # how to turn any given factor into a chunk of data:

C:Anaconda3libsite-packagespatsybuild.py in _examine_factor_types(factors, factor_states, data_iter_maker, NA_action)
    446                     cat_sniffers[factor] = CategoricalSniffer(NA_action,
    447                                                               factor.origin)
--> 448                 done = cat_sniffers[factor].sniff(value)
    449                 if done:
    450                     examine_needed.remove(factor)

C:Anaconda3libsite-packagespatsycategorical.py in sniff(self, data)
    200             return True
    201 
--> 202         data = _categorical_shape_fix(data)
    203 
    204         for value in data:

C:Anaconda3libsite-packagespatsycategorical.py in _categorical_shape_fix(data)
    155     # wrong shape.
    156     if hasattr(data, "ndim") and data.ndim > 1:
--> 157         raise PatsyError("categorical data cannot be >1-dimensional")
    158     # coerce scalars into 1d, which is consistent with what we do for numeric
    159     # factors. (See statsmodels/statsmodels#1881)

PatsyError: categorical data cannot be >1-dimensional

Lab: Multiple Regression¶

Dataset: Webpage_Product_Sales/Webpage_Product_Sales.csv
Build a model to predict sales using rest of the variables
Drop the less impacting variables based on p-values.
Is there any multicollinearity?
How many variables are there in the final model?
What is the R-squared of the final model?
Can you improve the model using same data and variables?

import pandas as pd 
Webpage_Product_Sales=pd.read_csv("C:AmritaDatavedi\Webpage_Product_Sales\Webpage_Product_Sales.csv")
Webpage_Product_Sales.shape

(675, 12)

Webpage_Product_Sales.columns

Index(['ID', 'DayofMonth', 'Weekday', 'Month', 'Social_Network_Ref_links',
       'Online_Ad_Paid_ref_links', 'Clicks_From_Serach_Engine',
       'Special_Discount', 'Holiday', 'Server_Down_time_Sec', 'Web_UI_Score',
       'Sales'],
      dtype='object')

import statsmodels.formula.api as sm
model1 = sm.ols(formula='Sales ~ Web_UI_Score+Server_Down_time_Sec+Holiday+Special_Discount+Clicks_From_Serach_Engine+Online_Ad_Paid_ref_links+Social_Network_Ref_links+Month+Weekday+DayofMonth', data=Webpage_Product_Sales)
fitted1 = model1.fit()
fitted1.summary()

#VIF
vif_cal(Webpage_Product_Sales,"Sales")

ID  VIF =  1.18
DayofMonth  VIF =  1.01
Weekday  VIF =  1.0
Month  VIF =  1.19
Social_Network_Ref_links  VIF =  1.02
Online_Ad_Paid_ref_links  VIF =  12.13
Clicks_From_Serach_Engine  VIF =  12.08
Special_Discount  VIF =  1.37
Holiday  VIF =  1.38
Server_Down_time_Sec  VIF =  1.02
Web_UI_Score  VIF =  1.02

##Dropped Clicks_From_Serach_Engine based on VIF

import statsmodels.formula.api as sm
model2 = sm.ols(formula='Sales ~ Web_UI_Score+Server_Down_time_Sec+Holiday+Special_Discount+Online_Ad_Paid_ref_links+Social_Network_Ref_links+Month+Weekday+DayofMonth', data=Webpage_Product_Sales)
fitted2 = model2.fit()
fitted2.summary()

#VIF for the updated model
vif_cal(Webpage_Product_Sales.drop(["Clicks_From_Serach_Engine"],axis=1),"Sales")

ID  VIF =  1.18
DayofMonth  VIF =  1.01
Weekday  VIF =  1.0
Month  VIF =  1.19
Social_Network_Ref_links  VIF =  1.01
Online_Ad_Paid_ref_links  VIF =  1.02
Special_Discount  VIF =  1.36
Holiday  VIF =  1.38
Server_Down_time_Sec  VIF =  1.02
Web_UI_Score  VIF =  1.02

##Drop the less impacting variables based on p-values.
##Dropped Web_UI_Score based on P-value

import statsmodels.formula.api as sm
model3 = sm.ols(formula='Sales ~ Server_Down_time_Sec+Holiday+Special_Discount+Online_Ad_Paid_ref_links+Social_Network_Ref_links+Month+Weekday+DayofMonth', data=Webpage_Product_Sales)
fitted3 = model3.fit()
fitted3.summary()

#How many variables are there in the final model?
8

8

#What is the R-squared of the final model?
fitted3.rsquared

0.8178742020411971

Interaction Terms¶

Adding interaction terms might help in improving the prediction accuracy of the model.
The addition of interaction terms needs prior knowledge of the dataset and variables

LAB: Interaction Terms

Add few interaction terms to above web product sales model and see the increase in the accuracy

import statsmodels.formula.api as sm
model4 = sm.ols(formula='Sales ~ Server_Down_time_Sec+Holiday+Special_Discount+Online_Ad_Paid_ref_links+Social_Network_Ref_links+Month+Weekday+DayofMonth+Holiday*Weekday', data=Webpage_Product_Sales)
fitted4 = model4.fit()
fitted4.summary()

Conclusion – Regression¶

Try adding the polynomial & interaction terms to your regression line. Sometimes they work like a charm.
Adjusted R-squared is a good measure of training/in time sample error. We can’t be sure about the final model performance based on this. We may have to perform cross-validation to get an idea on testing error.
Outlies can influence the regression line, we need to take care of data sanitization before building the regression line.

	Week_num	Passengers	Promotion_Budget	Service_Quality_Score	Holiday_week	Delayed_Cancelled_flight_ind	Inter_metro_flight_ratio	Bad_Weather_Ind	Technical_issues_ind
0	1	37824	517356	4.00000	NO	NO	0.70	YES	YES
1	2	43936	646086	2.67466	NO	YES	0.80	YES	YES
2	3	42896	638330	3.29473	NO	NO	0.90	NO	NO
3	4	35792	506492	3.85684	NO	NO	0.40	NO	NO
4	5	38624	609658	3.90757	NO	NO	0.87	NO	YES

Dep. Variable:	Passengers	R-squared:	0.933
Model:	OLS	Adj. R-squared:	0.932
Method:	Least Squares	F-statistic:	1084.
Date:	Wed, 27 Jul 2016	Prob (F-statistic):	1.66e-47
Time:	11:48:26	Log-Likelihood:	-751.34
No. Observations:	80	AIC:	1507.
Df Residuals:	78	BIC:	1511.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	1259.6058	1361.071	0.925	0.358	-1450.078 3969.290
Promotion_Budget	0.0695	0.002	32.923	0.000	0.065 0.074

Omnibus:	26.624	Durbin-Watson:	1.831
Prob(Omnibus):	0.000	Jarque-Bera (JB):	5.188
Skew:	-0.128	Prob(JB):	0.0747
Kurtosis:	1.779	Cond. No.	2.67e+06

Dep. Variable:	Passengers	R-squared:	0.242
Model:	OLS	Adj. R-squared:	0.232
Method:	Least Squares	F-statistic:	24.90
Date:	Wed, 27 Jul 2016	Prob (F-statistic):	3.58e-06
Time:	11:48:27	Log-Likelihood:	-848.30
No. Observations:	80	AIC:	1701.
Df Residuals:	78	BIC:	1705.
Df Model:	1
Covariance Type:	nonrobust

Omnibus:	10.172	Durbin-Watson:	1.385
Prob(Omnibus):	0.006	Jarque-Bera (JB):	10.098
Skew:	0.822	Prob(JB):	0.00641
Kurtosis:	3.573	Cond. No.	9.48

Omnibus:	26.259	Durbin-Watson:	1.800
Prob(Omnibus):	0.000	Jarque-Bera (JB):	5.075
Skew:	-0.096	Prob(JB):	0.0791
Kurtosis:	1.781	Cond. No.	5.25e+06

Omnibus:	6.902	Durbin-Watson:	2.312
Prob(Omnibus):	0.032	Jarque-Bera (JB):	2.759
Skew:	-0.051	Prob(JB):	0.252
Kurtosis:	2.096	Cond. No.	8.22e+06

Omnibus:	1.113	Durbin-Watson:	1.978
Prob(Omnibus):	0.573	Jarque-Bera (JB):	0.763
Skew:	-0.562	Prob(JB):	0.683
Kurtosis:	2.489	Cond. No.	6.00e+03

Omnibus:	0.426	Durbin-Watson:	2.065
Prob(Omnibus):	0.808	Jarque-Bera (JB):	0.434
Skew:	-0.347	Prob(JB):	0.805
Kurtosis:	2.378	Cond. No.	1.98e+04

Omnibus:	1.329	Durbin-Watson:	1.594
Prob(Omnibus):	0.514	Jarque-Bera (JB):	0.875
Skew:	-0.339	Prob(JB):	0.646
Kurtosis:	1.863	Cond. No.	7.85e+04

Dep. Variable:	Final_exam_marks	R-squared:	0.990
Model:	OLS	Adj. R-squared:	0.987
Method:	Least Squares	F-statistic:	452.3
Date:	Wed, 27 Jul 2016	Prob (F-statistic):	1.50e-18
Time:	11:48:28	Log-Likelihood:	-38.099
No. Observations:	24	AIC:	86.20
Df Residuals:	19	BIC:	92.09
Df Model:	4
Covariance Type:	nonrobust

Omnibus:	6.343	Durbin-Watson:	1.863
Prob(Omnibus):	0.042	Jarque-Bera (JB):	4.332
Skew:	0.973	Prob(JB):	0.115
Kurtosis:	3.737	Cond. No.	1.20e+03

Omnibus:	5.869	Durbin-Watson:	2.424
Prob(Omnibus):	0.053	Jarque-Bera (JB):	3.793
Skew:	0.864	Prob(JB):	0.150
Kurtosis:	3.898	Cond. No.	1.03e+03

	Coef.	Std.Err.	t	P>\|t\|	[0.025	0.975]
Intercept	-1.6226	1.9987	-0.8118	0.4269	-5.8060	2.5607
Sem1_Science	0.1738	0.0628	2.7668	0.0123	0.0423	0.3052
Sem2_Science	0.2785	0.0518	5.3795	0.0000	0.1702	0.3869
Sem1_Math	0.7890	0.1971	4.0023	0.0008	0.3764	1.2016
Sem2_Math	-0.2063	0.1914	-1.0782	0.2944	-0.6069	0.1942

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	2017.7724	1624.803	1.242	0.218	-1217.624 5253.169
Promotion_Budget	0.0707	0.002	28.297	0.000	0.066 0.076
Inter_metro_flight_ratio	-2121.5208	2473.189	-0.858	0.394	-7046.268 2803.227

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	1.921e+04	3542.694	5.424	0.000	1.22e+04 2.63e+04
Promotion_Budget	0.0555	0.004	15.476	0.000	0.048 0.063
Inter_metro_flight_ratio	-2003.4508	2129.095	-0.941	0.350	-6243.912 2237.010
Service_Quality_Score	-2802.0708	530.382	-5.283	0.000	-3858.419 -1745.723

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	-2.8798	1.163	-2.477	0.038	-5.561 -0.199
x1	-0.4894	0.370	-1.324	0.222	-1.342 0.363
x2	0.0029	0.001	2.586	0.032	0.000 0.005
x3	0.4572	0.176	2.595	0.032	0.051 0.864

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	-5.3751	4.687	-1.147	0.303	-17.423 6.673
x1	-0.6697	0.537	-1.247	0.268	-2.050 0.711
x2	0.0030	0.002	1.956	0.108	-0.001 0.007
x3	0.5063	0.249	2.036	0.097	-0.133 1.146
x4	0.0376	0.084	0.449	0.672	-0.178 0.253
x5	0.0436	0.169	0.258	0.806	-0.390 0.478
x6	0.0516	0.088	0.588	0.582	-0.174 0.277

Dep. Variable:	Sales	R-squared:	0.818
Model:	OLS	Adj. R-squared:	0.815
Method:	Least Squares	F-statistic:	298.4
Date:	Wed, 27 Jul 2016	Prob (F-statistic):	5.54e-238
Time:	12:45:36	Log-Likelihood:	-6456.7
No. Observations:	675	AIC:	1.294e+04
Df Residuals:	664	BIC:	1.299e+04
Df Model:	10
Covariance Type:	nonrobust

Omnibus:	40.759	Durbin-Watson:	1.356
Prob(Omnibus):	0.000	Jarque-Bera (JB):	102.136
Skew:	0.297	Prob(JB):	6.63e-23
Kurtosis:	4.811	Cond. No.	2.57e+04

Omnibus:	40.826	Durbin-Watson:	1.356
Prob(Omnibus):	0.000	Jarque-Bera (JB):	102.313
Skew:	0.298	Prob(JB):	6.07e-23
Kurtosis:	4.812	Cond. No.	1.94e+04

Omnibus:	41.049	Durbin-Watson:	1.352
Prob(Omnibus):	0.000	Jarque-Bera (JB):	103.243
Skew:	0.298	Prob(JB):	3.81e-23
Kurtosis:	4.821	Cond. No.	1.31e+04

Omnibus:	7.552	Durbin-Watson:	0.867
Prob(Omnibus):	0.023	Jarque-Bera (JB):	7.305
Skew:	0.219	Prob(JB):	0.0259
Kurtosis:	2.740	Cond. No.	2.32e+04

Regression

Regression

Contents

Correlation

What is need of correlation?

Correlation coefficient

LAB –Correlation Calculation

Beyond Pearson Correlation

From Correlation to Regression

What is Regression

Regression

Regression Line fitting

Error

Minimizing the error

Least Squares Estimation

LAB: Regression Line Fitting

How good is my regression line?

Explained and Unexplained Variation

R-Squared

Lab: R- Square

Multiple Regression

Code – Multiple Regression

Individual Impact of variables

LAB: Multiple Regression

Adjusted R-Squared

LAB: Adjusted R-Square

R-Squared vs Adjusted R-Squared

LAB: Multiple Regression- issues

Multicollinearity

Multicollinearity Detection

LAB: Multicollinearity

Lab: Multiple Regression¶

Interaction Terms¶

LAB: Interaction Terms

Conclusion – Regression¶