Link to the previous post: https://course.dvanalyticsmds.com/204-1-7-adjusted-r-squared-in-python/

In the last post of this session, we did cover basics of Multiple variable Linear Regression. In this post, we will Practice and try to solve issues associated with Multiple Regression.

Practice : Multiple Regression- issues

Import Final Exam Score data
Build a model to predict final score using the rest of the variables.
How are Sem2_Math & Final score related? As Sem2_Math score increases, what happens to Final score?
Remove “Sem1_Math” variable from the model and rebuild the model
Is there any change in R square or Adj R square
How are Sem2_Math & Final score related now? As Sem2_Math score increases, what happens to Final score?
Draw a scatter plot between Sem1_Math & Sem2_Math
Find the correlation between Sem1_Math & Sem2_Math

In [34]:

#Import Final Exam Score data
final_exam=pd.read_csv("datasets\\Final Exam\\Final Exam Score.csv")

In [35]:

#Size of the data
final_exam.shape

Out[35]:

(24, 5)

In [36]:

#Variable names
final_exam.columns

Out[36]:

Index(['Sem1_Science', 'Sem2_Science', 'Sem1_Math', 'Sem2_Math',
       'Final_exam_marks'],
      dtype='object')

In [37]:

#Build a model to predict final score using the rest of the variables.
from sklearn.linear_model import LinearRegression
lr1 = LinearRegression()
lr1.fit(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem1_Math"]+["Sem2_Math"]], final_exam[["Final_exam_marks"]])
predictions1 = lr1.predict(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem1_Math"]+["Sem2_Math"]])

import statsmodels.formula.api as sm
model1 = sm.ols(formula='Final_exam_marks ~ Sem1_Science+Sem2_Science+Sem1_Math+Sem2_Math', data=final_exam)
fitted1 = model1.fit()
fitted1.summary()

Out[37]:

OLS Regression Results
Dep. Variable:	Final_exam_marks	R-squared:	0.990
Model:	OLS	Adj. R-squared:	0.987
Method:	Least Squares	F-statistic:	452.3
Date:	Wed, 27 Jul 2016	Prob (F-statistic):	1.50e-18
Time:	11:48:28	Log-Likelihood:	-38.099
No. Observations:	24	AIC:	86.20
Df Residuals:	19	BIC:	92.09
Df Model:	4
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	-1.6226	1.999	-0.812	0.427	-5.806 2.561
Sem1_Science	0.1738	0.063	2.767	0.012	0.042 0.305
Sem2_Science	0.2785	0.052	5.379	0.000	0.170 0.387
Sem1_Math	0.7890	0.197	4.002	0.001	0.376 1.202
Sem2_Math	-0.2063	0.191	-1.078	0.294	-0.607 0.194

Omnibus:	6.343	Durbin-Watson:	1.863
Prob(Omnibus):	0.042	Jarque-Bera (JB):	4.332
Skew:	0.973	Prob(JB):	0.115
Kurtosis:	3.737	Cond. No.	1.20e+03

In [38]:

fitted1.rsquared

Out[38]:

0.98960765475687229

How are Sem2_Math & Final score related? As Sem2_Math score increases, what happens to Final score?

As Sem2_Math score increases Final score decreases

In [39]:

#Remove "Sem1_Math" variable from the model and rebuild the model
from sklearn.linear_model import LinearRegression
lr2 = LinearRegression()
lr2.fit(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem2_Math"]], final_exam[["Final_exam_marks"]])
predictions2 = lr2.predict(final_exam[["Sem1_Science"]+["Sem2_Science"]+["Sem2_Math"]])

import statsmodels.formula.api as sm
model2 = sm.ols(formula='Final_exam_marks ~ Sem1_Science+Sem2_Science+Sem2_Math', data=final_exam)
fitted2 = model2.fit()
fitted2.summary()

Out[39]:

OLS Regression Results
Dep. Variable:	Final_exam_marks	R-squared:	0.981
Model:	OLS	Adj. R-squared:	0.978
Method:	Least Squares	F-statistic:	341.4
Date:	Wed, 27 Jul 2016	Prob (F-statistic):	2.44e-17
Time:	11:48:29	Log-Likelihood:	-45.436
No. Observations:	24	AIC:	98.87
Df Residuals:	20	BIC:	103.6
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	-2.3986	2.632	-0.911	0.373	-7.889 3.092
Sem1_Science	0.2130	0.082	2.595	0.017	0.042 0.384
Sem2_Science	0.2686	0.068	3.925	0.001	0.126 0.411
Sem2_Math	0.5320	0.067	7.897	0.000	0.391 0.673

Omnibus:	5.869	Durbin-Watson:	2.424
Prob(Omnibus):	0.053	Jarque-Bera (JB):	3.793
Skew:	0.864	Prob(JB):	0.150
Kurtosis:	3.898	Cond. No.	1.03e+03

Is there any change in R square or Adj R square

Model	$R^{2}$

	$A d j R^{2}$


model1	0.990	0.987
model2	0.981	0.978

How are Sem2_Math & Final score related now? As Sem2_Math score increases, what happens to Final score?

As Sem2_Math score increases Final score also increases.

In [40]:

#Draw a scatter plot between Sem1_Math & Sem2_Mat

import matplotlib.pyplot as plt
%matplotlib inline 
plt.scatter(final_exam.Sem1_Math,final_exam.Sem2_Math)

Out[40]:

<matplotlib.collections.PathCollection at 0xb2cf0f0>

In [41]:

#Find the correlation between Sem1_Math & Sem2_Math 
np.corrcoef(final_exam.Sem1_Math,final_exam.Sem2_Math)

Out[41]:

array([[ 1.       ,  0.9924948],
       [ 0.9924948,  1.       ]])

The next post is about issues of multicollinearity in python.

Link to the next post : https://course.dvanalyticsmds.com/204-1-9-issue-of-multicollinearity-in-python/

23rd January 2018

204.1.8 Practice : Multiple Regression Issues

Practicing Multi Variable Linear Regression model.

Practice : Multiple Regression- issues

Dv Analytics

Dv Analytics

Dv Analytics

204.1.8 Practice : Multiple Regression Issues

Practicing Multi Variable Linear Regression model.

Practice : Multiple Regression- issues

Related Courses

Excel

Dv Analytics

Deep Learning

Dv Analytics

Explainable AI (XAI)

Dv Analytics