Link to the previous post : https://course.dvanalyticsmds.com/204-1-6-multiple-regression-in-python/

Adjusted R-Squared

Is it good to have as many independent variables as possible? Nope
R-square is deceptive. R-squared never decreases when a new X variable is added to the model – True?
We need a better measure or an adjustment to the original R-squared formula.
Adjusted R squared
- Its value depends on the number of explanatory variables
- Imposes a penalty for adding additional explanatory variables
- It is usually written as $(R^2)$

Very different from $(R^2)$ $R^{2}$ when there are too many predictors and n is less

[latex]\bar{R}^2 = R^2 – \frac{k-1}{n-k}(1-R^2)[/latex]

where n - number of observations k - number of parameters

R squared value increase if we increase the number of independent variables. Adjusted R-square increases only if a significant variable is added. Look at this example. As we are adding new variables, R square increases, Adjusted R-square may not increase.

Practice : Adjusted R-Square Dataset: “Adjusted Rsquare/ Adj_Sample.csv” Build a model to predict y using x1,x2 and x3. Note down R-Square and Adj R-Square values Build a model to predict y using x1,x2,x3,x4,x5 and x6. Note down R-Square and Adj R-Square values Build a model to predict y using x1,x2,x3,x4,x5,x6,x7 and x8. Note down R-Square and Adj R-Square values

In [26]:

adj_sample=pd.read_csv("datasets\\Adjusted RSquare\\Adj_Sample.csv")
adj_sample.shape

Out[26]:

(12, 9)

In [27]:

adj_sample.columns.values

Out[27]:

array(['Y', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8'], dtype=object)

In [28]:

#Build a model to predict y using x1,x2 and x3. Note down R-Square and Adj R-Square values 
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(adj_sample[["x1"]+["x2"]+["x3"]], adj_sample[["Y"]])
predictions = lr.predict(adj_sample[["x1"]+["x2"]+["x3"]])

In [29]:

import statsmodels.formula.api as sm
model = sm.ols(formula='Y ~ x1+x2+x3', data=adj_sample)
fitted1 = model.fit()
fitted1.summary()

C:\Anaconda3\lib\site-packages\scipy\stats\stats.py:1557: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=12
  "anyway, n=%i" % int(n))

Out[29]:

OLS Regression Results
Dep. Variable:	Y	R-squared:	0.684
Model:	OLS	Adj. R-squared:	0.566
Method:	Least Squares	F-statistic:	5.785
Date:	Wed, 27 Jul 2016	Prob (F-statistic):	0.0211
Time:	11:48:28	Log-Likelihood:	-10.430
No. Observations:	12	AIC:	28.86
Df Residuals:	8	BIC:	30.80
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	-2.8798	1.163	-2.477	0.038	-5.561 -0.199
x1	-0.4894	0.370	-1.324	0.222	-1.342 0.363
x2	0.0029	0.001	2.586	0.032	0.000 0.005
x3	0.4572	0.176	2.595	0.032	0.051 0.864

Omnibus:	1.113	Durbin-Watson:	1.978
Prob(Omnibus):	0.573	Jarque-Bera (JB):	0.763
Skew:	-0.562	Prob(JB):	0.683
Kurtosis:	2.489	Cond. No.	6.00e+03

In [30]:

#Build a model to predict y using x1,x2,x3,x4,x5 and x6. Note down R-Square and Adj R-Square values 

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(adj_sample[["x1"]+["x2"]+["x3"]+["x4"]+["x5"]+["x6"]], adj_sample[["Y"]])
predictions = lr.predict(adj_sample[["x1"]+["x2"]+["x3"]+["x4"]+["x5"]+["x6"]])

In [31]:

import statsmodels.formula.api as sm
model = sm.ols(formula='Y ~ x1+x2+x3+x4+x5+x6', data=adj_sample)
fitted2 = model.fit()
fitted2.summary()

C:\Anaconda3\lib\site-packages\scipy\stats\stats.py:1557: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=12
  "anyway, n=%i" % int(n))

Out[31]:

OLS Regression Results
Dep. Variable:	Y	R-squared:	0.717
Model:	OLS	Adj. R-squared:	0.377
Method:	Least Squares	F-statistic:	2.111
Date:	Wed, 27 Jul 2016	Prob (F-statistic):	0.215
Time:	11:48:28	Log-Likelihood:	-9.7790
No. Observations:	12	AIC:	33.56
Df Residuals:	5	BIC:	36.95
Df Model:	6
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	-5.3751	4.687	-1.147	0.303	-17.423 6.673
x1	-0.6697	0.537	-1.247	0.268	-2.050 0.711
x2	0.0030	0.002	1.956	0.108	-0.001 0.007
x3	0.5063	0.249	2.036	0.097	-0.133 1.146
x4	0.0376	0.084	0.449	0.672	-0.178 0.253
x5	0.0436	0.169	0.258	0.806	-0.390 0.478
x6	0.0516	0.088	0.588	0.582	-0.174 0.277

Omnibus:	0.426	Durbin-Watson:	2.065
Prob(Omnibus):	0.808	Jarque-Bera (JB):	0.434
Skew:	-0.347	Prob(JB):	0.805
Kurtosis:	2.378	Cond. No.	1.98e+04

In [32]:

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(adj_sample[["x1"]+["x2"]+["x3"]+["x4"]+["x5"]+["x6"]+["x7"]+["x8"]], adj_sample[["Y"]])
predictions = lr.predict(adj_sample[["x1"]+["x2"]+["x3"]+["x4"]+["x5"]+["x6"]+["x7"]+["x8"]])

In [33]:

import statsmodels.formula.api as sm
model = sm.ols(formula='Y ~ x1+x2+x3+x4+x5+x6+x7+x8', data=adj_sample)
fitted3 = model.fit()
fitted3.summary()

C:\Anaconda3\lib\site-packages\scipy\stats\stats.py:1557: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=12
  "anyway, n=%i" % int(n))

Out[33]:

OLS Regression Results
Dep. Variable:	Y	R-squared:	0.805
Model:	OLS	Adj. R-squared:	0.285
Method:	Least Squares	F-statistic:	1.549
Date:	Wed, 27 Jul 2016	Prob (F-statistic):	0.393
Time:	11:48:28	Log-Likelihood:	-7.5390
No. Observations:	12	AIC:	33.08
Df Residuals:	3	BIC:	37.44
Df Model:	8
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	17.0440	19.903	0.856	0.455	-46.297 80.385
x1	-0.0956	0.761	-0.126	0.908	-2.519 2.328
x2	0.0007	0.003	0.291	0.790	-0.007 0.009
x3	0.5157	0.306	1.684	0.191	-0.459 1.490
x4	0.0579	0.103	0.560	0.615	-0.271 0.387
x5	0.0858	0.191	0.448	0.684	-0.524 0.695
x6	-0.1747	0.220	-0.795	0.485	-0.874 0.525
x7	-0.0324	0.153	-0.212	0.846	-0.519 0.455
x8	-0.2321	0.207	-1.124	0.343	-0.890 0.425

Omnibus:	1.329	Durbin-Watson:	1.594
Prob(Omnibus):	0.514	Jarque-Bera (JB):	0.875
Skew:	-0.339	Prob(JB):	0.646
Kurtosis:	1.863	Cond. No.	7.85e+04

Model	$R^{2}$

	$A d j R^{2}$


Model1	0.684	0.566
Model2	0.717	0.377
Model3	0.805	0.285

R-Squared vs Adjusted R-Squared

We have built three models on Adj_sample data; model1, model2 and model3 with different number of variabes

1) What does it indicate if R-square is very far away from Adj-R square? An indication of too many variables/ Too many insignificant variables. We may have to see the variable impact test and drop few independent variables from the model.

2) How do you use Adj-R square? Build a model, Calculate R-square is near to adjusted R-square. If not, use variable selection techniques to bring R square near to Adj- R square. A difference of 2% between R square and Adj-R square is acceptable.

3) Is the only number of independent variables that make Adj-R Square down? No, if observe the formula carefully then we can see Adj-R square is influenced by k(number of variables) and n(number of observations) . If ‘k’ is high and ‘n’ is low then Adj-R Square will be very less.

Finally either reduce number of variables or increase the number of observations to bring Adj-R Square close to R Square

The next post is a practice session on multiple regression issues.

Link to the next post : https://course.dvanalyticsmds.com/204-1-8-practice-multiple-regression-issues/

24th January 2018

204.1.7 Adjusted R-squared in Python

R square for multiple variables in regression.

Adjusted R-Squared

Practice : Adjusted R-Square

R-Squared vs Adjusted R-Squared

Dv Analytics

Dv Analytics

Dv Analytics

204.1.7 Adjusted R-squared in Python

R square for multiple variables in regression.

Adjusted R-Squared

Practice : Adjusted R-Square

R-Squared vs Adjusted R-Squared

Related Courses

Knowledge Test : Excel

Dv Analytics

Excel

Dv Analytics

Deep Learning

Dv Analytics