• No products in the cart.

204.1.7 Adjusted R-squared in Python

R square for multiple variables in regression.

Link to the previous post : https://course.dvanalyticsmds.com/204-1-6-multiple-regression-in-python/

Adjusted R-Squared

  • Is it good to have as many independent variables as possible? Nope
  • R-square is deceptive. R-squared never decreases when a new X variable is added to the model – True?
  • We need a better measure or an adjustment to the original R-squared formula.
  • Adjusted R squared
    • Its value depends on the number of explanatory variables
    • Imposes a penalty for adding additional explanatory variables
    • It is usually written as (R^2)

 

  • Very different from (R^2) when there are too many predictors and n is less

[latex]\bar{R}^2 = R^2 – \frac{k-1}{n-k}(1-R^2)[/latex]

 

where n – number of observations k – number of parameters

R squared value increase if we increase the number of independent variables. Adjusted R-square increases only if a significant variable is added.  Look at this example. As we are adding new variables, R square increases, Adjusted R-square may not increase.

Practice : Adjusted R-Square

  • Dataset: “Adjusted Rsquare/ Adj_Sample.csv”
  • Build a model to predict y using x1,x2 and x3. Note down R-Square and Adj R-Square values
  • Build a model to predict y using x1,x2,x3,x4,x5 and x6. Note down R-Square and Adj R-Square values
  • Build a model to predict y using x1,x2,x3,x4,x5,x6,x7 and x8. Note down R-Square and Adj R-Square values
In [26]:
adj_sample=pd.read_csv("datasets\\Adjusted RSquare\\Adj_Sample.csv")
adj_sample.shape
Out[26]:
(12, 9)
In [27]:
adj_sample.columns.values
Out[27]:
array(['Y', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8'], dtype=object)
In [28]:
#Build a model to predict y using x1,x2 and x3. Note down R-Square and Adj R-Square values 
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(adj_sample[["x1"]+["x2"]+["x3"]], adj_sample[["Y"]])
predictions = lr.predict(adj_sample[["x1"]+["x2"]+["x3"]])
In [29]:
import statsmodels.formula.api as sm
model = sm.ols(formula='Y ~ x1+x2+x3', data=adj_sample)
fitted1 = model.fit()
fitted1.summary()
C:\Anaconda3\lib\site-packages\scipy\stats\stats.py:1557: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=12
  "anyway, n=%i" % int(n))
Out[29]:
OLS Regression Results
Dep. Variable: Y R-squared: 0.684
Model: OLS Adj. R-squared: 0.566
Method: Least Squares F-statistic: 5.785
Date: Wed, 27 Jul 2016 Prob (F-statistic): 0.0211
Time: 11:48:28 Log-Likelihood: -10.430
No. Observations: 12 AIC: 28.86
Df Residuals: 8 BIC: 30.80
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -2.8798 1.163 -2.477 0.038 -5.561 -0.199
x1 -0.4894 0.370 -1.324 0.222 -1.342 0.363
x2 0.0029 0.001 2.586 0.032 0.000 0.005
x3 0.4572 0.176 2.595 0.032 0.051 0.864
Omnibus: 1.113 Durbin-Watson: 1.978
Prob(Omnibus): 0.573 Jarque-Bera (JB): 0.763
Skew: -0.562 Prob(JB): 0.683
Kurtosis: 2.489 Cond. No. 6.00e+03
In [30]:
#Build a model to predict y using x1,x2,x3,x4,x5 and x6. Note down R-Square and Adj R-Square values 

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(adj_sample[["x1"]+["x2"]+["x3"]+["x4"]+["x5"]+["x6"]], adj_sample[["Y"]])
predictions = lr.predict(adj_sample[["x1"]+["x2"]+["x3"]+["x4"]+["x5"]+["x6"]])
In [31]:
import statsmodels.formula.api as sm
model = sm.ols(formula='Y ~ x1+x2+x3+x4+x5+x6', data=adj_sample)
fitted2 = model.fit()
fitted2.summary()
C:\Anaconda3\lib\site-packages\scipy\stats\stats.py:1557: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=12
  "anyway, n=%i" % int(n))
Out[31]:
OLS Regression Results
Dep. Variable: Y R-squared: 0.717
Model: OLS Adj. R-squared: 0.377
Method: Least Squares F-statistic: 2.111
Date: Wed, 27 Jul 2016 Prob (F-statistic): 0.215
Time: 11:48:28 Log-Likelihood: -9.7790
No. Observations: 12 AIC: 33.56
Df Residuals: 5 BIC: 36.95
Df Model: 6
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -5.3751 4.687 -1.147 0.303 -17.423 6.673
x1 -0.6697 0.537 -1.247 0.268 -2.050 0.711
x2 0.0030 0.002 1.956 0.108 -0.001 0.007
x3 0.5063 0.249 2.036 0.097 -0.133 1.146
x4 0.0376 0.084 0.449 0.672 -0.178 0.253
x5 0.0436 0.169 0.258 0.806 -0.390 0.478
x6 0.0516 0.088 0.588 0.582 -0.174 0.277
Omnibus: 0.426 Durbin-Watson: 2.065
Prob(Omnibus): 0.808 Jarque-Bera (JB): 0.434
Skew: -0.347 Prob(JB): 0.805
Kurtosis: 2.378 Cond. No. 1.98e+04
In [32]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(adj_sample[["x1"]+["x2"]+["x3"]+["x4"]+["x5"]+["x6"]+["x7"]+["x8"]], adj_sample[["Y"]])
predictions = lr.predict(adj_sample[["x1"]+["x2"]+["x3"]+["x4"]+["x5"]+["x6"]+["x7"]+["x8"]])
In [33]:
import statsmodels.formula.api as sm
model = sm.ols(formula='Y ~ x1+x2+x3+x4+x5+x6+x7+x8', data=adj_sample)
fitted3 = model.fit()
fitted3.summary()
C:\Anaconda3\lib\site-packages\scipy\stats\stats.py:1557: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=12
  "anyway, n=%i" % int(n))
Out[33]:
OLS Regression Results
Dep. Variable: Y R-squared: 0.805
Model: OLS Adj. R-squared: 0.285
Method: Least Squares F-statistic: 1.549
Date: Wed, 27 Jul 2016 Prob (F-statistic): 0.393
Time: 11:48:28 Log-Likelihood: -7.5390
No. Observations: 12 AIC: 33.08
Df Residuals: 3 BIC: 37.44
Df Model: 8
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 17.0440 19.903 0.856 0.455 -46.297 80.385
x1 -0.0956 0.761 -0.126 0.908 -2.519 2.328
x2 0.0007 0.003 0.291 0.790 -0.007 0.009
x3 0.5157 0.306 1.684 0.191 -0.459 1.490
x4 0.0579 0.103 0.560 0.615 -0.271 0.387
x5 0.0858 0.191 0.448 0.684 -0.524 0.695
x6 -0.1747 0.220 -0.795 0.485 -0.874 0.525
x7 -0.0324 0.153 -0.212 0.846 -0.519 0.455
x8 -0.2321 0.207 -1.124 0.343 -0.890 0.425
Omnibus: 1.329 Durbin-Watson: 1.594
Prob(Omnibus): 0.514 Jarque-Bera (JB): 0.875
Skew: -0.339 Prob(JB): 0.646
Kurtosis: 1.863 Cond. No. 7.85e+04
Model R2
AdjR2
Model1 0.684 0.566
Model2 0.717 0.377
Model3 0.805 0.285

R-Squared vs Adjusted R-Squared

We have built three models on Adj_sample data; model1, model2 and model3 with different number of variabes

1) What does it indicate if R-square is very far away from Adj-R square?  An indication of too many variables/ Too many insignificant variables.   We may have to see the variable impact test and drop few independent variables from the model.

2) How do you use Adj-R square? Build a model, Calculate R-square is near to adjusted R-square. If not, use variable selection techniques to bring R square near to Adj- R square.  A difference of 2% between R square and Adj-R square is acceptable.

3) Is the only number of independent variables that make Adj-R Square down? No, if observe the formula carefully then we can see Adj-R square is influenced by k(number of variables) and n(number of observations) . If ‘k’ is high and ‘n’ is low then Adj-R Square will be very less.

Finally either reduce number of variables or increase the number of observations to bring Adj-R Square close to R Square

The next post is a practice session on multiple regression issues.
24th January 2018

DV Analytics

DV Data & Analytics is a leading data science,  Cyber Security training and consulting firm, led by industry experts. We are aiming to train and prepare resources to acquire the most in-demand data science job opportunities in India and abroad.

Bangalore Center

DV Data & Analytics Bangalore Private Limited
#52, 2nd Floor:
Malleshpalya Maruthinagar Bengaluru.
Bangalore 560075
India
(+91) 9019 030 033 (+91) 8095 881 188
Email: info@dvanalyticsmds.com

Bhubneshwar Center

DV Data & Analytics Private Limited Bhubaneswar
Plot No A/7 :
Adjacent to Maharaja Cine Complex, Bhoinagar, Acharya Vihar
Bhubaneswar 751022
(+91) 8095 881 188 (+91) 8249 430 414
Email: info@dvanalyticsmds.com

top
© 2020. All Rights Reserved.