• No products in the cart.

203.4.3 ROC and AUC

What they are and why they are important.

ROC Curve – Interpretation

In previous section, we studied about  Calculating Sensitivity and Specificity in R

  • How many mistakes are we making to identify all the positives?
  • How many mistakes are we making to identify 70%, 80% and 90% of positives?
  • 1-Specificty(false positive rate) gives us an idea on mistakes that we are making
  • We would like to make 0% mistakes for identifying 100% positives
  • We would like to make very minimal mistakes for identifying maximum positives
  • We want that curve to be far away from straight line
  • Ideally we want the area under the curve as high as possible

Create three scenarios from ROC Curve

Scenario-1  (Point A on the ROC curve )

  • Imagine that t1 is the threshold value which results in the point A. t1- gives some sensitivity and specificity.
  • If we take t1 as threshold value we have the below scenario
  • True positive 65% and False Positive 10%
  • To capture nearly 65% of the good(target) we are making 10% mistakes
  • Are you happy with loosing 35% here and making only 10% mistakes there.
  • For example, you are dealing with loans. Where your target is finding bad customer in a loans portfolio. Out all the laon applications your model successfully identified 65% of the bad customers. In that process, it also wrongly classified 10% of good customers as bad customers.
  • So finally scenario -1 ; with probability threshold t1 : we have two losses 35% of bad customers will be given loans and 10% of good customers will be rejected loans.

Scenario-2  (Point B on the ROC curve )

  • Imagine that t2 is the threshold value which results in the point B.
  • If we take t2 as threshold value we have  the below scenario
  • True positive 80% and False Positive 30%
  • To capture nearly 80% of the good(target) we are making 30% mistakes
  • Are you happy with capturing 80% here and making only  30% mistakes there.
  • In our loans example, Out all the loan applications, your model successfully identified 80% of the bad customers. In that process, it also wrongly classified 30% of good customers as bad customers.
  • Now scenario -2 ; with probability threshold t2: we have two losses 20% of bad customers will be given loans and 30% of good customers will be rejected loans.

Scenario-3 (Point C on the ROC curve )

  • Imagine that t3 is the threshold value which results in the point C.
  • True positive 90% and False Positive 60%
  • To capture nearly 90% of the good(target) we are making 60% mistakes
  • Are you happy with capturing 90% here and making  as many as 60% mistakes there.
  • In our loans example, Out all the loan applications, your model successfully identified 90% of the bad customers. In that process, it also wrongly classified 60% of good customers as bad customers.
  • Now scenario -3 ; with probability threshold t3: we have two losses 10% of bad customers will be given loans and 60% of good customers will be rejected loans.

Scenario Analysis Conclusion:

  • Depending on your business you should choose the threshold.
  • If the problem that you are handling is detecting a bomb, then you may want to be nearly 100% accurate, which means you will make lot of mistakes (False positives). Scenario-3
  • In loans portfolio you don’t want to loose lot of good customers.  You would prefer scenario-1 or scenario-2.
  • If it is e-mail marketing and you want to capture as many responders as possible then you will choose Scenario-3
  • If its is telephone outbound call marketing then you don’t want to unnecessarily call non-responders.  There is a cost associated with false positives. You would prefer scenario-1 or scenario-2.

ROC and AUC

  • We want that curve to be far away from straight line. Ideally we want the area under the curve as high as possible
  • ROC comes with a connected topic, AUC. Area Under the Curve
  • ROC Curve Gives us an idea on the performance of the model under all possible values of threshold.
  • We want to make almost 0% mistakes while identifying all the positives, which means we want to see AUC value near to 1

AUC

  • AUC is near to 1 for a good model

ROC and AUC Calculation

Building a Logistic Regression Model

Product_slaes <- read.csv("C:\\Amrita\\Datavedi\\Product Sales Data\\Product_sales.csv")
prod_sales_Logit_model<- glm(Bought ~ Age, family=binomial,data=Product_slaes)
summary(prod_sales_Logit_model)
## 
## Call:
## glm(formula = Bought ~ Age, family = binomial, data = Product_slaes)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.6922  -0.1645  -0.0619   0.1246   3.5378  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.90975    0.72755  -9.497   <2e-16 ***
## Age          0.21786    0.02091  10.418   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 640.425  on 466  degrees of freedom
## Residual deviance:  95.015  on 465  degrees of freedom
## AIC: 99.015
## 
## Number of Fisher Scoring iterations: 7

Code – ROC Calculation

library(pROC)
## Warning: package 'pROC' was built under R version 3.1.3
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
predicted_prob<-predict(prod_sales_Logit_model,type="response")
roccurve <- roc(prod_sales_Logit_model$y, predicted_prob)
plot(roccurve)

## 
## Call:
## roc.default(response = prod_sales_Logit_model$y, predictor = predicted_prob)
## 
## Data: predicted_prob in 262 controls (prod_sales_Logit_model$y 0) < 205 cases (prod_sales_Logit_model$y 1).
## Area under the curve: 0.983

Code – AUC Calculation

auc(roccurve)
## Area under the curve: 0.983

Or

auc(prod_sales_Logit_model$y, predicted_prob)
## Area under the curve: 0.983

Code-ROC from Fiberbits Model

predicted_prob<-predict(Fiberbits_model_1,type="response")
roccurve <- roc(Fiberbits_model_1$y, predicted_prob)
plot(roccurve)

## 
## Call:
## roc.default(response = Fiberbits_model_1$y, predictor = predicted_prob)
## 
## Data: predicted_prob in 42141 controls (Fiberbits_model_1$y 0) < 57859 cases (Fiberbits_model_1$y 1).
## Area under the curve: 0.835

Code-AUC of Fiberbits Model

auc(roccurve)
## Area under the curve: 0.835

What is a best model? How to build?

  • A model with maximum accuracy /least error
  • A model that uses maximum information available in the given data
  • A model that has minimum squared error
  • A model that captures all the hidden patterns in the data
  • A model that produces the best perdition results

The next post is about What is the Best Model.

DV Analytics

DV Data & Analytics is a leading data science,  Cyber Security training and consulting firm, led by industry experts. We are aiming to train and prepare resources to acquire the most in-demand data science job opportunities in India and abroad.

Bangalore Center

DV Data & Analytics Bangalore Private Limited
#52, 2nd Floor:
Malleshpalya Maruthinagar Bengaluru.
Bangalore 560075
India
(+91) 9019 030 033 (+91) 8095 881 188
Email: info@dvanalyticsmds.com

Bhubneshwar Center

DV Data & Analytics Private Limited Bhubaneswar
Plot No A/7 :
Adjacent to Maharaja Cine Complex, Bhoinagar, Acharya Vihar
Bhubaneswar 751022
(+91) 8095 881 188 (+91) 8249 430 414
Email: info@dvanalyticsmds.com

top
© 2020. All Rights Reserved.