• No products in the cart.

203.4.4 What is a Best Model?

What quantifies a model to be the best?

What is a best model? How to build?

In previous section, we studied about ROC and AUC

  • A model with maximum accuracy /least error
  • A model that uses maximum information available in the given data
  • A model that has minimum squared error
  • A model that captures all the hidden patterns in the data
  • A model that produces the best perdition results

Model Selection

  • How to build/choose a best model?
  • Error on the training data is not a good meter of performance on future data
  • How to select the best model out of the set of available models ?
  • Are there any methods/metrics to choose best model?
  • What is training error? What is testing error? What is hold out sample error?

LAB: The Most Accurate Model

  • Data: Fiberbits/Fiberbits.csv
  • Build a decision tree to predict active_user
  • What is the accuracy of your model?
  • Grow the tree as much as you can and achieve 95% accuracy.

Solution

  • Model-1
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.1.3
Fiber_bits_tree1<-rpart(active_cust~., method="class", control=rpart.control(minsplit=30, cp=0.01), data=Fiberbits)
prp(Fiber_bits_tree1)

Fbits_pred1<-predict(Fiber_bits_tree1, type="class")
conf_matrix1<-table(Fbits_pred1,Fiberbits$active_cust)
accuracy1<-(conf_matrix1[1,1]+conf_matrix1[2,2])/(sum(conf_matrix1))
accuracy1
## [1] 0.84629
  • Model-2
Fiber_bits_tree2<-rpart(active_cust~., method="class", control=rpart.control(minsplit=5, cp=0.000001), data=Fiberbits)
Fbits_pred2<-predict(Fiber_bits_tree2, type="class")
conf_matrix2<-table(Fbits_pred2,Fiberbits$active_cust)
accuracy2<-(conf_matrix2[1,1]+conf_matrix2[2,2])/(sum(conf_matrix2))
accuracy2
## [1] 0.95063

Different Type of Datasets and Errors

The Training Error

  • The accuracy of our best model is 95%. Is the 5% error model really good?
  • The error on the training data is known as training error.
  • A low error rate on training data may not always mean the model is good.
  • What really matters is how the model is going to perform on unknown data or test data.
  • We need to find out a way to get an idea on error rate of test data.
  • We may have to keep aside a part of the data and use it for validation.
  • There are two types of datasets and two types of errors

Two Types of Datasets

  • There are two types of datasets
  • Training set: This is used in model building. The input data
  • Test set: The unknown dataset. This dataset is gives the accuracy of the final model
  • We may not have access to these two datasets for all machine learning problems. In some cases, we can take 90% of the available data and use it as training data and rest 10% can be treated as validation data
  • Validation set: This dataset kept aside for model validation and selection. This is a temporary subsite to test dataset. It is not third type of data
  • We create the validation data with the hope that the error rate on validation data will give us some basic idea on the test error

Types of Errors

  • The training error
  • The error on training dataset
  • In-time error
  • Error on the known data
  • Can be reduced while building the model
  • The test error
  • The error that matters
  • Out-of-time error
  • The error on unknown/new dataset.

“A good model will have both training and test error very near to each other and close to zero”

The Problem of Over Fitting

  • In search of the best model on the given data we add many predictors, polynomial terms, Interaction terms, variable transformations, derived variables, indicator/dummy variables etc.,
  • Most of the times we succeed in reducing the error. What error is this?
  • So by complicating the model we fit the best model for the training data.
  • Sometimes the error on the training data can reduce to near zero
  • But the same best model on training data fails miserably on test data.
  • Imagine building multiple models with small changes in training data. The resultant set of models will have huge variance in their parameter estimates.
  • The model is made really complicated, that it is very sensitive to minimal changes
  • By complicating the model the variance of the parameters estimates inflates
  • Model tries to fit the irrelevant characteristics in the data
  • Over fitting
  • The model is super good on training data but not so good on test data
  • We fit the model for the noise in the data
  • Less training error, high testing error
  • The model is over complicated with too many predictors
  • Model need to be simplified
  • A model with lot of variance

LAB: Model with huge Variance

  • Data: Fiberbits/Fiberbits.csv
  • Take initial 90% of the data. Consider it as training data. Keep the final 10% of the records for validation.
  • Build the best model(5% error) model on training data.
  • Use the validation data to verify the error rate. Is the error rate on the training data and validation data same?

Solution

fiber_bits_train<-Fiberbits[1:90000,]
fiber_bits_validation<-Fiberbits[90001:100000,]

Model on training data

Fiber_bits_tree3<-rpart(active_cust~., method="class", control=rpart.control(minsplit=5, cp=0.000001), data=fiber_bits_train)
Fbits_pred3<-predict(Fiber_bits_tree3, type="class")
conf_matrix3<-table(Fbits_pred3,fiber_bits_train$active_cust)
accuracy3<-(conf_matrix3[1,1]+conf_matrix3[2,2])/(sum(conf_matrix3))
accuracy3
## [1] 0.9524889

Validation Accuracy

fiber_bits_validation$pred <- predict(Fiber_bits_tree3, fiber_bits_validation,type="class")

conf_matrix_val<-table(fiber_bits_validation$pred,fiber_bits_validation$active_cust)
accuracy_val<-(conf_matrix_val[1,1]+conf_matrix_val[2,2])/(sum(conf_matrix_val))
accuracy_val
## [1] 0.7116

Error rate on validation data is more than the training data error.

The Problem of Under-fitting

  • Simple models are better. Its true but is that always true? May not be always true.
  • We might have given it up too early. Did we really capture all the information?
  • Did we do enough research and future reengineering to fit the best model? Is it the best model that can be fit on this data?
  • By being over cautious about variance in the parameters, we might miss out on some patterns in the data.
  • Model need to be complicated enough to capture all the information present.
  • If the training error itself is high, how can we be so sure about the model performance on unknown data?
  • Most of the accuracy and error measuring statistics give us a clear idea on training error, this is one advantage of under fitting, we can identify it confidently.
  • Under fitting
  • A model that is too simple
  • A mode with a scope for improvement
  • A model with lot of bias

LAB: Model with huge Bias

  • Lets simplify the model.
  • Take the high variance model and prune it.
  • Make it as simple as possible.
  • Find the training error and validation error.

Solution

  • Simple Model
Fiber_bits_tree4<-rpart(active_cust~., method="class", control=rpart.control(minsplit=30, cp=0.25), data=fiber_bits_train)
prp(Fiber_bits_tree4)

Fbits_pred4<-predict(Fiber_bits_tree4, type="class")
conf_matrix4<-table(Fbits_pred4,fiber_bits_train$active_cust)
conf_matrix4
##            
## Fbits_pred4     0     1
##           0 11209   921
##           1 25004 52866
accuracy4<-(conf_matrix4[1,1]+conf_matrix4[2,2])/(sum(conf_matrix4))
accuracy4
## [1] 0.7119444
  • Validation accuracy
fiber_bits_validation$pred1 <- predict(Fiber_bits_tree4, fiber_bits_validation,type="class")

conf_matrix_val1<-table(fiber_bits_validation$pred1,fiber_bits_validation$active_cust)
accuracy_val1<-(conf_matrix_val1[1,1]+conf_matrix_val1[2,2])/(sum(conf_matrix_val1))
accuracy_val1
## [1] 0.4224

Model Bias and Variance

  • Over fitting
  • Low Bias with High Variance
  • Low training error – ‘Low Bias’
  • High testing error
  • Unstable model – ‘High Variance’
  • The coefficients of the model change with small changes in the data
  • Under fitting
  • High Bias with low Variance
  • High training error – ‘high Bias’
  • testing error almost equal to training error
  • Stable model – ‘Low Variance’
  • The coefficients of the model doesn’t change with small changes in the data

The Bias-Variance Decomposition

\[Y = f(X)+\epsilon\] \[Var(\epsilon) = \sigma^2\] \[Squared Error = E[(Y -\hat{f}(x_0))^2 | X = x_0 ]\] \[= \sigma^2 + [E\hat{f}(x_0)-f(x_0)]^2 + E[\hat{f}(x_0)-E\hat{f}(x_0)]^2\] \[= \sigma^2 + (Bias)^2(\hat{f}(x_0))+Var(\hat{f}(x_0 ))\]

Overall Model Squared Error = Irreducible Error + \(Bias^2\) + Variance

Bias-Variance Decomposition

  • Overall Model Squared Error = Irreducible Error + \(Bias^2\) + Variance
  • Overall error is made by bias and variance together
  • High bias low variance, Low bias and high variance, both are bad for the overall accuracy of the model
  • A good model need to have low bias and low variance or at least an optimal where both of them are jointly low
  • How to choose such optimal model. How to choose that optimal model complexity

Choosing optimal model-Bias Variance Tradeoff

Bias Variance Tradeoff

Test and Training Error

Choosing Optimal Model

  • Unfortunately
  • There is no scientific method of choosing optimal model complexity that gives minimum test error.
  • Training error is not a good estimate of the test error.
  • There is always bias-variance tradeoff in choosing the appropriate complexity of the model.
  • We can use cross validation methods, boot strapping and bagging to choose the optimal and consistent model

Holdout Data Cross Validation

  • The best solution is out of time validation. Or the testing error should be given high priority over the training error.
  • A model that is performing good on training data and equally good on testing is preferred.
  • We may not have the test data always. How do we estimate test error?
  • We take the part of the data as training and keep aside some potion for validation. May be 80%-20% or 90%-10%
  • Data splitting is a very basic intuitive method

LAB: Holdout Data Cross Validation

  • Data: Fiberbits/Fiberbits.csv
  • Take a random sample with 80% data as training sample
  • Use rest 20% as holdout sample.
  • Build a model on 80% of the data. Try to validate it on holdout sample.
  • Try to increase or reduce the complexity and choose the best model that performs well on training data as well as holdout data

Solution

  • Caret is a good package for cross validation
library(caret)
sampleseed <- createDataPartition(Fiberbits$active_cust, p=0.80, list=FALSE)
train_new <- Fiberbits[sampleseed,]
hold_out <- Fiberbits[-sampleseed,]
  • Model1
library(rpart)
Fiber_bits_tree5<-rpart(active_cust~., method="class", control=rpart.control(minsplit=5, cp=0.000001), data=train_new)
Fbits_pred5<-predict(Fiber_bits_tree5, type="class")
  • Accuracy on Training Data
conf_matrix5<-table(Fbits_pred5,train_new$active_cust)
conf_matrix5
##            
## Fbits_pred5     0     1
##           0 31482  1689
##           1  2230 44599
accuracy5<-(conf_matrix5[1,1]+conf_matrix5[2,2])/(sum(conf_matrix5))
accuracy5
## [1] 0.9510125
  • Model1 Validation accuracy
hold_out$pred <- predict(Fiber_bits_tree5, hold_out, type="class")
conf_matrix_val<-table(hold_out$pred,hold_out$active_cust)
conf_matrix_val
##    
##         0     1
##   0  7003  1333
##   1  1426 10238
accuracy_val<-(conf_matrix_val[1,1]+conf_matrix_val[2,2])/(sum(conf_matrix_val))
accuracy_val
## [1] 0.86205
  • Model2
Fiber_bits_tree5<-rpart(active_cust~., method="class", control=rpart.control(minsplit=30, cp=0.05), data=train_new)
Fbits_pred5<-predict(Fiber_bits_tree5, type="class")
conf_matrix5<-table(Fbits_pred5,train_new$active_cust)
  • Accuracy on Training Data
accuracy5<-(conf_matrix5[1,1]+conf_matrix5[2,2])/(sum(conf_matrix5))
accuracy5
## [1] 0.7882375
  • Model2 Validation accuracy
hold_out$pred <- predict(Fiber_bits_tree5, hold_out,type="class")
conf_matrix_val<-table(hold_out$pred,hold_out$active_cust)
accuracy_val<-(conf_matrix_val[1,1]+conf_matrix_val[2,2])/(sum(conf_matrix_val))
accuracy_val
## [1] 0.79225
  • Model3
Fiber_bits_tree5<-rpart(active_cust~., method="class", control=rpart.control(minsplit=30, cp=0.001), data=train_new)
Fbits_pred5<-predict(Fiber_bits_tree5, type="class")
conf_matrix5<-table(Fbits_pred5,train_new$active_cust)
  • Accuracy on Training Data
accuracy5<-(conf_matrix5[1,1]+conf_matrix5[2,2])/(sum(conf_matrix5))
accuracy5
## [1] 0.8673
  • Model3 Validation accuracy
hold_out$pred <- predict(Fiber_bits_tree5, hold_out,type="class")
conf_matrix_val<-table(hold_out$pred,hold_out$active_cust)
accuracy_val<-(conf_matrix_val[1,1]+conf_matrix_val[2,2])/(sum(conf_matrix_val))
accuracy_val
## [1] 0.8661

Ten-fold Cross – Validation

  • Divide the data into 10 parts(randomly)
  • Use 9 parts as training data(90%) and the tenth part as holdout data(10%)
  • We can repeat this process 10 times
  • Build 10 models, find average error on 10 holdout samples. This gives us an idea on testing error

K-fold Cross Validation

  • A generalization of cross validation.
  • Divide the whole dataset into k equal parts
  • Use kth part of the data as the holdout sample, use remaining k-1 parts of the data as training data
  • Repeat this K times, build K models. The average error on holdout sample gives us an idea on the testing error
  • Which model to choose?
  • Choose the model with least error and least complexity
  • Or the model with less than average error and simple (less parameters)
  • Finally use complete data and build a model with the chosen number of parameters
  • Note: Its better to choose K between 5 to 10. Which gives 80% to 90% training data and rest 20% to 10% is holdout data

LAB – K-fold Cross Validation

  • Build a tree model on the fiber bits data.
  • Try to build the best model by making all the possible adjustments to the parameters.
  • What is the accuracy of the above model?
  • Perform 10 -fold cross validation. What is the final accuracy?
  • Perform 20 -fold cross validation. What is the final accuracy?
  • What can be the expected accuracy on the unknown dataset?

Solution

  • Model on complete training data
Fiber_bits_tree3<-rpart(active_cust~., method="class", control=rpart.control(minsplit=10, cp=0.000001), data=Fiberbits)
Fbits_pred3<-predict(Fiber_bits_tree3, type="class")
conf_matrix3<-table(Fbits_pred3,Fiberbits$active_cust)
conf_matrix3
##            
## Fbits_pred3     0     1
##           0 38154  2849
##           1  3987 55010
  • Accuracy on Traing Data
accuracy3<-(conf_matrix3[1,1]+conf_matrix3[2,2])/(sum(conf_matrix3))
accuracy3
## [1] 0.93164
  • k-fold Cross Validation building
  • K=10
library(caret)
train_dat <- trainControl(method="cv", number=10)

Need to convert the dependent variable to factor before fitting the model

Fiberbits$active_cust<-as.factor(Fiberbits$active_cust)
  • Building the models on K-fold samples
library(e1071)
## Warning: package 'e1071' was built under R version 3.1.3
K_fold_tree<-train(active_cust~., method="rpart", trControl=train_dat, control=rpart.control(minsplit=10, cp=0.000001),  data=Fiberbits)
K_fold_tree$finalModel
## n= 100000 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 100000 42141 1 (0.42141000 0.57859000)  
##   2) relocated>=0.5 12348   954 0 (0.92274052 0.07725948) *
##   3) relocated< 0.5 87652 30747 1 (0.35078492 0.64921508)  
##     6) Speed_test_result< 78.5 27517 10303 0 (0.62557692 0.37442308) *
##     7) Speed_test_result>=78.5 60135 13533 1 (0.22504365 0.77495635) *
prp(K_fold_tree$finalModel)

Kfold_pred<-predict(K_fold_tree)
conf_matrix6<-confusionMatrix(Kfold_pred,Fiberbits$active_cust)
conf_matrix6
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 28608 11257
##          1 13533 46602
##                                           
##                Accuracy : 0.7521          
##                  95% CI : (0.7494, 0.7548)
##     No Information Rate : 0.5786          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4879          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.6789          
##             Specificity : 0.8054          
##          Pos Pred Value : 0.7176          
##          Neg Pred Value : 0.7750          
##              Prevalence : 0.4214          
##          Detection Rate : 0.2861          
##    Detection Prevalence : 0.3987          
##       Balanced Accuracy : 0.7422          
##                                           
##        'Positive' Class : 0               
## 
  • K=20
library(caret)
train_dat <- trainControl(method="cv", number=20)

Need to convert the dependent variable to factor before fitting the model

Fiberbits$active_cust<-as.factor(Fiberbits$active_cust)

Building the models on K-fold samples

library(e1071)
K_fold_tree_1<-train(active_cust~., method="rpart", trControl=train_dat, control=rpart.control(minsplit=10, cp=0.000001),  data=Fiberbits)
K_fold_tree_1$finalModel
## n= 100000 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 100000 42141 1 (0.42141000 0.57859000)  
##   2) relocated>=0.5 12348   954 0 (0.92274052 0.07725948) *
##   3) relocated< 0.5 87652 30747 1 (0.35078492 0.64921508)  
##     6) Speed_test_result< 78.5 27517 10303 0 (0.62557692 0.37442308) *
##     7) Speed_test_result>=78.5 60135 13533 1 (0.22504365 0.77495635) *
prp(K_fold_tree_1$finalModel)

Kfold_pred<-predict(K_fold_tree_1)

Caret package has confusion matrix function

conf_matrix6_1<-confusionMatrix(Kfold_pred,Fiberbits$active_cust)
conf_matrix6_1
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 28608 11257
##          1 13533 46602
##                                           
##                Accuracy : 0.7521          
##                  95% CI : (0.7494, 0.7548)
##     No Information Rate : 0.5786          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4879          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.6789          
##             Specificity : 0.8054          
##          Pos Pred Value : 0.7176          
##          Neg Pred Value : 0.7750          
##              Prevalence : 0.4214          
##          Detection Rate : 0.2861          
##    Detection Prevalence : 0.3987          
##       Balanced Accuracy : 0.7422          
##                                           
##        'Positive' Class : 0               
## 

Bootstrap Cross Validation

Bootstrap Methods

  • Boot strapping is a powerful tool to get an idea on accuracy of the model and the test error
  • Can estimate the likely future performance of a given modeling procedure, on new data not yet realized.
  • The Algorithm
  • We have a training data is of size N
  • Draw random sample with replacement of size N – This gives a new dataset, it might have repeated observations, some observations might not have even appeared once.
  • Create B such new datasets. These are called boot strap datasets
  • Build the model on these B datasets, we can test the models on the original training dataset.

Bootstrap Example

  • Example
  1. We have a training data is of size 500
  2. Boot Strap Data-1:
  • Create a dataset of size 500. To create this dataset, draw a random point, note it down, then replace it back. Again draw another sample point. Repeat this process 500 times. This makes a dataset of size 500. Call this as Boot Strap Data-1
  1. Multiple Boot Strap datasets
  • Repeat the procedure in step -2 multiple times. Say 200 times. Then we have 200 Boot Strap datasets
  1. We can build the models on these 200 boost strap datasets and the average error gives a good idea on overall error. We can even use the original training data as the test data for each of the models

LAB: Bootstrap Cross Validation

  • Draw a boot strap sample with sufficient sample size
  • Build a tree model and get an estimate on true accuracy of the model

Solution

  • Draw a boot strap sample with sufficient sample size

Where number is B

train_control <- trainControl(method="boot", number=20) 

Tree model on boots straped data

Boot_Strap_model <- train(active_cust~., method="rpart", trControl= train_control, control=rpart.control(minsplit=10, cp=0.000001),  data=Fiberbits)
Boot_Strap_predictions <- predict(Boot_Strap_model)

conf_matrix7<-confusionMatrix(Boot_Strap_predictions,Fiberbits$active_cust)
conf_matrix7
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 28608 11257
##          1 13533 46602
##                                           
##                Accuracy : 0.7521          
##                  95% CI : (0.7494, 0.7548)
##     No Information Rate : 0.5786          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4879          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.6789          
##             Specificity : 0.8054          
##          Pos Pred Value : 0.7176          
##          Neg Pred Value : 0.7750          
##              Prevalence : 0.4214          
##          Detection Rate : 0.2861          
##    Detection Prevalence : 0.3987          
##       Balanced Accuracy : 0.7422          
##                                           
##        'Positive' Class : 0               
## 

The next post is about Type of Dataset, Type of Errors and Problem of Overfitting.

DV Analytics

DV Data & Analytics is a leading data science,  Cyber Security training and consulting firm, led by industry experts. We are aiming to train and prepare resources to acquire the most in-demand data science job opportunities in India and abroad.

Bangalore Center

DV Data & Analytics Bangalore Private Limited
#52, 2nd Floor:
Malleshpalya Maruthinagar Bengaluru.
Bangalore 560075
India
(+91) 9019 030 033 (+91) 8095 881 188
Email: info@dvanalyticsmds.com

Bhubneshwar Center

DV Data & Analytics Private Limited Bhubaneswar
Plot No A/7 :
Adjacent to Maharaja Cine Complex, Bhoinagar, Acharya Vihar
Bhubaneswar 751022
(+91) 8095 881 188 (+91) 8249 430 414
Email: info@dvanalyticsmds.com

top
© 2020. All Rights Reserved.