203.4.7 Cross Validation

Choosing Optimal Model

In previous section, we studied about Model-Bias Variance Tradeoff

Unfortunately there is no scientific method of choosing optimal model complexity that gives minimum test error.
Training error is not a good estimate of the test error.
There is always bias-variance tradeoff in choosing the appropriate complexity of the model.
We can use cross validation methods, boot strapping and bagging to choose the optimal and consistent model

Holdout Data Cross Validation

The best solution is out of time validation. Or the testing error should be given high priority over the training error.
A model that is performing good on training data and equally good on testing is preferred.
We may not have the test data always. How do we estimate test error?
We take the part of the data as training and keep aside some potion for validation. May be 80%-20% or 90%-10%
Data splitting is a very basic intuitive method

LAB: Holdout Data Cross Validation

Data: Fiberbits/Fiberbits.csv
Take a random sample with 80% data as training sample
Use rest 20% as holdout sample.
Build a model on 80% of the data. Try to validate it on holdout sample.
Try to increase or reduce the complexity and choose the best model that performs well on training data as well as holdout data

Solution

Caret is a good package for cross validation

library(caret)
sampleseed <- createDataPartition(Fiberbits$active_cust, p=0.80, list=FALSE)
train_new <- Fiberbits[sampleseed,]
hold_out <- Fiberbits[-sampleseed,]

Model1

library(rpart)
Fiber_bits_tree5<-rpart(active_cust~., method="class", control=rpart.control(minsplit=5, cp=0.000001), data=train_new)
Fbits_pred5<-predict(Fiber_bits_tree5, type="class")

Accuracy on Training Data

conf_matrix5<-table(Fbits_pred5,train_new$active_cust)
conf_matrix5

##            
## Fbits_pred5     0     1
##           0 31482  1689
##           1  2230 44599

accuracy5<-(conf_matrix5[1,1]+conf_matrix5[2,2])/(sum(conf_matrix5))
accuracy5

## [1] 0.9510125

Model1 Validation accuracy

hold_out$pred <- predict(Fiber_bits_tree5, hold_out, type="class")
conf_matrix_val<-table(hold_out$pred,hold_out$active_cust)
conf_matrix_val

##    
##         0     1
##   0  7003  1333
##   1  1426 10238

accuracy_val<-(conf_matrix_val[1,1]+conf_matrix_val[2,2])/(sum(conf_matrix_val))
accuracy_val

## [1] 0.86205

Model2

Fiber_bits_tree5<-rpart(active_cust~., method="class", control=rpart.control(minsplit=30, cp=0.05), data=train_new)
Fbits_pred5<-predict(Fiber_bits_tree5, type="class")
conf_matrix5<-table(Fbits_pred5,train_new$active_cust)

Accuracy on Training Data

accuracy5<-(conf_matrix5[1,1]+conf_matrix5[2,2])/(sum(conf_matrix5))
accuracy5

## [1] 0.7882375

Model2 Validation accuracy

hold_out$pred <- predict(Fiber_bits_tree5, hold_out,type="class")
conf_matrix_val<-table(hold_out$pred,hold_out$active_cust)
accuracy_val<-(conf_matrix_val[1,1]+conf_matrix_val[2,2])/(sum(conf_matrix_val))
accuracy_val

## [1] 0.79225

Model3

Fiber_bits_tree5<-rpart(active_cust~., method="class", control=rpart.control(minsplit=30, cp=0.001), data=train_new)
Fbits_pred5<-predict(Fiber_bits_tree5, type="class")
conf_matrix5<-table(Fbits_pred5,train_new$active_cust)

Accuracy on Training Data

accuracy5<-(conf_matrix5[1,1]+conf_matrix5[2,2])/(sum(conf_matrix5))
accuracy5

## [1] 0.8673

Model3 Validation accuracy

hold_out$pred <- predict(Fiber_bits_tree5, hold_out,type="class")
conf_matrix_val<-table(hold_out$pred,hold_out$active_cust)
accuracy_val<-(conf_matrix_val[1,1]+conf_matrix_val[2,2])/(sum(conf_matrix_val))
accuracy_val

## [1] 0.8661

Ten-fold Cross – Validation

Divide the data into 10 parts(randomly)
Use 9 parts as training data(90%) and the tenth part as holdout data(10%)
We can repeat this process 10 times
Build 10 models, find average error on 10 holdout samples. This gives us an idea on testing error

K-fold Cross Validation

A generalization of cross validation.
Divide the whole dataset into k equal parts
Use kth part of the data as the holdout sample, use remaining k-1 parts of the data as training data
Repeat this K times, build K models. The average error on holdout sample gives us an idea on the testing error
Which model to choose?
Choose the model with least error and least complexity
Or the model with less than average error and simple (less parameters)
Finally use complete data and build a model with the chosen number of parameters
Note: Its better to choose K between 5 to 10. Which gives 80% to 90% training data and rest 20% to 10% is holdout data

LAB – K-fold Cross Validation

Build a tree model on the fiber bits data.
Try to build the best model by making all the possible adjustments to the parameters.
What is the accuracy of the above model?
Perform 10 -fold cross validation. What is the final accuracy?
Perform 20 -fold cross validation. What is the final accuracy?
What can be the expected accuracy on the unknown dataset?

Solution

Model on complete training data

Fiber_bits_tree3<-rpart(active_cust~., method="class", control=rpart.control(minsplit=10, cp=0.000001), data=Fiberbits)
Fbits_pred3<-predict(Fiber_bits_tree3, type="class")
conf_matrix3<-table(Fbits_pred3,Fiberbits$active_cust)
conf_matrix3

##            
## Fbits_pred3     0     1
##           0 38154  2849
##           1  3987 55010

Accuracy on Traing Data

accuracy3<-(conf_matrix3[1,1]+conf_matrix3[2,2])/(sum(conf_matrix3))
accuracy3

## [1] 0.93164

k-fold Cross Validation building
K=10

library(caret)
train_dat <- trainControl(method="cv", number=10)

Need to convert the dependent variable to factor before fitting the model

Fiberbits$active_cust<-as.factor(Fiberbits$active_cust)

Building the models on K-fold samples

library(e1071)

## Warning: package 'e1071' was built under R version 3.1.3

K_fold_tree<-train(active_cust~., method="rpart", trControl=train_dat, control=rpart.control(minsplit=10, cp=0.000001),  data=Fiberbits)
K_fold_tree$finalModel

## n= 100000 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 100000 42141 1 (0.42141000 0.57859000)  
##   2) relocated>=0.5 12348   954 0 (0.92274052 0.07725948) *
##   3) relocated< 0.5 87652 30747 1 (0.35078492 0.64921508)  
##     6) Speed_test_result< 78.5 27517 10303 0 (0.62557692 0.37442308) *
##     7) Speed_test_result>=78.5 60135 13533 1 (0.22504365 0.77495635) *

prp(K_fold_tree$finalModel)

Kfold_pred<-predict(K_fold_tree)
conf_matrix6<-confusionMatrix(Kfold_pred,Fiberbits$active_cust)
conf_matrix6

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 28608 11257
##          1 13533 46602
##                                           
##                Accuracy : 0.7521          
##                  95% CI : (0.7494, 0.7548)
##     No Information Rate : 0.5786          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4879          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.6789          
##             Specificity : 0.8054          
##          Pos Pred Value : 0.7176          
##          Neg Pred Value : 0.7750          
##              Prevalence : 0.4214          
##          Detection Rate : 0.2861          
##    Detection Prevalence : 0.3987          
##       Balanced Accuracy : 0.7422          
##                                           
##        'Positive' Class : 0               
##

K=20

library(caret)
train_dat <- trainControl(method="cv", number=20)

Need to convert the dependent variable to factor before fitting the model

Fiberbits$active_cust<-as.factor(Fiberbits$active_cust)

Building the models on K-fold samples

library(e1071)
K_fold_tree_1<-train(active_cust~., method="rpart", trControl=train_dat, control=rpart.control(minsplit=10, cp=0.000001),  data=Fiberbits)
K_fold_tree_1$finalModel

## n= 100000 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 100000 42141 1 (0.42141000 0.57859000)  
##   2) relocated>=0.5 12348   954 0 (0.92274052 0.07725948) *
##   3) relocated< 0.5 87652 30747 1 (0.35078492 0.64921508)  
##     6) Speed_test_result< 78.5 27517 10303 0 (0.62557692 0.37442308) *
##     7) Speed_test_result>=78.5 60135 13533 1 (0.22504365 0.77495635) *

prp(K_fold_tree_1$finalModel)

Kfold_pred<-predict(K_fold_tree_1)

Caret package has confusion matrix function

conf_matrix6_1<-confusionMatrix(Kfold_pred,Fiberbits$active_cust)
conf_matrix6_1

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 28608 11257
##          1 13533 46602
##                                           
##                Accuracy : 0.7521          
##                  95% CI : (0.7494, 0.7548)
##     No Information Rate : 0.5786          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4879          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.6789          
##             Specificity : 0.8054          
##          Pos Pred Value : 0.7176          
##          Neg Pred Value : 0.7750          
##              Prevalence : 0.4214          
##          Detection Rate : 0.2861          
##    Detection Prevalence : 0.3987          
##       Balanced Accuracy : 0.7422          
##                                           
##        'Positive' Class : 0               
##

Bootstrap Cross Validation

Bootstrap Methods

Boot strapping is a powerful tool to get an idea on accuracy of the model and the test error
Can estimate the likely future performance of a given modeling procedure, on new data not yet realized.
The Algorithm
We have a training data is of size N
Draw random sample with replacement of size N – This gives a new dataset, it might have repeated observations, some observations might not have even appeared once.
Create B such new datasets. These are called boot strap datasets
Build the model on these B datasets, we can test the models on the original training dataset.

Bootstrap Example

Example

We have a training data is of size 500
Boot Strap Data-1:

Create a dataset of size 500. To create this dataset, draw a random point, note it down, then replace it back. Again draw another sample point. Repeat this process 500 times. This makes a dataset of size 500. Call this as Boot Strap Data-1

Multiple Boot Strap datasets

Repeat the procedure in step -2 multiple times. Say 200 times. Then we have 200 Boot Strap datasets

We can build the models on these 200 boost strap datasets and the average error gives a good idea on overall error. We can even use the original training data as the test data for each of the models

LAB: Bootstrap Cross Validation

Draw a boot strap sample with sufficient sample size
Build a tree model and get an estimate on true accuracy of the model

Solution

Draw a boot strap sample with sufficient sample size

Where number is B

train_control <- trainControl(method="boot", number=20)

Tree model on boots straped data

Boot_Strap_model <- train(active_cust~., method="rpart", trControl= train_control, control=rpart.control(minsplit=10, cp=0.000001),  data=Fiberbits)
Boot_Strap_predictions <- predict(Boot_Strap_model)

conf_matrix7<-confusionMatrix(Boot_Strap_predictions,Fiberbits$active_cust)
conf_matrix7

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 28608 11257
##          1 13533 46602
##                                           
##                Accuracy : 0.7521          
##                  95% CI : (0.7494, 0.7548)
##     No Information Rate : 0.5786          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4879          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.6789          
##             Specificity : 0.8054          
##          Pos Pred Value : 0.7176          
##          Neg Pred Value : 0.7750          
##              Prevalence : 0.4214          
##          Detection Rate : 0.2861          
##    Detection Prevalence : 0.3987          
##       Balanced Accuracy : 0.7422          
##                                           
##        'Positive' Class : 0               
##

Conclusion

We studied
Validating a model, Types of data & Types of errors
The problem of over fitting & The problem of under fitting
Bias Variance Tradeoff
Cross validation & Boot strapping
Training error is what we see and that is not the true performance metric
Test error plays vital role in model selection
R-square, Adj-R-square, Accuracy, ROC, AUC, AIC and BIC can be used to get an idea on training error
Cross Validation and Boot strapping techniques give us an idea on test error
Choose the model based on the combination of AIC, Cross Validation and Boot strapping results
Bootstrap is widely used in ensemble models & random forests.

In next section, we will be studying about Neural Networks

21st June 2017

Cross validating a model

Choosing Optimal Model

Holdout Data Cross Validation

LAB: Holdout Data Cross Validation

Solution

Ten-fold Cross – Validation

K-fold Cross Validation

LAB – K-fold Cross Validation

Solution

Bootstrap Cross Validation

Bootstrap Methods

Bootstrap Example

LAB: Bootstrap Cross Validation

Solution

Conclusion