Before start our lesson please download the datasets.
Contents
- Model Validation
- Metrics for validating Classification Problems
- Sensitivity and Specificity
- Sensitivity Vs Specificity
- When Sensitivity is given High Priority
- When SPecificity is given High Priority
- Receiver Operating Characteristic(ROC) curve and its Interpretation
- Area Under Curve(AUC)
- What is meant by best model
- Model Selection
- Types of data
- Types of errors
- Overfitting
- The problem of over fitting
- Under fitting
- The problem of under fitting
- Bias Variance Tradeoff
- Bias- Variance Decomposition
- Choosing optimal model
- Cross validation Techniques
- Hold out Data Cross Validation
- Ten Fold Cross validation
- K- Fold Cross Validation
- Boot strap Cross Validation
- Conclusion
Model Validation
Model Validation is nothing but checking how good is our model. In Regression Model validation is done through R square and Adj R-square whereas in classification techniques, Logistic Regression we have used measures like Confusion Matrix and Accuracy. In addition to Confusion Matrix and Accuracy there are many more validation models for classification problems.
Metrics for validating Classification Problems
- Confusion Matrix
- Specificity
- ROC
- Lemeshow Goodness-of-Fit Test
- Sensitivity
- KS
- Gini
- Concordance and discordance
- Chi-Square
- Hosmer
- Lift curve
All of them are measuring the model accuracy only. Some metrics work really well for certain class of problems. Confusion matrix, ROC and AUC will be sufficient for most of the business problems.
Sensitivity and Specificity
Sensitivity and Specificity of a classification problem are derived from Confusion Matrix. Sensitivity is defined as Percentage of positives that are successfully classified as positive or out of all the positives in a class how many of them we have successfully classified as positive. Sensitivity will be higher when we don’t have any False Negatives whereas Specificity is defined as Percentage of negatives that are successfully classified as negatives.Sensitivity and Specificity are derived from confusion matrix.
- Accuracy=(TP+TN)/(TP+FP+FN+TN)
- Misclassification Rate=(FP+FN)/(TP+FP+FN+TN)
- Sensitivity : Percentage of positives that are successfully classified as positive
- Specificity : Percentage of negatives that are successfully classified as negatives
Calculating Sensitivity and Specificity
Building Logistic Regression Model
#Importing necessary libraries
import sklearn as sk
import pandas as pd
import numpy as np
import scipy as sp
#Importing the dataset
Fiber_df= pd.read_csv("datasetsFiberbitsFiberbits.csv")
###to see head and tail of the Fiber dataset
Fiber_df.head(5)
#Name of the columns/Variables
Fiber_df.columns
#Building and training a Logistic Regression model
import statsmodels.formula.api as sm
logistic1 = sm.logit(formula='active_cust~income+months_on_network+Num_complaints+number_plan_changes+relocated+monthly_bill+technical_issues_per_month+Speed_test_result', data=Fiber_df)
fitted1 = logistic1.fit()
fitted1.summary()
###predicting values
predicted_values1=fitted1.predict(Fiber_df[["income"]+['months_on_network']+['Num_complaints']+['number_plan_changes']+['relocated']+['monthly_bill']+['technical_issues_per_month']+['Speed_test_result']])
predicted_values1[1:10]
### Converting predicted values into classes using threshold
threshold=0.5
predicted_class1=np.zeros(predicted_values1.shape)
predicted_class1[predicted_values1>threshold]=1
predicted_class1
#Confusion matrix, Accuracy, sensitivity and specificity
from sklearn.metrics import confusion_matrix
cm1 = confusion_matrix(Fiber_df[['active_cust']],predicted_class1)
print('Confusion Matrix : n', cm1)
total1=sum(sum(cm1))
#####from confusion matrix calculate accuracy
accuracy1=(cm1[0,0]+cm1[1,1])/total1
print ('Accuracy : ', accuracy1)
sensitivity1 = cm1[0,0]/(cm1[0,0]+cm1[0,1])
print('Sensitivity : ', sensitivity1 )
specificity1 = cm1[1,1]/(cm1[1,0]+cm1[1,1])
print('Specificity : ', specificity1)
Changing Threshold to 0.8
### Converting predicted values into classes using new threshold
threshold=0.8
predicted_class1=np.zeros(predicted_values1.shape)
predicted_class1[predicted_values1>threshold]=1
predicted_class1
Change in Confusion Matrix, Accuracy and Sensitivity-Specificity
#Confusion matrix, Accuracy, sensitivity and specificity
from sklearn.metrics import confusion_matrix
cm1 = confusion_matrix(Fiber_df[['active_cust']],predicted_class1)
print('Confusion Matrix : n', cm1)
total1=sum(sum(cm1))
#####from confusion matrix calculate accuracy
accuracy1=(cm1[0,0]+cm1[1,1])/total1
print ('Accuracy : ', accuracy1)
sensitivity1 = cm1[0,0]/(cm1[0,0]+cm1[0,1])
print('Sensitivity : ', sensitivity1 )
specificity1 = cm1[1,1]/(cm1[1,0]+cm1[1,1])
print('Specificity : ', specificity1)
Sensitivity VS Specificity
By changing the threshold customers classification will be changed hence the sensitivity and specificity will be changed. Which one of these two we should maximize? What should be ideal threshold? Ideally we want to maximize both Sensitivity & Specificity. But this is not possible always. There is always a tradeoff. Sometimes we want to be 100% sure on Predicted negatives; sometimes we want to be 100% sure on Predicted positives. Sometimes we simply don’t want to compromise on sensitivity sometimes we don’t want to compromise on specificity. The threshold is set based on business problem
When Sensitivity is a High Priority
Imagine that we are building a model to predict the bad customers or default customers before issuing a loan. The profit on good customer loan is not equal to the loss on one bad customer loan. In this case one bad customer is not equal to one good customer. • If p is probability of default then we would like to set our threshold in such a way that we don’t miss any of the bad customers. • We set the threshold in such a way that Sensitivity is high • We can compromise on specificity here. If we wrongly reject a good customer, our loss is very less compared to giving a loan to a bad customer. • We don’t really worry about the good customers here, they are not harmful hence we can have less Specificity.
When Specificity is a High Priority
- Testing a medicine is good or poisonous
In this case, we have to really avoid cases like, Actual medicine is poisonous and model is predicting them as good. • We can’t take any chance here. • The specificity need to be near 100. • The sensitivity can be compromised here. It is not very harmful not to use a good medicine when compared with vice versa case
Sensitivity vs Specificity – Importance
- There are some cases where Sensitivity is important and need to be near to 1.
- There are business cases where Specificity is important and need to be near to 1.
- We need to understand the business problem and decide the importance of Sensitivity and Specificity.
ROC Curve
ROC(Receiver operating characteristic) curve is drawn by taking False positive rate on X-axis and True positive rate on Y- axis.ROC tells us how many mistakes are we making to identify all the positives?
ROC Curve – Interpretation
- How many mistakes are we making to identify all the positives?
- How many mistakes are we making to identify 70%, 80% and 90% of positives?
- 1-Specificty(false positive rate) gives us an idea on mistakes that we are making
- We would like to make 0% mistakes for identifying 100% positives
- We would like to make very minimal mistakes for identifying maximum positives
- We want that curve to be far away from straight line
- Ideally we want the area under the curve as high as possible
AUC
We want that curve to be far away from straight line. Ideally we want the area under the curve as high as possible. ROC Curve gives us the overall performance of the model whereas AUC tells us or quantifies the exact value looking at which we can tell directly whether the model is good or bad. We want to make almost 0% mistakes while identifying all the positives, which means we want to see AUC value near to 1
AUC
- AUC is near to 1 for a good model
ROC and AUC Calculation
Building a Logistic Regression Model
###for visualising the plots use matplotlib and import roc_curve,auc from sklearn.metrics
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
%matplotlib inline
actual = Fiber_df[['active_cust']]
false_positive_rate, true_positive_rate, thresholds = roc_curve(actual, predicted_values1)
plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, true_positive_rate)
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate(Sensitivity)')
plt.xlabel('False Positive Rate(Specificity)')
plt.show()
###Threshold values used for the roc_curve can be viewed from threshold array
thresholds
###Area under Curve-AUC
roc_auc = auc(false_positive_rate, true_positive_rate)
roc_auc
What is a best model? How to build?
- A model with maximum accuracy /least error.
- A model that uses maximum information available in the given data.
- A model that has minimum squared error.
- A model that captures all the hidden patterns in the data.
- A model that produces the best perdition results.
Model Selection
- How to build/choose a best model?
- Error on the training data is not a good meter of performance on future data
- How to select the best model out of the set of available models ?
- Are there any methods/metrics to choose best model?
- What is training error? What is testing error? What is hold out sample error?
LAB: The Most Accurate Model
- Data: Fiberbits/Fiberbits.csv
- Build a decision tree to predict active_user
- What is the accuracy of your model?
- Grow the tree as much as you can and achieve 95% accuracy.
Solution
#Preparing the X and y to train the model
features = list(Fiber_df.drop(['active_cust'],1).columns)
X = np.array(Fiber_df[features])
y = np.array(Fiber_df['active_cust'])
#Let's make a model by choosing some initial parameters.
from sklearn import tree
tree_config = tree.DecisionTreeClassifier(criterion='gini',
splitter='best',
max_depth=10,
min_samples_split=1,
min_samples_leaf=30,
max_leaf_nodes=10)
#Training the model and finding the accuracy of the model
tree_config.fit(X,y)
tree_config.score(X,y)
tree_config_new = tree.DecisionTreeClassifier(criterion='gini',
splitter='best',
max_depth=None,
min_samples_split=2,
min_samples_leaf=1,
max_leaf_nodes=None)
#Training the model and accuracy
tree_config_new.fit(X,y)
tree_config_new.score(X,y)
Different Type of Datasets and Errors
Datasets
There are two types of datasets
- Training set and
- Test set
- Validation set
Training set
This is used in model building. The input data
Test set
Unknown dataset that gives the accuracy of the final model. we may not have access to these two datasets for all machine learning problems. All the time we will have training datasets sometimes we may also have test dataset. If only training dataset is available then we consider 90% of the available data as training data and rest 10% can be treated as validation data.
Validation set
This dataset kept aside for model validation and selection. This is a temporary substitute to test dataset. It is not third type of data • We create the validation data with the hope that the error rate on validation data will give us some basic idea on the test error. Once we have the training dataset and validation dataset we built the best model on training set, see its accuracy and if there is any error it is called as training error
Types of Errors
There are two types of errors. They are
- Training Error and
- Testing Error
Training error
- The error on training dataset
- In-time error
- Error on the known data
- Can be reduced while building the model
Testing error
- The error that matters
- Out-of-time error
- The error on unknown/new dataset.
“A good model will have both training and test error very near to each other and close to zero”.If the model is very good on training data and not good on testing data then it is called Overfitting
Overfitting
- The model is super good on training data but not so good on test data.
- Less training error, high testing error
- The model is over complicated with too many predictors
- A model with lot of variance
The Problem of Over Fitting
- In search of the best model on the given data we add many predictors, polynomial terms, Interaction terms, variable transformations, derived variables, indicator/dummy variables etc.,
- Most of the times we succeed in reducing the error. What error is this?
- By complicating the model we fit the best model for the training data.
- Sometimes the error on the training data can reduce to near zero
- But the same best model on training data fails miserably on test data.
- Imagine building multiple models with small changes in training data. The resultant set of models will have huge variance in their parameter estimates.
- The model is made really complicated, that it is very sensitive to minimal changes
- By complicating the model the variance of the parameters estimates inflates
- Model tries to fit the irrelevant characteristics in the data
LAB: Model with huge Variance
- Data: Fiberbits/Fiberbits.csv
- Take initial 90% of the data. Consider it as training data. Keep the final 10% of the records for validation.
- Build the best model(5% error) model on training data.
- Use the validation data to verify the error rate. Is the error rate on the training data and validation data same?
#Splitting the dataset into training and testing datasets
X = np.array(Fiber_df[features])
y = np.array(Fiber_df['active_cust'])
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size = 0.9)
#Building model on training data.
tree_var = tree.DecisionTreeClassifier(criterion='gini',
splitter='best',
max_depth=20,
min_samples_split=2,
min_samples_leaf=1,
max_leaf_nodes=None)
tree_var.fit(X_train,y_train)
#Accuracy of the model on training data
tree_var.score(X_train,y_train)
#Accuracy on the test data
tree_var.score(X_test,y_test)
- Error rate on validation data is more than the training data error.
Under-fitting
If the model is over simplified then we might not be capturing all the data. If we really require 20 variables but using only 10 variables then we might lose the real information that is there on the dataset. If we don’t do enough research then how can we build a best model for the dataset that is given to us. If the model is simple then the training error itself will be high. Model need to be complicated enough to capture all the information that is present, we can’t oversimplify the model. By oversimplifying the model if we are losing the information that is called the Problem of Under-fitting.
The Problem of Under-fitting
- Simple models are better. Its true but is that always true? May not be always true.
- We might have given it up too early. Did we really capture all the information?
- Did we do enough research and future reengineering to fit the best model? Is it the best model that can be fit on this data?
- By being over cautious about variance in the parameters, we might miss out on some patterns in the data.
- Model need to be complicated enough to capture all the information present.
- If the training error itself is high, how can we be so sure about the model performance on unknown data?
- Most of the accuracy and error measuring statistics give us a clear idea on training error, this is one advantage of under fitting, we can identify it confidently.
Summary of Under-fitting
- A model that is too simple
- A mode with a scope for improvement
- A model with lot of bias
LAB: Model with huge Bias
- Lets simplify the model.
- Take the high variance model and prune it.
- Make it as simple as possible.
- Find the training error and validation error.
Solution
#We can prune the tree by changing the parameters
tree_bias = tree.DecisionTreeClassifier(criterion='gini',
splitter='best',
max_depth=10,
min_samples_split=30,
min_samples_leaf=30,
max_leaf_nodes=20)
tree_bias.fit(X_train,y_train)
#Training accuracy
tree_bias.score(X_train,y_train)
#Lets prune the tree further. Lets oversimplyfy the model
tree_bias1 = tree.DecisionTreeClassifier(criterion='gini',
splitter='random',
max_depth=1,
min_samples_split=100,
min_samples_leaf=100,
max_leaf_nodes=2)
tree_bias1.fit(X_train,y_train)
#Training Accuracy of new model
tree_bias1.score(X_train,y_train)
#Validation accuracy on test data
tree_bias1.score(X_test,y_test)
Model Bias and Variance
- Over fitting
- Low Bias with High Variance
- Low training error – ‘Low Bias’
- High testing error
- Unstable model – ‘High Variance’
- The coefficients of the model change with small changes in the data
- Under fitting
- High Bias with low Variance
- High training error – ‘high Bias’
- testing error almost equal to training error
- Stable model – ‘Low Variance’
- The coefficients of the model doesn’t change with small changes in the data
Observations
Look at the bottom left and if our aim is to hit the center of circle and if we are putting all the circles near to the center of the circle then that is the model with low bias with low variance. Now come to the second one in the bottom, in this model all the points are very nearer but far away from the center of the circle. This indicates low variance with high bias. In Top left all the points are near to the circle but none of them are closer. So this model has low bias with high variance. In Top right all the points are away from the center of the circle and at the same time these points also not nearer to them. This indicates High Bias with High Variance.
The Bias-Variance Decomposition
- Overall Model Squared Error = Irreducible Error +
+ Variance
- Overall error is made by bias and variance together.
- High bias low variance, Low bias and high variance, both are bad for the overall accuracy of the model.
- A good model need to have low bias and low variance or at least an optimal where both of them are jointly low.
- How to choose such optimal model. How to choose that optimal model complexity.
Choosing optimal model-Bias Variance Tradeoff
Bias Variance Trade off
Observations
Above figure shows that as the Model Complexity increases, bias decreases and we also noticed that as the Model Complexity increases first the variance reduces but later on as the complexity becomes really high then the variance of the model increases. So the best spot is the one with the least value or optimal value of bias and variance
Test and Training Error
Observations
As the Model Complexity increases the training error keeps on reducing but the test error might reduce initially but it keeps on increasing as the complexity increases further. If you don’t have enough complexity then both test and training will be high. So the best spot is the one with the least value or optimal value of training and test error.
Choosing Optimal Model
Unfortunately there is no scientific method of choosing optimal model complexity that gives minimum test error. Training error is not a good estimate of the test error. There is always bias-variance tradeoff in choosing the appropriate complexity of the model. We can use cross validation methods, boot strapping and bagging to choose the optimal and consistent model
Holdout Data Cross Validation
Taking Hold out data or Cross validation is one of the best ways of finding original error or actual error present in the model. As we know the training error doesn’t give us original error in the model, we may Overfit or Underfit. Cross validation error really gives us the final accuracy of the model using which we can choose Model Complexity and build better model. A model that is performing good on training data and equally good on testing is preferred. If the testing data is not given then we split the training data (May be 80%-20% or 90%-10%) into two parts. First part is used to build the model and second part is used to validate the model.
LAB: Holdout Data Cross Validation
- Data: Fiberbits/Fiberbits.csv
- Take a random sample with 80% data as training sample
- Use rest 20% as holdout sample.
- Build a model on 80% of the data. Try to validate it on holdout sample.
- Try to increase or reduce the complexity and choose the best model that performs well on training data as well as holdout data.
Solution
#Splitting data into 80:20::train:test
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size = 0.8)
#Defining tree parameters and training the tree
tree_CV = tree.DecisionTreeClassifier(criterion='gini',
splitter='best',
max_depth=20,
min_samples_split=2,
min_samples_leaf=1)
tree_CV.fit(X_train,y_train)
#Training score
tree_CV.score(X_train,y_train)
#Validation Accuracy on test data
tree_CV.score(X_test,y_test)
tree_CV1 = tree.DecisionTreeClassifier(criterion='gini',
splitter='best',
max_depth=10,
min_samples_split=30,
min_samples_leaf=30,
max_leaf_nodes=30)
tree_CV1.fit(X_train,y_train)
#Training score of this pruned tree model
tree_CV1.score(X_train,y_train)
#Validation score of pruned tree model
tree_CV1.score(X_test,y_test)
Ten-fold Cross – Validation
Divide the whole data into 10 parts. Use 9 parts as training data (90%) and tenth part as holdout data(10%). We repeat this whole process 10 times; build 10 models find the average error that really gives the actual training error. Even the smallest error that occurs in the model can be reduced by Ten-fold cross validation.
Working
Let us consider above bars indicates overall data and this data is divided into 10 parts. While building Model1 the last green box in the first bar is taken as test data and remaining 9 parts is taken as training data and find out what is the training error and what is the test error. While building 2nd model we will use 1 to 8 and 10th part as training data and 9th part as testing data and build model M2 and find out the training error and testing error. This process is repeated until the 1st part is taken as test data. That is how we can do k4 cross validation which will have10 errors, on an average we can find the overall training error and test error that gives us idea on overall accuracy of the model that we are building
K-fold Cross Validation
A generalization of cross validation, here we divide dataset into k equal parts. Among these k equal parts Kth part is treated as test data and remaining (K-1) parts are treated as training data. As there are k parts build k models and find average of all these models that gives an idea on testing error.
Which model to choose?
- Choose the model with least error and least complexity
- Or the model with less than average error and simple (less parameters)
- Finally use complete data and build a model with the chosen number of parameters
Note
Its better to choose K between 5 to 10. Which gives 80% to 90% training data and rest 20% to 10% is holdout data.
LAB – K-fold Cross Validation
- Build a tree model on the fiber bits data.
- Try to build the best model by making all the possible adjustments to the parameters.
- What is the accuracy of the above model?
- Perform 10 -fold cross validation. What is the final accuracy?
- Perform 20 -fold cross validation. What is the final accuracy?
- What can be the expected accuracy on the unknown dataset?
Solution
## Defining the model parameters
tree_KF = tree.DecisionTreeClassifier(criterion='gini',
splitter='best',
max_depth=30,
min_samples_split=30,
min_samples_leaf=30,
max_leaf_nodes=60)
#Importing kfold from cross_validation
from sklearn.cross_validation import KFold
#Simple K-Fold cross validation. 10 folds.
kfold = KFold(len(Fiber_df), n_folds=10)
## Checking the accuracy of model on 10-folds
from sklearn import cross_validation
score10 = cross_validation.cross_val_score(tree_KF,X, y,cv=kfold)
score10
#Mean accuracy of 10-fold
score10.mean()
#Simple K-Fold cross validation. 20 folds.
kfold = KFold(len(Fiber_df), n_folds=20)
#Accuracy score of 20-fold model
score20 = cross_validation.cross_val_score(tree_KF,X, y,cv=kfold)
score20
#Mean accuracy of 20-fold
score20.mean()
Bootstrap Cross Validation
Boot strapping is a powerful tool to get an idea on accuracy of the model and the test error. Bootstrap Cross Validation is most widely used. It is very unique and gives a good idea on the overall error. It can be used for building stable models. The results that coming out of bootstrap can be trusted and accepted very easily. The main usage of bootstrap is that it can estimate the future performance of a given model on new data which is not yet realized.
Algorithm
- We have a training data is of size N
- Draw random sample with replacement of size N – This gives a new dataset, it might have repeated observations, some observations might not have even appeared once.
- Create B such new datasets. These are called boot strap datasets
- Build the model on these B datasets; we can test the models on the original training dataset.
Bootstrap Method
Here we take the data in a slightly different manner, we take bootstrap samples. Let us suppose the training data size is N. Let say if there are 100,000 observations then we draw a random sample with replacement of size N. we draw 100,000 samples one at a time. We take a sample point and we note it down again we put the sample point back in the dataset again the dataset is 100,000. Again we take one more sample from it like that we repeat this whole process 10,000 times then we have a very new dataset because we have taken it by doing sample with replacement. In that new dataset some observations might repeat twice or thrice and some observations might not have picked at all because we are doing sampling one at a time. Though we have picked the whole 10,000 observations some of the points might have never chosen and some of the points might chose twice or thrice. So that is one bootstrap sample of size 100,000. If we do this 10 such samples or B such samples then those are called B-Bootstrap samples. Now instead of one training dataset we have 10 or B such training datasets. On each of training dataset we can build a model and then we can test the model on original training dataset and get an idea on the overall accuracy of the model.
Bootstrap Example
- We have a training data is of size 500
- Boot Strap Data-1:
- Create a dataset of size 500. To create this dataset, draw a random point, note it down, then replace it back. Again draw another sample point. Repeat this process 500 times. This makes a dataset of size 500. Call this as Boot Strap Data-1
- Multiple Boot Strap datasets
- Repeat the procedure in step -2 multiple times. Say 200 times. Then we have 200 Boot Strap datasets
- We can build the models on these 200 boost strap datasets and the average error gives a good idea on overall error. We can even use the original training data as the test data for each of the models
LAB: Bootstrap Cross Validation
- Draw a boot strap sample with sufficient sample size
- Build a tree model and get an estimate on true accuracy of the model
Solution
# Defining the tree parameters
tree_BS = tree.DecisionTreeClassifier(criterion='gini',
splitter='best',
max_depth=30,
min_samples_split=30,
min_samples_leaf=30,
max_leaf_nodes=60)
# Defining the bootstrap variable for 10 random samples
bootstrap=cross_validation.ShuffleSplit(n=len(Fiber_df),
n_iter=10,
random_state=0)
###checking the error in the Boot Strap models###
BS_score = cross_validation.cross_val_score(tree_BS,X, y,cv=bootstrap)
BS_score
#Expected accuracy according to bootstrap validation
###checking the error in the Boot Strap models###
BS_score.mean()
Conclusion
- We studied
- Validating a model, Types of data & Types of errors
- The problem of over fitting & The problem of under fitting
- Bias Variance Tradeoff
- Cross validation & Boot strapping
- Training error is what we see and that is not the true performance metric
- Test error plays vital role in model selection
- R-square, Adj-R-square, Accuracy, ROC, AUC, AIC and BIC can be used to get an idea on training error
- Cross Validation and Boot strapping techniques give us an idea on test error.
- Choose the model based on the combination of AIC, Cross Validation and Boot strapping results.
- Bootstrap is widely used in ensemble models & random forests.
References
Hastie,Tibshirani and Friedman - The Elements of Statistical Learning (2nd edition, 2009).
http://scott.fortmann-roe.com/docs/BiasVariance.html