Before start our lesson please download the datasets.

Model Validation
Metrics for validating Classification Problems
Sensitivity and Specificity
Sensitivity Vs Specificity
When Sensitivity is given High Priority
When SPecificity is given High Priority
Receiver Operating Characteristic(ROC) curve and its Interpretation
Area Under Curve(AUC)
What is meant by best model
Model Selection
Types of data
Types of errors
Overfitting
The problem of over fitting
Under fitting
The problem of under fitting
Bias Variance Tradeoff
Bias- Variance Decomposition
Choosing optimal model
Cross validation Techniques
- Hold out Data Cross Validation
- Ten Fold Cross validation
- K- Fold Cross Validation
- Boot strap Cross Validation
Conclusion

Model Validation

Model Validation is nothing but checking how good is our model. In Regression Model validation is done through R square and Adj R-square whereas in classification techniques, Logistic Regression we have used measures like Confusion Matrix and Accuracy. In addition to Confusion Matrix and Accuracy there are many more validation models for classification problems.

Metrics for validating Classification Problems

Confusion Matrix
Specificity
ROC
Lemeshow Goodness-of-Fit Test
Sensitivity
KS
Gini
Concordance and discordance
Chi-Square
Hosmer
Lift curve

All of them are measuring the model accuracy only. Some metrics work really well for certain class of problems. Confusion matrix, ROC and AUC will be sufficient for most of the business problems.

Sensitivity and Specificity

Sensitivity and Specificity of a classification problem are derived from Confusion Matrix. Sensitivity is defined as Percentage of positives that are successfully classified as positive or out of all the positives in a class how many of them we have successfully classified as positive. Sensitivity will be higher when we don’t have any False Negatives whereas Specificity is defined as Percentage of negatives that are successfully classified as negatives.Sensitivity and Specificity are derived from confusion matrix.

Accuracy=(TP+TN)/(TP+FP+FN+TN)
Misclassification Rate=(FP+FN)/(TP+FP+FN+TN)
Sensitivity : Percentage of positives that are successfully classified as positive
Specificity : Percentage of negatives that are successfully classified as negatives

Calculating Sensitivity and Specificity

Building Logistic Regression Model

In [1]:

#Importing necessary libraries
					import sklearn as sk
					import pandas as pd
					import numpy as np
					import scipy as sp

In [2]:

#Importing the dataset
				Fiber_df= pd.read_csv("datasetsFiberbitsFiberbits.csv")
				###to see head and tail of the Fiber dataset
				Fiber_df.head(5)

Out[2]:

	active_cust	income	months_on_network	Num_complaints	number_plan_changes	monthly_bill	technical_issues_per_month	Speed_test_result
0	0	1586	85	4	1	121	4	85
1	0	1581	85	4	1	133	4	85
2	0	1594	82	4	1	118	4	85
3	0	1594	82	4	1	123	4	85
4	1	1609	80	4	1	177	4	85

In [3]:

#Name of the columns/Variables
			Fiber_df.columns

Out[3]:

Index(['active_cust', 'income', 'months_on_network', 'Num_complaints',
			       'number_plan_changes', 'relocated', 'monthly_bill',
			       'technical_issues_per_month', 'Speed_test_result'],
		dtype='object')

In [4]:

#Building and training a Logistic Regression model
		import statsmodels.formula.api as sm
		logistic1 = sm.logit(formula='active_cust~income+months_on_network+Num_complaints+number_plan_changes+relocated+monthly_bill+technical_issues_per_month+Speed_test_result', data=Fiber_df)
		fitted1 = logistic1.fit()
		fitted1.summary()

Optimization terminated successfully.
		         Current function value: 0.493647
		         Iterations 9

Out[4]:

Logit Regression Results
Dep. Variable:	active_cust	No. Observations:	100000
Model:	Logit	Df Residuals:	99991
Method:	MLE	Df Model:	8
Date:	Fri, 18 Nov 2016	Pseudo R-squ.:	0.2748
Time:	19:16:40	Log-Likelihood:	-49365.
converged:	True	LL-Null:	-68074.
		LLR p-value:	0.000

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	-17.6101	0.301	-58.538	0.000	-18.200 -17.020
income	0.0017	8.21e-05	20.820	0.000	0.002 0.002
months_on_network	0.0288	0.001	28.654	0.000	0.027 0.031
Num_complaints	-0.6865	0.030	-22.811	0.000	-0.746 -0.628
number_plan_changes	-0.1896	0.008	-24.940	0.000	-0.205 -0.175
relocated	-3.1626	0.040	-79.927	0.000	-3.240 -3.085
monthly_bill	-0.0022	0.000	-13.995	0.000	-0.003 -0.002
technical_issues_per_month	-0.3904	0.007	-54.581	0.000	-0.404 -0.376
Speed_test_result	0.2222	0.002	93.435	0.000	0.218 0.227

In [5]:

###predicting values
	predicted_values1=fitted1.predict(Fiber_df[["income"]+['months_on_network']+['Num_complaints']+['number_plan_changes']+['relocated']+['monthly_bill']+['technical_issues_per_month']+['Speed_test_result']])
	predicted_values1[1:10]

Out[5]:

array([ 0.83701059,  0.83271114,  0.83117449,  0.80896979,  0.8520262 ,
0.82713018,  0.85504571,  0.85131352,  0.85537857])

In [6]:

### Converting predicted values into classes using threshold
threshold=0.5
predicted_class1=np.zeros(predicted_values1.shape)
predicted_class1[predicted_values1>threshold]=1
predicted_class1

Out[6]:

array([ 1.,  1.,  1., ...,  1.,  1.,  1.])

In [7]:

#Confusion matrix, Accuracy, sensitivity and specificity
from sklearn.metrics import confusion_matrix
cm1 = confusion_matrix(Fiber_df[['active_cust']],predicted_class1)
print('Confusion Matrix : n', cm1)
total1=sum(sum(cm1))
#####from confusion matrix calculate accuracy
accuracy1=(cm1[0,0]+cm1[1,1])/total1
print ('Accuracy : ', accuracy1)
sensitivity1 = cm1[0,0]/(cm1[0,0]+cm1[0,1])
print('Sensitivity : ', sensitivity1 )
specificity1 = cm1[1,1]/(cm1[1,0]+cm1[1,1])
print('Specificity : ', specificity1)

Confusion Matrix : 
 [[29492 12649]
 [10847 47012]]
Accuracy :  0.76504
Sensitivity :  0.699841009943
Specificity :  0.812527005306

Changing Threshold to 0.8

In [8]:

### Converting predicted values into classes using new threshold
threshold=0.8
predicted_class1=np.zeros(predicted_values1.shape)
predicted_class1[predicted_values1>threshold]=1
predicted_class1

Out[8]:

array([ 1.,  1.,  1., ...,  1.,  1.,  1.])

Change in Confusion Matrix, Accuracy and Sensitivity-Specificity

In [9]:

#Confusion matrix, Accuracy, sensitivity and specificity
from sklearn.metrics import confusion_matrix
cm1 = confusion_matrix(Fiber_df[['active_cust']],predicted_class1)
print('Confusion Matrix : n', cm1)
total1=sum(sum(cm1))
#####from confusion matrix calculate accuracy
accuracy1=(cm1[0,0]+cm1[1,1])/total1
print ('Accuracy : ', accuracy1)
sensitivity1 = cm1[0,0]/(cm1[0,0]+cm1[0,1])
print('Sensitivity : ', sensitivity1 )
specificity1 = cm1[1,1]/(cm1[1,0]+cm1[1,1])
print('Specificity : ', specificity1)

Confusion Matrix : 
 [[37767  4374]
 [30521 27338]]
Accuracy :  0.65105
Sensitivity :  0.896205595501
Specificity :  0.472493475518

Sensitivity VS Specificity

By changing the threshold customers classification will be changed hence the sensitivity and specificity will be changed. Which one of these two we should maximize? What should be ideal threshold? Ideally we want to maximize both Sensitivity & Specificity. But this is not possible always. There is always a tradeoff. Sometimes we want to be 100% sure on Predicted negatives; sometimes we want to be 100% sure on Predicted positives. Sometimes we simply don’t want to compromise on sensitivity sometimes we don’t want to compromise on specificity. The threshold is set based on business problem

When Sensitivity is a High Priority

Imagine that we are building a model to predict the bad customers or default customers before issuing a loan. The profit on good customer loan is not equal to the loss on one bad customer loan. In this case one bad customer is not equal to one good customer. • If p is probability of default then we would like to set our threshold in such a way that we don’t miss any of the bad customers. • We set the threshold in such a way that Sensitivity is high • We can compromise on specificity here. If we wrongly reject a good customer, our loss is very less compared to giving a loan to a bad customer. • We don’t really worry about the good customers here, they are not harmful hence we can have less Specificity.

When Specificity is a High Priority

Testing a medicine is good or poisonous

In this case, we have to really avoid cases like, Actual medicine is poisonous and model is predicting them as good. • We can’t take any chance here. • The specificity need to be near 100. • The sensitivity can be compromised here. It is not very harmful not to use a good medicine when compared with vice versa case

Sensitivity vs Specificity – Importance

There are some cases where Sensitivity is important and need to be near to 1.
There are business cases where Specificity is important and need to be near to 1.
We need to understand the business problem and decide the importance of Sensitivity and Specificity.

ROC Curve

ROC(Receiver operating characteristic) curve is drawn by taking False positive rate on X-axis and True positive rate on Y- axis.ROC tells us how many mistakes are we making to identify all the positives?

ROC Curve – Interpretation

How many mistakes are we making to identify all the positives?
How many mistakes are we making to identify 70%, 80% and 90% of positives?
1-Specificty(false positive rate) gives us an idea on mistakes that we are making
We would like to make 0% mistakes for identifying 100% positives
We would like to make very minimal mistakes for identifying maximum positives
We want that curve to be far away from straight line
Ideally we want the area under the curve as high as possible

AUC

We want that curve to be far away from straight line. Ideally we want the area under the curve as high as possible. ROC Curve gives us the overall performance of the model whereas AUC tells us or quantifies the exact value looking at which we can tell directly whether the model is good or bad. We want to make almost 0% mistakes while identifying all the positives, which means we want to see AUC value near to 1

AUC

AUC is near to 1 for a good model

ROC and AUC Calculation

Building a Logistic Regression Model

In [10]:

###for visualising the plots use matplotlib and import roc_curve,auc from sklearn.metrics 
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
%matplotlib inline
actual = Fiber_df[['active_cust']]
false_positive_rate, true_positive_rate, thresholds = roc_curve(actual, predicted_values1)
plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, true_positive_rate)
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate(Sensitivity)')
plt.xlabel('False Positive Rate(Specificity)')
plt.show()

In [11]:

###Threshold values used for the roc_curve can be viewed from threshold array
thresholds

Out[11]:

array([  2.00000000e+00,   1.00000000e+00,   9.99978894e-01, ...,
8.28263852e-03,   8.28015047e-03,   9.42770507e-04])

In [12]:

###Area under Curve-AUC
roc_auc = auc(false_positive_rate, true_positive_rate)
roc_auc

Out[12]:

0.83503740455417319

What is a best model? How to build?

A model with maximum accuracy /least error.
A model that uses maximum information available in the given data.
A model that has minimum squared error.
A model that captures all the hidden patterns in the data.
A model that produces the best perdition results.

Model Selection

How to build/choose a best model?
Error on the training data is not a good meter of performance on future data
How to select the best model out of the set of available models ?
Are there any methods/metrics to choose best model?
What is training error? What is testing error? What is hold out sample error?

LAB: The Most Accurate Model

Data: Fiberbits/Fiberbits.csv
Build a decision tree to predict active_user
What is the accuracy of your model?
Grow the tree as much as you can and achieve 95% accuracy.

Solution

In [13]:

#Preparing the X and y to train the model
features = list(Fiber_df.drop(['active_cust'],1).columns)
X = np.array(Fiber_df[features])
y = np.array(Fiber_df['active_cust'])

In [14]:

#Let's make a model by choosing some initial  parameters.
from sklearn import tree
tree_config = tree.DecisionTreeClassifier(criterion='gini', 
                                   splitter='best', 
                                   max_depth=10, 
                                   min_samples_split=1, 
                                   min_samples_leaf=30, 
                                   max_leaf_nodes=10)

In [15]:

#Training the model and finding the accuracy of the model                 
tree_config.fit(X,y)
tree_config.score(X,y)

Out[15]:

0.84972999999999999

The first decision tree we have built is giving us an accuracy of 84.97% on the training data. We will grow the tree to achieve 95% accuracy.

In [16]:

tree_config_new = tree.DecisionTreeClassifier(criterion='gini', 
                                              splitter='best', 
                                              max_depth=None, 
                                              min_samples_split=2, 
                                              min_samples_leaf=1, 
                                              max_leaf_nodes=None)

In [17]:

#Training the model and accuracy
tree_config_new.fit(X,y)
tree_config_new.score(X,y)

Out[17]:

0.99668999999999996

Different Type of Datasets and Errors

Datasets

There are two types of datasets

Training set and
Test set
Validation set

Training set

This is used in model building. The input data

Test set

Unknown dataset that gives the accuracy of the final model. we may not have access to these two datasets for all machine learning problems. All the time we will have training datasets sometimes we may also have test dataset. If only training dataset is available then we consider 90% of the available data as training data and rest 10% can be treated as validation data.

Validation set

This dataset kept aside for model validation and selection. This is a temporary substitute to test dataset. It is not third type of data • We create the validation data with the hope that the error rate on validation data will give us some basic idea on the test error. Once we have the training dataset and validation dataset we built the best model on training set, see its accuracy and if there is any error it is called as training error

Types of Errors

There are two types of errors. They are

Training Error and
Testing Error

Training error

The error on training dataset
In-time error
Error on the known data
Can be reduced while building the model

Testing error

The error that matters
Out-of-time error
The error on unknown/new dataset.

“A good model will have both training and test error very near to each other and close to zero”.If the model is very good on training data and not good on testing data then it is called Overfitting

Overfitting

The model is super good on training data but not so good on test data.
Less training error, high testing error
The model is over complicated with too many predictors
A model with lot of variance

The Problem of Over Fitting

In search of the best model on the given data we add many predictors, polynomial terms, Interaction terms, variable transformations, derived variables, indicator/dummy variables etc.,
Most of the times we succeed in reducing the error. What error is this?
By complicating the model we fit the best model for the training data.
Sometimes the error on the training data can reduce to near zero
But the same best model on training data fails miserably on test data.
Imagine building multiple models with small changes in training data. The resultant set of models will have huge variance in their parameter estimates.
The model is made really complicated, that it is very sensitive to minimal changes
By complicating the model the variance of the parameters estimates inflates
Model tries to fit the irrelevant characteristics in the data

LAB: Model with huge Variance

Data: Fiberbits/Fiberbits.csv
Take initial 90% of the data. Consider it as training data. Keep the final 10% of the records for validation.
Build the best model(5% error) model on training data.
Use the validation data to verify the error rate. Is the error rate on the training data and validation data same?

In [18]:

#Splitting the dataset into training and testing datasets
X = np.array(Fiber_df[features])
y = np.array(Fiber_df['active_cust'])
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size = 0.9)

In [19]:

#Building model on training data.
tree_var = tree.DecisionTreeClassifier(criterion='gini', 
                                              splitter='best', 
                                              max_depth=20, 
                                              min_samples_split=2, 
                                              min_samples_leaf=1, 
                                              max_leaf_nodes=None)
tree_var.fit(X_train,y_train)

Out[19]:

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=20,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=None, splitter='best')

In [20]:

#Accuracy of the model on training data
tree_var.score(X_train,y_train)

Out[20]:

0.95315555555555553

Validation accuracy :

In [21]:

#Accuracy on the test data
tree_var.score(X_test,y_test)

Out[21]:

0.86550000000000005

Error rate on validation data is more than the training data error.

Under-fitting

If the model is over simplified then we might not be capturing all the data. If we really require 20 variables but using only 10 variables then we might lose the real information that is there on the dataset. If we don’t do enough research then how can we build a best model for the dataset that is given to us. If the model is simple then the training error itself will be high. Model need to be complicated enough to capture all the information that is present, we can’t oversimplify the model. By oversimplifying the model if we are losing the information that is called the Problem of Under-fitting.

The Problem of Under-fitting

Simple models are better. Its true but is that always true? May not be always true.
We might have given it up too early. Did we really capture all the information?
Did we do enough research and future reengineering to fit the best model? Is it the best model that can be fit on this data?
By being over cautious about variance in the parameters, we might miss out on some patterns in the data.
Model need to be complicated enough to capture all the information present.
If the training error itself is high, how can we be so sure about the model performance on unknown data?
Most of the accuracy and error measuring statistics give us a clear idea on training error, this is one advantage of under fitting, we can identify it confidently.

Summary of Under-fitting

A model that is too simple
A mode with a scope for improvement
A model with lot of bias

LAB: Model with huge Bias

Lets simplify the model.
Take the high variance model and prune it.
Make it as simple as possible.
Find the training error and validation error.

Solution

In [22]:

#We can prune the tree by changing the parameters 
tree_bias = tree.DecisionTreeClassifier(criterion='gini', 
                                              splitter='best', 
                                              max_depth=10, 
                                              min_samples_split=30, 
                                              min_samples_leaf=30, 
                                              max_leaf_nodes=20)
tree_bias.fit(X_train,y_train)

Out[22]:

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10,
            max_features=None, max_leaf_nodes=20, min_samples_leaf=30,
            min_samples_split=30, min_weight_fraction_leaf=0.0,
presort=False, random_state=None, splitter='best')

In [23]:

#Training accuracy
tree_bias.score(X_train,y_train)

Out[23]:

0.85344444444444445

In [24]:

#Lets prune the tree further.  Lets oversimplyfy the model
tree_bias1 = tree.DecisionTreeClassifier(criterion='gini', 
                                              splitter='random', 
                                              max_depth=1, 
                                              min_samples_split=100, 
                                              min_samples_leaf=100, 
                                              max_leaf_nodes=2)
tree_bias1.fit(X_train,y_train)

Out[24]:

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=1,
            max_features=None, max_leaf_nodes=2, min_samples_leaf=100,
            min_samples_split=100, min_weight_fraction_leaf=0.0,
presort=False, random_state=None, splitter='random')

In [25]:

#Training Accuracy of new model
tree_bias1.score(X_train,y_train)

Out[25]:

0.68231111111111109

In [26]:

#Validation accuracy on test data
tree_bias1.score(X_test,y_test)

Out[26]:

0.68910000000000005

Model Bias and Variance

Over fitting
- Low Bias with High Variance
- Low training error – ‘Low Bias’
- High testing error
- Unstable model – ‘High Variance’
- The coefficients of the model change with small changes in the data
Under fitting
- High Bias with low Variance
- High training error – ‘high Bias’
- testing error almost equal to training error
- Stable model – ‘Low Variance’
- The coefficients of the model doesn’t change with small changes in the data

Observations

Look at the bottom left and if our aim is to hit the center of circle and if we are putting all the circles near to the center of the circle then that is the model with low bias with low variance. Now come to the second one in the bottom, in this model all the points are very nearer but far away from the center of the circle. This indicates low variance with high bias. In Top left all the points are near to the circle but none of them are closer. So this model has low bias with high variance. In Top right all the points are away from the center of the circle and at the same time these points also not nearer to them. This indicates High Bias with High Variance.

The Bias-Variance Decomposition

$Y = f(X)+epsilon$ $Var(epsilon) = sigma^2$ $Squared Error = E[(Y -hat{f}(x_0))^2 | X = x_0 ]$ $= sigma^2 + [Ehat{f}(x_0)-f(x_0)]^2 + E[hat{f}(x_0)-Ehat{f}(x_0)]^2$ $= sigma^2 + (Bias)^2(hat{f}(x_0))+Var(hat{f}(x_0 ))$

Overall Model Squared Error = Irreducible Error + $Bias^2$ + Variance
Overall error is made by bias and variance together.
High bias low variance, Low bias and high variance, both are bad for the overall accuracy of the model.
A good model need to have low bias and low variance or at least an optimal where both of them are jointly low.
How to choose such optimal model. How to choose that optimal model complexity.

Choosing optimal model-Bias Variance Tradeoff

Bias Variance Trade off

Observations

Above figure shows that as the Model Complexity increases, bias decreases and we also noticed that as the Model Complexity increases first the variance reduces but later on as the complexity becomes really high then the variance of the model increases. So the best spot is the one with the least value or optimal value of bias and variance

Test and Training Error

Observations

As the Model Complexity increases the training error keeps on reducing but the test error might reduce initially but it keeps on increasing as the complexity increases further. If you don’t have enough complexity then both test and training will be high. So the best spot is the one with the least value or optimal value of training and test error.

Choosing Optimal Model

Unfortunately there is no scientific method of choosing optimal model complexity that gives minimum test error. Training error is not a good estimate of the test error. There is always bias-variance tradeoff in choosing the appropriate complexity of the model. We can use cross validation methods, boot strapping and bagging to choose the optimal and consistent model

Holdout Data Cross Validation

Taking Hold out data or Cross validation is one of the best ways of finding original error or actual error present in the model. As we know the training error doesn’t give us original error in the model, we may Overfit or Underfit. Cross validation error really gives us the final accuracy of the model using which we can choose Model Complexity and build better model. A model that is performing good on training data and equally good on testing is preferred. If the testing data is not given then we split the training data (May be 80%-20% or 90%-10%) into two parts. First part is used to build the model and second part is used to validate the model.

LAB: Holdout Data Cross Validation

Data: Fiberbits/Fiberbits.csv
Take a random sample with 80% data as training sample
Use rest 20% as holdout sample.
Build a model on 80% of the data. Try to validate it on holdout sample.
Try to increase or reduce the complexity and choose the best model that performs well on training data as well as holdout data.

Solution

In [27]:

#Splitting data into 80:20::train:test
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size = 0.8)

In [28]:

#Defining tree parameters and training the tree
tree_CV = tree.DecisionTreeClassifier(criterion='gini', 
                                              splitter='best', 
                                              max_depth=20, 
                                              min_samples_split=2, 
                                              min_samples_leaf=1)
tree_CV.fit(X_train,y_train)

Out[28]:

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=20,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=None, splitter='best')

In [29]:

#Training score
tree_CV.score(X_train,y_train)

Out[29]:

0.95631250000000001

In [30]:

#Validation Accuracy on test data
tree_CV.score(X_test,y_test)

Out[30]:

0.85909999999999997

Improving the above model:

In [31]:

tree_CV1 = tree.DecisionTreeClassifier(criterion='gini', 
                                              splitter='best', 
                                              max_depth=10, 
                                              min_samples_split=30, 
                                              min_samples_leaf=30, 
                                              max_leaf_nodes=30)
tree_CV1.fit(X_train,y_train)

Out[31]:

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10,
            max_features=None, max_leaf_nodes=30, min_samples_leaf=30,
            min_samples_split=30, min_weight_fraction_leaf=0.0,
presort=False, random_state=None, splitter='best')

In [32]:

#Training score of this pruned tree model
tree_CV1.score(X_train,y_train)

Out[32]:

0.85914999999999997

In [33]:

#Validation score of pruned tree model
tree_CV1.score(X_test,y_test)

Out[33]:

0.85624999999999996

The model above is giving same accuracy on training and holdout data.

Ten-fold Cross – Validation

Divide the whole data into 10 parts. Use 9 parts as training data (90%) and tenth part as holdout data(10%). We repeat this whole process 10 times; build 10 models find the average error that really gives the actual training error. Even the smallest error that occurs in the model can be reduced by Ten-fold cross validation.

Working

Let us consider above bars indicates overall data and this data is divided into 10 parts. While building Model1 the last green box in the first bar is taken as test data and remaining 9 parts is taken as training data and find out what is the training error and what is the test error. While building 2nd model we will use 1 to 8 and 10th part as training data and 9th part as testing data and build model M2 and find out the training error and testing error. This process is repeated until the 1st part is taken as test data. That is how we can do k4 cross validation which will have10 errors, on an average we can find the overall training error and test error that gives us idea on overall accuracy of the model that we are building

K-fold Cross Validation

A generalization of cross validation, here we divide dataset into k equal parts. Among these k equal parts Kth part is treated as test data and remaining (K-1) parts are treated as training data. As there are k parts build k models and find average of all these models that gives an idea on testing error.

Which model to choose?

Choose the model with least error and least complexity
Or the model with less than average error and simple (less parameters)
Finally use complete data and build a model with the chosen number of parameters

Note

Its better to choose K between 5 to 10. Which gives 80% to 90% training data and rest 20% to 10% is holdout data.

LAB – K-fold Cross Validation

Build a tree model on the fiber bits data.
Try to build the best model by making all the possible adjustments to the parameters.
What is the accuracy of the above model?
Perform 10 -fold cross validation. What is the final accuracy?
Perform 20 -fold cross validation. What is the final accuracy?
What can be the expected accuracy on the unknown dataset?

Solution

In [3]:

## Defining the model parameters
tree_KF = tree.DecisionTreeClassifier(criterion='gini', 
                                              splitter='best', 
                                              max_depth=30, 
                                              min_samples_split=30, 
                                              min_samples_leaf=30, 
                                              max_leaf_nodes=60)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-3-c22c88ef37a4> in <module>()
      1 ## Defining the model parameters
----> 2 tree_KF = tree.DecisionTreeClassifier(criterion='gini', 
      3                                               splitter='best',
      4                                               max_depth=30,
      5                                               min_samples_split=30,
NameError: name 'tree' is not defined

In [35]:

#Importing kfold from cross_validation
from sklearn.cross_validation import KFold

In [36]:

#Simple K-Fold cross validation. 10 folds.
kfold = KFold(len(Fiber_df), n_folds=10)

In [37]:

## Checking the accuracy of model on 10-folds
from sklearn import cross_validation
score10 = cross_validation.cross_val_score(tree_KF,X, y,cv=kfold)
score10

Out[37]:

array([ 0.8358,  0.703 ,  0.6184,  0.8047,  0.8385,  0.7994,  0.7675,
0.7507,  0.7913,  0.7206])

In [38]:

#Mean accuracy of 10-fold
score10.mean()

Out[38]:

0.76299000000000006

In [39]:

#Simple K-Fold cross validation. 20 folds.
kfold = KFold(len(Fiber_df), n_folds=20)

In [40]:

#Accuracy score of 20-fold model
score20 = cross_validation.cross_val_score(tree_KF,X, y,cv=kfold)
score20

Out[40]:

array([ 0.9048,  0.781 ,  0.8288,  0.612 ,  0.283 ,  0.6676,  0.9226,
        0.7482,  0.907 ,  0.7866,  0.6784,  0.866 ,  0.8788,  0.911 ,
0.925 ,  0.7318,  0.9724,  0.7502,  0.6954,  0.7456])

In [41]:

#Mean accuracy of 20-fold
score20.mean()

Out[41]:

0.77981

With 10-fold kross validation we can expect Accuracy : 76.29%. With 20-fold kross validation we can expect Accuracy : 77.98%.

Bootstrap Cross Validation

Boot strapping is a powerful tool to get an idea on accuracy of the model and the test error. Bootstrap Cross Validation is most widely used. It is very unique and gives a good idea on the overall error. It can be used for building stable models. The results that coming out of bootstrap can be trusted and accepted very easily. The main usage of bootstrap is that it can estimate the future performance of a given model on new data which is not yet realized.

Algorithm

We have a training data is of size N
Draw random sample with replacement of size N – This gives a new dataset, it might have repeated observations, some observations might not have even appeared once.
- Create B such new datasets. These are called boot strap datasets
- Build the model on these B datasets; we can test the models on the original training dataset.

Bootstrap Method

Here we take the data in a slightly different manner, we take bootstrap samples. Let us suppose the training data size is N. Let say if there are 100,000 observations then we draw a random sample with replacement of size N. we draw 100,000 samples one at a time. We take a sample point and we note it down again we put the sample point back in the dataset again the dataset is 100,000. Again we take one more sample from it like that we repeat this whole process 10,000 times then we have a very new dataset because we have taken it by doing sample with replacement. In that new dataset some observations might repeat twice or thrice and some observations might not have picked at all because we are doing sampling one at a time. Though we have picked the whole 10,000 observations some of the points might have never chosen and some of the points might chose twice or thrice. So that is one bootstrap sample of size 100,000. If we do this 10 such samples or B such samples then those are called B-Bootstrap samples. Now instead of one training dataset we have 10 or B such training datasets. On each of training dataset we can build a model and then we can test the model on original training dataset and get an idea on the overall accuracy of the model.

Bootstrap Example

We have a training data is of size 500
Boot Strap Data-1:
- Create a dataset of size 500. To create this dataset, draw a random point, note it down, then replace it back. Again draw another sample point. Repeat this process 500 times. This makes a dataset of size 500. Call this as Boot Strap Data-1
Multiple Boot Strap datasets
- Repeat the procedure in step -2 multiple times. Say 200 times. Then we have 200 Boot Strap datasets
We can build the models on these 200 boost strap datasets and the average error gives a good idea on overall error. We can even use the original training data as the test data for each of the models

LAB: Bootstrap Cross Validation

Draw a boot strap sample with sufficient sample size
Build a tree model and get an estimate on true accuracy of the model

Solution

In [42]:

# Defining the tree parameters
tree_BS = tree.DecisionTreeClassifier(criterion='gini', 
                                              splitter='best', 
                                              max_depth=30, 
                                              min_samples_split=30, 
                                              min_samples_leaf=30, 
                                              max_leaf_nodes=60)

In [43]:

# Defining the bootstrap variable for 10 random samples
bootstrap=cross_validation.ShuffleSplit(n=len(Fiber_df), 
                                        n_iter=10, 
                                        random_state=0)

In [44]:

###checking the error in the Boot Strap models###
BS_score = cross_validation.cross_val_score(tree_BS,X, y,cv=bootstrap)
BS_score

Out[44]:

array([ 0.8658,  0.8699,  0.8658,  0.8655,  0.8694,  0.8741,  0.8689,
0.8689,  0.8639,  0.8672])

In [45]:

#Expected accuracy according to bootstrap validation
###checking the error in the Boot Strap models###
BS_score.mean()

Out[45]:

0.86793999999999993

With 10 bootstrap samples we can expect an Accuracy : 77.98%.

Conclusion

We studied
- Validating a model, Types of data & Types of errors
- The problem of over fitting & The problem of under fitting
- Bias Variance Tradeoff
- Cross validation & Boot strapping
Training error is what we see and that is not the true performance metric
Test error plays vital role in model selection
R-square, Adj-R-square, Accuracy, ROC, AUC, AIC and BIC can be used to get an idea on training error
Cross Validation and Boot strapping techniques give us an idea on test error.
Choose the model based on the combination of AIC, Cross Validation and Boot strapping results.
Bootstrap is widely used in ensemble models & random forests.

References

Hastie,Tibshirani and Friedman - The Elements of Statistical Learning (2nd edition, 2009).
http://scott.fortmann-roe.com/docs/BiasVariance.html

Select Category

Handout – Model Selection and Cross Validation in Python

Before start our lesson please download the datasets.

Contents

Model Validation

Metrics for validating Classification Problems

Sensitivity and Specificity

Calculating Sensitivity and Specificity

Building Logistic Regression Model

Changing Threshold to 0.8

Change in Confusion Matrix, Accuracy and Sensitivity-Specificity

Sensitivity VS Specificity

When Sensitivity is a High Priority

When Specificity is a High Priority

Sensitivity vs Specificity – Importance

ROC Curve

ROC Curve – Interpretation

AUC

AUC

ROC and AUC Calculation

What is a best model? How to build?

Model Selection

LAB: The Most Accurate Model

Different Type of Datasets and Errors

Datasets

Training set

Test set

Validation set

Types of Errors

Training error

Testing error

Overfitting

The Problem of Over Fitting

LAB: Model with huge Variance

Under-fitting

The Problem of Under-fitting

Summary of Under-fitting

LAB: Model with huge Bias

Model Bias and Variance

Observations

The Bias-Variance Decomposition

Choosing optimal model-Bias Variance Tradeoff

Bias Variance Trade off

Observations

Test and Training Error

Observations

Choosing Optimal Model

Holdout Data Cross Validation

LAB: Holdout Data Cross Validation

Ten-fold Cross – Validation

Working

K-fold Cross Validation

Which model to choose?

Note

LAB – K-fold Cross Validation

Bootstrap Cross Validation

Algorithm

Bootstrap Method

Bootstrap Example

LAB: Bootstrap Cross Validation

Conclusion

References