Link to the previous post: https://course.dvanalyticsmds.com/204-4-7-problem-of-overfitting/

The Problem of Under-fitting

Simple models are better. It’s true but is that always true? May not be always true.
We might have given it up too early. Did we really capture all the information?
Did we do enough research and future re-engineering to fit the best model? Is it the best model that can be fit on this data?
By being over cautious about variance in the parameters, we might miss out on some patterns in the data.
Model need to be complicated enough to capture all the information present.
If the training error itself is high, how can we be so sure about the model performance on unknown data?
Most of the accuracy and error measuring statistics give us a clear idea on training error, this is one advantage of under fitting, we can identify it confidently.
Under fitting
- A model that is too simple
- A mode with a scope for improvement
- A model with lot of bias

Practice : Model with huge Bias

Lets simplify the model.
Take the high variance model and prune it.
Make it as simple as possible.
Find the training error and validation error.

Solution

In [22]:

#We can prune the tree by changing the parameters 
tree_bias = tree.DecisionTreeClassifier(criterion='gini', 
                                              splitter='best', 
                                              max_depth=10, 
                                              min_samples_split=30, 
                                              min_samples_leaf=30, 
                                              max_leaf_nodes=20)
tree_bias.fit(X_train,y_train)

Out[22]:

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10,
            max_features=None, max_leaf_nodes=20, min_samples_leaf=30,
            min_samples_split=30, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [23]:

#Training accuracy
tree_bias.score(X_train,y_train)

Out[23]:

0.85344444444444445

In [24]:

#Lets prune the tree further.  Lets oversimplyfy the model
tree_bias1 = tree.DecisionTreeClassifier(criterion='gini', 
                                              splitter='random', 
                                              max_depth=1, 
                                              min_samples_split=100, 
                                              min_samples_leaf=100, 
                                              max_leaf_nodes=2)
tree_bias1.fit(X_train,y_train)

Out[24]:

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=1,
            max_features=None, max_leaf_nodes=2, min_samples_leaf=100,
            min_samples_split=100, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='random')

In [25]:

#Training Accuracy of new model
tree_bias1.score(X_train,y_train)

Out[25]:

0.68231111111111109

In [26]:

#Validation accuracy on test data
tree_bias1.score(X_test,y_test)

Out[26]:

0.68910000000000005

In next post we will discuss how to choose optimal model using Bias Variance Trade off.

Link to the next post : https://course.dvanalyticsmds.com/204-4-9-model-bias-variance-tradeoff/

21st June 2017

204.4.8 Problem of Under-fitting

What happens if the model is Under-fitted? Huge Bias?

The Problem of Under-fitting

Practice : Model with huge Bias

Dv Analytics

Dv Analytics

Dv Analytics

204.4.8 Problem of Under-fitting

What happens if the model is Under-fitted? Huge Bias?

The Problem of Under-fitting

Practice : Model with huge Bias

Related Courses

Excel

Dv Analytics

Deep Learning

Dv Analytics

Explainable AI (XAI)

Dv Analytics