• No products in the cart.

204.4.5 What is a Best Model?

What quantifies a model to be the best?
Link to the previous post : https://course.dvanalyticsmds.com/204-4-4-roc-and-auc/

What is a best model? How to build?

  • A model with maximum accuracy /least error.
  • A model that uses maximum information available in the given data.
  • A model that has minimum squared error.
  • A model that captures all the hidden patterns in the data.
  • A model that produces the best perdition results.

Model Selection

  • How to build/choose a best model?
  • Error on the training data is not a good meter of performance on future data.
  • How to select the best model out of the set of available models ?
  • Are there any methods/metrics to choose best model?
  • What is training error? What is testing error? What is hold out sample error?

Practice : The Most Accurate Model

  • Data: Fiberbits/Fiberbits.csv
  • Build a decision tree to predict active_user
  • What is the accuracy of your model?
  • Grow the tree as much as you can and achieve 95% accuracy.

Solution

In [13]:
#Preparing the X and y to train the model
features = list(Fiber_df.drop(['active_cust'],1).columns)

X = np.array(Fiber_df[features])
y = np.array(Fiber_df['active_cust'])
In [14]:
#Let's make a model by choosing some initial  parameters.
from sklearn import tree

tree_config = tree.DecisionTreeClassifier(criterion='gini', 
                                   splitter='best', 
                                   max_depth=10, 
                                   min_samples_split=1, 
                                   min_samples_leaf=30, 
                                   max_leaf_nodes=10)
In [15]:
#Training the model and finding the accuracy of the model                 
tree_config.fit(X,y)
tree_config.score(X,y)
Out[15]:
0.84972999999999999

The first decision tree we have built is giving us an accuracy of 84.97% on the training data. We will grow the tree to achieve 95% accuracy.

In [16]:
tree_config_new = tree.DecisionTreeClassifier(criterion='gini', 
                                              splitter='best', 
                                              max_depth=None, 
                                              min_samples_split=2, 
                                              min_samples_leaf=1, 
                                              max_leaf_nodes=None)
In [17]:
#Training the model and accuracy
tree_config_new.fit(X,y)
tree_config_new.score(X,y)
Out[17]:
0.99668999999999996

This seem to be a matter of accuracy, the high the accuracy is good a model becomes. But, high accuracy comes with a price too. We might get to see it in next posts.

The next post is about type of datasets ,type of errors and problem of overfitting.

Link to the next post : https://course.dvanalyticsmds.com/204-4-6-type-of-datasets-type-of-errors-and-problem-of-overfitting/

DV Analytics

DV Data & Analytics is a leading data science,  Cyber Security training and consulting firm, led by industry experts. We are aiming to train and prepare resources to acquire the most in-demand data science job opportunities in India and abroad.

Bangalore Center

DV Data & Analytics Bangalore Private Limited
#52, 2nd Floor:
Malleshpalya Maruthinagar Bengaluru.
Bangalore 560075
India
(+91) 9019 030 033 (+91) 8095 881 188
Email: info@dvanalyticsmds.com

Bhubneshwar Center

DV Data & Analytics Private Limited Bhubaneswar
Plot No A/7 :
Adjacent to Maharaja Cine Complex, Bhoinagar, Acharya Vihar
Bhubaneswar 751022
(+91) 8095 881 188 (+91) 8249 430 414
Email: info@dvanalyticsmds.com

top
© 2020. All Rights Reserved.