Decision Trees
Introduction
Decision tree is a type of supervised learning algorithm that is mostly used in classification problems. In this technique, we split the population or sample into two or more homogeneous sets based on most significant differentiator in input variables.
Contents
1.What is segmentation
2.Segmentation Business Problem
3.Decision Tree Approach
4.Splitting Criterion
5.Impurity or Diversity Measures
6.Entropy
7.Information Gain
8.Purity Measures
9.Decision Tree Algorithm
10.Multiple Splits for a single variable
11.Modified Decision Tree Algorithm
12.Problem of overfitting
13.Pruning
14.Conclusion
What is Segmentation?
To understand segmentation, let us imagine a scenario where we want to run an SMS marketing campaign to attract more customers. We have a list of customers and we need to send them SMS, may be coupons, discounts, etc. We do not send a single SMS to all the customers because customers are different.
- Some customers like to see high discount
- Some customers want to see a large collection of items
- Some customers are fans of a particular brands
- Some customers are Male and some are Female
So, instead of sending one SMS to all the customers, if we send customized SMS to segments of the customers, then the marketing campaign will be effective. When the customers feel that they are connected to marketing campaign or SMS then they will get interest to buy the product and then we can say that the marketing campaign went well. So we want to divide the customers based on their old purchases.
Divide the customers in such a way that, customers inside a group are homogeneous whereas customers across the group are heterogeneous. That means customers in two different groups behave differently. So we might have to send two different SMS. For dividing the customers into different groups according to the above condition, we use an algorithm called Decision Trees.
Segmentation Business Problem
The Data
Observations
Above problem describes three fields: Gender, Marital Status and whether the product is ordered or not by the customer. Some customers are male and some are female, some customers are married and some are unmarried. Now we have to decide that, if the customer is Male and if he is married, will he order the Product or not? At the same time we also need to decide if the person is female and if she is unmarried, will she order the product or not?
Using the historical data, if someone has higher probability to order the product then we might sent a different message and if one customer has very low probability to order then we will send a message with discount to attract them to buy the product.
If the customer has already higher probability then we will try to do upselling by sending other discounts or cross selling by showing them a different product so that they can buy 2 products. In this way we can improve our business.
For solving the problem we have to rearrange the data
Re-Arranging the data
Observations
From the above result we have noticed that there are total 14 customers. Among these 14 customers, 10 customers did not order the product and 4 have ordered the product. Among these 14 customers there are 8 Males and 6 Females. Among these 8 Males, 6 are married and 2 are Unmarried. One married (Male) and one unmarried (Male) have ordered the product.Among 6 Females, 3 are married and 3 are unmarried. All unmarried females have ordered the product whereas married females did not order the product.
This analysis clearly shows that females who are unmarried have high probability to buy the product whereas married females are not interested to buy the product. On the other side, Males who are married are not interested to buy the product whereas 50% of Males who are Unmarried are interested to buy the product
Therefore,
Married Males won't buy the product whereas
Unmarried Females will buy the product
The Decision Tree Approach
Aim is to divide the dataset into segments. Each segment need to be useful for business decision making, that means the segment should be pure i.e., if we segment the whole population into 2 groups, in one group there should be buyers and in the other group there should be non-buyers then we can send different messages related to buyers and non-buyers.
Example Sales Segmentation Based on Age
Observations
Let us consider there are 100 customers to start with, now if we divide these 100 customers based on Age we will get two segments: Young and Old. From the above picture we have noticed that in Young Segment there are 60 People whereas in Old Segment there are 40 People. Now within this 60 Young people, 31 are buying and 29 are not buying. Within these old customers, 19 are buying and 21 are not buying. At the overall level we came to know that 50% is buying and 50% is not buying, which includes both Young and Old. Even if we divide the young customers based on Age, it looks like 50% are buying and 50% not buying. Again in old customers, 50% are buying and 50% not buying, then dividing the whole population based on Age doesnt really help us. Now let us try dividing the whole population based on Gender. Let us see whether this attribute will be helpful in splitting the whole population in a better way or not.
Example Sales Segmentation Based on Gender
Observations
Here the whole population is divided into two segments(Male and Female) based on Gender.There are “60” Male customers and “40” Female customers. Within Male customers(60), “48” are buying and “12” are not buying. within Female customers(40), “2” are buying and “38” are not buying. If we see overall, out of 50 customers who bought that product, 48 are actually Male and only 2 are Female. Within Male there is 80% chance of buying and within female there is only 5% chance of buying. Now when we divide the whole population based on Gender then we can make really good business inferences. Here most of the buyers are Male customers and Non-buyers are female customers. So, dividing the whole population based on Gender is actually giving us a beter split or giving us a better intuition to run a business strategy.
Main Questions
- We are looking for pure segments
- Dataset has many attributes
- Which is the right attribute for pure segmentation?
- Can we start with any attribute?
- Which attribute to start with? – The best separating attribute
- Customer Age can impact the sales, gender can impact sales , customer place and demographics can impact the sales. How to identify the best attribute and the split?
The Splitting Criterion
The best split is the split that does the best job for separating data into groups where each group has very dominant single class, that means if we divide the whole population into groups, One group should have all buyers and one group should have non-buyers. Let us clearly see this concept through Sales segmentation based on AGE as well as through Sales segmentation based on Gender.
Example Sales Segmentation Based on Age
Observations
At root level there is 50% chance of buying and 50% chance of non-buying. Now after the split, 52% is positive and 48% is negative in young group, whereas in old group 52% is negative and 48% is positive. This is not really giving us pure splits because we did not gain much information here. At the root level there are 100 customers, where 50% buying and 50% not buying, even at the split level, both young and old customer segments are having 50% chance of buying and 50% chance of not buying. So the AGE variable is not really a good splitting variable.
Example Sales Segmentation Based on Gender
Example Sales Segmentation Based on Gender
Observations
At root level again 50% chance of buying and 50% chance of non-buying, but at the individual splits Male Population has 80% chance of buying whereas female population has only 5% chance of buying that means 95% of female population are not interested to buy the product. This is a pure segment from the point of not buying. Male Population is also pure segment from the point of buying. So this split is very much better than eariler split which we have done based on AGE. So we are looking for varibles like GENDER.
Impurity (Diversity) Measures
- We are looking for an impurity or diversity measure, that will give high score for the Age variable(high impurity while segmenting) and Low score for Gender variable(Low impurity while segmenting).
Low score for Gender variable(Low impurity while segmenting)
For calculating the impurity there is a mathematical formula called Entropy.
Entropy
Entropy characterizes the impurity/diversity of a segment. So the impurity measure that we are looking for is Entropy. Entropyis a measure of uncertainty/impurity/diversity. It measures the information amount in a message. If S is the segment of training measures then
Entropy(S) = -p+ log2 p+ – p- log2 p-
Where
p+ is the probability of positive class and
p- is the probability of negative class
Entropy is highest when the split has p of 0.5
Entropy is least when the split is pure .ie p of 1
Note
If entropy is low we will get better segmentation
If entropy is high we will not get better segmentation
Entropy is highest when the split has p of 0.5
- Entropy(S) =
- Entropy is highest when the split has p of 0.5
- 50-50 class ratio in a segment is really impure, hence entropy is high
- Entropy(S) =
- Entropy(S) =
- Entropy(S) = 1
- Entropy(S) =
Entropy is least when the split is pure .ie p of 1
- Entropy(S) =
- Entropy is least when the split is pure ie p of 1
- 100-0 class ratio in a segment is really pure, hence entropy is low
- Entropy(S) =
- Entropy(S) =
- Entropy(S) = 0
- Entropy(S) =
The less the entropy, the better the split
- Lesser the entropy, better the split
- Entropy is formulated in such a way that, its value will be high for impure segments.
Entropy Calculation – Example
- Entropy at root
- Total population at root 100 [50+,50-]
- Entropy(S) =
- 1
- 100% Impurity at root
- Entropy(S) =
- Gender Splits the population into two segments
- Segment-1 : Age=”Young”
- Segment-2: Age=”Old”
- Entropy at segment-1
- Age=”Young” segment has 60 records [31+,29-]
- (-31/60)log(31/60,2)-(29/60)log(29/60,2)
- 0.9991984 (99% Impurity in this segment)
- Age=”Young” segment has 60 records [31+,29-]
- Entropy at segment-2
- Age=”Old” segment has 40 records [19+,21-]
- (-19/40)log(19/40,2)-(21/40)log(21/40,2)
- 0.9981959(99% Impurity in this segment too)
- Age=”Old” segment has 40 records [19+,21-]
LAB: Entropy Calculation – Example
- Calculate entropy at the root for the given population
- Calculate the entropy for the two distinct gender segments
Code- Entropy Calculation
- Entropy at root 100%
- Male Segment : (-48/60)log(48/60,2)-(12/60)log(12/60,2)
- 0.7219281
- FemaleSegment : (-2/40)log(2/40,2)-(38/40)log(38/40,2)
- 0.286397
Information Gain
Information Gain gives us better picture about the variable based on its splitting criterion. It also provides information before and after the segmentation
- Information Gain= entropyBeforeSplit – entropyAfterSplit
- Easy way to understand Information gain= (overall entropy at parent node) – (sum of weighted entropy at each child node)
- Attribute with maximum information is best split attribute
Information Gain- Calculation
- Entropy Overall = 100% (Impurity)
- Entropy Young Segment = 99%
- Entropy Old Sgment = 99%
- Information Gain for Age =100-(0.699+0.499)=1
- Entropy Overall = 100% (Impurity)
- Entropy Male Segment = 72%
- Entropy Female Sgment = 29%
- Information Gain for Age =100-(0.672+0.429)=45.2
LAB: Information Gain
Calculate the information gain. This example is based on the variable split.
Calculate the information gain. This example is based on the variable split
Output-Information Gain
Split With Respect to ‘Owning a car’
- Entropy([28+,39-]) Ovearll = -28/67 log2 28/67 – 39/67 log2 39/67 = 98% (Impurity)
- Entropy([25+,4-]) Owing a car = 57%
- Entropy([3+,35-]) No car = 40%
- Information Gain for Owing a car =98-((29/67)57+(38/67)40)=50.6
Split With Respect to ‘Gender’
- Entropy([19+,21-]) Male= 99%
- Entropy([9+,18-]) Female = 91%
- Information Gain for Gender=98-((40/67)99+(27/67)91) =2.2
Other Purity (Diversity) Measures
- Chi-square measure of association
- Gini Index : Gini(T) =
- Information Gain Ratio
- Misclassification error
The Decision tree Algorithm
Major step is to identify the best split variables and best split criteria. Once we have the split then we have to go to segment level and drill down further. In order to find the best attribute for segmentation, we will calculate Information Gain for all the attributes. The one with the highest attribute is selected.
Until stopped:
- Select a leaf node
- Find the best splitting attribute
- Spilt the node using the attribute
- Go to each child node and repeat step 2 3
Stopping criteria:
- Each leaf-node contains examples of one type
- Algorithm ran out of attributes
- No further significant information gain
The Decision tree Algorithm – Demo
Entropy([4+,10-]) Ovearll = 86.3% (Impurity)
- Entropy([7+,1-]) Male= 54.3%
- Entropy([3+,3-]) Female = 100%
- Information Gain for Gender=86.3-((8/14)54.3+(6/14)100) =12.4
Entropy([4+,10-]) Ovearll = 86.3% (Impurity)
- Entropy([0+,9-]) Married = 0%
- Entropy([4+,1-]) Un Married= 72.1%
- Information Gain for Marital Status=86.3-((9/14)0+(5/14)72.1)=60.5
- The information gain for Marital Status is high, so it has to be the first variable for segmentation
- Now we consider the segment “Married” and repeat the same process of looking for the best splitting variable for this sub segment.
### The Decision tree Algorithm
Until stopped:
- Select a leaf node
- Find the best splitting attribute
- Spilt the node using the attribute
- Go to each child node and repeat step 2 3
Stopping criteria: - Each leaf-node contains examples of one type
- Algorithm ran out of attributes
- No further significant information gain
Many Splits for a Single Variable
- Sometimes we may find multiple values taken by a variable
- which will lead to multiple split options for a single variable
- that will give us multiple information gain values for a single variable
What is the information gain for income?
- What is the information gain for income?
- There are multiple options to calculate Information gain
- For income, we will consider all possible scenarios and calculate the information gain for each scenario
- The best split is the one with highest information gain
- Within income, out of all the options, the split with best information gain is considered
- So, node partitioning for multi class attributes need to be included in the decision tree algorithm
- We need find best splitting attribute along with best split rule
The Decision tree Algorithm- Full version
Until stopped:
- Select a leaf node
- Select an attribute
- Partition the node population and calculate information gain.
- Find the split with maximum information gain for this attribute.
- Repeat this for all attributes
- Find the best splitting attribute along with best split rule .
- Spilt the node using the attribute.
- Go to each child node and repeat step 2 to 4.
Stopping criteria:
- Each leaf-node contains examples of one type.
- Algorithm ran out of attributes.
- No further significant information gain.
Explanation of the Algorithm
We will select one attribute at a time. For each attribute we will calculate the node calculation and we consider all the scenarios and find the best split criteria. Then we will have the Information Gain for that particular variable along with the best split criteria and then we do the same thing for all the variables. In the end we will have the best splitting Attribute along with best split rule.
LAB: Decision Tree Building
- Import Data:Ecom_Cust_Relationship_Management/Ecom_Cust_Survey.csv
- How many customers have participated in the survey?
- Most of the customers are satisfied or dis-satisfied?
- Can you segment the data and find the concentrated satisfied and dis-satisfied customer segments ?
- What are the major characteristics of satisfied customers?
- What are the major characteristics of dis-satisfied customers?
Solution
#Import Data
import pandas as pd
df = pd.read_csv('DatasetsEcom_Cust_Relationship_ManagementEcom_Cust_Survey.csv',header = 0)
# to remove all the missing values rows
df.dropna(inplace='True')
#Q 1. How many customers have participated in the survey?
df.shape
#ANS: 11805
#total number of customers rows
df.shape[0]
#Q.2 Overall most of the customers are satisfied or dis-satisfied?
df.Overall_Satisfaction.value_counts()
#number of satisfied customers
satisfied = df['Overall_Satisfaction'].map( {'Dis Satisfied': 0, 'Satisfied': 1} ).astype(int).sum()
satisfied
#number of dis satisfied customers
df.shape[0]-satisfied
#ANS: 6411
df.Overall_Satisfaction.value_counts()
- Can you segment the data and find the concentrated satisfied and dis-satisfied customer segments ?
We will create a tree model in python using the sci-kit module.
Before that, we need to convert most of the feature data into numerical or hash values as scikit only works with numerical data.
# Welcome to variable transformation
df['Region'] = df['Region'].map({'EAST': 1, 'WEST': 2, 'NORTH': 3, 'SOUTH':4}).astype(int)
df['Customer_Type'] = df['Customer_Type'].map({'Prime': 1, 'Non_Prime': 0}).astype(int)
#We will also need to change the column names, as '.' and spaces are part of many basic funcions in python
df.rename(columns={'Order Quantity':'Order_Quantity', 'Improvement Area' :'Improvement_Area'}, inplace=True)
df['Improvement_Area'] = df['Improvement_Area'].map({'Website UI':1, 'Packing Shipping':2, 'Product Quality':3,}).astype(int)
df['Overall_Satisfaction'] = df['Overall_Satisfaction'].map( {'Dis Satisfied': 0, 'Satisfied': 1} ).astype(int)
#Need the library to create the tree
from sklearn import tree
#Defining Features and lables
features= list(df.columns[:6])
X = df[features]
y = df['Overall_Satisfaction']
#Building Tree Model
clf = tree.DecisionTreeClassifier(max_depth=2)
clf.fit(X,y)
Plotting the trees
Unfortunately drawing a beautiful tree is not easy in python as it is in R, none the less we need a way out
- Have a latest version of jupyter and python installed : anaconda with python 3.5 and jupyter 4.2.3 is used in this session
- We need to install graphviz tool in our system and set the path in environment variables.
- Visit http://www.graphviz.org/Download..php and find the optimal version for your computer.
- Get the path for gvedit.exe in installation directory(for me it was “C:Program Files (x86)Graphviz2.38bin”)
- goto start->computer->system properties->advanced settings->environment variables and add the path.
- To make graphviz work in your python environment, we need to install graphviz package too.
- use ‘conda install graphviz’ in command prompt.
- We will need python package pydotplus(for older python versions pydot)
- use this command in your anaconda prompt:
conda install -c conda-forge pydotplus
- if any error regarding version occurs while installing the package, go to https://anaconda.org/search?q=pydotplus
- this link will show the channel name of the suitable version.
- and we can use :
conda install -c <channel name here> pydotplus
- use this command in your anaconda prompt:
Note: older version of python uses pydot(depreciated); use pydotplus for new versions and
Please follow all the steps in the given order.
#What are the major characteristics of satisfied customers?
from IPython.display import Image
from sklearn.externals.six import StringIO
import pydotplus
dot_data = StringIO()
tree.export_graphviz(clf,
out_file = dot_data,
feature_names = features,
filled=True, rounded=True,
impurity=False)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
By looking at the plot we can answer two questions:
- What are the major characteristics of satisfied customers?Order.Quantity less than 40.5 and Age less than 29.5 or Order.Quantity greater than equal to 40.5
- What are the major characteristics of dis-satisfied customers?Order.Quantity more than 40.5 Age greater than equal to 29.5
LAB: Tree Validation
- Find the accuracy of the classification for the tree model
#Tree Validation
predict1 = clf.predict(X)
from sklearn.metrics import confusion_matrix ###for using confusion matrix###
cm = confusion_matrix(y, predict1)
print (cm)
total = sum(sum(cm))
#####from confusion matrix calculate accuracy
accuracy = (cm[0,0]+cm[1,1])/total
accuracy
- We can also use the .score() function to predict the accuracy in python from sklearn library.
- However, confusion matrix allows us to see the wrong classifications, that gives us an intuitive understanding of our prediction.
clf.score(X,y)
LAB: The Problem of Overfitting
- Import Dataset: “Buyers Profiles/Train_data.csv”
- Import both test and training data
- Build a decision tree model on training data
- Find the accuracy on training data
- Find the predictions for test data
- What is the model prediction accuracy on test data?
Solution
- Import Dataset: “Buyers Profiles/Train_data.csv”
- Import both test and training data
import pandas as pd
train = pd.read_csv("DatasetsBuyers ProfilesTrain_data.csv", header=0)
test = pd.read_csv("datasetsBuyers ProfilesTest_data.csv", header=0)
train.shape
test.shape
train.info()
# the data have string values we need to convert them into numerica values
train['Gender'] = train['Gender'].map( {'Male': 1, 'Female': 0} ).astype(int)
train['Bought'] = train['Bought'].map({'Yes':1, 'No':0}).astype(int)
test['Gender'] = test['Gender'].map( {'Male': 1, 'Female': 0} ).astype(int)
test['Bought'] = test['Bought'].map({'Yes':1, 'No':0}).astype(int)
train.info()
from sklearn import tree
#Defining Features and lables
features = list(train.columns[:2])
X_train = train[features]
y_train = train['Bought']
#X_train
X_test = test[features]
y_test = test['Bought']
#training Tree Model
clf = tree.DecisionTreeClassifier()
clf.fit(X_train,y_train)
#Plotting the trees
dot_data = StringIO()
tree.export_graphviz(clf,
out_file = dot_data,
feature_names = features,
filled=True, rounded=True,
impurity=False)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
predict1 = clf.predict(X_train)
print(predict1)
predict2 = clf.predict(X_test)
print(predict2)
####Calculation of Accuracy and Confusion Matrix on the training data
from sklearn.metrics import confusion_matrix ###for using confusion matrix###
cm1 = confusion_matrix(y_train,predict1)
cm1
total1 = sum(sum(cm1))
#####from confusion matrix calculate accuracy
accuracy1 = (cm1[0,0]+cm1[1,1])/total1
accuracy1
#Accuracy On Test Data
cm2 = confusion_matrix(y_test,predict2)
cm2
total2 = sum(sum(cm2))
#####from confusion matrix calculate accuracy
accuracy2 = (cm2[0,0]+cm2[1,1])/total2
accuracy2
The Problem of Overfitting
- If we further grow the tree, we might even see each row of the input data table as the final rules
- The model will be really good on the training data, but it will fail to validate on the test data
- Growing the tree beyond a certain level of complexity leads to overfitting
- A really big tree is very likely to suffer from overfitting.
Pruning
- Growing the tree beyond a certain level of complexity leads to overfitting
- In our data, age doesn’t have any impact on the target variable.
- Growing the tree beyond Gender is not going to add any value. Need to cut it at Gender
- This process of trimming trees is called Pruning
Pruning to Avoid Overfitting
- Pruning helps us to avoid overfitting
- Generally it is preferred to have a simple model, it avoids overfitting issue.
- Any additional split that does not add significant value is not worth while.
- We can use Cp – Complexity parameter in R, to control the tree growth.
Code-Tree Pruning
#We will rebuild a new tree by using above data and see how it works by tweeking the parameteres
dtree = tree.DecisionTreeClassifier(criterion = "gini", splitter = 'random', max_leaf_nodes = 10, min_samples_leaf = 5, max_depth= 5)
dtree.fit(X_train,y_train)
predict3 = dtree.predict(X_train)
print(predict3)
predict4 = dtree.predict(X_test)
print(predict4)
#Accuracy of the model that we created with modified model parameters.
score2 = dtree.score(X_test, y_test)
score2
LAB: Tree Building Model Selection
- Import fiber bits data. This is internet service provider data. The idea is to predict the customer attrition based on some independent factors.
- Build a decision tree model for fiber bits data.
- Prune the tree if required.
- Find out the final accuracy.
- Is there any 100% active/inactive customer segment?
Solution
import pandas as pd
import numpy as np
Fiber_df = pd.read_csv("DatasetsFiberbitsFiberbits.csv", header=0)
Fiber_df.info()
from sklearn import cross_validation, tree
#Defining Features and lables
features = list(Fiber_df.drop(['active_cust'],1).columns) #this code gives a list of column names except 'active_cust'
features
X = np.array(Fiber_df[features])
y = np.array(Fiber_df['active_cust'])
We will split the training data into train and test to validate the model accuracy
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y, train_size = 0.8)
clf1 = tree.DecisionTreeClassifier()
clf1.fit(X_train,y_train)
#Accuracy of the model clf1
clf1.score(X_test,y_test)
We can tweak the parameters and build a new model.
#Let's make a model by chnaging the parameters.
clf2 = tree.DecisionTreeClassifier(criterion='gini',
splitter='best',
max_depth=10,
min_samples_split=5,
min_samples_leaf=5,
min_weight_fraction_leaf=0.5,
max_leaf_nodes=10)
clf2.fit(X_train,y_train)
clf2.score(X_test,y_test)
Conclusion
- Decision trees are powerful and very simple to represent and understand.
- One need to be careful with the size of the tree. Decision trees are more prone to overfitting than other algorithms.
- Can be applied to any type of data, especially with categorical predictors.
- One can use decision trees to perform a basic customer segmentation and build a different predictive model on the segments.