• No products in the cart.

Handout – Decision Trees in python

Before start our lesson please download the datasets.

Introduction

Decision tree is a type of supervised learning algorithm that is mostly used in classification problems. In this technique, we split the population or sample into two or more homogeneous sets based on most significant differentiator in input variables.

Contents

1.What is segmentation 2.Segmentation Business Problem 3.Decision Tree Approach 4.Splitting Criterion 5.Impurity or Diversity Measures 6.Entropy 7.Information Gain 8.Purity Measures 9.Decision Tree Algorithm 10.Multiple Splits for a single variable 11.Modified Decision Tree Algorithm 12.Problem of overfitting 13.Pruning 14.Conclusion

What is Segmentation?

To understand segmentation, let us imagine a scenario where we want to run an SMS marketing campaign to attract more customers. We have a list of customers and we need to send them SMS, may be coupons, discounts, etc. We do not send a single SMS to all the customers because customers are different.

  • Some customers like to see high discount
  • Some customers want to see a large collection of items
  • Some customers are fans of a particular brands
  • Some customers are Male and some are Female

So, instead of sending one SMS to all the customers, if we send customized SMS to segments of the customers, then the marketing campaign will be effective. When the customers feel that they are connected to marketing campaign or SMS then they will get interest to buy the product and then we can say that the marketing campaign went well. So we want to divide the customers based on their old purchases. Divide the customers in such a way that, customers inside a group are homogeneous whereas customers across the group are heterogeneous. That means customers in two different groups behave differently. So we might have to send two different SMS. For dividing the customers into different groups according to the above condition, we use an algorithm called Decision Trees.

Segmentation Business Problem

The Data

Observations

Above problem describes three fields: Gender, Marital Status and whether the product is ordered or not by the customer. Some customers are male and some are female, some customers are married and some are unmarried. Now we have to decide that, if the customer is Male and if he is married, will he order the Product or not? At the same time we also need to decide if the person is female and if she is unmarried, will she order the product or not? Using the historical data, if someone has higher probability to order the product then we might sent a different message and if one customer has very low probability to order then we will send a message with discount to attract them to buy the product. If the customer has already higher probability then we will try to do upselling by sending other discounts or cross selling by showing them a different product so that they can buy 2 products. In this way we can improve our business. For solving the problem we have to rearrange the data

Re-Arranging the data

Observations From the above result we have noticed that there are total 14 customers. Among these 14 customers, 10 customers did not order the product and 4 have ordered the product. Among these 14 customers there are 8 Males and 6 Females. Among these 8 Males, 6 are married and 2 are Unmarried. One married (Male) and one unmarried (Male) have ordered the product.Among 6 Females, 3 are married and 3 are unmarried. All unmarried females have ordered the product whereas married females did not order the product. This analysis clearly shows that females who are unmarried have high probability to buy the product whereas married females are not interested to buy the product. On the other side, Males who are married are not interested to buy the product whereas 50% of Males who are Unmarried are interested to buy the product Therefore,

  Married Males won't buy the product whereas 

  Unmarried Females will buy the product

The Decision Tree Approach

Aim is to divide the dataset into segments. Each segment need to be useful for business decision making, that means the segment should be pure i.e., if we segment the whole population into 2 groups, in one group there should be buyers and in the other group there should be non-buyers then we can send different messages related to buyers and non-buyers.

Example Sales Segmentation Based on Age

Observations

Let us consider there are 100 customers to start with, now if we divide these 100 customers based on Age we will get two segments: Young and Old. From the above picture we have noticed that in Young Segment there are 60 People whereas in Old Segment there are 40 People. Now within this 60 Young people, 31 are buying and 29 are not buying. Within these old customers, 19 are buying and 21 are not buying. At the overall level we came to know that 50% is buying and 50% is not buying, which includes both Young and Old. Even if we divide the young customers based on Age, it looks like 50% are buying and 50% not buying. Again in old customers, 50% are buying and 50% not buying, then dividing the whole population based on Age doesnt really help us. Now let us try dividing the whole population based on Gender. Let us see whether this attribute will be helpful in splitting the whole population in a better way or not.

Example Sales Segmentation Based on Gender

Observations

Here the whole population is divided into two segments(Male and Female) based on Gender.There are “60” Male customers and “40” Female customers. Within Male customers(60), “48” are buying and “12” are not buying. within Female customers(40), “2” are buying and “38” are not buying. If we see overall, out of 50 customers who bought that product, 48 are actually Male and only 2 are Female. Within Male there is 80% chance of buying and within female there is only 5% chance of buying. Now when we divide the whole population based on Gender then we can make really good business inferences. Here most of the buyers are Male customers and Non-buyers are female customers. So, dividing the whole population based on Gender is actually giving us a beter split or giving us a better intuition to run a business strategy.

Main Questions

  • We are looking for pure segments
  • Dataset has many attributes
  • Which is the right attribute for pure segmentation?
  • Can we start with any attribute?
  • Which attribute to start with? – The best separating attribute
  • Customer Age can impact the sales, gender can impact sales , customer place and demographics can impact the sales. How to identify the best attribute and the split?

The Splitting Criterion

The best split is the split that does the best job for separating data into groups where each group has very dominant single class, that means if we divide the whole population into groups, One group should have all buyers and one group should have non-buyers. Let us clearly see this concept through Sales segmentation based on AGE as well as through Sales segmentation based on Gender.

Example Sales Segmentation Based on Age

Observations

At root level there is 50% chance of buying and 50% chance of non-buying. Now after the split, 52% is positive and 48% is negative in young group, whereas in old group 52% is negative and 48% is positive. This is not really giving us pure splits because we did not gain much information here. At the root level there are 100 customers, where 50% buying and 50% not buying, even at the split level, both young and old customer segments are having 50% chance of buying and 50% chance of not buying. So the AGE variable is not really a good splitting variable.

Example Sales Segmentation Based on Gender

Example Sales Segmentation Based on Gender

Observations

At root level again 50% chance of buying and 50% chance of non-buying, but at the individual splits Male Population has 80% chance of buying whereas female population has only 5% chance of buying that means 95% of female population are not interested to buy the product. This is a pure segment from the point of not buying. Male Population is also pure segment from the point of buying. So this split is very much better than eariler split which we have done based on AGE. So we are looking for varibles like GENDER.

Impurity (Diversity) Measures

  • We are looking for an impurity or diversity measure, that will give high score for the Age variable(high impurity while segmenting) and Low score for Gender variable(Low impurity while segmenting).

Low score for Gender variable(Low impurity while segmenting)

For calculating the impurity there is a mathematical formula called Entropy.

Entropy

Entropy characterizes the impurity/diversity of a segment. So the impurity measure that we are looking for is Entropy. Entropyis a measure of uncertainty/impurity/diversity. It measures the information amount in a message. If S is the segment of training measures then Entropy(S) = -p+ log2 p+ – p- log2 p- Where

  p+ is the probability of positive class and
  p- is the probability of negative class 

Entropy is highest when the split has p of 0.5 Entropy is least when the split is pure .ie p of 1

Note

 If entropy is low we will get better segmentation
 If entropy is high we will not get better segmentation

Entropy is highest when the split has p of 0.5

  • Entropy(S) = -p_+ log_2p_+ - p_- log_2 p_-
  • Entropy is highest when the split has p of 0.5
  • 50-50 class ratio in a segment is really impure, hence entropy is high
    • Entropy(S) = -p_+ log_2p_+ - p_- log_2 p_-
    • Entropy(S) = -0.5*log_2(0.5) - 0.5*log_2(0.5)
    • Entropy(S) = 1

Entropy is least when the split is pure .ie p of 1

  • Entropy(S) = -p_+ log_2p_+ - p_- log_2 p_-
  • Entropy is least when the split is pure ie p of 1
  • 100-0 class ratio in a segment is really pure, hence entropy is low
    • Entropy(S) =-p_+ log_2p_+ - p_- log_2 p_-
    • Entropy(S) = -1*log_2(1) - 0*log_2(0)
    • Entropy(S) = 0

The less the entropy, the better the split

  • Lesser the entropy, better the split
  • Entropy is formulated in such a way that, its value will be high for impure segments.

Entropy Calculation – Example

  • Entropy at root
  • Total population at root 100 [50+,50-]
    • Entropy(S) = -p_+ log_2p_+ - p_- log_2 p_-
    • -0.5 log_2 (0.5) - 0.5 log_2 (0.5)
    • -(0.5)(-1) - (0.5)(-1)
    • 1
    • 100% Impurity at root

Entropy(S) = -p_+ log_2p_+ - p_- log_2 p_-

Entropy Calculation
  • Gender Splits the population into two segments
  • Segment-1 : Age=”Young”
  • Segment-2: Age=”Old”
  • Entropy at segment-1
    • Age=”Young” segment has 60 records [31+,29-] Entropy(S) = $-p_+ log_2p_+ - p_- log_2 p_-
    • -31/60 log_2 31/60 - 29/60 log_2 29/60
    • (-31/60)log(31/60,2)-(29/60)log(29/60,2)
    • 0.9991984 (99% Impurity in this segment)

  • Entropy at segment-2
    • Age=”Old” segment has 40 records [19+,21-] Entropy(S) = $-p_+ log_2p_+ - p_- log_2 p_-
    • -19/40 log_2 19/40 - 21/40 log_2 21/40
    • (-19/40)log(19/40,2)-(21/40)log(21/40,2)
    • 0.9981959(99% Impurity in this segment too)

LAB: Entropy Calculation – Example

  • Calculate entropy at the root for the given population
  • Calculate the entropy for the two distinct gender segments

Code- Entropy Calculation

  • Entropy at root 100%
  • Male Segment : (-48/60)log(48/60,2)-(12/60)log(12/60,2)
    • 0.7219281
  • FemaleSegment : (-2/40)log(2/40,2)-(38/40)log(38/40,2)
    • 0.286397

Information Gain

Information Gain gives us better picture about the variable based on its splitting criterion. It also provides information before and after the segmentation

  • Information Gain= entropyBeforeSplit – entropyAfterSplit
  • Easy way to understand Information gain= (overall entropy at parent node) – (sum of weighted entropy at each child node)
  • Attribute with maximum information is best split attribute

Information Gain- Calculation

DV Analytics

DV Data & Analytics is a leading data science,  Cyber Security training and consulting firm, led by industry experts. We are aiming to train and prepare resources to acquire the most in-demand data science job opportunities in India and abroad.

Bangalore Center

DV Data & Analytics Bangalore Private Limited
#52, 2nd Floor:
Malleshpalya Maruthinagar Bengaluru.
Bangalore 560075
India
(+91) 9019 030 033 (+91) 8095 881 188
Email: info@dvanalyticsmds.com

Bhubneshwar Center

DV Data & Analytics Private Limited Bhubaneswar
Plot No A/7 :
Adjacent to Maharaja Cine Complex, Bhoinagar, Acharya Vihar
Bhubaneswar 751022
(+91) 8095 881 188 (+91) 8249 430 414
Email: info@dvanalyticsmds.com

top
© 2020. All Rights Reserved.