• No products in the cart.

203.3.11 Practice : Tree Building & Model Selection

A conclusion of Decision Tree series.

LAB: Tree Building & Model Selection

In previous section, we studied about Pruning a Decision Tree in R

  • Import fiber bits data. This is internet service provider data. The idea is to predict the customer attrition based on some independent factors
  • Build a decision tree model for fiber bits data
  • Prune the tree if required
  • Find out the final accuracy
  • Is there any 100% active/inactive customer segment?

Solution

Fiberbits <- read.csv("C:\\Amrita\\Datavedi\\Fiberbits\\Fiberbits.csv")
Fiber_bits_tree<-rpart(active_cust~., method="class", control=rpart.control(minsplit=30, cp=0.001), data=Fiberbits)
Fiber_bits_tree
## n= 100000 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##    1) root 100000 42141 1 (0.42141000 0.57859000)  
##      2) relocated>=0.5 12348   954 0 (0.92274052 0.07725948)  
##        4) technical_issues_per_month>=1.5 11294   526 0 (0.95342660 0.04657340) *
##        5) technical_issues_per_month< 1.5 1054   428 0 (0.59392789 0.40607211)  
##         10) number_plan_changes>=4.5 495    45 0 (0.90909091 0.09090909) *
##         11) number_plan_changes< 4.5 559   176 1 (0.31484794 0.68515206)  
##           22) Speed_test_result< 79.5 45     0 0 (1.00000000 0.00000000) *
##           23) Speed_test_result>=79.5 514   131 1 (0.25486381 0.74513619) *
##      3) relocated< 0.5 87652 30747 1 (0.35078492 0.64921508)  
##        6) Speed_test_result< 78.5 27517 10303 0 (0.62557692 0.37442308)  
##         12) technical_issues_per_month>=3.5 22187  5791 0 (0.73899130 0.26100870)  
##           24) number_plan_changes< 0.5 9735  1132 0 (0.88371854 0.11628146)  
##             48) Speed_test_result< 77.5 7750   541 0 (0.93019355 0.06980645) *
##             49) Speed_test_result>=77.5 1985   591 0 (0.70226700 0.29773300)  
##               98) income>=2008.5 1211   133 0 (0.89017341 0.10982659)  
##                196) income< 2526 1163    85 0 (0.92691316 0.07308684) *
##                197) income>=2526 48     0 1 (0.00000000 1.00000000) *
##               99) income< 2008.5 774   316 1 (0.40826873 0.59173127)  
##                198) income< 1785.5 270    97 0 (0.64074074 0.35925926) *
##                199) income>=1785.5 504   143 1 (0.28373016 0.71626984) *
##           25) number_plan_changes>=0.5 12452  4659 0 (0.62584324 0.37415676)  
##             50) number_plan_changes>=1.5 7867  1358 0 (0.82738020 0.17261980) *
##             51) number_plan_changes< 1.5 4585  1284 1 (0.28004362 0.71995638) *
##         13) technical_issues_per_month< 3.5 5330   818 1 (0.15347092 0.84652908)  
##           26) income>=1945.5 1849   619 1 (0.33477555 0.66522445)  
##             52) monthly_bill>=148 167    29 0 (0.82634731 0.17365269) *
##             53) monthly_bill< 148 1682   481 1 (0.28596908 0.71403092)  
##              106) income< 2362 1407   472 1 (0.33546553 0.66453447)  
##                212) technical_issues_per_month>=1.5 176    25 0 (0.85795455 0.14204545) *
##                213) technical_issues_per_month< 1.5 1231   321 1 (0.26076361 0.73923639)  
##                  426) income>=2180.5 126    21 0 (0.83333333 0.16666667) *
##                  427) income< 2180.5 1105   216 1 (0.19547511 0.80452489) *
##              107) income>=2362 275     9 1 (0.03272727 0.96727273) *
##           27) income< 1945.5 3481   199 1 (0.05716748 0.94283252) *
##        7) Speed_test_result>=78.5 60135 13533 1 (0.22504365 0.77495635)  
##         14) Speed_test_result< 82.5 25734  9271 1 (0.36026269 0.63973731)  
##           28) Speed_test_result>=80.5 11671  5312 0 (0.54485477 0.45514523)  
##             56) income>=1722.5 6306  1299 0 (0.79400571 0.20599429)  
##              112) income< 1992.5 5828   888 0 (0.84763212 0.15236788)  
##                224) number_plan_changes>=1.5 3053   189 0 (0.93809368 0.06190632) *
##                225) number_plan_changes< 1.5 2775   699 0 (0.74810811 0.25189189)  
##                  450) number_plan_changes< 0.5 2284   358 0 (0.84325744 0.15674256)  
##                    900) technical_issues_per_month>=3.5 1511    57 0 (0.96227664 0.03772336) *
##                    901) technical_issues_per_month< 3.5 773   301 0 (0.61060802 0.38939198)  
##                     1802) monthly_bill>=148 364    41 0 (0.88736264 0.11263736) *
##                     1803) monthly_bill< 148 409   149 1 (0.36430318 0.63569682) *
##                  451) number_plan_changes>=0.5 491   150 1 (0.30549898 0.69450102) *
##              113) income>=1992.5 478    67 1 (0.14016736 0.85983264) *
##             57) income< 1722.5 5365  1352 1 (0.25200373 0.74799627)  
##              114) number_plan_changes>=1.5 1586   680 1 (0.42875158 0.57124842)  
##                228) Speed_test_result< 81.5 894   370 0 (0.58612975 0.41387025)  
##                  456) technical_issues_per_month>=3.5 301    54 0 (0.82059801 0.17940199) *
##                  457) technical_issues_per_month< 3.5 593   277 1 (0.46711636 0.53288364)  
##                    914) income< 1604.5 261    92 0 (0.64750958 0.35249042) *
##                    915) income>=1604.5 332   108 1 (0.32530120 0.67469880) *
##                229) Speed_test_result>=81.5 692   156 1 (0.22543353 0.77456647) *
##              115) number_plan_changes< 1.5 3779   672 1 (0.17782482 0.82217518) *
##           29) Speed_test_result< 80.5 14063  2912 1 (0.20706819 0.79293181)  
##             58) income< 1960.5 11360  2725 1 (0.23987676 0.76012324)  
##              116) Num_complaints>=4.5 292    87 0 (0.70205479 0.29794521)  
##                232) technical_issues_per_month>=3.5 197     0 0 (1.00000000 0.00000000) *
##                233) technical_issues_per_month< 3.5 95     8 1 (0.08421053 0.91578947) *
##              117) Num_complaints< 4.5 11068  2520 1 (0.22768341 0.77231659)  
##                234) number_plan_changes>=1.5 4003  1180 1 (0.29477892 0.70522108)  
##                  468) income>=1809.5 1229   582 1 (0.47355574 0.52644426)  
##                    936) Speed_test_result>=79.5 477   132 0 (0.72327044 0.27672956) *
##                    937) Speed_test_result< 79.5 752   237 1 (0.31515957 0.68484043) *
##                  469) income< 1809.5 2774   598 1 (0.21557318 0.78442682) *
##                235) number_plan_changes< 1.5 7065  1340 1 (0.18966737 0.81033263) *
##             59) income>=1960.5 2703   187 1 (0.06918239 0.93081761) *
##         15) Speed_test_result>=82.5 34401  4262 1 (0.12389175 0.87610825) *

Plotting the Tree

prp(Fiber_bits_tree,box.col=c("Grey", "Orange")[Fiber_bits_tree$frame$yval],varlen=0,faclen=0, type=1,extra=4,under=TRUE)

Code-Choosing Cp and Cross Validation Error

printcp(Fiber_bits_tree)
## 
## Classification tree:
## rpart(formula = active_cust ~ ., data = Fiberbits, method = "class", 
##     control = rpart.control(minsplit = 30, cp = 0.001))
## 
## Variables actually used in tree construction:
## [1] income                     monthly_bill              
## [3] Num_complaints             number_plan_changes       
## [5] relocated                  Speed_test_result         
## [7] technical_issues_per_month
## 
## Root node error: 42141/100000 = 0.42141
## 
## n= 100000 
## 
##           CP nsplit rel error  xerror      xstd
## 1  0.2477397      0   1.00000 1.00000 0.0037054
## 2  0.1639971      1   0.75226 0.75226 0.0034917
## 3  0.0876581      2   0.58826 0.58826 0.0032402
## 4  0.0293301      3   0.50061 0.50061 0.0030616
## 5  0.0239316      6   0.41261 0.41295 0.0028450
## 6  0.0081631      8   0.36475 0.37498 0.0027372
## 7  0.0024560      9   0.35659 0.35811 0.0026862
## 8  0.0022662     11   0.35168 0.35362 0.0026723
## 9  0.0018272     13   0.34714 0.34520 0.0026457
## 10 0.0016848     15   0.34349 0.34228 0.0026364
## 11 0.0014001     18   0.33832 0.33825 0.0026234
## 12 0.0013763     24   0.32859 0.33495 0.0026127
## 13 0.0013170     26   0.32583 0.33115 0.0026003
## 14 0.0012933     28   0.32320 0.32859 0.0025918
## 15 0.0011390     33   0.31563 0.32465 0.0025787
## 16 0.0010678     34   0.31449 0.32088 0.0025661
## 17 0.0010000     35   0.31342 0.31926 0.0025606

Plot-Choosing Cp and Cross Validation Error

plotcp(Fiber_bits_tree) 

Pruning

Fiber_bits_tree_1<-prune(Fiber_bits_tree, cp=0.0081631)
Fiber_bits_tree_1
## n= 100000 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 100000 42141 1 (0.42141000 0.57859000)  
##    2) relocated>=0.5 12348   954 0 (0.92274052 0.07725948) *
##    3) relocated< 0.5 87652 30747 1 (0.35078492 0.64921508)  
##      6) Speed_test_result< 78.5 27517 10303 0 (0.62557692 0.37442308)  
##       12) technical_issues_per_month>=3.5 22187  5791 0 (0.73899130 0.26100870)  
##         24) number_plan_changes< 0.5 9735  1132 0 (0.88371854 0.11628146) *
##         25) number_plan_changes>=0.5 12452  4659 0 (0.62584324 0.37415676)  
##           50) number_plan_changes>=1.5 7867  1358 0 (0.82738020 0.17261980) *
##           51) number_plan_changes< 1.5 4585  1284 1 (0.28004362 0.71995638) *
##       13) technical_issues_per_month< 3.5 5330   818 1 (0.15347092 0.84652908) *
##      7) Speed_test_result>=78.5 60135 13533 1 (0.22504365 0.77495635)  
##       14) Speed_test_result< 82.5 25734  9271 1 (0.36026269 0.63973731)  
##         28) Speed_test_result>=80.5 11671  5312 0 (0.54485477 0.45514523)  
##           56) income>=1722.5 6306  1299 0 (0.79400571 0.20599429) *
##           57) income< 1722.5 5365  1352 1 (0.25200373 0.74799627) *
##         29) Speed_test_result< 80.5 14063  2912 1 (0.20706819 0.79293181) *
##       15) Speed_test_result>=82.5 34401  4262 1 (0.12389175 0.87610825) *

Plot after Pruning

prp(Fiber_bits_tree_1,box.col=c("Grey", "Orange")[Fiber_bits_tree$frame$yval],varlen=0,faclen=0, type=1,extra=4,under=TRUE)

Choosing Cp and Cross Validation Error with New Model

printcp(Fiber_bits_tree_1) 
## 
## Classification tree:
## rpart(formula = active_cust ~ ., data = Fiberbits, method = "class", 
##     control = rpart.control(minsplit = 30, cp = 0.001))
## 
## Variables actually used in tree construction:
## [1] income                     number_plan_changes       
## [3] relocated                  Speed_test_result         
## [5] technical_issues_per_month
## 
## Root node error: 42141/100000 = 0.42141
## 
## n= 100000 
## 
##          CP nsplit rel error  xerror      xstd
## 1 0.2477397      0   1.00000 1.00000 0.0037054
## 2 0.1639971      1   0.75226 0.75226 0.0034917
## 3 0.0876581      2   0.58826 0.58826 0.0032402
## 4 0.0293301      3   0.50061 0.50061 0.0030616
## 5 0.0239316      6   0.41261 0.41295 0.0028450
## 6 0.0081631      8   0.36475 0.37498 0.0027372
plotcp(Fiber_bits_tree_1) 

Pruning further

Fiber_bits_tree_2<-prune(Fiber_bits_tree, cp=0.0239316)
Fiber_bits_tree_2
## n= 100000 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 100000 42141 1 (0.42141000 0.57859000)  
##    2) relocated>=0.5 12348   954 0 (0.92274052 0.07725948) *
##    3) relocated< 0.5 87652 30747 1 (0.35078492 0.64921508)  
##      6) Speed_test_result< 78.5 27517 10303 0 (0.62557692 0.37442308)  
##       12) technical_issues_per_month>=3.5 22187  5791 0 (0.73899130 0.26100870) *
##       13) technical_issues_per_month< 3.5 5330   818 1 (0.15347092 0.84652908) *
##      7) Speed_test_result>=78.5 60135 13533 1 (0.22504365 0.77495635)  
##       14) Speed_test_result< 82.5 25734  9271 1 (0.36026269 0.63973731)  
##         28) Speed_test_result>=80.5 11671  5312 0 (0.54485477 0.45514523)  
##           56) income>=1722.5 6306  1299 0 (0.79400571 0.20599429) *
##           57) income< 1722.5 5365  1352 1 (0.25200373 0.74799627) *
##         29) Speed_test_result< 80.5 14063  2912 1 (0.20706819 0.79293181) *
##       15) Speed_test_result>=82.5 34401  4262 1 (0.12389175 0.87610825) *

Tree- After Pruning further

prp(Fiber_bits_tree_2,box.col=c("Grey", "Orange")[Fiber_bits_tree$frame$yval],varlen=0,faclen=0, type=1,extra=4,under=TRUE)

Conclusion

  • Decision trees are powerful and very simple to represent and understand.
  • One need to be careful with the size of the tree. Decision trees are more prone to overfitting than other algorithms
  • Can be applied to any type of data, especially with categorical predictors
  • One can use decision trees to perform a basic customer segmentation and build a different predictive model on the segments

 

In next section, we will be studying about  Model Section and Cross Validation

DV Analytics

DV Data & Analytics is a leading data science,  Cyber Security training and consulting firm, led by industry experts. We are aiming to train and prepare resources to acquire the most in-demand data science job opportunities in India and abroad.

Bangalore Center

DV Data & Analytics Bangalore Private Limited
#52, 2nd Floor:
Malleshpalya Maruthinagar Bengaluru.
Bangalore 560075
India
(+91) 9019 030 033 (+91) 8095 881 188
Email: info@dvanalyticsmds.com

Bhubneshwar Center

DV Data & Analytics Private Limited Bhubaneswar
Plot No A/7 :
Adjacent to Maharaja Cine Complex, Bhoinagar, Acharya Vihar
Bhubaneswar 751022
(+91) 8095 881 188 (+91) 8249 430 414
Email: info@dvanalyticsmds.com

top
© 2020. All Rights Reserved.