• No products in the cart.

Random Forests

Ensemble Models & Random Forests

Contents

  • Introduction
  • Ensemble Learning
  • How ensemble learning works
  • Bagging
  • Building models using Bagging
  • Random Forest algorithm
  • Random Forest model building
  • Boosting
  • Building models using boosting
  • Conclusion

The Wisdom of Crowds

  • One should not expend energy trying to identify an expert within a group but instead rely on the group’s collective wisdom, however make sure that Opinions must be independent and some knowledge of the truth must reside with some group members – Surowiecki
  • So instead of trying to build one great model, its better to build some independent moderate models and take their average as final prediction

What is Ensemble Learning

  • Imagine a classifier problem, there are two classes +1 & -1 in the target
  • Imagine that we built a best possible decision tree, it has 91% accuracy
  • Let x be the new data point and our decision tree predicts it to be +1. Is there a way I can do better than 91% by using the same data
  • Lets build 3 more models on the same data. And see we can improve the performance
  • We have four models on the same dataset, Each of them have different accuracy. But unfortunately there seem to be no real improvement in the accuracy.
  • What about prediction of the data point x?
  • Except the decision tree, the rest all algorithms are predicting the class of x as -1; Intuitively we would like to believe that the class of x is -1
  • The combined voting model seem to be having less error than each of the individual models. This is the actual philosophy of ensemble learning

Ensemble Models

  • Obtaining a better predictions using multiple models on the same dataset
  • Not every time it is possible to find single best fit model for our data, ensemble model combines multiple models to come up with one consolidated model
  • Ensemble models work on the principle that multiple moderately accurate models can give us a highly accurate model
  • Understandably, the Building and Evaluating the ensemble models is computationally expensive
  • Build one really good model is the usual statistical approach. Build many models and average the results is the philosophy of Ensemble learning

Why Ensemble technique works?

  • Imagine three models
    • M1 with an error rate of 10%
    • M2 with an error rate of 10%
    • M3 with an error rate of 10%
  • The three models have to be independent, we can’t build the same model three times and expect the error to reduce. Any changes to the modeling technique in model -1 should not impact model-2
  • In this scenario, the worst ensemble model will have 10% error rate
  • The best ensemble model will have an error rate of 2.8%
    • 2 out of 3 models predicted wrong + all models predicted wrong
    • (3C2)*(0.1)(0.1)(0.9) + (0.1)(0.1)(0.1)
    • 2.8% The best ensemble model will have an error rate of 2.8%

Types of Ensemble Models

  • The above example is a very primitive type of ensemble model. There are better and statistically stronger ensemble methods that will yield better results
  • Two most popular ensemble methodologies are
    • Bagging
    • Boosting

Bagging

  • Take multiple boot strap samples from the population and build classifiers on each of the samples. For prediction take mean or mode of all the individual model predictions.
  • Bagging has two major parts 1) Boot strap sampling 2) Aggregation of learners
  • Bagging = Bootstrap Aggregating
  • In Bagging we combine many unstable models to produce a stable model. Hence the predictors will be very reliable(less variance in the final model).

Boot strapping

  • We have a training data is of size N
  • Draw random sample with replacement of size N – This gives a new dataset, it might have repeated observations, some observations might not have even appeared once.
  • We are selecting records one-at-a-time, returning each selected record back in the population, giving it a chance to be selected again
  • Create B such new datasets. These are called boot strap datasets

The Bagging Algorithm

  • The training dataset D
  • Draw k boot strap sample sets from dataset D
  • For each boot strap sample i
    • Build a classifier model (M_i)
  • We will have total of k classifiers (M_1 , M_2 ,… M_k)
  • Vote over for the final classifier output and take the average for regression output

Why Bagging Works

  • We are selecting records one-at-a-time, returning each selected record back in the population, giving it a chance to be selected again
  • Note that the variance in the consolidated prediction is reduced, if we have independent samples. That way we can reduce the unavoidable errors made by the single model.
  • In a given boot strap sample, some observations have chance to select multiple times and some observations might not have selected at all.
  • There a proven theory that boot strap samples have only 63% of overall population and rest 37% is not present.
  • So the data used in each of these models is not exactly same, This makes our learning models independent. This helps our predictors have the uncorrelated errors.
  • Finally the errors from the individual models cancel out and give us a better ensemble model with higher accuracy
  • Bagging is really useful when there is lot of variance in our data

LAB: Bagging Models

  • Import Boston house price data. It is part of MASS package
  • Get some basic meta details of the data
  • Take 90% data use it for training and take rest 10% as holdout data
  • Build a single linear regression model on the training data.
  • On the hold out data, calculate the error (squared deviation) for the regression model.
  • Build the regression model using bagging technique. Build at least 25 models
  • On the hold out data, calculate the error (squared deviation) for the consolidated bagged regression model.
  • What is the improvement of the bagged model when compared with the single model?

Solution

#Importing Boston  house pricing data. 
library(MASS)
data(Boston)
head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2
## 6  5.21 28.7
dim(Boston)
## [1] 506  14
##Training and holdout sample
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(500)
sampleseed <- createDataPartition(Boston$medv, p=0.9, list=FALSE)

train_boston<-Boston[sampleseed,]
test_boston<-Boston[-sampleseed,]

###Regression Model
reg_model<- lm(medv ~ ., data=train_boston)
summary(reg_model)
## 
## Call:
## lm(formula = medv ~ ., data = train_boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.4763  -2.7684  -0.4912   1.9030  26.4569 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.637e+01  5.534e+00   6.572 1.40e-10 ***
## crim        -1.042e-01  3.513e-02  -2.965 0.003195 ** 
## zn           4.482e-02  1.459e-02   3.073 0.002248 ** 
## indus        1.986e-02  6.566e-02   0.302 0.762462    
## chas         2.733e+00  8.765e-01   3.118 0.001939 ** 
## nox         -1.844e+01  4.018e+00  -4.590 5.79e-06 ***
## rm           3.845e+00  4.670e-01   8.234 2.04e-15 ***
## age          8.782e-04  1.434e-02   0.061 0.951211    
## dis         -1.488e+00  2.096e-01  -7.101 4.94e-12 ***
## rad          2.770e-01  6.993e-02   3.960 8.71e-05 ***
## tax         -1.062e-02  3.944e-03  -2.693 0.007348 ** 
## ptratio     -9.799e-01  1.385e-01  -7.073 5.92e-12 ***
## black        9.620e-03  2.827e-03   3.403 0.000726 ***
## lstat       -5.051e-01  5.706e-02  -8.852  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.787 on 444 degrees of freedom
## Multiple R-squared:  0.7309, Adjusted R-squared:  0.723 
## F-statistic: 92.75 on 13 and 444 DF,  p-value: < 2.2e-16
###Accuracy testing on holdout data
pred_reg<-predict(reg_model, newdata=test_boston[,-14])
reg_err<-sum((test_boston$medv-pred_reg)^2)
reg_err
## [1] 918.5927
###Bagging Ensemble Model
library(ipred)
bagg_model<- bagging(medv ~ ., data=train_boston , nbagg=30)

###Accuracy testing on holout data
pred_bagg<-predict(bagg_model, newdata=test_boston[,-14])
bgg_err<-sum((test_boston$medv-pred_bagg)^2)
bgg_err
## [1] 390.9028
###Overall Improvement
reg_err
## [1] 918.5927
bgg_err
## [1] 390.9028
(reg_err-bgg_err)/reg_err
## [1] 0.5744547

Random Forest

  • Like many trees form a forest, many decision tree model together form a Random Forest model
  • Random forest is a specific case of bagging methodology. Bagging on decision trees is random forest
  • In random forest we induce two types of randomness
    • Firstly, we take the boot strap samples of the population and build decision trees on each of the sample.
    • While building the individual trees on boot strap samples, we take a subset of the features randomly
  • Random forests are very stable they are as good as SVMs and sometimes better

Random Forest Algorithm

  • The training dataset D with t number of features
  • Draw k boot strap sample sets from dataset D
  • For each boot strap sample i
    • Build a decision tree model (M_i) using only p number of features (where p<<t)
    • Each tree has maximal strength they are fully grown and not pruned.
  • We will have total of k decision treed (M_1 , M_2 ,… M_k); Each of these trees are built on reactively different training data and different set of features
  • Vote over for the final classifier output and take the average for regression output

The Random Factors in Random Forest

  • We need to note the most important aspect of random forest, i.e inducing randomness into the bagging of trees. There are two major sources of randomness
    • Randomness in data: Boot strapping, this will make sure that any two samples data is somewhat different
    • Randomness in features: While building the decision trees on boot strapped samples we consider only a random subset of features.
  • Why to induce the randomness?
    • The major trick of ensemble models is the independence of models.
    • If we take the same data and build same model for 100 times, we will not see any improvement
    • To make all our decision trees independent, we take independent samples set and independent features set
    • As a rule of thumb we can consider square root of the number of features, if ‘t’ is very large else p=t/3

Why Random Forest Works

  • For a training data with 20 features we are building 100 decision trees with 5 features each, instated of single great decision. The individual trees may be weak classifiers.
  • Its like building weak classifiers on subsets of data. The grouping of large sets of random trees generally produces accurate models.
  • In this example we have three simple classifiers.
  • m1 classifies anything above the line as +1 and below as -1, m2 classifies all the points above the line as -1 and below as +1 and m3 classifies everything on the left as -1 and right as +1
  • Each of these models have fair amount of misclassification error.
  • All these three weak models together make a strong model.

LAB: Random Forest

  • Dataset: /Car Accidents IOT/Train.csv
  • Build a decision tree model to predict the fatality of accident
  • Build a decision tree model on the training data.
  • On the test data, calculate the classification error and accuracy.
  • Build a random forest model on the training data.
  • On the test data, calculate the classification error and accuracy.
  • What is the improvement of the Random Forest model when compared with the single tree?

Solution

#Data Import
train<- read.csv("C:/Amrita/Datavedi/Car Accidents IOT/Train.csv")
test<- read.csv("C:/Amrita/Datavedi/Car Accidents IOT/Test.csv")

dim(train)
## [1] 15109    23
head(train)
##   Fatal      S1       S2       S3  S4       S5 S6 S7 S8 S9      S10 S11
## 1     1 36.2247 10.77330 0.243897 596 100.6710  0  0  1 28 0.016064 313
## 2     1 35.7343 17.45510 0.243897 600 100.0000  0  0  1 14 0.015812 319
## 3     1 31.6561  7.61366 0.308763 604  99.3377  0  0  1  4 0.015560 323
## 4     1 33.8320 13.11190 0.293195 616  97.4026  0  0  1  8 0.016001 320
## 5     1 42.5138 13.99850 0.259465 632  94.9367  0  0  1  8 0.016064 322
## 6     1 36.1261 14.85930 0.278925 600 100.0000  0  0  1  4 0.015749 314
##   S12 S13 S14 S15   S16  S17     S18 S19  S20 S21     S22
## 1   1   1  57   0 0.280  240 5.99375   0  0.0   4 14.9382
## 2   1   1  57   0 0.175  240 5.99375   0  0.0   4 14.8827
## 3   1   1  58   0 0.280  240 5.99375   0  0.0   4 14.6005
## 4   1   1  58   0 0.385  240 4.50625   0 13.0   4 14.6782
## 5   1   1  57   0 0.070  240 5.99375   0 19.5   4 15.3461
## 6   1   1  58   0 0.175 1008 4.50625   0 23.9   4 15.0559
###Decision Tree
library(rpart)
crash_model_ds<-rpart(Fatal ~ ., method="class", control=rpart.control(minsplit=30, cp=0.03),   data=train)

#Training accuarcy
predicted_y<-predict(crash_model_ds, type="class")
table(predicted_y)
## predicted_y
##    0    1 
## 5745 9364
confusionMatrix(predicted_y,train$Fatal)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4735 1010
##          1 1581 7783
##                                           
##                Accuracy : 0.8285          
##                  95% CI : (0.8224, 0.8345)
##     No Information Rate : 0.582           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.643           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7497          
##             Specificity : 0.8851          
##          Pos Pred Value : 0.8242          
##          Neg Pred Value : 0.8312          
##              Prevalence : 0.4180          
##          Detection Rate : 0.3134          
##    Detection Prevalence : 0.3802          
##       Balanced Accuracy : 0.8174          
##                                           
##        'Positive' Class : 0               
## 
#Accuaracy on Test data
predicted_test_ds<-predict(crash_model_ds, test, type="class")
confusionMatrix(predicted_test_ds,test$Fatal)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2897  561
##          1  995 4612
##                                           
##                Accuracy : 0.8284          
##                  95% CI : (0.8204, 0.8361)
##     No Information Rate : 0.5707          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6448          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7443          
##             Specificity : 0.8916          
##          Pos Pred Value : 0.8378          
##          Neg Pred Value : 0.8225          
##              Prevalence : 0.4293          
##          Detection Rate : 0.3196          
##    Detection Prevalence : 0.3815          
##       Balanced Accuracy : 0.8179          
##                                           
##        'Positive' Class : 0               
## 
###Random Forest
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
rf_model <- randomForest(as.factor(train$Fatal) ~ ., ntree=200,   mtry=ncol(train)/3, data=train)

#Training accuaracy
predicted_y<-predict(rf_model)
table(predicted_y)
## predicted_y
##    0    1 
## 5921 9188
confusionMatrix(predicted_y,train$Fatal)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5600  321
##          1  716 8472
##                                           
##                Accuracy : 0.9314          
##                  95% CI : (0.9272, 0.9353)
##     No Information Rate : 0.582           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8577          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8866          
##             Specificity : 0.9635          
##          Pos Pred Value : 0.9458          
##          Neg Pred Value : 0.9221          
##              Prevalence : 0.4180          
##          Detection Rate : 0.3706          
##    Detection Prevalence : 0.3919          
##       Balanced Accuracy : 0.9251          
##                                           
##        'Positive' Class : 0               
## 
#Accuaracy on Test data
predicted_test_rf<-predict(rf_model,test, type="class")
confusionMatrix(predicted_test_rf,test$Fatal)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3479  192
##          1  413 4981
##                                           
##                Accuracy : 0.9333          
##                  95% CI : (0.9279, 0.9383)
##     No Information Rate : 0.5707          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8628          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8939          
##             Specificity : 0.9629          
##          Pos Pred Value : 0.9477          
##          Neg Pred Value : 0.9234          
##              Prevalence : 0.4293          
##          Detection Rate : 0.3838          
##    Detection Prevalence : 0.4050          
##       Balanced Accuracy : 0.9284          
##                                           
##        'Positive' Class : 0               
## 

Boosting

  • Boosting is one more famous ensemble method
  • Boosting uses a slightly different techniques to that of bagging.
  • Boosting is a well proven theory that works really well on many of the machine learning problems like speech recognition
  • If bagging is wisdom of crowds then boosting is wisdom of crowds where each individual is given some weight based on their expertise
  • Boosting in general decreases the bias error and builds strong predictive models.
  • Boosting is an iterative technique. We adjust the weight of the observation based on the previous classification.
  • If an observation was classified incorrectly, it tries to increase the weight of this observation and vice versa.

Boosting Main idea

Final Classifier (C = sum alpha_i c_i)

How weighted samples are taken

Boosting Illustration

Below is the training data and their classes We need to take a note of record numbers, they will help us in weighted sampling later

Theory behind Boosting Algorithm

  • Take the dataset
  • Build a classifier (C_m) and find the error
  • Calculate error rate of the classifier
    • Error rate of (epsilon _m = sum w_i I (y_i neq C_m (x)) / sum w_i) = Sum of misclassification weight / sum of sample weights
  • Calculate an intermediate factor called a. It analogous to accuracy rate of the model. It will be later used in weight updating. It is derived from error
    • (alpha _m = log(1- epsilon _m)/epsilon _m))
  • Update weights of each record in the sample using the a factor. The indicator function will make sure that the misclassifications are given more weight
    • For i =1,2,… N
      • (W_(i+1) = w_i e^(alpha _m I(y_ineq C_m (x))))
      • Renormalize so that sum of weights is 1
  • Repeat this model building and weight update process until we have no misclassification
  • Final collation is done by voting from all the modes. While taking the votes, each model is weighted by the accuracy factor (alpha)
    • (C = sign(sum alpha _i C_i(x)))

Gradient Boosting

  • Ada boosting
    • Adaptive Boosting
    • Till now we discussed Ada boosting technique. Here we give high weight to misclassified records.
  • Gradient Boosting
    • Similar to Ada boosting algorithm.
    • The approach is same but there are slight modifications during re-weighted sampling.
    • We update the weights based on misclassification rate and gradient
    • Gradient boosting serves better for some class of problems like regression.

LAB: Boosting

  • Rightly categorizing the items based on their detailed feature specifications. More than 100 specifications have been collected.
  • Data: Ecom_Products_Menu/train.csv
  • Build a decision tree model and check the training and testing accuracy
  • Build a boosted decision tree.
  • Is there any improvement from the earlier decision tree

Solution

train <- read.csv("C:/Amrita/Datavedi/Ecom_Products_Menu/train.csv")
test <- read.csv("C:/Amrita/Datavedi/Ecom_Products_Menu/test.csv")

dim(train)
## [1] 50122   102
##Decison Tree
library(rpart)
ecom_products_ds<-rpart(Category ~ ., method="class", control=rpart.control(minsplit=30, cp=0.01),  data=train[,-1])
library(rattle)
## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
fancyRpartPlot(ecom_products_ds)

#Training accuarcy
library(caret)
predicted_y<-predict(ecom_products_ds, type="class")
table(predicted_y)
## predicted_y
##   Accessories    Appliances        Camara          Ipod       Laptops 
##             0         10899          2733          2442             0 
##       Mobiles Personal_Care       Tablets            TV 
##             0         10288         23760             0
confusionMatrix(predicted_y,train$Category)
## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      Accessories Appliances Camara  Ipod Laptops Mobiles
##   Accessories             0          0      0     0       0       0
##   Appliances            825       5536   1086   130     506     709
##   Camara                 88        387   1456     4      55     388
##   Ipod                   30         17     23  2032     144       5
##   Laptops                 0          0      0     0       0       0
##   Mobiles                 0          0      0     0       0       0
##   Personal_Care         110        308    152     0      18      79
##   Tablets              1288        615   1247    51    5743     377
##   TV                      0          0      0     0       0       0
##                Reference
## Prediction      Personal_Care Tablets    TV
##   Accessories               0       0     0
##   Appliances             1035     932   140
##   Camara                  252      84    19
##   Ipod                     13     159    19
##   Laptops                   0       0     0
##   Mobiles                   0       0     0
##   Personal_Care          9545      19    57
##   Tablets                 607   11885  1947
##   TV                        0       0     0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6076          
##                  95% CI : (0.6033, 0.6119)
##     No Information Rate : 0.2609          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5053          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Accessories Class: Appliances Class: Camara
## Sensitivity                     0.00000            0.8066       0.36731
## Specificity                     1.00000            0.8760       0.97233
## Pos Pred Value                      NaN            0.5079       0.53275
## Neg Pred Value                  0.95329            0.9662       0.94708
## Prevalence                      0.04671            0.1369       0.07909
## Detection Rate                  0.00000            0.1105       0.02905
## Detection Prevalence            0.00000            0.2174       0.05453
## Balanced Accuracy               0.50000            0.8413       0.66982
##                      Class: Ipod Class: Laptops Class: Mobiles
## Sensitivity              0.91655          0.000        0.00000
## Specificity              0.99144          1.000        1.00000
## Pos Pred Value           0.83210            NaN            NaN
## Neg Pred Value           0.99612          0.871        0.96892
## Prevalence               0.04423          0.129        0.03108
## Detection Rate           0.04054          0.000        0.00000
## Detection Prevalence     0.04872          0.000        0.00000
## Balanced Accuracy        0.95400          0.500        0.50000
##                      Class: Personal_Care Class: Tablets Class: TV
## Sensitivity                        0.8335         0.9087   0.00000
## Specificity                        0.9808         0.6794   1.00000
## Pos Pred Value                     0.9278         0.5002       NaN
## Neg Pred Value                     0.9521         0.9547   0.95647
## Prevalence                         0.2285         0.2609   0.04353
## Detection Rate                     0.1904         0.2371   0.00000
## Detection Prevalence               0.2053         0.4740   0.00000
## Balanced Accuracy                  0.9071         0.7941   0.50000
#Accuarcy on Test data
predicted_test_ds<-predict(ecom_products_ds, test[,-1], type="class")
confusionMatrix(predicted_test_ds,test$Category)
## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      Accessories Appliances Camara Ipod Laptops Mobiles
##   Accessories             0          0      0    0       0       0
##   Appliances            172       1308    269   40      92     170
##   Camara                 15         80    383    1      16      95
##   Ipod                   14          4      3  469      28       0
##   Laptops                 0          0      0    0       0       0
##   Mobiles                 0          0      0    0       0       0
##   Personal_Care          23         75     42    0       1      23
##   Tablets               274        134    294   12    1401      83
##   TV                      0          0      0    0       0       0
##                Reference
## Prediction      Personal_Care Tablets   TV
##   Accessories               0       0    0
##   Appliances              234     210   42
##   Camara                   52      23    3
##   Ipod                      3      49    5
##   Laptops                   0       0    0
##   Mobiles                   0       0    0
##   Personal_Care          2242      10   17
##   Tablets                 152    2751  442
##   TV                        0       0    0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6085          
##                  95% CI : (0.5996, 0.6173)
##     No Information Rate : 0.2588          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5071          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Accessories Class: Appliances Class: Camara
## Sensitivity                     0.00000            0.8170       0.38648
## Specificity                     1.00000            0.8790       0.97353
## Pos Pred Value                      NaN            0.5156       0.57335
## Neg Pred Value                  0.95764            0.9682       0.94517
## Prevalence                      0.04236            0.1362       0.08430
## Detection Rate                  0.00000            0.1113       0.03258
## Detection Prevalence            0.00000            0.2158       0.05682
## Balanced Accuracy               0.50000            0.8480       0.68000
##                      Class: Ipod Class: Laptops Class: Mobiles
## Sensitivity              0.89847         0.0000        0.00000
## Specificity              0.99056         1.0000        1.00000
## Pos Pred Value           0.81565            NaN            NaN
## Neg Pred Value           0.99526         0.8692        0.96844
## Prevalence               0.04440         0.1308        0.03156
## Detection Rate           0.03989         0.0000        0.00000
## Detection Prevalence     0.04891         0.0000        0.00000
## Balanced Accuracy        0.94452         0.5000        0.50000
##                      Class: Personal_Care Class: Tablets Class: TV
## Sensitivity                        0.8356         0.9040    0.0000
## Specificity                        0.9789         0.6796    1.0000
## Pos Pred Value                     0.9215         0.4963       NaN
## Neg Pred Value                     0.9527         0.9530    0.9567
## Prevalence                         0.2282         0.2588    0.0433
## Detection Rate                     0.1907         0.2340    0.0000
## Detection Prevalence               0.2070         0.4715    0.0000
## Balanced Accuracy                  0.9073         0.7918    0.5000
###Boosting

library(xgboost)
library(methods)
library(data.table)
library(magrittr)

# converting datasets to Numeric format. xgboost needs at least one numeric column 
train[,c(-1,-102)] <- lapply( train[,c(-1,-102)], as.numeric)
test[,c(-1,-102)] <- lapply( test[,c(-1,-102)], as.numeric)

# converting datasets to Matrix format. Data frame is not supported by xgboost
trainMatrix <- train[,c(-1,-102)] %>% as.matrix
testMatrix <- test[,c(-1,-102)] %>% as.matrix

#The label should be in numeric format and it should start from 0
y<-as.integer(train$Category)-1
table(y,train$Category)
##    
## y   Accessories Appliances Camara  Ipod Laptops Mobiles Personal_Care
##   0        2341          0      0     0       0       0             0
##   1           0       6863      0     0       0       0             0
##   2           0          0   3964     0       0       0             0
##   3           0          0      0  2217       0       0             0
##   4           0          0      0     0    6466       0             0
##   5           0          0      0     0       0    1558             0
##   6           0          0      0     0       0       0         11452
##   7           0          0      0     0       0       0             0
##   8           0          0      0     0       0       0             0
##    
## y   Tablets    TV
##   0       0     0
##   1       0     0
##   2       0     0
##   3       0     0
##   4       0     0
##   5       0     0
##   6       0     0
##   7   13079     0
##   8       0  2182
test_y<-as.integer(test$Category)-1
table(test_y,test$Category)
##       
## test_y Accessories Appliances Camara Ipod Laptops Mobiles Personal_Care
##      0         498          0      0    0       0       0             0
##      1           0       1601      0    0       0       0             0
##      2           0          0    991    0       0       0             0
##      3           0          0      0  522       0       0             0
##      4           0          0      0    0    1538       0             0
##      5           0          0      0    0       0     371             0
##      6           0          0      0    0       0       0          2683
##      7           0          0      0    0       0       0             0
##      8           0          0      0    0       0       0             0
##       
## test_y Tablets   TV
##      0       0    0
##      1       0    0
##      2       0    0
##      3       0    0
##      4       0    0
##      5       0    0
##      6       0    0
##      7    3043    0
##      8       0  509
#Setting the parameters for multiclass classification
param <- list("objective" = "multi:softprob","eval.metric" = "merror",   "num_class" =9)
#"multi:softmax" --set XGBoost to do multiclass classification using the softmax objective, you also need to set num_class(number of classes)     
#"merror": Multiclass classification error rate. It is calculated as #(wrong cases)/#(all cases).

XGBModel <- xgboost(param=param, data = trainMatrix, label = y, nrounds=40)
## [0]  train-merror:0.269223
## [1]  train-merror:0.241750
## [2]  train-merror:0.229500
## [3]  train-merror:0.222776
## [4]  train-merror:0.218966
## [5]  train-merror:0.211923
## [6]  train-merror:0.208312
## [7]  train-merror:0.203703
## [8]  train-merror:0.199553
## [9]  train-merror:0.196481
## [10] train-merror:0.192969
## [11] train-merror:0.190695
## [12] train-merror:0.188241
## [13] train-merror:0.185487
## [14] train-merror:0.183193
## [15] train-merror:0.180400
## [16] train-merror:0.177886
## [17] train-merror:0.175552
## [18] train-merror:0.173217
## [19] train-merror:0.171362
## [20] train-merror:0.168968
## [21] train-merror:0.166474
## [22] train-merror:0.164379
## [23] train-merror:0.162743
## [24] train-merror:0.161925
## [25] train-merror:0.160389
## [26] train-merror:0.158214
## [27] train-merror:0.156478
## [28] train-merror:0.155521
## [29] train-merror:0.154284
## [30] train-merror:0.152628
## [31] train-merror:0.151271
## [32] train-merror:0.149356
## [33] train-merror:0.147879
## [34] train-merror:0.146283
## [35] train-merror:0.144827
## [36] train-merror:0.143749
## [37] train-merror:0.142053
## [38] train-merror:0.140358
## [39] train-merror:0.139240
#Training accuarcy
predicted_y<-predict(XGBModel, trainMatrix)
probs <- data.frame(matrix(predicted_y, nrow=nrow(train), ncol=9,  byrow = TRUE))

probs_final<-as.data.frame(cbind(row.names(probs),apply(probs,1, function(x) c(0:8)[which(x==max(x))])))
table(probs_final$V2)
## 
##     0     1     2     3     4     5     6     7     8 
##  2140  6969  3997  2227  5142  1242 11418 15605  1382
confusionMatrix(probs_final$V2,y)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1     2     3     4     5     6     7     8
##          0  1820    32    13     1    74    26    94    58    22
##          1    73  6495   123     1    13   129   119    12     4
##          2    13    78  3584     2     4   204   103     9     0
##          3     8     5     3  2192     0     1     0     8    10
##          4    84    20     4     3  3830     5    12   970   214
##          5    28    55    60     2     2  1051    37     7     0
##          6    81   105    93     1     5    95 10987    20    31
##          7   216    73    82    15  2500    46    92 11932   649
##          8    18     0     2     0    38     1     8    63  1252
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8608          
##                  95% CI : (0.8577, 0.8638)
##     No Information Rate : 0.2609          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8306          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.77745   0.9464  0.90414  0.98872  0.59233  0.67458
## Specificity           0.99330   0.9890  0.99105  0.99927  0.96995  0.99607
## Pos Pred Value        0.85047   0.9320  0.89667  0.98428  0.74485  0.84622
## Neg Pred Value        0.98914   0.9915  0.99176  0.99948  0.94140  0.98963
## Prevalence            0.04671   0.1369  0.07909  0.04423  0.12901  0.03108
## Detection Rate        0.03631   0.1296  0.07151  0.04373  0.07641  0.02097
## Detection Prevalence  0.04270   0.1390  0.07975  0.04443  0.10259  0.02478
## Balanced Accuracy     0.88537   0.9677  0.94759  0.99400  0.78114  0.83532
##                      Class: 6 Class: 7 Class: 8
## Sensitivity            0.9594   0.9123  0.57379
## Specificity            0.9889   0.9008  0.99729
## Pos Pred Value         0.9623   0.7646  0.90593
## Neg Pred Value         0.9880   0.9668  0.98092
## Prevalence             0.2285   0.2609  0.04353
## Detection Rate         0.2192   0.2381  0.02498
## Detection Prevalence   0.2278   0.3113  0.02757
## Balanced Accuracy      0.9741   0.9066  0.78554
#Accuarcy on Test data

predicted_test_boost<-predict(XGBModel, testMatrix)
probs_test <- data.frame(matrix(predicted_test_boost, nrow=nrow(test), ncol=9,  byrow = TRUE))

probs_final_test<-as.data.frame(cbind(row.names(probs_test),apply(probs_test,1, function(x) c(0:8)[which(x==max(x))])))
table(probs_final_test$V2)
## 
##    0    1    2    3    4    5    6    7    8 
##  446 1654 1037  514 1202  231 2699 3707  266
confusionMatrix(probs_final_test$V2,test_y)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2    3    4    5    6    7    8
##          0  327   15    2    1   26    8   38   22    7
##          1   27 1476   34    0    4   66   37    9    1
##          2    1   29  881    0    4   78   34   10    0
##          3    1    1    1  502    0    1    2    4    2
##          4   29    6    2    1  743    4    2  344   71
##          5   11   21   22    0    0  163   13    0    1
##          6   38   35   32    1    2   35 2526    9   21
##          7   58   18   17   15  733   16   26 2620  204
##          8    6    0    0    2   26    0    5   25  202
## 
## Overall Statistics
##                                           
##                Accuracy : 0.803           
##                  95% CI : (0.7957, 0.8102)
##     No Information Rate : 0.2588          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.76            
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.65663   0.9219  0.88900  0.96169   0.4831  0.43935
## Specificity           0.98943   0.9825  0.98551  0.99893   0.9551  0.99403
## Pos Pred Value        0.73318   0.8924  0.84957  0.97665   0.6181  0.70563
## Neg Pred Value        0.98488   0.9876  0.98974  0.99822   0.9247  0.98195
## Prevalence            0.04236   0.1362  0.08430  0.04440   0.1308  0.03156
## Detection Rate        0.02782   0.1256  0.07494  0.04270   0.0632  0.01387
## Detection Prevalence  0.03794   0.1407  0.08821  0.04372   0.1022  0.01965
## Balanced Accuracy     0.82303   0.9522  0.93725  0.98031   0.7191  0.71669
##                      Class: 6 Class: 7 Class: 8
## Sensitivity            0.9415   0.8610  0.39686
## Specificity            0.9809   0.8752  0.99431
## Pos Pred Value         0.9359   0.7068  0.75940
## Neg Pred Value         0.9827   0.9474  0.97328
## Prevalence             0.2282   0.2588  0.04330
## Detection Rate         0.2149   0.2229  0.01718
## Detection Prevalence   0.2296   0.3153  0.02263
## Balanced Accuracy      0.9612   0.8681  0.69558

When Ensemble doesn’t work?

  • The models have to be independent, we can’t build the same model multiple times and expect the error to reduce.
  • We may have to bring in the independence by choosing subsets of data, or subset of features while building the individual models
  • Ensemble may backfire if we use dependent models that are already less accurate. The final ensemble might turn out to be even worse model.
  • Yes, there is a small disclaimer in “Wisdom of Crowd” theory. We need good independent individuals. If we collate any dependent individuals with poor knowledge, then we might end with an even worse ensemble.
  • For example, we built three models, model-1 , model-2 are bad, model-3 is good. Most of the times ensemble will result the combined output of model-1 and model-2, based on voting

LAB: When Ensemble doesn’t work?

  • When the individual models/ sample are dependent
#Data Import
train<- read.csv("C:/Amrita/Datavedi/Car Accidents IOT/Train.csv")
test<- read.csv("C:/Amrita/Datavedi/Car Accidents IOT/Test.csv")

####Logistic Regression
crash_model_logistic <- glm(Fatal ~ . , data=train, family = binomial())
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(crash_model_logistic)
## 
## Call:
## glm(formula = Fatal ~ ., family = binomial(), data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.4904  -0.8571   0.3656   0.8242   3.1945  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  8.954e-01  5.412e-01   1.654 0.098067 .  
## S1          -1.045e-02  2.860e-03  -3.653 0.000259 ***
## S2          -3.740e-03  5.454e-03  -0.686 0.492915    
## S3           2.638e-01  6.112e-02   4.316 1.59e-05 ***
## S4           1.605e-03  2.197e-04   7.304 2.80e-13 ***
## S5           3.161e-02  2.718e-03  11.631  < 2e-16 ***
## S6           3.748e-03  2.414e-03   1.553 0.120537    
## S7          -8.739e-04  2.476e-04  -3.530 0.000415 ***
## S8           1.684e-01  3.209e-02   5.247 1.54e-07 ***
## S9          -8.099e-04  7.008e-04  -1.156 0.247805    
## S10         -9.886e+01  9.210e+00 -10.734  < 2e-16 ***
## S11         -1.538e-02  8.875e-04 -17.334  < 2e-16 ***
## S12         -2.447e-01  2.161e-02 -11.324  < 2e-16 ***
## S13          3.227e+00  1.092e-01  29.549  < 2e-16 ***
## S14          7.233e-03  1.663e-03   4.350 1.36e-05 ***
## S15          6.571e-03  4.373e-03   1.503 0.132889    
## S16         -7.763e-02  5.666e-02  -1.370 0.170693    
## S17         -3.497e-04  6.861e-05  -5.097 3.46e-07 ***
## S18         -2.865e-04  4.433e-04  -0.646 0.518052    
## S19         -6.798e-02  6.262e-02  -1.086 0.277665    
## S20         -1.001e-02  2.043e-03  -4.902 9.49e-07 ***
## S21         -4.146e-01  2.398e-02 -17.291  < 2e-16 ***
## S22          1.678e-01  6.718e-03  24.981  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20538  on 15108  degrees of freedom
## Residual deviance: 14794  on 15086  degrees of freedom
## AIC: 14840
## 
## Number of Fisher Scoring iterations: 8
#Training accuarcy
predicted_y<-round(predict(crash_model_logistic,type="response"),0)
confusionMatrix(predicted_y,crash_model_logistic$y)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4394 1300
##          1 1922 7493
##                                           
##                Accuracy : 0.7867          
##                  95% CI : (0.7801, 0.7933)
##     No Information Rate : 0.582           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5556          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.6957          
##             Specificity : 0.8522          
##          Pos Pred Value : 0.7717          
##          Neg Pred Value : 0.7959          
##              Prevalence : 0.4180          
##          Detection Rate : 0.2908          
##    Detection Prevalence : 0.3769          
##       Balanced Accuracy : 0.7739          
##                                           
##        'Positive' Class : 0               
## 
#Accuarcy on Test data
predicted_test_logistic<-round(predict(crash_model_logistic,test, type="response"),0)
confusionMatrix(predicted_test_logistic,test$Fatal)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2766  781
##          1 1126 4392
##                                          
##                Accuracy : 0.7896         
##                  95% CI : (0.7811, 0.798)
##     No Information Rate : 0.5707         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.5659         
##  Mcnemar's Test P-Value : 3.343e-15      
##                                          
##             Sensitivity : 0.7107         
##             Specificity : 0.8490         
##          Pos Pred Value : 0.7798         
##          Neg Pred Value : 0.7959         
##              Prevalence : 0.4293         
##          Detection Rate : 0.3051         
##    Detection Prevalence : 0.3913         
##       Balanced Accuracy : 0.7799         
##                                          
##        'Positive' Class : 0              
## 
###Decision Tree

library(rpart)
crash_model_ds<-rpart(Fatal ~ ., method="class",   data=train)

#Training accuarcy
predicted_y<-predict(crash_model_ds, type="class")
table(predicted_y)
## predicted_y
##    0    1 
## 5544 9565
confusionMatrix(predicted_y,train$Fatal)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4705  839
##          1 1611 7954
##                                           
##                Accuracy : 0.8378          
##                  95% CI : (0.8319, 0.8437)
##     No Information Rate : 0.582           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6609          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7449          
##             Specificity : 0.9046          
##          Pos Pred Value : 0.8487          
##          Neg Pred Value : 0.8316          
##              Prevalence : 0.4180          
##          Detection Rate : 0.3114          
##    Detection Prevalence : 0.3669          
##       Balanced Accuracy : 0.8248          
##                                           
##        'Positive' Class : 0               
## 
#Accuaracy on Test data
predicted_test_ds<-predict(crash_model_ds, test, type="class")
confusionMatrix(predicted_test_ds,test$Fatal)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2884  454
##          1 1008 4719
##                                          
##                Accuracy : 0.8387         
##                  95% CI : (0.831, 0.8462)
##     No Information Rate : 0.5707         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.665          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.7410         
##             Specificity : 0.9122         
##          Pos Pred Value : 0.8640         
##          Neg Pred Value : 0.8240         
##              Prevalence : 0.4293         
##          Detection Rate : 0.3181         
##    Detection Prevalence : 0.3682         
##       Balanced Accuracy : 0.8266         
##                                          
##        'Positive' Class : 0              
## 
####SVM Model
library(e1071)
pc <- proc.time()
crash_model_svm <- svm(Fatal ~ . , type="C", data = train)
proc.time() - pc
##    user  system elapsed 
##   89.49    0.13   92.84
summary(crash_model_svm)
## 
## Call:
## svm(formula = Fatal ~ ., data = train, type = "C")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.04545455 
## 
## Number of Support Vectors:  6992
## 
##  ( 3582 3410 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1
#Confusion Matrix
library(caret)
label_predicted<-predict(crash_model_svm, type = "class")
confusionMatrix(label_predicted,train$Fatal)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4811  538
##          1 1505 8255
##                                           
##                Accuracy : 0.8648          
##                  95% CI : (0.8592, 0.8702)
##     No Information Rate : 0.582           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.716           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7617          
##             Specificity : 0.9388          
##          Pos Pred Value : 0.8994          
##          Neg Pred Value : 0.8458          
##              Prevalence : 0.4180          
##          Detection Rate : 0.3184          
##    Detection Prevalence : 0.3540          
##       Balanced Accuracy : 0.8503          
##                                           
##        'Positive' Class : 0               
## 
#Out of time validation with test data
predicted_test_svm<-predict(crash_model_svm, newdata =test[,-1] , type = "class")
confusionMatrix(predicted_test_svm,test[,1])
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2933  399
##          1  959 4774
##                                           
##                Accuracy : 0.8502          
##                  95% CI : (0.8427, 0.8575)
##     No Information Rate : 0.5707          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6887          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7536          
##             Specificity : 0.9229          
##          Pos Pred Value : 0.8803          
##          Neg Pred Value : 0.8327          
##              Prevalence : 0.4293          
##          Detection Rate : 0.3236          
##    Detection Prevalence : 0.3676          
##       Balanced Accuracy : 0.8382          
##                                           
##        'Positive' Class : 0               
## 
####Ensemble Model

#DS and SVM are predictng 1 & 2
predicted_test_logistic1<-predicted_test_logistic+1

Ens_predicted_data<-data.frame(lg=as.numeric(predicted_test_logistic1),ds=as.numeric(predicted_test_ds), svm=as.numeric(predicted_test_svm))

Ens_predicted_data$final<-ifelse(Ens_predicted_data$lg+Ens_predicted_data$ds+Ens_predicted_data$svm<4.5,0,1)
table(Ens_predicted_data$final)
## 
##    0    1 
## 3340 5725
##Ensemble Model accuracy test data
confusionMatrix(Ens_predicted_data$final,test[,1])
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2878  462
##          1 1014 4711
##                                           
##                Accuracy : 0.8372          
##                  95% CI : (0.8294, 0.8447)
##     No Information Rate : 0.5707          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6618          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7395          
##             Specificity : 0.9107          
##          Pos Pred Value : 0.8617          
##          Neg Pred Value : 0.8229          
##              Prevalence : 0.4293          
##          Detection Rate : 0.3175          
##    Detection Prevalence : 0.3685          
##       Balanced Accuracy : 0.8251          
##                                           
##        'Positive' Class : 0               
## 

Conclusion

  • Ensemble methods are most widely used methods these days. With advanced machines, its not really a huge task to build multiple models.
  • Both bagging and boosting does a good job of reducing bias and variance
  • Random forests are relatively fast, since we are building many small trees, it doesn’t put lot of pressure on the computing machine
  • Random forest can also give the variable importance. We need to be careful with categorical features, random forests tend to give higher importance to variables with higher number of levels.
  • In Boosted algorithms we may have to restrict the number of iterations to avoid overfitting
  • Ensemble models are the final effort of a data scientist, while building the most suitable predictive model for the data

 

DV Analytics

DV Data & Analytics is a leading data science training and consulting firm, led by industry experts. We are aiming to train and prepare resources to acquire the most in-demand data science job opportunities in India and abroad.

Bangalore Center

DV Data & Analytics Bangalore Private Limited
#52, 2nd Floor:
Malleshpalya Maruthinagar Bengaluru.
Bangalore 560075
India
(+91) 9019 030 033 (+91) 8095 881 188
Email: info@dvanalyticsmds.com

Bhubneshwar Center

DV Data & Analytics Private Limited Bhubaneswar
Plot No A/7 :
Adjacent to Maharaja Cine Complex, Bhoinagar, Acharya Vihar
Bhubaneswar 751022
(+91) 8095 881 188 (+91) 8249 430 414
Email: info@dvanalyticsmds.com

top
© 2020. All Rights Reserved.