Ensemble Models & Random Forests

Introduction
Ensemble Learning
How ensemble learning works
Bagging
Building models using Bagging
Random Forest algorithm
Random Forest model building
Boosting
Building models using boosting
Conclusion

The Wisdom of Crowds

One should not expend energy trying to identify an expert within a group but instead rely on the group’s collective wisdom, however make sure that Opinions must be independent and some knowledge of the truth must reside with some group members – Surowiecki
So instead of trying to build one great model, its better to build some independent moderate models and take their average as final prediction

What is Ensemble Learning

Imagine a classifier problem, there are two classes +1 & -1 in the target
Imagine that we built a best possible decision tree, it has 91% accuracy
Let x be the new data point and our decision tree predicts it to be +1. Is there a way I can do better than 91% by using the same data
Lets build 3 more models on the same data. And see we can improve the performance

We have four models on the same dataset, Each of them have different accuracy. But unfortunately there seem to be no real improvement in the accuracy.

What about prediction of the data point x?
Except the decision tree, the rest all algorithms are predicting the class of x as -1; Intuitively we would like to believe that the class of x is -1
The combined voting model seem to be having less error than each of the individual models. This is the actual philosophy of ensemble learning

Ensemble Models

Obtaining a better predictions using multiple models on the same dataset
Not every time it is possible to find single best fit model for our data, ensemble model combines multiple models to come up with one consolidated model
Ensemble models work on the principle that multiple moderately accurate models can give us a highly accurate model
Understandably, the Building and Evaluating the ensemble models is computationally expensive
Build one really good model is the usual statistical approach. Build many models and average the results is the philosophy of Ensemble learning

Why Ensemble technique works?

Imagine three models
- M1 with an error rate of 10%
- M2 with an error rate of 10%
- M3 with an error rate of 10%
The three models have to be independent, we can’t build the same model three times and expect the error to reduce. Any changes to the modeling technique in model -1 should not impact model-2
In this scenario, the worst ensemble model will have 10% error rate
The best ensemble model will have an error rate of 2.8%
- 2 out of 3 models predicted wrong + all models predicted wrong
- (3C2)*(0.1)(0.1)(0.9) + (0.1)(0.1)(0.1)
- 2.8% The best ensemble model will have an error rate of 2.8%

Types of Ensemble Models

The above example is a very primitive type of ensemble model. There are better and statistically stronger ensemble methods that will yield better results
Two most popular ensemble methodologies are
- Bagging
- Boosting

Bagging

Take multiple boot strap samples from the population and build classifiers on each of the samples. For prediction take mean or mode of all the individual model predictions.
Bagging has two major parts 1) Boot strap sampling 2) Aggregation of learners
Bagging = Bootstrap Aggregating
In Bagging we combine many unstable models to produce a stable model. Hence the predictors will be very reliable(less variance in the final model).

Boot strapping

We have a training data is of size N
Draw random sample with replacement of size N – This gives a new dataset, it might have repeated observations, some observations might not have even appeared once.
We are selecting records one-at-a-time, returning each selected record back in the population, giving it a chance to be selected again
Create B such new datasets. These are called boot strap datasets

The Bagging Algorithm

The training dataset D
Draw k boot strap sample sets from dataset D
For each boot strap sample i
- Build a classifier model (M_i)
We will have total of k classifiers (M_1 , M_2 ,… M_k)
Vote over for the final classifier output and take the average for regression output

Why Bagging Works

We are selecting records one-at-a-time, returning each selected record back in the population, giving it a chance to be selected again
Note that the variance in the consolidated prediction is reduced, if we have independent samples. That way we can reduce the unavoidable errors made by the single model.
In a given boot strap sample, some observations have chance to select multiple times and some observations might not have selected at all.
There a proven theory that boot strap samples have only 63% of overall population and rest 37% is not present.
So the data used in each of these models is not exactly same, This makes our learning models independent. This helps our predictors have the uncorrelated errors.
Finally the errors from the individual models cancel out and give us a better ensemble model with higher accuracy
Bagging is really useful when there is lot of variance in our data

LAB: Bagging Models

Import Boston house price data. It is part of MASS package
Get some basic meta details of the data
Take 90% data use it for training and take rest 10% as holdout data
Build a single linear regression model on the training data.
On the hold out data, calculate the error (squared deviation) for the regression model.
Build the regression model using bagging technique. Build at least 25 models
On the hold out data, calculate the error (squared deviation) for the consolidated bagged regression model.
What is the improvement of the bagged model when compared with the single model?

Solution

#Importing Boston  house pricing data. 
library(MASS)
data(Boston)
head(Boston)

##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2
## 6  5.21 28.7

dim(Boston)

## [1] 506  14

##Training and holdout sample
library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

set.seed(500)
sampleseed <- createDataPartition(Boston$medv, p=0.9, list=FALSE)

train_boston<-Boston[sampleseed,]
test_boston<-Boston[-sampleseed,]

###Regression Model
reg_model<- lm(medv ~ ., data=train_boston)
summary(reg_model)

## 
## Call:
## lm(formula = medv ~ ., data = train_boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.4763  -2.7684  -0.4912   1.9030  26.4569 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.637e+01  5.534e+00   6.572 1.40e-10 ***
## crim        -1.042e-01  3.513e-02  -2.965 0.003195 ** 
## zn           4.482e-02  1.459e-02   3.073 0.002248 ** 
## indus        1.986e-02  6.566e-02   0.302 0.762462    
## chas         2.733e+00  8.765e-01   3.118 0.001939 ** 
## nox         -1.844e+01  4.018e+00  -4.590 5.79e-06 ***
## rm           3.845e+00  4.670e-01   8.234 2.04e-15 ***
## age          8.782e-04  1.434e-02   0.061 0.951211    
## dis         -1.488e+00  2.096e-01  -7.101 4.94e-12 ***
## rad          2.770e-01  6.993e-02   3.960 8.71e-05 ***
## tax         -1.062e-02  3.944e-03  -2.693 0.007348 ** 
## ptratio     -9.799e-01  1.385e-01  -7.073 5.92e-12 ***
## black        9.620e-03  2.827e-03   3.403 0.000726 ***
## lstat       -5.051e-01  5.706e-02  -8.852  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.787 on 444 degrees of freedom
## Multiple R-squared:  0.7309, Adjusted R-squared:  0.723 
## F-statistic: 92.75 on 13 and 444 DF,  p-value: < 2.2e-16

###Accuracy testing on holdout data
pred_reg<-predict(reg_model, newdata=test_boston[,-14])
reg_err<-sum((test_boston$medv-pred_reg)^2)
reg_err

## [1] 918.5927

###Bagging Ensemble Model
library(ipred)
bagg_model<- bagging(medv ~ ., data=train_boston , nbagg=30)

###Accuracy testing on holout data
pred_bagg<-predict(bagg_model, newdata=test_boston[,-14])
bgg_err<-sum((test_boston$medv-pred_bagg)^2)
bgg_err

## [1] 390.9028

###Overall Improvement
reg_err

## [1] 918.5927

bgg_err

## [1] 390.9028

(reg_err-bgg_err)/reg_err

## [1] 0.5744547

Random Forest

Like many trees form a forest, many decision tree model together form a Random Forest model
Random forest is a specific case of bagging methodology. Bagging on decision trees is random forest
In random forest we induce two types of randomness
- Firstly, we take the boot strap samples of the population and build decision trees on each of the sample.
- While building the individual trees on boot strap samples, we take a subset of the features randomly
Random forests are very stable they are as good as SVMs and sometimes better

Random Forest Algorithm

The training dataset D with t number of features
Draw k boot strap sample sets from dataset D
For each boot strap sample i
- Build a decision tree model (M_i) using only p number of features (where p<<t)
- Each tree has maximal strength they are fully grown and not pruned.
We will have total of k decision treed (M_1 , M_2 ,… M_k); Each of these trees are built on reactively different training data and different set of features
Vote over for the final classifier output and take the average for regression output

The Random Factors in Random Forest

We need to note the most important aspect of random forest, i.e inducing randomness into the bagging of trees. There are two major sources of randomness
- Randomness in data: Boot strapping, this will make sure that any two samples data is somewhat different
- Randomness in features: While building the decision trees on boot strapped samples we consider only a random subset of features.
Why to induce the randomness?
- The major trick of ensemble models is the independence of models.
- If we take the same data and build same model for 100 times, we will not see any improvement
- To make all our decision trees independent, we take independent samples set and independent features set
- As a rule of thumb we can consider square root of the number of features, if ‘t’ is very large else p=t/3

Why Random Forest Works

For a training data with 20 features we are building 100 decision trees with 5 features each, instated of single great decision. The individual trees may be weak classifiers.
Its like building weak classifiers on subsets of data. The grouping of large sets of random trees generally produces accurate models.

In this example we have three simple classifiers.
m1 classifies anything above the line as +1 and below as -1, m2 classifies all the points above the line as -1 and below as +1 and m3 classifies everything on the left as -1 and right as +1
Each of these models have fair amount of misclassification error.
All these three weak models together make a strong model.

LAB: Random Forest

Dataset: /Car Accidents IOT/Train.csv
Build a decision tree model to predict the fatality of accident
Build a decision tree model on the training data.
On the test data, calculate the classification error and accuracy.
Build a random forest model on the training data.
On the test data, calculate the classification error and accuracy.
What is the improvement of the Random Forest model when compared with the single tree?

Solution

#Data Import
train<- read.csv("C:/Amrita/Datavedi/Car Accidents IOT/Train.csv")
test<- read.csv("C:/Amrita/Datavedi/Car Accidents IOT/Test.csv")

dim(train)

## [1] 15109    23

head(train)

##   Fatal      S1       S2       S3  S4       S5 S6 S7 S8 S9      S10 S11
## 1     1 36.2247 10.77330 0.243897 596 100.6710  0  0  1 28 0.016064 313
## 2     1 35.7343 17.45510 0.243897 600 100.0000  0  0  1 14 0.015812 319
## 3     1 31.6561  7.61366 0.308763 604  99.3377  0  0  1  4 0.015560 323
## 4     1 33.8320 13.11190 0.293195 616  97.4026  0  0  1  8 0.016001 320
## 5     1 42.5138 13.99850 0.259465 632  94.9367  0  0  1  8 0.016064 322
## 6     1 36.1261 14.85930 0.278925 600 100.0000  0  0  1  4 0.015749 314
##   S12 S13 S14 S15   S16  S17     S18 S19  S20 S21     S22
## 1   1   1  57   0 0.280  240 5.99375   0  0.0   4 14.9382
## 2   1   1  57   0 0.175  240 5.99375   0  0.0   4 14.8827
## 3   1   1  58   0 0.280  240 5.99375   0  0.0   4 14.6005
## 4   1   1  58   0 0.385  240 4.50625   0 13.0   4 14.6782
## 5   1   1  57   0 0.070  240 5.99375   0 19.5   4 15.3461
## 6   1   1  58   0 0.175 1008 4.50625   0 23.9   4 15.0559

###Decision Tree
library(rpart)
crash_model_ds<-rpart(Fatal ~ ., method="class", control=rpart.control(minsplit=30, cp=0.03),   data=train)

#Training accuarcy
predicted_y<-predict(crash_model_ds, type="class")
table(predicted_y)

## predicted_y
##    0    1 
## 5745 9364

confusionMatrix(predicted_y,train$Fatal)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4735 1010
##          1 1581 7783
##                                           
##                Accuracy : 0.8285          
##                  95% CI : (0.8224, 0.8345)
##     No Information Rate : 0.582           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.643           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7497          
##             Specificity : 0.8851          
##          Pos Pred Value : 0.8242          
##          Neg Pred Value : 0.8312          
##              Prevalence : 0.4180          
##          Detection Rate : 0.3134          
##    Detection Prevalence : 0.3802          
##       Balanced Accuracy : 0.8174          
##                                           
##        'Positive' Class : 0               
##

#Accuaracy on Test data
predicted_test_ds<-predict(crash_model_ds, test, type="class")
confusionMatrix(predicted_test_ds,test$Fatal)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2897  561
##          1  995 4612
##                                           
##                Accuracy : 0.8284          
##                  95% CI : (0.8204, 0.8361)
##     No Information Rate : 0.5707          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6448          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7443          
##             Specificity : 0.8916          
##          Pos Pred Value : 0.8378          
##          Neg Pred Value : 0.8225          
##              Prevalence : 0.4293          
##          Detection Rate : 0.3196          
##    Detection Prevalence : 0.3815          
##       Balanced Accuracy : 0.8179          
##                                           
##        'Positive' Class : 0               
##

###Random Forest
library(randomForest)

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

rf_model <- randomForest(as.factor(train$Fatal) ~ ., ntree=200,   mtry=ncol(train)/3, data=train)

#Training accuaracy
predicted_y<-predict(rf_model)
table(predicted_y)

## predicted_y
##    0    1 
## 5921 9188

confusionMatrix(predicted_y,train$Fatal)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5600  321
##          1  716 8472
##                                           
##                Accuracy : 0.9314          
##                  95% CI : (0.9272, 0.9353)
##     No Information Rate : 0.582           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8577          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8866          
##             Specificity : 0.9635          
##          Pos Pred Value : 0.9458          
##          Neg Pred Value : 0.9221          
##              Prevalence : 0.4180          
##          Detection Rate : 0.3706          
##    Detection Prevalence : 0.3919          
##       Balanced Accuracy : 0.9251          
##                                           
##        'Positive' Class : 0               
##

#Accuaracy on Test data
predicted_test_rf<-predict(rf_model,test, type="class")
confusionMatrix(predicted_test_rf,test$Fatal)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3479  192
##          1  413 4981
##                                           
##                Accuracy : 0.9333          
##                  95% CI : (0.9279, 0.9383)
##     No Information Rate : 0.5707          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8628          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8939          
##             Specificity : 0.9629          
##          Pos Pred Value : 0.9477          
##          Neg Pred Value : 0.9234          
##              Prevalence : 0.4293          
##          Detection Rate : 0.3838          
##    Detection Prevalence : 0.4050          
##       Balanced Accuracy : 0.9284          
##                                           
##        'Positive' Class : 0               
##

Boosting

Boosting is one more famous ensemble method
Boosting uses a slightly different techniques to that of bagging.
Boosting is a well proven theory that works really well on many of the machine learning problems like speech recognition
If bagging is wisdom of crowds then boosting is wisdom of crowds where each individual is given some weight based on their expertise
Boosting in general decreases the bias error and builds strong predictive models.
Boosting is an iterative technique. We adjust the weight of the observation based on the previous classification.
If an observation was classified incorrectly, it tries to increase the weight of this observation and vice versa.

Boosting Main idea

Final Classifier (C = sum alpha_i c_i)

How weighted samples are taken

Boosting Illustration

Below is the training data and their classes We need to take a note of record numbers, they will help us in weighted sampling later

Theory behind Boosting Algorithm

Take the dataset
Build a classifier (C_m) and find the error
Calculate error rate of the classifier
- Error rate of (epsilon _m = sum w_i I (y_i neq C_m (x)) / sum w_i) = Sum of misclassification weight / sum of sample weights
Calculate an intermediate factor called a. It analogous to accuracy rate of the model. It will be later used in weight updating. It is derived from error
- (alpha _m = log(1- epsilon _m)/epsilon _m))
Update weights of each record in the sample using the a factor. The indicator function will make sure that the misclassifications are given more weight
- For i =1,2,… N
  - (W_(i+1) = w_i e^(alpha _m I(y_ineq C_m (x))))
  - Renormalize so that sum of weights is 1
Repeat this model building and weight update process until we have no misclassification
Final collation is done by voting from all the modes. While taking the votes, each model is weighted by the accuracy factor (alpha)
- (C = sign(sum alpha _i C_i(x)))

Gradient Boosting

Ada boosting
- Adaptive Boosting
- Till now we discussed Ada boosting technique. Here we give high weight to misclassified records.
Gradient Boosting
- Similar to Ada boosting algorithm.
- The approach is same but there are slight modifications during re-weighted sampling.
- We update the weights based on misclassification rate and gradient
- Gradient boosting serves better for some class of problems like regression.

LAB: Boosting

Rightly categorizing the items based on their detailed feature specifications. More than 100 specifications have been collected.
Data: Ecom_Products_Menu/train.csv
Build a decision tree model and check the training and testing accuracy
Build a boosted decision tree.
Is there any improvement from the earlier decision tree

Solution

train <- read.csv("C:/Amrita/Datavedi/Ecom_Products_Menu/train.csv")
test <- read.csv("C:/Amrita/Datavedi/Ecom_Products_Menu/test.csv")

dim(train)

## [1] 50122   102

##Decison Tree
library(rpart)
ecom_products_ds<-rpart(Category ~ ., method="class", control=rpart.control(minsplit=30, cp=0.01),  data=train[,-1])
library(rattle)

## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

fancyRpartPlot(ecom_products_ds)

#Training accuarcy
library(caret)
predicted_y<-predict(ecom_products_ds, type="class")
table(predicted_y)

## predicted_y
##   Accessories    Appliances        Camara          Ipod       Laptops 
##             0         10899          2733          2442             0 
##       Mobiles Personal_Care       Tablets            TV 
##             0         10288         23760             0

confusionMatrix(predicted_y,train$Category)

## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      Accessories Appliances Camara  Ipod Laptops Mobiles
##   Accessories             0          0      0     0       0       0
##   Appliances            825       5536   1086   130     506     709
##   Camara                 88        387   1456     4      55     388
##   Ipod                   30         17     23  2032     144       5
##   Laptops                 0          0      0     0       0       0
##   Mobiles                 0          0      0     0       0       0
##   Personal_Care         110        308    152     0      18      79
##   Tablets              1288        615   1247    51    5743     377
##   TV                      0          0      0     0       0       0
##                Reference
## Prediction      Personal_Care Tablets    TV
##   Accessories               0       0     0
##   Appliances             1035     932   140
##   Camara                  252      84    19
##   Ipod                     13     159    19
##   Laptops                   0       0     0
##   Mobiles                   0       0     0
##   Personal_Care          9545      19    57
##   Tablets                 607   11885  1947
##   TV                        0       0     0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6076          
##                  95% CI : (0.6033, 0.6119)
##     No Information Rate : 0.2609          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5053          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Accessories Class: Appliances Class: Camara
## Sensitivity                     0.00000            0.8066       0.36731
## Specificity                     1.00000            0.8760       0.97233
## Pos Pred Value                      NaN            0.5079       0.53275
## Neg Pred Value                  0.95329            0.9662       0.94708
## Prevalence                      0.04671            0.1369       0.07909
## Detection Rate                  0.00000            0.1105       0.02905
## Detection Prevalence            0.00000            0.2174       0.05453
## Balanced Accuracy               0.50000            0.8413       0.66982
##                      Class: Ipod Class: Laptops Class: Mobiles
## Sensitivity              0.91655          0.000        0.00000
## Specificity              0.99144          1.000        1.00000
## Pos Pred Value           0.83210            NaN            NaN
## Neg Pred Value           0.99612          0.871        0.96892
## Prevalence               0.04423          0.129        0.03108
## Detection Rate           0.04054          0.000        0.00000
## Detection Prevalence     0.04872          0.000        0.00000
## Balanced Accuracy        0.95400          0.500        0.50000
##                      Class: Personal_Care Class: Tablets Class: TV
## Sensitivity                        0.8335         0.9087   0.00000
## Specificity                        0.9808         0.6794   1.00000
## Pos Pred Value                     0.9278         0.5002       NaN
## Neg Pred Value                     0.9521         0.9547   0.95647
## Prevalence                         0.2285         0.2609   0.04353
## Detection Rate                     0.1904         0.2371   0.00000
## Detection Prevalence               0.2053         0.4740   0.00000
## Balanced Accuracy                  0.9071         0.7941   0.50000

#Accuarcy on Test data
predicted_test_ds<-predict(ecom_products_ds, test[,-1], type="class")
confusionMatrix(predicted_test_ds,test$Category)

## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      Accessories Appliances Camara Ipod Laptops Mobiles
##   Accessories             0          0      0    0       0       0
##   Appliances            172       1308    269   40      92     170
##   Camara                 15         80    383    1      16      95
##   Ipod                   14          4      3  469      28       0
##   Laptops                 0          0      0    0       0       0
##   Mobiles                 0          0      0    0       0       0
##   Personal_Care          23         75     42    0       1      23
##   Tablets               274        134    294   12    1401      83
##   TV                      0          0      0    0       0       0
##                Reference
## Prediction      Personal_Care Tablets   TV
##   Accessories               0       0    0
##   Appliances              234     210   42
##   Camara                   52      23    3
##   Ipod                      3      49    5
##   Laptops                   0       0    0
##   Mobiles                   0       0    0
##   Personal_Care          2242      10   17
##   Tablets                 152    2751  442
##   TV                        0       0    0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6085          
##                  95% CI : (0.5996, 0.6173)
##     No Information Rate : 0.2588          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5071          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Accessories Class: Appliances Class: Camara
## Sensitivity                     0.00000            0.8170       0.38648
## Specificity                     1.00000            0.8790       0.97353
## Pos Pred Value                      NaN            0.5156       0.57335
## Neg Pred Value                  0.95764            0.9682       0.94517
## Prevalence                      0.04236            0.1362       0.08430
## Detection Rate                  0.00000            0.1113       0.03258
## Detection Prevalence            0.00000            0.2158       0.05682
## Balanced Accuracy               0.50000            0.8480       0.68000
##                      Class: Ipod Class: Laptops Class: Mobiles
## Sensitivity              0.89847         0.0000        0.00000
## Specificity              0.99056         1.0000        1.00000
## Pos Pred Value           0.81565            NaN            NaN
## Neg Pred Value           0.99526         0.8692        0.96844
## Prevalence               0.04440         0.1308        0.03156
## Detection Rate           0.03989         0.0000        0.00000
## Detection Prevalence     0.04891         0.0000        0.00000
## Balanced Accuracy        0.94452         0.5000        0.50000
##                      Class: Personal_Care Class: Tablets Class: TV
## Sensitivity                        0.8356         0.9040    0.0000
## Specificity                        0.9789         0.6796    1.0000
## Pos Pred Value                     0.9215         0.4963       NaN
## Neg Pred Value                     0.9527         0.9530    0.9567
## Prevalence                         0.2282         0.2588    0.0433
## Detection Rate                     0.1907         0.2340    0.0000
## Detection Prevalence               0.2070         0.4715    0.0000
## Balanced Accuracy                  0.9073         0.7918    0.5000

###Boosting

library(xgboost)
library(methods)
library(data.table)
library(magrittr)

# converting datasets to Numeric format. xgboost needs at least one numeric column 
train[,c(-1,-102)] <- lapply( train[,c(-1,-102)], as.numeric)
test[,c(-1,-102)] <- lapply( test[,c(-1,-102)], as.numeric)

# converting datasets to Matrix format. Data frame is not supported by xgboost
trainMatrix <- train[,c(-1,-102)] %>% as.matrix
testMatrix <- test[,c(-1,-102)] %>% as.matrix

#The label should be in numeric format and it should start from 0
y<-as.integer(train$Category)-1
table(y,train$Category)

##    
## y   Accessories Appliances Camara  Ipod Laptops Mobiles Personal_Care
##   0        2341          0      0     0       0       0             0
##   1           0       6863      0     0       0       0             0
##   2           0          0   3964     0       0       0             0
##   3           0          0      0  2217       0       0             0
##   4           0          0      0     0    6466       0             0
##   5           0          0      0     0       0    1558             0
##   6           0          0      0     0       0       0         11452
##   7           0          0      0     0       0       0             0
##   8           0          0      0     0       0       0             0
##    
## y   Tablets    TV
##   0       0     0
##   1       0     0
##   2       0     0
##   3       0     0
##   4       0     0
##   5       0     0
##   6       0     0
##   7   13079     0
##   8       0  2182

test_y<-as.integer(test$Category)-1
table(test_y,test$Category)

##       
## test_y Accessories Appliances Camara Ipod Laptops Mobiles Personal_Care
##      0         498          0      0    0       0       0             0
##      1           0       1601      0    0       0       0             0
##      2           0          0    991    0       0       0             0
##      3           0          0      0  522       0       0             0
##      4           0          0      0    0    1538       0             0
##      5           0          0      0    0       0     371             0
##      6           0          0      0    0       0       0          2683
##      7           0          0      0    0       0       0             0
##      8           0          0      0    0       0       0             0
##       
## test_y Tablets   TV
##      0       0    0
##      1       0    0
##      2       0    0
##      3       0    0
##      4       0    0
##      5       0    0
##      6       0    0
##      7    3043    0
##      8       0  509

#Setting the parameters for multiclass classification
param <- list("objective" = "multi:softprob","eval.metric" = "merror",   "num_class" =9)
#"multi:softmax" --set XGBoost to do multiclass classification using the softmax objective, you also need to set num_class(number of classes)     
#"merror": Multiclass classification error rate. It is calculated as #(wrong cases)/#(all cases).

XGBModel <- xgboost(param=param, data = trainMatrix, label = y, nrounds=40)

## [0]  train-merror:0.269223
## [1]  train-merror:0.241750
## [2]  train-merror:0.229500
## [3]  train-merror:0.222776
## [4]  train-merror:0.218966
## [5]  train-merror:0.211923
## [6]  train-merror:0.208312
## [7]  train-merror:0.203703
## [8]  train-merror:0.199553
## [9]  train-merror:0.196481
## [10] train-merror:0.192969
## [11] train-merror:0.190695
## [12] train-merror:0.188241
## [13] train-merror:0.185487
## [14] train-merror:0.183193
## [15] train-merror:0.180400
## [16] train-merror:0.177886
## [17] train-merror:0.175552
## [18] train-merror:0.173217
## [19] train-merror:0.171362
## [20] train-merror:0.168968
## [21] train-merror:0.166474
## [22] train-merror:0.164379
## [23] train-merror:0.162743
## [24] train-merror:0.161925
## [25] train-merror:0.160389
## [26] train-merror:0.158214
## [27] train-merror:0.156478
## [28] train-merror:0.155521
## [29] train-merror:0.154284
## [30] train-merror:0.152628
## [31] train-merror:0.151271
## [32] train-merror:0.149356
## [33] train-merror:0.147879
## [34] train-merror:0.146283
## [35] train-merror:0.144827
## [36] train-merror:0.143749
## [37] train-merror:0.142053
## [38] train-merror:0.140358
## [39] train-merror:0.139240

#Training accuarcy
predicted_y<-predict(XGBModel, trainMatrix)
probs <- data.frame(matrix(predicted_y, nrow=nrow(train), ncol=9,  byrow = TRUE))

probs_final<-as.data.frame(cbind(row.names(probs),apply(probs,1, function(x) c(0:8)[which(x==max(x))])))
table(probs_final$V2)

## 
##     0     1     2     3     4     5     6     7     8 
##  2140  6969  3997  2227  5142  1242 11418 15605  1382

confusionMatrix(probs_final$V2,y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1     2     3     4     5     6     7     8
##          0  1820    32    13     1    74    26    94    58    22
##          1    73  6495   123     1    13   129   119    12     4
##          2    13    78  3584     2     4   204   103     9     0
##          3     8     5     3  2192     0     1     0     8    10
##          4    84    20     4     3  3830     5    12   970   214
##          5    28    55    60     2     2  1051    37     7     0
##          6    81   105    93     1     5    95 10987    20    31
##          7   216    73    82    15  2500    46    92 11932   649
##          8    18     0     2     0    38     1     8    63  1252
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8608          
##                  95% CI : (0.8577, 0.8638)
##     No Information Rate : 0.2609          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8306          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.77745   0.9464  0.90414  0.98872  0.59233  0.67458
## Specificity           0.99330   0.9890  0.99105  0.99927  0.96995  0.99607
## Pos Pred Value        0.85047   0.9320  0.89667  0.98428  0.74485  0.84622
## Neg Pred Value        0.98914   0.9915  0.99176  0.99948  0.94140  0.98963
## Prevalence            0.04671   0.1369  0.07909  0.04423  0.12901  0.03108
## Detection Rate        0.03631   0.1296  0.07151  0.04373  0.07641  0.02097
## Detection Prevalence  0.04270   0.1390  0.07975  0.04443  0.10259  0.02478
## Balanced Accuracy     0.88537   0.9677  0.94759  0.99400  0.78114  0.83532
##                      Class: 6 Class: 7 Class: 8
## Sensitivity            0.9594   0.9123  0.57379
## Specificity            0.9889   0.9008  0.99729
## Pos Pred Value         0.9623   0.7646  0.90593
## Neg Pred Value         0.9880   0.9668  0.98092
## Prevalence             0.2285   0.2609  0.04353
## Detection Rate         0.2192   0.2381  0.02498
## Detection Prevalence   0.2278   0.3113  0.02757
## Balanced Accuracy      0.9741   0.9066  0.78554

#Accuarcy on Test data

predicted_test_boost<-predict(XGBModel, testMatrix)
probs_test <- data.frame(matrix(predicted_test_boost, nrow=nrow(test), ncol=9,  byrow = TRUE))

probs_final_test<-as.data.frame(cbind(row.names(probs_test),apply(probs_test,1, function(x) c(0:8)[which(x==max(x))])))
table(probs_final_test$V2)

## 
##    0    1    2    3    4    5    6    7    8 
##  446 1654 1037  514 1202  231 2699 3707  266

confusionMatrix(probs_final_test$V2,test_y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2    3    4    5    6    7    8
##          0  327   15    2    1   26    8   38   22    7
##          1   27 1476   34    0    4   66   37    9    1
##          2    1   29  881    0    4   78   34   10    0
##          3    1    1    1  502    0    1    2    4    2
##          4   29    6    2    1  743    4    2  344   71
##          5   11   21   22    0    0  163   13    0    1
##          6   38   35   32    1    2   35 2526    9   21
##          7   58   18   17   15  733   16   26 2620  204
##          8    6    0    0    2   26    0    5   25  202
## 
## Overall Statistics
##                                           
##                Accuracy : 0.803           
##                  95% CI : (0.7957, 0.8102)
##     No Information Rate : 0.2588          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.76            
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.65663   0.9219  0.88900  0.96169   0.4831  0.43935
## Specificity           0.98943   0.9825  0.98551  0.99893   0.9551  0.99403
## Pos Pred Value        0.73318   0.8924  0.84957  0.97665   0.6181  0.70563
## Neg Pred Value        0.98488   0.9876  0.98974  0.99822   0.9247  0.98195
## Prevalence            0.04236   0.1362  0.08430  0.04440   0.1308  0.03156
## Detection Rate        0.02782   0.1256  0.07494  0.04270   0.0632  0.01387
## Detection Prevalence  0.03794   0.1407  0.08821  0.04372   0.1022  0.01965
## Balanced Accuracy     0.82303   0.9522  0.93725  0.98031   0.7191  0.71669
##                      Class: 6 Class: 7 Class: 8
## Sensitivity            0.9415   0.8610  0.39686
## Specificity            0.9809   0.8752  0.99431
## Pos Pred Value         0.9359   0.7068  0.75940
## Neg Pred Value         0.9827   0.9474  0.97328
## Prevalence             0.2282   0.2588  0.04330
## Detection Rate         0.2149   0.2229  0.01718
## Detection Prevalence   0.2296   0.3153  0.02263
## Balanced Accuracy      0.9612   0.8681  0.69558

When Ensemble doesn’t work?

The models have to be independent, we can’t build the same model multiple times and expect the error to reduce.
We may have to bring in the independence by choosing subsets of data, or subset of features while building the individual models
Ensemble may backfire if we use dependent models that are already less accurate. The final ensemble might turn out to be even worse model.
Yes, there is a small disclaimer in “Wisdom of Crowd” theory. We need good independent individuals. If we collate any dependent individuals with poor knowledge, then we might end with an even worse ensemble.
For example, we built three models, model-1 , model-2 are bad, model-3 is good. Most of the times ensemble will result the combined output of model-1 and model-2, based on voting

LAB: When Ensemble doesn’t work?

When the individual models/ sample are dependent

#Data Import
train<- read.csv("C:/Amrita/Datavedi/Car Accidents IOT/Train.csv")
test<- read.csv("C:/Amrita/Datavedi/Car Accidents IOT/Test.csv")

####Logistic Regression
crash_model_logistic <- glm(Fatal ~ . , data=train, family = binomial())

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(crash_model_logistic)

## 
## Call:
## glm(formula = Fatal ~ ., family = binomial(), data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.4904  -0.8571   0.3656   0.8242   3.1945  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  8.954e-01  5.412e-01   1.654 0.098067 .  
## S1          -1.045e-02  2.860e-03  -3.653 0.000259 ***
## S2          -3.740e-03  5.454e-03  -0.686 0.492915    
## S3           2.638e-01  6.112e-02   4.316 1.59e-05 ***
## S4           1.605e-03  2.197e-04   7.304 2.80e-13 ***
## S5           3.161e-02  2.718e-03  11.631  < 2e-16 ***
## S6           3.748e-03  2.414e-03   1.553 0.120537    
## S7          -8.739e-04  2.476e-04  -3.530 0.000415 ***
## S8           1.684e-01  3.209e-02   5.247 1.54e-07 ***
## S9          -8.099e-04  7.008e-04  -1.156 0.247805    
## S10         -9.886e+01  9.210e+00 -10.734  < 2e-16 ***
## S11         -1.538e-02  8.875e-04 -17.334  < 2e-16 ***
## S12         -2.447e-01  2.161e-02 -11.324  < 2e-16 ***
## S13          3.227e+00  1.092e-01  29.549  < 2e-16 ***
## S14          7.233e-03  1.663e-03   4.350 1.36e-05 ***
## S15          6.571e-03  4.373e-03   1.503 0.132889    
## S16         -7.763e-02  5.666e-02  -1.370 0.170693    
## S17         -3.497e-04  6.861e-05  -5.097 3.46e-07 ***
## S18         -2.865e-04  4.433e-04  -0.646 0.518052    
## S19         -6.798e-02  6.262e-02  -1.086 0.277665    
## S20         -1.001e-02  2.043e-03  -4.902 9.49e-07 ***
## S21         -4.146e-01  2.398e-02 -17.291  < 2e-16 ***
## S22          1.678e-01  6.718e-03  24.981  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20538  on 15108  degrees of freedom
## Residual deviance: 14794  on 15086  degrees of freedom
## AIC: 14840
## 
## Number of Fisher Scoring iterations: 8

#Training accuarcy
predicted_y<-round(predict(crash_model_logistic,type="response"),0)
confusionMatrix(predicted_y,crash_model_logistic$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4394 1300
##          1 1922 7493
##                                           
##                Accuracy : 0.7867          
##                  95% CI : (0.7801, 0.7933)
##     No Information Rate : 0.582           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5556          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.6957          
##             Specificity : 0.8522          
##          Pos Pred Value : 0.7717          
##          Neg Pred Value : 0.7959          
##              Prevalence : 0.4180          
##          Detection Rate : 0.2908          
##    Detection Prevalence : 0.3769          
##       Balanced Accuracy : 0.7739          
##                                           
##        'Positive' Class : 0               
##

#Accuarcy on Test data
predicted_test_logistic<-round(predict(crash_model_logistic,test, type="response"),0)
confusionMatrix(predicted_test_logistic,test$Fatal)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2766  781
##          1 1126 4392
##                                          
##                Accuracy : 0.7896         
##                  95% CI : (0.7811, 0.798)
##     No Information Rate : 0.5707         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.5659         
##  Mcnemar's Test P-Value : 3.343e-15      
##                                          
##             Sensitivity : 0.7107         
##             Specificity : 0.8490         
##          Pos Pred Value : 0.7798         
##          Neg Pred Value : 0.7959         
##              Prevalence : 0.4293         
##          Detection Rate : 0.3051         
##    Detection Prevalence : 0.3913         
##       Balanced Accuracy : 0.7799         
##                                          
##        'Positive' Class : 0              
##

###Decision Tree

library(rpart)
crash_model_ds<-rpart(Fatal ~ ., method="class",   data=train)

#Training accuarcy
predicted_y<-predict(crash_model_ds, type="class")
table(predicted_y)

## predicted_y
##    0    1 
## 5544 9565

confusionMatrix(predicted_y,train$Fatal)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4705  839
##          1 1611 7954
##                                           
##                Accuracy : 0.8378          
##                  95% CI : (0.8319, 0.8437)
##     No Information Rate : 0.582           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6609          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7449          
##             Specificity : 0.9046          
##          Pos Pred Value : 0.8487          
##          Neg Pred Value : 0.8316          
##              Prevalence : 0.4180          
##          Detection Rate : 0.3114          
##    Detection Prevalence : 0.3669          
##       Balanced Accuracy : 0.8248          
##                                           
##        'Positive' Class : 0               
##

#Accuaracy on Test data
predicted_test_ds<-predict(crash_model_ds, test, type="class")
confusionMatrix(predicted_test_ds,test$Fatal)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2884  454
##          1 1008 4719
##                                          
##                Accuracy : 0.8387         
##                  95% CI : (0.831, 0.8462)
##     No Information Rate : 0.5707         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.665          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.7410         
##             Specificity : 0.9122         
##          Pos Pred Value : 0.8640         
##          Neg Pred Value : 0.8240         
##              Prevalence : 0.4293         
##          Detection Rate : 0.3181         
##    Detection Prevalence : 0.3682         
##       Balanced Accuracy : 0.8266         
##                                          
##        'Positive' Class : 0              
##

####SVM Model
library(e1071)
pc <- proc.time()
crash_model_svm <- svm(Fatal ~ . , type="C", data = train)
proc.time() - pc

##    user  system elapsed 
##   89.49    0.13   92.84

summary(crash_model_svm)

## 
## Call:
## svm(formula = Fatal ~ ., data = train, type = "C")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.04545455 
## 
## Number of Support Vectors:  6992
## 
##  ( 3582 3410 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

#Confusion Matrix
library(caret)
label_predicted<-predict(crash_model_svm, type = "class")
confusionMatrix(label_predicted,train$Fatal)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4811  538
##          1 1505 8255
##                                           
##                Accuracy : 0.8648          
##                  95% CI : (0.8592, 0.8702)
##     No Information Rate : 0.582           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.716           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7617          
##             Specificity : 0.9388          
##          Pos Pred Value : 0.8994          
##          Neg Pred Value : 0.8458          
##              Prevalence : 0.4180          
##          Detection Rate : 0.3184          
##    Detection Prevalence : 0.3540          
##       Balanced Accuracy : 0.8503          
##                                           
##        'Positive' Class : 0               
##

#Out of time validation with test data
predicted_test_svm<-predict(crash_model_svm, newdata =test[,-1] , type = "class")
confusionMatrix(predicted_test_svm,test[,1])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2933  399
##          1  959 4774
##                                           
##                Accuracy : 0.8502          
##                  95% CI : (0.8427, 0.8575)
##     No Information Rate : 0.5707          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6887          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7536          
##             Specificity : 0.9229          
##          Pos Pred Value : 0.8803          
##          Neg Pred Value : 0.8327          
##              Prevalence : 0.4293          
##          Detection Rate : 0.3236          
##    Detection Prevalence : 0.3676          
##       Balanced Accuracy : 0.8382          
##                                           
##        'Positive' Class : 0               
##

####Ensemble Model

#DS and SVM are predictng 1 & 2
predicted_test_logistic1<-predicted_test_logistic+1

Ens_predicted_data<-data.frame(lg=as.numeric(predicted_test_logistic1),ds=as.numeric(predicted_test_ds), svm=as.numeric(predicted_test_svm))

Ens_predicted_data$final<-ifelse(Ens_predicted_data$lg+Ens_predicted_data$ds+Ens_predicted_data$svm<4.5,0,1)
table(Ens_predicted_data$final)

## 
##    0    1 
## 3340 5725

##Ensemble Model accuracy test data
confusionMatrix(Ens_predicted_data$final,test[,1])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2878  462
##          1 1014 4711
##                                           
##                Accuracy : 0.8372          
##                  95% CI : (0.8294, 0.8447)
##     No Information Rate : 0.5707          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6618          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7395          
##             Specificity : 0.9107          
##          Pos Pred Value : 0.8617          
##          Neg Pred Value : 0.8229          
##              Prevalence : 0.4293          
##          Detection Rate : 0.3175          
##    Detection Prevalence : 0.3685          
##       Balanced Accuracy : 0.8251          
##                                           
##        'Positive' Class : 0               
##

Conclusion

Ensemble methods are most widely used methods these days. With advanced machines, its not really a huge task to build multiple models.
Both bagging and boosting does a good job of reducing bias and variance
Random forests are relatively fast, since we are building many small trees, it doesn’t put lot of pressure on the computing machine
Random forest can also give the variable importance. We need to be careful with categorical features, random forests tend to give higher importance to variables with higher number of levels.
In Boosted algorithms we may have to restrict the number of iterations to avoid overfitting
Ensemble models are the final effort of a data scientist, while building the most suitable predictive model for the data

Random Forests

Random Forests

Ensemble Models & Random Forests

Contents

The Wisdom of Crowds

What is Ensemble Learning

Ensemble Models

Why Ensemble technique works?

Types of Ensemble Models

Bagging

Boot strapping

The Bagging Algorithm

Why Bagging Works

LAB: Bagging Models

Solution

Random Forest

Random Forest Algorithm

The Random Factors in Random Forest

Why Random Forest Works

LAB: Random Forest

Solution

Boosting

Boosting Main idea

How weighted samples are taken

Boosting Illustration

Theory behind Boosting Algorithm

Gradient Boosting

LAB: Boosting

Solution

When Ensemble doesn’t work?

LAB: When Ensemble doesn’t work?

Conclusion