You can download the datasets and R code file for this session here.

Random Forests

Ensemble Models & Random Forests

Introduction
Ensemble Learning
How Ensemble Learning Works?
Bagging
Building Models Using Bagging
Random Forest Algorithm
Random Forest Model Building
Boosting
Building Models Using Boosting
Conclusion

The Wisdom of Crowds

To understand the Ensemble Modelling technique, we need to understand a concept called The Wisdom Of Crowds.

Let us take an example to understand what exactly the Wisdom Of Crowds is.
Here is a problem statement which says “Estimate the monthly expenditure of a family in a city”.
There are numerous techniques to predict the average monthly expenditure of each family.
For instance, we can use simple descriptive statistics, or we can use multiple variables which together will finally predict the average monthly expenditure.
Let us take a scenario where different predictive models have come up.
On one hand, there is a renowned & eminent professor, who is an expert in predictive modelling and he is the world’s best data scientist and has came up with a single predictive model.
And on the other hand, there is a group of a hundred associate professors who have taken up this challenge independently, by collecting their own data or may be by adopting their own modeling techniques which can use a different set of variables altogether.
Note that, each one of them has come up with a unique predictive model to estimate the expenditure.
That is, 100 assistant professors, who do not know each other, have built 100 different models.
In case-1, we can see that the professor’s model predicted an estimated monthly expenditure of 6500$ (six thousand five hundred dollars), whereas in case-2, some of them predicted 8000$, some of them predicted 6000$, some 7500$.
We can see that the average of the 100 predicted values has been calculated from different estimates as 7200$.
Now the question arises….
There is one model which says 6500$ is estimated expenditure and the other says 7200$ is estimated expenditure, then what average of 100 models are.
Thus these 100 assistant professors might not be as good as the eminent professor.
However, they are good data scientists since they have built fairly good models from different perspectives. And thus it makes sense to choose the average of 100 predictions rather than to rely on a single model.
Here is the definition of wisdom of crowds.
Let us see the example again by putting it in an easy way; wisdom of crowds is instead of taking one best model or instead of looking at one best thing or may be it is better to look at as many fairly good models as possible and take the average.
Here we believe that, the average wisdom of crowds is better than relying on one good model.
In other words, one should not expend energy trying to identify an expert within a group, instead he should rely on the group’s collective wisdom. However one has to ensure that the opinions must be independent.
That is, these 100 assistant professors should not talk to each other, should not be depending on each other.
Otherwise they will not be building 100 different models.
They might not be carrying information from 100 different angles.
If they are dependent on each other model-1 might be same as model-2 or as model-3, which does not make sense.
Thus, all these 100 professors have to be independent.
Also some kind of knowledge of the truth must reside with some group members.
Here, the group members are the assistant professors.
Each one of them should be at least good in their own perspective and is expected to have true knowledge.
We can not afford taking any random guesses to build these models.
All of them are also data scientist as well.
So instead of trying to build one great model, it’s always better to build some independent moderate models and take their average as the final prediction.
This is the concept of the wisdom of crowds.

What is Ensemble Learning

Ensemble Learning is completely based on the concept of wisdom of crowds
We build multiple models and then use the average or use all the models as the final model instead of building one single model.
Let us see what exactly is Ensemble Learning
Imagine a classifier problem, just a binary classifier problem with two classes where we want to classify the data point as +1(plus one) & -1(minus one) at the end of building the algorithm.
Let say we want to build the best possible decision tree with 91% accuracy.
Let (x) be a new data point.
Now, we shall use this decision tree that will finally predict whether (x) falls under class +1 or class -1.
Let us suppose that, the decision tree has classified the new data as +1.
Now let us ask ourselves that, is there a way we can do better than 91% by using the same data?
Now the solution for this question is as follows:
Let us build three more models on the same data. And see whether we can improve performance considerably.

We have four models on the same dataset.
Each of them have different accuracy but unfortunately there seems to be no real improvement in the accuracy.
- The first one is (Decision Tree Model), which has already been built.
- The second one is (Logistic Regression Model).
- The third one is (Neural Nets Model).
- (SVM Model) is the fourth.

Each of them has a different accuracy.
We know that Decision Tree has an accuracy of 91% and error of 9%.
Logistic Regression has an accuracy of 90% and an error of 10%.
Neural Network has accuracy of 91% and an error of 9%.
And in the case of SVM, the accuracy level is 92% and error is 8%.
We can observe that there is not much of considerable improvement.
Earlier, the accuracy was 91% and error was 9%.
However, SVM does a slightly better job with accuracy of 92% and error of 8%.
This cannot be considered as a real improvement over what we had earlier.
Now, what exactly is Ensemble Learning.

Now what about prediction of the data point (x)?
We know that the new data point (x) was predicted as +1 by the decision tree model in the initial attempt to build the model using decision tree.
That is, when we substituted new data point (x), it was classified as class +1.
Here if we want to see what logistic regression fitting is, when we use the logistic regression model, then the new data point (x) has been predicted as -1.
The new data point (x) is predicted as -1 by Neural Nets model.
The new data point (x) is predicted as -1 again by SVM model.
Only decision tree has predicted it as +1.
The combined voting model seem to be having less error than each of the individual models.
If we choose the voting model, then what voting model means is if we choose all the 4 models and the output is the highest voted value.
Let us say three of them are predicting as -1 and only one is predicting as +1, then we should believe that point as -1.
Based on voting model, we choose the final prediction as -1 and instead of choosing one model if we combined all those four models and take the voting, then finally the prediction might be actually accurate than the initial single model approach.
This is the actual philosophy of ensemble modelling.
Therefore, instead of building one model, we built several models and take their voting.
Instead of building one decision tree we built one, two, three and four models and then each one of the model might not be very strong in their own but we want to combine all of them and then take voting or average.
Then finally identify the predictor based on the average of the combination of all four models.
This is the actual philosophy of ensemble learning.

Ensemble Models

Ensemble technique is all about obtaining better predictions using multiple models on the same dataset instead of building one best model.
Because, it is not always possible to find the single best fit model for our data.
Ensemble model combines multiple models to come up with one consolidated model.
Ensemble models work on the principle which says “multiple models which are moderately accurate can give a highly accurate model”.
Understandably, building and evaluating the ensemble models is computationally expensive.
That is instead of building one model we are building multiple models.
The effort to build ensemble models is far more than that of a single model.
Build one real good model is the usual statistical approach.
Build multiple models and average the results is the philosophy of Ensemble learning which is nothing but the wisdom of crowds.
Thus instead of building one best model, let us build multiple models and combine them for prediction.

Why Ensemble technique works?

Imagine three independent and equivalent models:
- M1 with an error rate of 10%.
- M2 with an error rate of 10%.
- M3 with an error rate of 10%.
The three models have to be independent, because we can’t build the same model three times and expect the error to reduce. Any changes to the modeling technique in model-1 should not impact on model-2.
That is, model-1, model-2 and model-3 should not be doing the same work.
In this scenario, the worst ensemble model will have 10% error rate.
For finding the best ensemble model we need to combine these models and take the voting criteria.
The best ensemble model will have an error rate of 2.8%.
- two out of three models predicted wrong + all models predicted wrong
- = (3C2)*(0.1)(0.1)(0.9) + (0.1)(0.1)(0.1)
- = 2.8%
Thus we just took three moderately good models each with an error rate of 10% and by merely combining them we could get the error rate to 2.8%.

Overview

Now talking about the overview of the previous topic, i.e., of Ensemble Technique.
Instead of building one model, we prefer to build multiple models and take voting.
This voting criteria makes us less prone to errors, thereby increasing the accuracy level much higher than the individual accuracies levels of each models.
We will be looking forward into the techniques of bagging and boosting in next topic.

Types of Ensemble Models

The above example is a very primitive type of ensemble model whose sole purpose was to give you an idea of the whole ensemble technique.
However, practically, ensemble models are built differently.
There are better and statistically stronger ensemble methods that will yield better results.
Two most popular ensemble methodologies are
- Bagging
- Boosting

Bagging

Bagging is one of the ensemble techniques.
Before studying the bagging, let us take an overview about Bootstrap Sampling.
Bootstrap Sampling refers to taking sample points with replacement again and again. And these samples are called bootstrap samples.
Coming back to bagging, the bagging philosophy says, “Take multiple bootstrap samples from the population and build a classifier on each of the samples”.
Let us say, if we have ten bootstrap samples, then we shall have ten models.
For the prediction, take mean or mode, i.e., take the average or take a voting of all the individual model predictions and choose the final predictor.
Bagging has two major parts:
- Bootstrap sampling
- Aggregation of learners
Thus, Bagging = Bootstrap Aggregating
In Bagging we combine multiple moderately stable models to produce one final stable model.
Hence the predictors will be highly reliable.
Thus, the final model will have less variance and highly consistent coefficients .

Bootstrapping

Let us look at bootstrapping as it is the first step.
If we have a training data set of size N, then we draw samples of size N with replacement, i.e., we take a single data point, then we note it down and we put it back.
Again we take another data point, then note it down and put it back.
We are selecting records one-at-a-time, returning each selected record back in the population, giving it a chance to be selected again
We repeat this process until all of the N samples in the whole dataset are finished, i.e., we take bootstrap sample-1 and we sample it N times.
Then, we draw a random sample of size N with replacement, that gives us a new data set.

Each of these boot strap samples can have repeated observations. Or sometimes, it can have some observations which might not appear even once.
We are selecting records one-at-a-time, returning each of the selected records. We call these records 1,2,3 and so on up to N records.
Thus we select a record randomly, we shall keep it and then we note it down and put it back and again we select.
The next time we might be select the same one and or a different one, it does not matter.
And we repeat the process N times which is a form a bootstrap sample of size N,
And we call it as bootstrap sample-1.
In this way, we create ‘B’ such new sample data sets.
These sets are called bootstrap samples.
This is the bootstrap part is of bagging.

The Bagging Algorithm

We draw (k) bootstrap sample sets from the training dataset D
For each bootstrap sample (i).
Now build a classifier model ( $M_i$ ), i.e., we will have total of (k) classifiers ( $M_1, M_2, M_3 ... M_k$ ).
Then, we can either vote over or take average, whichever, we feel is the best way to aggregate.
Let us say we chose to take a vote-over to find the prediction for the final classifier output or we could choose to take the average in case of regression, then that will be the final bagged model.
Thus , we took (k), bootstrap sample sets, then we have built (k) models on each of these.
Then all these models are combined together to form a consolidated model which is called a bagged model.

Why Bagging Works

Recall that we had a similar question, “Why Ensemble Works?”.
In fact, bagging is one of the typical ensemble models.
And we know that, for any ensemble model, we need to make sure that all the samples are independent.
We are selecting records one-at-a-time, returning each of the selected records back to the population, giving it a chance to be selected again.
Note that the variance in the consolidated prediction is reduced, if we have independent samples.
This way, we could reduce the unavoidable errors made by a single model.
Because if we just have one single model, it might catch some unwanted pattern or an outlier.
However, when we draw the bootstrap samples again and again, then we tend to have a very consistent and a robust model by the end of the bagging process.
We know the fact that, in a given bootstrap sample, some observations might get selected multiple times where as some observations might not get a chance at all.
There a proven theory that bootstrap samples contain only 63% of the overall population and rest 37% is not present.
Thus, the data used in each of these models is not exactly same. But this makes our learning models independent and this helps our predictors have the uncorrelated errors.
Finally the errors from the individual models cancel out and give us a better ensemble model with higher accuracy.
Bagging is extremely useful when there is lot of variance in our data.
When there are too many points that are away from the rest of the points or when there are too many outliers may be bagging is the way to go.
As we take bootstrap samples and build several models and finally combine them to build a bagging model.

LAB: Bagging Models

Import Boston house price data. It is part of MASS package.
Get some basic meta details of the data.
Take 90% data use it for training and take rest 10% as holdout data.
Build a single linear regression model on the training data.
On the hold out data, calculate the error (squared deviation) for the regression model.
Build the regression model using bagging technique. Build at least 25 models.
On the hold out data, calculate the error (squared deviation) for the consolidated bagged regression model.
What is the improvement of the bagged model when compared with the single model?

Solution

Import Boston house price data. It is part of MASS package

#Importing Boston  house pricing data. 
library(MASS)

Get some basic meta details of the data.

data(Boston)
head(Boston)

##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2
## 6  5.21 28.7

dim(Boston)

## [1] 506  14

Take 90% data use it for training and take rest 10% as holdout data.

##Training and holdout sample
library(caret)

## Warning: package 'caret' was built under R version 3.3.2

## Loading required package: lattice

## Loading required package: ggplot2

set.seed(500)
sampleseed <- createDataPartition(Boston$medv, p=0.9, list=FALSE)

train_boston<-Boston[sampleseed,]
test_boston<-Boston[-sampleseed,]

Build a single linear regression model on the training data.

###Regression Model
reg_model<- lm(medv ~ ., data=train_boston)
summary(reg_model)

## 
## Call:
## lm(formula = medv ~ ., data = train_boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.4763  -2.7684  -0.4912   1.9030  26.4569 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.637e+01  5.534e+00   6.572 1.40e-10 ***
## crim        -1.042e-01  3.513e-02  -2.965 0.003195 ** 
## zn           4.482e-02  1.459e-02   3.073 0.002248 ** 
## indus        1.986e-02  6.566e-02   0.302 0.762462    
## chas         2.733e+00  8.765e-01   3.118 0.001939 ** 
## nox         -1.844e+01  4.018e+00  -4.590 5.79e-06 ***
## rm           3.845e+00  4.670e-01   8.234 2.04e-15 ***
## age          8.782e-04  1.434e-02   0.061 0.951211    
## dis         -1.488e+00  2.096e-01  -7.101 4.94e-12 ***
## rad          2.770e-01  6.993e-02   3.960 8.71e-05 ***
## tax         -1.062e-02  3.944e-03  -2.693 0.007348 ** 
## ptratio     -9.799e-01  1.385e-01  -7.073 5.92e-12 ***
## black        9.620e-03  2.827e-03   3.403 0.000726 ***
## lstat       -5.051e-01  5.706e-02  -8.852  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.787 on 444 degrees of freedom
## Multiple R-squared:  0.7309, Adjusted R-squared:  0.723 
## F-statistic: 92.75 on 13 and 444 DF,  p-value: < 2.2e-16

On the hold out data, calculate the error (squared deviation) for the regression model.

###Accuracy testing on holdout data
pred_reg<-predict(reg_model, newdata=test_boston[,-14])
reg_err<-sum((test_boston$medv-pred_reg)^2)
reg_err

## [1] 918.5927

Build the regression model using bagging technique. Build at least 25 models.

###Bagging Ensemble Model
library(ipred)

## Warning: package 'ipred' was built under R version 3.3.2

bagg_model<- bagging(medv ~ ., data=train_boston , nbagg=30)

On the hold out data, calculate the error (squared deviation) for the consolidated bagged regression model.

###Accuracy testing on holout data
pred_bagg<-predict(bagg_model, newdata=test_boston[,-14])
bgg_err<-sum((test_boston$medv-pred_bagg)^2)
bgg_err

## [1] 390.9028

What is the improvement of the bagged model when compared with the single model?

###Overall Improvement
reg_err

## [1] 918.5927

bgg_err

## [1] 390.9028

(reg_err-bgg_err)/reg_err

## [1] 0.5744547

Random Forest

Random forest is a specific case of ensembling techniques.
Precisely putting, then it is a specific case of bagging methodology.
Bagging, specifically on the decision trees is known as random forest.
Just Like many trees form a forest, many decision tree models together form a Random Forest model.
In random forest, we induce two types of randomness.
Firstly, we take the bootstrap samples of the population and build decision trees on each of the sample.
While building the individual trees on bootstrap samples, we take a subset of the features randomly.
We will not build the model on each of the sample using all the features.
We do not use all the predictor variables.
We do not use all the variables that are impacting.
Instead, we use a random set of variables that the model needs not to be one of the best models.
At least it will use a few variables and give us a loose classifier.
Thus, that is the major point in random forest that we will be using a subset of variables.
Now the question is why the subset of variables are used.
Thus, that if we use one subset of features or variables in model-1; another subset of variables in model-2, and again another sample of features in model-3, then all these models or trees will be totally independent.
Even if, each tree will give us a different result, however, all the trees together might give us a good consolidated random forest model.
Random forests are very stable. They are as good as SVMs and sometimes better than other algorithms.

Random Forest Algorithm

The random forest algorithm is very close to bagging.
With the training dataset (D) with (t) number of features let us draw k boot strap sample sets from the dataset (D).
For each bootstrap sample (i), build a decision tree model ( $M_i$ ) using only (p) number of features, where (p) is much less than (t).
For example; if there is a dataset with 200 features or variables, then we might only use randomly chosen 20 or 30 or 50 or 100 variables.
And we build the decision tree model ( $M_1$ ) with ( $b_1$ ) bootstrap samples, with ( $p_1$ ) number of randomly chosen features.
Again, we build model ( $M_2$ ) with ( $b_2$ ) bootstrap samples, with ( $p_2$ ) number of randomly chosen features.
Similarly we further build models till ( $M_i$ ) which are completely independent so there is randomness because of the different bootstrap samples being used.
Note that, (p) features are randomly chosen thus there is randomness being induced in the second level too.
And finally we consolidate them as the random forest model.
The way we consolidate is again through voting.

Recap..

Let us quickly have a recap.
We have a training dataset (D), with (t) number of features.
We draw (k) number of bootstrap samples.
For each bootstrap sample (I), build a decision tree model ( $M_i$ ) using only (p) number of features where (p) is much less than (t).
Each tree has maximal strength they are fully grown and not pruned.
We shall have a total of (k) decision trees ( $M_1, M_2, M_3 ... M_k$ );
Each of these trees are built on relatively different training data and different set of features
Finally, we shall vote over for the final classifier output.
And then we shall take the average for regression output (w_h).

Random forest model is nothing but we have all these models together consolidation is done based on the voting.
If it is regression, then it will take the average of all the outputs that is the regression consolidation.
If it is classification, then let’s say predicting +1 and -1 with 30 trees.
Then 25 trees are predicting as +1 based on voting, then we will say the final prediction is +1 that is the random forest algorithm.

The Random Factors in Random Forest

We need to note the most important aspect of random forest, i.e inducing randomness into the bagging of trees. There are two major sources of randomness.
- Randomness in data: Boot strapping, this will make sure that any two samples data is somewhat different.
- Randomness in features: While building the decision trees on boot strapped samples we consider only a random subset of features.
Why to induce the randomness?
- The major trick of ensemble models is the independence of models.
- If we take the same data and build same model for 100 times, we will not see any improvement
- To make all our decision trees independent, we take independent samples set and independent features set.
- As a rule of thumb, if (‘t’) is very large, then we can consider square root of the number of features; or else (p=t/3).

Why Random Forest Works

We need to note the most important aspect of random forest, i.e inducing randomness into the bagging of trees.
There are two major sources of randomness;
- Being the randomness in the data
- Being the randomness in the features.
Randomness in data is induced because of the bootstrapping, thus this will make sure that any two samples from the data are somewhat different.
While randomness in features: is induced while building the decision trees on the bootstrapped samples since we consider only a random subset of the features.
Why to induce the randomness?
The major trick of ensemble models is in “the models being independent”.
If we take the same data and build the same model for 100 times, we shall not see any improvement.
To make all our decision trees independent, we take a set of independent samples and a set of independent features.
As a rule of thumb for choosing the value of (p), we should look at the total number of features, (t).
We shall consider square root of (‘t’) if the value of (‘t’) is very large.
Else we shall consider (‘t by three’) if the value of (‘t’) is not so large, i.e., we go for one-third of the number of features when the total number of features is not so large.
Why Random Forest works?
For a training data with 20 features, we build 100 decision trees with 5 features each, instead of a single great decision tree, because the individual trees may be weak classifiers.
It is like building weak classifiers on subsets of data. The grouping of large sets of random trees generally produces more accurate models.
Suppose if we have 100 trees and each one them does one single thing very clearly, say each one of them is a fairly good classifier to identify a particular pattern.
Then all of them will make a good random forest in identifying the patterns.

In this example we have three simple classifiers.
( $M_1$ ) classifies anything above the line as +1 and below as -1, ( $M_2$ ) classifies all the points above the line as -1 and below as +1 and ( $M_3$ ) classifies everything on the left as -1 and right as +1.
Each of these models have fair amount of misclassification error.
All these three weak models together make a strong model.
If we take all these boundaries, then anything inside the boundary is -1 and anything outside the boundary is +1.
Hence that way the model consolidated based on theses 3 models is the best one.
It has almost zero error as it is not wrongly classifying anything.

LAB: Random Forest

Dataset: /Car Accidents IOT/Train.csv
Build a decision tree model to predict the fatality of accident.
Build a decision tree model on the training data.
On the test data, calculate the classification error and accuracy.
Build a random forest model on the training data.
On the test data, calculate the classification error and accuracy.
What is the improvement of the Random Forest model when compared with the single tree?

Solution

Dataset: /Car Accidents IOT/Train.csv

#Data Import
train<- read.csv("~\R Dataset\Car Accidents IOT\Train.csv")
test<- read.csv("~\R Dataset\Car Accidents IOT\Test.csv")

dim(train)

## [1] 15109    23

Here is the training data and here is the testing data of the car accidents IOT.
We know there are numerous sensors in the car.
Some sensors in the steering system, some in the engine, some near the wheels, some near the bumper and may be one at the clutch.
Essentially, in this dataset, we have the data collected from all these sensors.
And based on these data, we try to predict the fatality of the accident.
Here, we could see there are 23 variables and 15109 observations.

dim(test)

## [1] 9065   23

And in the test data, we have 23 variables again, however; with 9065 observations.
And the output is either fatal or not fatal.
So here we can see the training data with sensor 1, sensor 2, sensor3 and so on up to sensor 22 and the output will either be 1 or zero, which indicates whether the accident is fatal or not.

head(train)

##   Fatal      S1       S2       S3  S4       S5 S6 S7 S8 S9      S10 S11
## 1     1 36.2247 10.77330 0.243897 596 100.6710  0  0  1 28 0.016064 313
## 2     1 35.7343 17.45510 0.243897 600 100.0000  0  0  1 14 0.015812 319
## 3     1 31.6561  7.61366 0.308763 604  99.3377  0  0  1  4 0.015560 323
## 4     1 33.8320 13.11190 0.293195 616  97.4026  0  0  1  8 0.016001 320
## 5     1 42.5138 13.99850 0.259465 632  94.9367  0  0  1  8 0.016064 322
## 6     1 36.1261 14.85930 0.278925 600 100.0000  0  0  1  4 0.015749 314
##   S12 S13 S14 S15   S16  S17     S18 S19  S20 S21     S22
## 1   1   1  57   0 0.280  240 5.99375   0  0.0   4 14.9382
## 2   1   1  57   0 0.175  240 5.99375   0  0.0   4 14.8827
## 3   1   1  58   0 0.280  240 5.99375   0  0.0   4 14.6005
## 4   1   1  58   0 0.385  240 4.50625   0 13.0   4 14.6782
## 5   1   1  57   0 0.070  240 5.99375   0 19.5   4 15.3461
## 6   1   1  58   0 0.175 1008 4.50625   0 23.9   4 15.0559

Now we shall try to build a decision tree model with fatality versus the rest of the variables.
Here we take the minimum split as 30, (cp=0.03), which are some standard values and the data set is the training data set.

Build a decision tree model to predict the fatality of accident.

###Decision Tree
library(rpart)

## Warning: package 'rpart' was built under R version 3.3.2

crash_model_ds<-rpart(Fatal ~ ., method="class", control=rpart.control(minsplit=30, cp=0.03),   data=train)

So, here is the tree which we have built,
Now, we shall check the accuracy of the model
Let us run this training data code snippet.
We can see that the accuracy is 82.84%, so we can say most of the times it is predicting correctly
Now let us let us run the testing data code snippet.
We can see that 82.85% is the testing accuracy.

Build a decision tree model on the training data.

#Training accuarcy
predicted_y<-predict(crash_model_ds, type="class")
table(predicted_y)

## predicted_y
##    0    1 
## 5745 9364

confusionMatrix(predicted_y,train$Fatal)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4735 1010
##          1 1581 7783
##                                           
##                Accuracy : 0.8285          
##                  95% CI : (0.8224, 0.8345)
##     No Information Rate : 0.582           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.643           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7497          
##             Specificity : 0.8851          
##          Pos Pred Value : 0.8242          
##          Neg Pred Value : 0.8312          
##              Prevalence : 0.4180          
##          Detection Rate : 0.3134          
##    Detection Prevalence : 0.3802          
##       Balanced Accuracy : 0.8174          
##                                           
##        'Positive' Class : 0               
##

On the test data, calculate the classification error and accuracy.

#Accuaracy on Test data
predicted_test_ds<-predict(crash_model_ds, test, type="class")
confusionMatrix(predicted_test_ds,test$Fatal)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2897  561
##          1  995 4612
##                                           
##                Accuracy : 0.8284          
##                  95% CI : (0.8204, 0.8361)
##     No Information Rate : 0.5707          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6448          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7443          
##             Specificity : 0.8916          
##          Pos Pred Value : 0.8378          
##          Neg Pred Value : 0.8225          
##              Prevalence : 0.4293          
##          Detection Rate : 0.3196          
##    Detection Prevalence : 0.3815          
##       Balanced Accuracy : 0.8179          
##                                           
##        'Positive' Class : 0               
##

Since the accuracy on Training data is 82.84% and on the testing data is 82.85%.
We can say that it is a fairly good model of decision tree.
Now, let us build a random forest, so in the process we shall build 200 decision tree models.
With each model, we shall take boot strap samples.
However the number of features for each tree Would be one-third of the number of training data columns, which means we shall consider 1/3 of the overall training data features.
Here, we can see out of these 23 features.
Every time we shall take around 7 or 8 features in each of the tree.
In this way, we shall build 200 such trees and consolidate them to make the final random forest
Let us run these lines and build the random forest.
It looks like it is taking some time to build as we are building 200 such trees.
Now that the random forest model is built, let us see its summary.

Build a random forest model on the training data.

###Random Forest
library(randomForest)

## Warning: package 'randomForest' was built under R version 3.3.2

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

rf_model <- randomForest(as.factor(train$Fatal) ~ ., ntree=200,   mtry=ncol(train)/3, data=train)
summary(rf_model)

##                 Length Class  Mode     
## call                5  -none- call     
## type                1  -none- character
## predicted       15109  factor numeric  
## err.rate          600  -none- numeric  
## confusion           6  -none- numeric  
## votes           30218  matrix numeric  
## oob.times       15109  -none- numeric  
## classes             2  -none- character
## importance         22  -none- numeric  
## importanceSD        0  -none- NULL     
## localImportance     0  -none- NULL     
## proximity           0  -none- NULL     
## ntree               1  -none- numeric  
## mtry                1  -none- numeric  
## forest             14  -none- list     
## y               15109  factor numeric  
## test                0  -none- NULL     
## inbag               0  -none- NULL     
## terms               3  terms  call

This is the summary of the random forest model.
Majorly, it will give ensemble or bagging type of summary.
However, we might want to see the confusion matrix or the final tree or the random forest itself, but its very hard to make any sense out of the random forest data as we have built 200 trees.

#Training accuaracy
predicted_y<-predict(rf_model)
table(predicted_y)

## predicted_y
##    0    1 
## 5921 9188

confusionMatrix(predicted_y,train$Fatal)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5600  321
##          1  716 8472
##                                           
##                Accuracy : 0.9314          
##                  95% CI : (0.9272, 0.9353)
##     No Information Rate : 0.582           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8577          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8866          
##             Specificity : 0.9635          
##          Pos Pred Value : 0.9458          
##          Neg Pred Value : 0.9221          
##              Prevalence : 0.4180          
##          Detection Rate : 0.3706          
##    Detection Prevalence : 0.3919          
##       Balanced Accuracy : 0.9251          
##                                           
##        'Positive' Class : 0               
##

Let us go ahead and check the Training accuracy
We can see it is 93% which is much higher than the 82% of the earlier data set model.

On the test data, calculate the classification error and accuracy.

#Accuaracy on Test data
predicted_test_rf<-predict(rf_model,test, type="class")
confusionMatrix(predicted_test_rf,test$Fatal)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3479  192
##          1  413 4981
##                                           
##                Accuracy : 0.9333          
##                  95% CI : (0.9279, 0.9383)
##     No Information Rate : 0.5707          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8628          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8939          
##             Specificity : 0.9629          
##          Pos Pred Value : 0.9477          
##          Neg Pred Value : 0.9234          
##              Prevalence : 0.4293          
##          Detection Rate : 0.3838          
##    Detection Prevalence : 0.4050          
##       Balanced Accuracy : 0.9284          
##                                           
##        'Positive' Class : 0               
##

Now let us go for the testing accuracy on the hold out data, which turns out to be 93.45%.
Thus, we can definitely say that this model is much more improved and is a better classifier in comparison with a single decision tree.

Boosting

Boosting is one more famous ensemble method
Boosting uses a slightly different techniques to that of bagging.
Boosting is a well proven theory that works really well on many of the machine learning problems like speech recognition.
If bagging is wisdom of crowds then boosting is wisdom of crowds where each individual is given some weight based on their expertise.
Boosting in general decreases the bias error and builds strong predictive models.
Boosting is an iterative technique. We adjust the weight of the observation based on the previous classification.
If an observation was classified incorrectly, it tries to increase the weight of this observation and vice versa.

Boosting Main idea

Final Classifier (C = sum alpha_i c_i)

The main idea or philosophy of boosting is as follows:
Take a random sample of the population of size (N) where each record has a (1/N) chance of picking.
Let variable (w) denote the weight of each observation point.
Initially, the weight ((w)) is (1/N).
Secondly, we shall pick a random sample from the population and build a classifier on that particular random sample.
Then, we shall note down the accuracy.
Obviously, the classifier might not classify all the records correctly and some might be wrong.
We shall identify the previously misclassified samples and add more weight.
So that means, in the new weighted model, the previously misclassified observations will be picked more often.
Then, we shall build a new model on the re-weighted sample that we just collected.
So, in the new re-weighted model, since we picked up most of the previously wrongly classified records; we expect this model to build a better model for those records.
Now, let us check the error for this new model.
And if the classifier still misclassifies some of the records then we shall repeat the same process.
Make sure that the wrongly classified records are picked more often than the correctly classified ones.
Based on the new weighted samples, we shall build a new model.
The final weighted classifier will be the sum of the products of accuracy and classification of each of the models.
That’s the main idea of boosting.

Recap..

Let us quickly have a recap.
Initially, we need to pick up one of the samples as all the samples will have the same weight.
Then we shall build a model directly.
We know that any model will most likely have some error.
That means, there will be an error, say epsilon-1 and an accuracy factor, say alpha-1 for this particular model.
And then, we update the weight.
That means we resample again from the previous sample which will give us a weighted sample.
How do we take the weighted sample??
That will be, by picking up the previously misclassified records more often.
And those which were correctly classified will be chosen less often.
Then we build a new model $C_2$ () , with an error epsilon-2, the accuracy factor is alpha-2.
Again here, whatever is misclassified will be picked more often.
And the correctly classified will be picked less often.
We repeat the same process until we achieve the desired accuracy.

How weighted samples are taken

Let us try to understand in more detail, how the weighted samples are taken.
Consider an example:

Imagine a data with 10 points and actual classes are minus (-) and plus (+).
Now we build model ( $M_1$ ). In each observation, The predicted class of ( $M_1$ ) will classify the data into plus or minus.
So the let us take model ( $M_1$ ).
Here in the observation-1, minus is correctly predicted as minus.
Even in the observation-2, minus is correctly predicted as a minus.
But in the third and the fourth observations, plus is wrongly predicted as minus.
So we can see that the model ( $M_1$ ) is wrongly classifying the third, the fourth and the sixth observations.
However the rest of the observations are predicted correctly.
Now we shall take weighted samples, that is, adding more weight to the wrongly predicted observations.
In this weighted sample, the observations 3, 4 and 6 should appear more often than others.
So to build the model ( $M_2$ ), observations 1 & 2 have been picked, 3 & 4 have been picked 5, 6 & 7 have been picked.
We could observe that the observation 4 is picked again because 4 has been misclassified. Similarly, observations 3 & 6 have also been picked again because of the misclassification.
Then again we build a model ( $M_2$ ) on this new data set.
( $M_2$ ) is classifying everything just like the model ( $M_1$ ).
Previously misclassified ones are classified correctly now in the observations 3, 4 & 6 however, observations 5 and 7 are misclassified.
We need to repeat the weighted sampling once again.
This time we shall give more weightage to 5 and 7, because they were previously misclassified in the new weighted sample-1.
They will be picked more often.
Thus in the weighted sample-2; 5 is chosen 3 times, 7 is chosen 3 times, 6 is picked twice and the rest of the observations are picked only once.
Now in the model ( $M_3$ ), these are the actual classes.
We can see that ( $M_3$ ) predicts everything correctly that is how weighted sampling and boosting work.

Boosting Illustration

To understand boosting clearly, we shall see a visual illustration of the same example that we discussed in the boosting earlier
Now take a look at the data observation record numbers because in boosting the sampling and weighted sampling we need to remember the records
What exactly is the observation number because it will be picked later based on whether it is classified correctly or wrongly

Note – Below is the training data and their classes. And we need to take a note of record numbers, they will help us in weighted sampling later.

These are the 10 data points some of them are positive and some of them are negative
Thus here, in the illustration we can see, 5 positive points, that are in blue and 5 negative points that are in red.

Classifier model ( $M_1$ ) is built, anything above the line is – and below the line is +.
3 out of 10 are misclassified by the model ( $M_1$ ).
3, 4 and 6: These are the misclassified by model ( $M_1$ ).
As per boosting technique, these will be resampled again.
Next time we shall give more weight to these points.
Thus here 3, 4 and 6 are misclassified and clearly; we can see that, anything above is classified as red and anything below the line is classified as blue.
These are correctly classified and these 3 are wrongly classified this is model ( $M_1$ ).

This is the first sampling and this is model ( $M_1$ ) result.
Now we shall resample and give more weight to the data points 3, 4 and 6 and then build a model (M_2).
We resampled so that each one of them is picked more often than the others.
Thus the sample points 9 and 10 didn’t appear at all.
But the points 3, 4, 6 are picked again.
(M_2) is built on this data. Anything above this line is red and below the line is blue.
(M_2) is classifying the points 5 & 7 incorrectly.

Model ( $M_2$ ) made sure that the data points 3, 4 and 6 are classified correctly.
They are positive and they are classified as positive.
From previous model, the current model has an improvement.
This model has classify the points are correctly.
But again this model has misclassified 5 and 7.
Thus in the next iteration, we have to pick 5 and 7 more often than other points that means we will give these observations will be more weights.

This is the 3rd weighted sample; which is the final one.
Here the point 5 is picked thrice and 7 is picked thrice.
Now we build a new model ( $M_3$ ) which is built on this data.
Anything to the left hand side of the line is blue and anything to the right is red.
We can observe that ( $M_3$ ) is now classifying everything correctly.
Thus we don’t need to do further weighted sampling.

Thus by now we have built three models ( $M_1$ ), ( $M_2$ ) & ( $M_3$ ) which all together are giving the final result.
The final model now will be picked on weighted Votes.
For a given data point more than 2 models seem to be indicating the right class.
For example take point 6, it is classified as minus(-) by ( $M_1$ ) and plus(+) by ( $M_2$ ) and as plus(+) by ( $M_3$ ) thus the final result will be +
Similarly take a point 2, it will be classified as minus(-) by ( $M_1$ ), minus(-) by ( $M_2$ ) and plus(+) by ( $M_3$ ), final result will be minus(-).
So the final weighted combination of all the three models and the predictions will yield a highly accurate model.
That is how the boosting works.

Theory behind Boosting Algorithm

Take the dataset
Build a classifier ( $C_m$ ) and find the error
Calculate error rate of the classifier
- Error rate of $\epsilon _m = \sum w_i I (y_i \neq C_m (x)) / \sum w_i$ = Sum of misclassification weight / sum of sample weights
Calculate an intermediate factor called a. It analogous to accuracy rate of the model. It will be later used in weight updating. It is derived from error
- $\alpha _m = log(1- \epsilon _m)/\epsilon _m)$
Update weights of each record in the sample using the a factor. The indicator function will make sure that the misclassifications are given more weight
- For (i=1,2,… N)
  - $W_(i+1) = w_i e^(\alpha _m I(y_ineq C_m (x)))$
  - Renormalize so that sum of weights is 1.
Repeat this model building and weight update process until we have no misclassification
Final collation is done by voting from all the modes. While taking the votes, each model is weighted by the accuracy factor (alpha)
- $C = sign(\sum \alpha _i C_i(x))$

Gradient Boosting

Ada boosting
- Adaptive Boosting
- Till now we discussed Ada boosting technique. Here we give high weight to misclassified records.
Gradient Boosting
- Similar to Ada boosting algorithm.
- The approach is same but there are slight modifications during re-weighted sampling.
- We update the weights based on misclassification rate and gradient
- Gradient boosting serves better for some class of problems like regression.

LAB: Boosting

Rightly categorizing the items based on their detailed feature specifications. More than 100 specifications have been collected.
Data: Ecom_Products_Menu/train.csv
Build a decision tree model and check the training and testing accuracy
Build a boosted decision tree.
Is there any improvement from the earlier decision tree

Solution

train <- read.csv("~\R Dataset\Ecom_Products_Menu\train.csv")
test <- read.csv("~\R Dataset\Ecom_Products_Menu\test.csv")

dim(train)

## [1] 50122   102

dim(test)

## [1] 11756   102

names(train)

##   [1] "id"       "spec1"    "spec2"    "spec3"    "spec4"    "spec5"   
##   [7] "spec6"    "spec7"    "spec8"    "spec9"    "spec10"   "spec11"  
##  [13] "spec12"   "spec13"   "spec14"   "spec15"   "spec16"   "spec17"  
##  [19] "spec18"   "spec19"   "spec20"   "spec21"   "spec22"   "spec23"  
##  [25] "spec24"   "spec25"   "spec26"   "spec27"   "spec28"   "spec29"  
##  [31] "spec30"   "spec31"   "spec32"   "spec33"   "spec34"   "spec35"  
##  [37] "spec36"   "spec37"   "spec38"   "spec39"   "spec40"   "spec41"  
##  [43] "spec42"   "spec43"   "spec44"   "spec45"   "spec46"   "spec47"  
##  [49] "spec48"   "spec49"   "spec50"   "spec51"   "spec52"   "spec53"  
##  [55] "spec54"   "spec55"   "spec56"   "spec57"   "spec58"   "spec59"  
##  [61] "spec60"   "spec61"   "spec62"   "spec63"   "spec64"   "spec65"  
##  [67] "spec66"   "spec67"   "spec68"   "spec69"   "spec70"   "spec71"  
##  [73] "spec72"   "spec73"   "spec74"   "spec75"   "spec76"   "spec77"  
##  [79] "spec78"   "spec79"   "spec80"   "spec81"   "spec82"   "spec83"  
##  [85] "spec84"   "spec85"   "spec86"   "spec87"   "spec88"   "spec89"  
##  [91] "spec90"   "spec91"   "spec92"   "spec93"   "spec94"   "spec95"  
##  [97] "spec96"   "spec97"   "spec98"   "spec99"   "spec100"  "Category"

Last one is category whether it is mobile whatever is the type of category class that belongs to based on the specification we need to classify it in the e-commerce website automatically.
We build a decision tree model.
While trying to predict the category by using the specifications and we shall build our first decision tree model on it.
Dataset is fairly large, so it might take some time.
We can see the plot as well.

##Decison Tree
library(rpart)
ecom_products_ds<-rpart(Category ~ ., method="class", control=rpart.control(minsplit=30, cp=0.01),  data=train[,-1])
library(rattle)

## Warning: package 'rattle' was built under R version 3.3.2

## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

fancyRpartPlot(ecom_products_ds)

#Training accuarcy
library(caret)
predicted_y<-predict(ecom_products_ds, type="class")
table(predicted_y)

## predicted_y
##   Accessories    Appliances        Camara          Ipod       Laptops 
##             0         10899          2733          2442             0 
##       Mobiles Personal_Care       Tablets            TV 
##             0         10288         23760             0

confusionMatrix(predicted_y,train$Category)

## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      Accessories Appliances Camara  Ipod Laptops Mobiles
##   Accessories             0          0      0     0       0       0
##   Appliances            825       5536   1086   130     506     709
##   Camara                 88        387   1456     4      55     388
##   Ipod                   30         17     23  2032     144       5
##   Laptops                 0          0      0     0       0       0
##   Mobiles                 0          0      0     0       0       0
##   Personal_Care         110        308    152     0      18      79
##   Tablets              1288        615   1247    51    5743     377
##   TV                      0          0      0     0       0       0
##                Reference
## Prediction      Personal_Care Tablets    TV
##   Accessories               0       0     0
##   Appliances             1035     932   140
##   Camara                  252      84    19
##   Ipod                     13     159    19
##   Laptops                   0       0     0
##   Mobiles                   0       0     0
##   Personal_Care          9545      19    57
##   Tablets                 607   11885  1947
##   TV                        0       0     0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6076          
##                  95% CI : (0.6033, 0.6119)
##     No Information Rate : 0.2609          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5053          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Accessories Class: Appliances Class: Camara
## Sensitivity                     0.00000            0.8066       0.36731
## Specificity                     1.00000            0.8760       0.97233
## Pos Pred Value                      NaN            0.5079       0.53275
## Neg Pred Value                  0.95329            0.9662       0.94708
## Prevalence                      0.04671            0.1369       0.07909
## Detection Rate                  0.00000            0.1105       0.02905
## Detection Prevalence            0.00000            0.2174       0.05453
## Balanced Accuracy               0.50000            0.8413       0.66982
##                      Class: Ipod Class: Laptops Class: Mobiles
## Sensitivity              0.91655          0.000        0.00000
## Specificity              0.99144          1.000        1.00000
## Pos Pred Value           0.83210            NaN            NaN
## Neg Pred Value           0.99612          0.871        0.96892
## Prevalence               0.04423          0.129        0.03108
## Detection Rate           0.04054          0.000        0.00000
## Detection Prevalence     0.04872          0.000        0.00000
## Balanced Accuracy        0.95400          0.500        0.50000
##                      Class: Personal_Care Class: Tablets Class: TV
## Sensitivity                        0.8335         0.9087   0.00000
## Specificity                        0.9808         0.6794   1.00000
## Pos Pred Value                     0.9278         0.5002       NaN
## Neg Pred Value                     0.9521         0.9547   0.95647
## Prevalence                         0.2285         0.2609   0.04353
## Detection Rate                     0.1904         0.2371   0.00000
## Detection Prevalence               0.2053         0.4740   0.00000
## Balanced Accuracy                  0.9071         0.7941   0.50000

The accuracy is 60%, which is very less on the training data itself.

#Accuarcy on Test data
predicted_test_ds<-predict(ecom_products_ds, test[,-1], type="class")
confusionMatrix(predicted_test_ds,test$Category)

## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      Accessories Appliances Camara Ipod Laptops Mobiles
##   Accessories             0          0      0    0       0       0
##   Appliances            172       1308    269   40      92     170
##   Camara                 15         80    383    1      16      95
##   Ipod                   14          4      3  469      28       0
##   Laptops                 0          0      0    0       0       0
##   Mobiles                 0          0      0    0       0       0
##   Personal_Care          23         75     42    0       1      23
##   Tablets               274        134    294   12    1401      83
##   TV                      0          0      0    0       0       0
##                Reference
## Prediction      Personal_Care Tablets   TV
##   Accessories               0       0    0
##   Appliances              234     210   42
##   Camara                   52      23    3
##   Ipod                      3      49    5
##   Laptops                   0       0    0
##   Mobiles                   0       0    0
##   Personal_Care          2242      10   17
##   Tablets                 152    2751  442
##   TV                        0       0    0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6085          
##                  95% CI : (0.5996, 0.6173)
##     No Information Rate : 0.2588          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5071          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Accessories Class: Appliances Class: Camara
## Sensitivity                     0.00000            0.8170       0.38648
## Specificity                     1.00000            0.8790       0.97353
## Pos Pred Value                      NaN            0.5156       0.57335
## Neg Pred Value                  0.95764            0.9682       0.94517
## Prevalence                      0.04236            0.1362       0.08430
## Detection Rate                  0.00000            0.1113       0.03258
## Detection Prevalence            0.00000            0.2158       0.05682
## Balanced Accuracy               0.50000            0.8480       0.68000
##                      Class: Ipod Class: Laptops Class: Mobiles
## Sensitivity              0.89847         0.0000        0.00000
## Specificity              0.99056         1.0000        1.00000
## Pos Pred Value           0.81565            NaN            NaN
## Neg Pred Value           0.99526         0.8692        0.96844
## Prevalence               0.04440         0.1308        0.03156
## Detection Rate           0.03989         0.0000        0.00000
## Detection Prevalence     0.04891         0.0000        0.00000
## Balanced Accuracy        0.94452         0.5000        0.50000
##                      Class: Personal_Care Class: Tablets Class: TV
## Sensitivity                        0.8356         0.9040    0.0000
## Specificity                        0.9789         0.6796    1.0000
## Pos Pred Value                     0.9215         0.4963       NaN
## Neg Pred Value                     0.9527         0.9530    0.9567
## Prevalence                         0.2282         0.2588    0.0433
## Detection Rate                     0.1907         0.2340    0.0000
## Detection Prevalence               0.2070         0.4715    0.0000
## Balanced Accuracy                  0.9073         0.7918    0.5000

###Boosting

library(xgboost)

## Warning: package 'xgboost' was built under R version 3.3.2

library(methods)
library(data.table)

## Warning: package 'data.table' was built under R version 3.3.2

library(magrittr)

## Warning: package 'magrittr' was built under R version 3.3.2

On the testing data also, it comes up only till 60%.
Even in the confusion matrix, we can see many of the accessories are classified as appliances, etc.
There are lot of errors; there are too many points above and below the diagonal elements in the confusion matrix.
It is very clear that predictions are going wrong.
Now we shall see boosting algorithm for the same.
Here, firstly we shall build a model and whatever is misclassified will be taken with a higher probability i.e., those observations will be taken more often than the rest and then.
They shall be finally classified correctly.
Let us see what better does the boosting classification has for this dataset.
Let us include some important libraries.
We will be using a function called xgboost and this function needs at least one numeric column.
Thus, we shall convert datasets to numeric format and then we should convert the dataset to matrix as xgboost function accepts the dataset in matrix format.
Let us create two matrices here; namely
- Train matrix
- Test matrix
The next requirement would be to change the label should to numeric format.
And this label should start from 0.

# converting datasets to Numeric format. xgboost needs at least one numeric column 
train[,c(-1,-102)] <- lapply( train[,c(-1,-102)], as.numeric)
test[,c(-1,-102)] <- lapply( test[,c(-1,-102)], as.numeric)

# converting datasets to Matrix format. Data frame is not supported by xgboost
trainMatrix <- train[,c(-1,-102)] %>% as.matrix
testMatrix <- test[,c(-1,-102)] %>% as.matrix

#The label should be in numeric format and it should start from 0
y<-as.integer(train$Category)-1
table(y,train$Category)

##    
## y   Accessories Appliances Camara  Ipod Laptops Mobiles Personal_Care
##   0        2341          0      0     0       0       0             0
##   1           0       6863      0     0       0       0             0
##   2           0          0   3964     0       0       0             0
##   3           0          0      0  2217       0       0             0
##   4           0          0      0     0    6466       0             0
##   5           0          0      0     0       0    1558             0
##   6           0          0      0     0       0       0         11452
##   7           0          0      0     0       0       0             0
##   8           0          0      0     0       0       0             0
##    
## y   Tablets    TV
##   0       0     0
##   1       0     0
##   2       0     0
##   3       0     0
##   4       0     0
##   5       0     0
##   6       0     0
##   7   13079     0
##   8       0  2182

test_y<-as.integer(test$Category)-1
table(test_y,test$Category)

##       
## test_y Accessories Appliances Camara Ipod Laptops Mobiles Personal_Care
##      0         498          0      0    0       0       0             0
##      1           0       1601      0    0       0       0             0
##      2           0          0    991    0       0       0             0
##      3           0          0      0  522       0       0             0
##      4           0          0      0    0    1538       0             0
##      5           0          0      0    0       0     371             0
##      6           0          0      0    0       0       0          2683
##      7           0          0      0    0       0       0             0
##      8           0          0      0    0       0       0             0
##       
## test_y Tablets   TV
##      0       0    0
##      1       0    0
##      2       0    0
##      3       0    0
##      4       0    0
##      5       0    0
##      6       0    0
##      7    3043    0
##      8       0  509

Basically we are preparing the data for the analysis.
This is the outcome that we have just created it.
Now we here we are creating the list of parameters.
We have to set the objective of the function which is multisoft prob.
Through multi softmax – we set the XGBoost to do multiclass classification using the softmax objective.
Here we set the number of classes to 9.
Then we have this Multiclass classification error rate which is calculated as the ratio (number of wrong cases) to (the total number of cases).
These are the parameters that we need to set to get started with xgboost.

#Setting the parameters for multiclass classification
param <- list("objective" = "multi:softprob","eval.metric" = "merror",   "num_class" =9)
#"multi:softmax" --set XGBoost to do multiclass classification using the softmax objective, you also need to set num_class(number of classes)     
#"merror": Multiclass classification error rate. It is calculated as #(wrong cases)/#(all cases).

Here, we can use the library xgboost; data in terms of training matrix, which labels as y.
We perform 40 rounds of waiting and re-waiting.
Because as you keep increasing the iterations, you will get models that will be totally error free.

XGBModel <- xgboost(param=param, data = trainMatrix, label = y, nrounds=40)

## [1]  train-merror:0.270041 
## [2]  train-merror:0.244603 
## [3]  train-merror:0.232772 
## [4]  train-merror:0.226068 
## [5]  train-merror:0.220841 
## [6]  train-merror:0.215614 
## [7]  train-merror:0.211923 
## [8]  train-merror:0.207793 
## [9]  train-merror:0.204401 
## [10] train-merror:0.201568 
## [11] train-merror:0.198835 
## [12] train-merror:0.196880 
## [13] train-merror:0.193707 
## [14] train-merror:0.191812 
## [15] train-merror:0.188859 
## [16] train-merror:0.186724 
## [17] train-merror:0.185009 
## [18] train-merror:0.182056 
## [19] train-merror:0.179422 
## [20] train-merror:0.178504 
## [21] train-merror:0.176509 
## [22] train-merror:0.173975 
## [23] train-merror:0.172379 
## [24] train-merror:0.170703 
## [25] train-merror:0.168589 
## [26] train-merror:0.166015 
## [27] train-merror:0.164738 
## [28] train-merror:0.162883 
## [29] train-merror:0.161466 
## [30] train-merror:0.160249 
## [31] train-merror:0.158912 
## [32] train-merror:0.157635 
## [33] train-merror:0.155800 
## [34] train-merror:0.154503 
## [35] train-merror:0.153067 
## [36] train-merror:0.152009 
## [37] train-merror:0.151450 
## [38] train-merror:0.150014 
## [39] train-merror:0.148957 
## [40] train-merror:0.147839

We can see that, in the first iteration the error is 26%; in the 2nd iteration the error is 24%; and slowly while iterations are increasing the error is reducing.
Final iteration error is 13%.
Now Let us calculate training accuracy for the boosting model.

#Training accuarcy
predicted_y<-predict(XGBModel, trainMatrix)
probs <- data.frame(matrix(predicted_y, nrow=nrow(train), ncol=9,  byrow = TRUE))

probs_final<-as.data.frame(cbind(row.names(probs),apply(probs,1, function(x) c(0:8)[which(x==max(x))])))
table(probs_final$V2)

## 
##     0     1     2     3     4     5     6     7     8 
##  2098  6980  4006  2217  5167  1219 11446 15643  1346

confusionMatrix(probs_final$V2,y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1     2     3     4     5     6     7     8
##          0  1768    34    14     2    78    32    92    61    17
##          1    77  6477   122     1    14   137   133    15     4
##          2     8    81  3573     2     6   222   102    11     1
##          3     8     4     2  2180     0     3     0    10    10
##          4    92    14     3     2  3740     5    14  1064   233
##          5    28    60    74     2     2  1004    43     6     0
##          6   102   117    91     2     4   106 10961    20    43
##          7   239    76    84    26  2582    48    99 11812   677
##          8    19     0     1     0    40     1     8    80  1197
## 
## Overall Statistics
##                                          
##                Accuracy : 0.8522         
##                  95% CI : (0.849, 0.8553)
##     No Information Rate : 0.2609         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.8201         
##  Mcnemar's Test P-Value : < 2.2e-16      
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.75523   0.9438  0.90136  0.98331  0.57841  0.64442
## Specificity           0.99309   0.9884  0.99062  0.99923  0.96731  0.99557
## Pos Pred Value        0.84271   0.9279  0.89191  0.98331  0.72382  0.82363
## Neg Pred Value        0.98807   0.9911  0.99152  0.99923  0.93936  0.98867
## Prevalence            0.04671   0.1369  0.07909  0.04423  0.12901  0.03108
## Detection Rate        0.03527   0.1292  0.07129  0.04349  0.07462  0.02003
## Detection Prevalence  0.04186   0.1393  0.07992  0.04423  0.10309  0.02432
## Balanced Accuracy     0.87416   0.9661  0.94599  0.99127  0.77286  0.81999
##                      Class: 6 Class: 7 Class: 8
## Sensitivity            0.9571   0.9031  0.54858
## Specificity            0.9875   0.8966  0.99689
## Pos Pred Value         0.9576   0.7551  0.88930
## Neg Pred Value         0.9873   0.9633  0.97981
## Prevalence             0.2285   0.2609  0.04353
## Detection Rate         0.2187   0.2357  0.02388
## Detection Prevalence   0.2284   0.3121  0.02685
## Balanced Accuracy      0.9723   0.8999  0.77274

Training accuracy is 86%.
We can see that, the error was around 13%.
Accuracy on test data matters the most.
And it turns out to be 80% which is much better than our earlier model i.e., decision tree model which gave 60% accuracy.
Thus this is how boosting works.

#Accuarcy on Test data

predicted_test_boost<-predict(XGBModel, testMatrix)
probs_test <- data.frame(matrix(predicted_test_boost, nrow=nrow(test), ncol=9,  byrow = TRUE))

probs_final_test<-as.data.frame(cbind(row.names(probs_test),apply(probs_test,1, function(x) c(0:8)[which(x==max(x))])))
table(probs_final_test$V2)

## 
##    0    1    2    3    4    5    6    7    8 
##  454 1640 1029  514 1176  239 2696 3733  275

confusionMatrix(probs_final_test$V2,test_y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2    3    4    5    6    7    8
##          0  329   16    3    1   26    7   42   22    8
##          1   25 1469   31    0    4   69   34    7    1
##          2    2   31  876    1    4   75   34    5    1
##          3    1    2    1  503    0    0    1    5    1
##          4   36    3    2    1  723    4    5  336   66
##          5   12   22   27    0    0  165   12    0    1
##          6   39   36   32    0    2   33 2526    7   21
##          7   49   22   18   14  752   17   26 2633  202
##          8    5    0    1    2   27    1    3   28  208
## 
## Overall Statistics
##                                          
##                Accuracy : 0.8023         
##                  95% CI : (0.795, 0.8095)
##     No Information Rate : 0.2588         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.7591         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.66064   0.9176  0.88396  0.96360   0.4701  0.44474
## Specificity           0.98890   0.9832  0.98579  0.99902   0.9557  0.99350
## Pos Pred Value        0.72467   0.8957  0.85131  0.97860   0.6148  0.69038
## Neg Pred Value        0.98505   0.9870  0.98928  0.99831   0.9230  0.98211
## Prevalence            0.04236   0.1362  0.08430  0.04440   0.1308  0.03156
## Detection Rate        0.02799   0.1250  0.07452  0.04279   0.0615  0.01404
## Detection Prevalence  0.03862   0.1395  0.08753  0.04372   0.1000  0.02033
## Balanced Accuracy     0.82477   0.9504  0.93487  0.98131   0.7129  0.71912
##                      Class: 6 Class: 7 Class: 8
## Sensitivity            0.9415   0.8653  0.40864
## Specificity            0.9813   0.8738  0.99404
## Pos Pred Value         0.9369   0.7053  0.75636
## Neg Pred Value         0.9827   0.9489  0.97378
## Prevalence             0.2282   0.2588  0.04330
## Detection Rate         0.2149   0.2240  0.01769
## Detection Prevalence   0.2293   0.3175  0.02339
## Balanced Accuracy      0.9614   0.8695  0.70134

On the testing data, the accuracy is 80%; whereas boosting has an accuracy of almost the same on the testing data.
Boosting accuracy is 80% and it is same for the random forest too.
Both are ensemble techniques.
Boosting is as strong as random forest and in some cases it works better than random forest.
We could indeed play around with the parameters like iterations etc.
That is as increase the iterations we will keep getting better and better accuracies. Thus, boosting gives us edge in shaping the accuracy better.

When Ensemble doesn’t work?

Till now we have seen most of the ensemble techniques which were working better than the individual models.
Now we shall see the situations where these ensemble techniques do not work.
That is when the ensemble models do not give any edge over other normal models.
In the ensemble techniques, the models have to be independent and we can not build the same model multiple times and expect the error to reduce.
We may have to bring the independence by choosing subsets of data, or subset of features while building the individual models.
Ensemble may backfire if we use dependent models that are already less accurate. The final ensemble might turn out to be even worse model.
Yes, there is a small disclaimer in “Wisdom of Crowds” theory. We need good independent individuals. If we collate any dependent individuals with poor knowledge, then we might end with an even worse ensemble.
For example, we built three models; model-1 and model-2 are bad but model-3 is good.
Most of the times ensemble will result the combined output of model-1 and model-2, because they are dependent and bad.
Based on voting the results of the model-1 and model-2 will be the final. Even though model-3 is good, it will not be considered, which will result in a worse final ensemble.

LAB: When Ensemble doesn’t work?

When the individual models/samples are dependent.

#Data Import
train<- read.csv("~\R Dataset\Car Accidents IOT\Train.csv")
test<- read.csv("~\R Dataset\Car Accidents IOT\Test.csv")

####Logistic Regression
crash_model_logistic <- glm(Fatal ~ . , data=train, family = binomial())

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(crash_model_logistic)

## 
## Call:
## glm(formula = Fatal ~ ., family = binomial(), data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.4904  -0.8571   0.3656   0.8242   3.1945  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  8.954e-01  5.412e-01   1.654 0.098067 .  
## S1          -1.045e-02  2.860e-03  -3.653 0.000259 ***
## S2          -3.740e-03  5.454e-03  -0.686 0.492915    
## S3           2.638e-01  6.112e-02   4.316 1.59e-05 ***
## S4           1.605e-03  2.197e-04   7.304 2.80e-13 ***
## S5           3.161e-02  2.718e-03  11.631  < 2e-16 ***
## S6           3.748e-03  2.414e-03   1.553 0.120537    
## S7          -8.739e-04  2.476e-04  -3.530 0.000415 ***
## S8           1.684e-01  3.209e-02   5.247 1.54e-07 ***
## S9          -8.099e-04  7.008e-04  -1.156 0.247805    
## S10         -9.886e+01  9.210e+00 -10.734  < 2e-16 ***
## S11         -1.538e-02  8.875e-04 -17.334  < 2e-16 ***
## S12         -2.447e-01  2.161e-02 -11.324  < 2e-16 ***
## S13          3.227e+00  1.092e-01  29.549  < 2e-16 ***
## S14          7.233e-03  1.663e-03   4.350 1.36e-05 ***
## S15          6.571e-03  4.373e-03   1.503 0.132889    
## S16         -7.763e-02  5.666e-02  -1.370 0.170693    
## S17         -3.497e-04  6.861e-05  -5.097 3.46e-07 ***
## S18         -2.865e-04  4.433e-04  -0.646 0.518052    
## S19         -6.798e-02  6.262e-02  -1.086 0.277665    
## S20         -1.001e-02  2.043e-03  -4.902 9.49e-07 ***
## S21         -4.146e-01  2.398e-02 -17.291  < 2e-16 ***
## S22          1.678e-01  6.718e-03  24.981  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20538  on 15108  degrees of freedom
## Residual deviance: 14794  on 15086  degrees of freedom
## AIC: 14840
## 
## Number of Fisher Scoring iterations: 8

#Training accuarcy
predicted_y<-round(predict(crash_model_logistic,type="response"),0)
confusionMatrix(predicted_y,crash_model_logistic$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4394 1300
##          1 1922 7493
##                                           
##                Accuracy : 0.7867          
##                  95% CI : (0.7801, 0.7933)
##     No Information Rate : 0.582           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5556          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.6957          
##             Specificity : 0.8522          
##          Pos Pred Value : 0.7717          
##          Neg Pred Value : 0.7959          
##              Prevalence : 0.4180          
##          Detection Rate : 0.2908          
##    Detection Prevalence : 0.3769          
##       Balanced Accuracy : 0.7739          
##                                           
##        'Positive' Class : 0               
##

#Accuarcy on Test data
predicted_test_logistic<-round(predict(crash_model_logistic,test, type="response"),0)
confusionMatrix(predicted_test_logistic,test$Fatal)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2766  781
##          1 1126 4392
##                                          
##                Accuracy : 0.7896         
##                  95% CI : (0.7811, 0.798)
##     No Information Rate : 0.5707         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.5659         
##  Mcnemar's Test P-Value : 3.343e-15      
##                                          
##             Sensitivity : 0.7107         
##             Specificity : 0.8490         
##          Pos Pred Value : 0.7798         
##          Neg Pred Value : 0.7959         
##              Prevalence : 0.4293         
##          Detection Rate : 0.3051         
##    Detection Prevalence : 0.3913         
##       Balanced Accuracy : 0.7799         
##                                          
##        'Positive' Class : 0              
##

Logistic Regression Accuracy is 78%.

###Decision Tree

library(rpart)
crash_model_ds<-rpart(Fatal ~ ., method="class",   data=train)

#Training accuarcy
predicted_y<-predict(crash_model_ds, type="class")
table(predicted_y)

## predicted_y
##    0    1 
## 5544 9565

confusionMatrix(predicted_y,train$Fatal)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4705  839
##          1 1611 7954
##                                           
##                Accuracy : 0.8378          
##                  95% CI : (0.8319, 0.8437)
##     No Information Rate : 0.582           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6609          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7449          
##             Specificity : 0.9046          
##          Pos Pred Value : 0.8487          
##          Neg Pred Value : 0.8316          
##              Prevalence : 0.4180          
##          Detection Rate : 0.3114          
##    Detection Prevalence : 0.3669          
##       Balanced Accuracy : 0.8248          
##                                           
##        'Positive' Class : 0               
##

#Accuaracy on Test data
predicted_test_ds<-predict(crash_model_ds, test, type="class")
confusionMatrix(predicted_test_ds,test$Fatal)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2884  454
##          1 1008 4719
##                                          
##                Accuracy : 0.8387         
##                  95% CI : (0.831, 0.8462)
##     No Information Rate : 0.5707         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.665          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.7410         
##             Specificity : 0.9122         
##          Pos Pred Value : 0.8640         
##          Neg Pred Value : 0.8240         
##              Prevalence : 0.4293         
##          Detection Rate : 0.3181         
##    Detection Prevalence : 0.3682         
##       Balanced Accuracy : 0.8266         
##                                          
##        'Positive' Class : 0              
##

Decision Tree Accuracy is 83%.

####SVM Model
library(e1071)

## Warning: package 'e1071' was built under R version 3.3.2

pc <- proc.time()
crash_model_svm <- svm(Fatal ~ . , type="C", data = train)
proc.time() - pc

##    user  system elapsed 
##   68.13    0.10   68.33

summary(crash_model_svm)

## 
## Call:
## svm(formula = Fatal ~ ., data = train, type = "C")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.04545455 
## 
## Number of Support Vectors:  6992
## 
##  ( 3582 3410 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

#Confusion Matrix
library(caret)
label_predicted<-predict(crash_model_svm, type = "class")
confusionMatrix(label_predicted,train$Fatal)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4811  538
##          1 1505 8255
##                                           
##                Accuracy : 0.8648          
##                  95% CI : (0.8592, 0.8702)
##     No Information Rate : 0.582           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.716           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7617          
##             Specificity : 0.9388          
##          Pos Pred Value : 0.8994          
##          Neg Pred Value : 0.8458          
##              Prevalence : 0.4180          
##          Detection Rate : 0.3184          
##    Detection Prevalence : 0.3540          
##       Balanced Accuracy : 0.8503          
##                                           
##        'Positive' Class : 0               
##

#Out of time validation with test data
predicted_test_svm<-predict(crash_model_svm, newdata =test[,-1] , type = "class")
confusionMatrix(predicted_test_svm,test[,1])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2933  399
##          1  959 4774
##                                           
##                Accuracy : 0.8502          
##                  95% CI : (0.8427, 0.8575)
##     No Information Rate : 0.5707          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6887          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7536          
##             Specificity : 0.9229          
##          Pos Pred Value : 0.8803          
##          Neg Pred Value : 0.8327          
##              Prevalence : 0.4293          
##          Detection Rate : 0.3236          
##    Detection Prevalence : 0.3676          
##       Balanced Accuracy : 0.8382          
##                                           
##        'Positive' Class : 0               
##

SVM Accuracy is 85%.

####Ensemble Model

#DS and SVM are predictng 1 & 2
predicted_test_logistic1<-predicted_test_logistic+1

Ens_predicted_data<-data.frame(lg=as.numeric(predicted_test_logistic1),ds=as.numeric(predicted_test_ds), svm=as.numeric(predicted_test_svm))

Ens_predicted_data$final<-ifelse(Ens_predicted_data$lg+Ens_predicted_data$ds+Ens_predicted_data$svm<4.5,0,1)
table(Ens_predicted_data$final)

## 
##    0    1 
## 3340 5725

##Ensemble Model accuracy test data
confusionMatrix(Ens_predicted_data$final,test[,1])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2878  462
##          1 1014 4711
##                                           
##                Accuracy : 0.8372          
##                  95% CI : (0.8294, 0.8447)
##     No Information Rate : 0.5707          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6618          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7395          
##             Specificity : 0.9107          
##          Pos Pred Value : 0.8617          
##          Neg Pred Value : 0.8229          
##              Prevalence : 0.4293          
##          Detection Rate : 0.3175          
##    Detection Prevalence : 0.3685          
##       Balanced Accuracy : 0.8251          
##                                           
##        'Positive' Class : 0               
##

Ensemble Accuracy is 83%.

Conclusion

Ensemble methods are most widely used methods these days. With advanced machines, its not really a huge task to build multiple models.
Both bagging and boosting does a good job of reducing bias and variance.
Random forests are relatively fast, since we are building many small trees, it does not put a lot of pressure on the computing machine.
Random forest can also give the variable importance. We need to be careful with categorical features, as random forests tend to give higher importance to variables with higher number of levels.
In Boosted algorithms we may have to restrict the number of iterations as more the iterations, the lesser would be the error, but what really matters is the testing error, thus we may have to restrict the number of iterations to avoid over fitting.
Ensemble models are the final effort of a data scientist, while building the most suitable predictive model for the data that we build the ensemble model which is the consolidation of independent models.

Select Category

Handout – Random Forest and Boosting

You can download the datasets and R code file for this session here.

Random Forests

Ensemble Models & Random Forests

Contents

The Wisdom of Crowds

What is Ensemble Learning

Ensemble Models

Why Ensemble technique works?

Types of Ensemble Models

Bagging

Bootstrapping

The Bagging Algorithm

Why Bagging Works

LAB: Bagging Models

Solution

Random Forest

Random Forest Algorithm

The Random Factors in Random Forest

Why Random Forest Works

LAB: Random Forest

Solution

Boosting

Boosting Main idea

How weighted samples are taken

Boosting Illustration

Theory behind Boosting Algorithm

Gradient Boosting

LAB: Boosting

Solution

When Ensemble doesn’t work?

LAB: When Ensemble doesn’t work?

Conclusion