Before start our lesson please download the datasets.

Ensemble Models & Random Forests

Contents

Introduction
Ensemble Learning
How ensemble learning works
Bagging
Building models using Bagging
Random Forest algorithm
Random Forest model building
Boosting
Building models using boosting
Conclusion

The Wisdom of Crowds

Let us take an example to understand what exactly the Wisdom Of Crowds is.
Here is a problem statement which says “Estimate the monthly expenditure of a family in a city”.
There are numerous techniques to predict the average monthly expenditure of each family.
For instance, we can use simple descriptive statistics, or we can use multiple variables which together will finally predict the average monthly expenditure.
Let us take a scenario where different predictive models have come up.
On one hand, there is a renowned & eminent professor, who is an expert in predictive modelling and he is the world’s best data scientist and has came up with a single predictive model.
And on the other hand, there is a group of a hundred associate professors who have taken up this challenge independently, by collecting their own data or may be by adopting their own modeling techniques which can use a different set of variables altogether.
Note that, each one of them has come up with a unique predictive model to estimate the expenditure.
That is, 100 assistant professors, who do not know each other, have built 100 different models.
In case-1, we can see that the professor’s model predicted an estimated monthly expenditure of 6500 dollars (six thousand five hundred dollars), whereas in case-2, some of them predicted 8000 dollars, some of them predicted 6000 dollars, some 7500 dollars.
We can see that the average of the 100 predicted values has been calculated from different estimates as 7200 dollars.

The question arises…

There is one model which says 6500 dollars is estimated expenditure and the other says, 7200 dollars is estimated expenditure, then what average of 100 models are.
Thus, these 100 assistant professors might not be as good as the eminent professor.
However, they are good data scientists since they have built fairly good models from different perspectives. And thus it makes sense to choose the average of 100 predictions rather than to rely on a single model.
Here is the definition of wisdom of crowds.
Let us see the example again by putting it in an easy way; wisdom of crowds is instead of taking one best model or instead of looking at one best thing or may be it is better to look at as many fairly good models as possible and take the average.
Here we believe that, the average wisdom of crowds is better than relying on one good model.
In other words, one should not expand energy trying to identify an expert within a group, instead he should rely on the group’s collective wisdom. However one has to ensure that the opinions must be independent.
That is, these 100 assistant professors should not talk to each other, should not be depending on each other.
Otherwise they will not be building 100 different models.
They might not be carrying information from 100 different angles.
If they are dependent on each other, then model-1 might be same as model-2 or as model-3, which does not make sense.
Thus, all these 100 professors have to be independent.
Also some kind of knowledge of the truth must reside with some group members.
Here, the group members are the assistant professors.
Each one of them should be at least good in their own perspective and is expected to have true knowledge.
We can not afford taking any random guesses to build these models.
All of them are also data scientist as well.
So instead of trying to build one great model, it’s always better to build some independent moderate models and take their average as the final prediction.
This is the concept of the wisdom of crowds.

What is Ensemble Learning

Ensemble Learning is completely based on the concept of wisdom of crowds
We build multiple models and then use the average or use all the models as the final model instead of building one single model.
Let us see what exactly is Ensemble Learning.
Imagine a classifier problem, just a binary classifier problem with two classes where we want to classify the data point as +1 (plus one) & -1 (minus one) at the end of building the algorithm.
Let say we want to build the best possible decision tree with 91% accuracy.
Let $x$ be a new data point.
Now, we shall use this decision tree that will finally predict whether $x$ $x$ falls under class +1 or class -1.
Let us suppose that, the decision tree has classified the new data as +1.
Now let us ask ourselves that, is there a way we can do better than 91% by using the same data?
Now the solution for this question is as follows:
Let us build three more models on the same data. And see whether we can improve performance considerably.
We have four models on the same dataset, Each of them have different accuracy. But unfortunately there seem to be no real improvement in the accuracy.
The first one is Decision Tree Model, which has already been built.
The second one is Logistic Regression Model.
The third one is Neural Nets Model.
SVM Model is the fourth.

Each of them has a different accuracy.
We know that Decision Tree has an accuracy of 91% and error of 9%.
Logistic Regression has an accuracy of 90% and an error of 10%.
Neural Network has accuracy of 91% and an error of 9%.
And in the case of SVM, the accuracy level is 92% and error is 8%.
We can observe that there is not much of considerable improvement.
Earlier, the accuracy was 91% and error was 9%.
However, SVM does a slightly better job with accuracy of 92% and error of 8%.
This cannot be considered as a real improvement over what we had earlier.
Now, what exactly is Ensemble Learning.

Now what about prediction of the data point $x$ ?
We know that the new data point $x$ was predicted as +1 by the decision tree model in the initial attempt to build the model using decision tree.
That is, when we substituted new data point $x$ , it was classified as class +1.
Here if we want to see what logistic regression fitting is, when we use the logistic regression model, then the new data point $x$ has been predicted as -1.
The new data point $x$ is predicted as -1 by Neural Nets model.
The new data point $x$ is predicted as -1 again by SVM model.
Only decision tree has predicted it as +1.
The combined voting model seem to be having less error than each of the individual models.
If we choose the voting model, then what voting model means is if we choose all the 4 models and the output is the highest voted value.
Let us say three of them are predicting as -1 and only one is predicting as +1, then we should believe that point as -1.
Based on voting model, we choose the final prediction as -1 and instead of choosing one model if we combined all those four models and take the voting, then finally the prediction might be actually accurate than the initial single model approach.
This is the actual philosophy of ensemble modelling.
Therefore, instead of building one model, we built several models and take their voting.
Instead of building one decision tree we built one, two, three and four models and then each one of the model might not be very strong in their own but we want to combine all of them and then take voting or average.
Then finally identify the predictor based on the average of the combination of all four models.
This is the actual philosophy of ensemble learning.

Ensemble Models

Ensemble technique is all about obtaining better predictions using multiple models on the same dataset instead of building one best model.
Because, it is not always possible to find the single best fit model for our data.
Ensemble model combines multiple models to come up with one consolidated model.
Ensemble models work on the principle which says “multiple models which are moderately accurate can give a highly accurate model”.
Understandably, building and evaluating the ensemble models is computationally expensive.
That is instead of building one model we are building multiple models.
The effort to build ensemble models is far more than that of a single model.
Build one real good model is the usual statistical approach.
Build multiple models and average the results is the philosophy of Ensemble learning which is nothing but the wisdom of crowds.
Thus instead of building one best model, let us build multiple models and combine them for prediction.

Why Ensemble technique works?

Imagine three independent and equivalent models:
- M1 with an error rate of 10%.
- M2 with an error rate of 10%.
- M3 with an error rate of 10%.
The three models have to be independent, because we can’t build the same model three times and expect the error to reduce. Any changes to the modeling technique in model-1 should not impact on model-2.
That is, model-1, model-2 and model-3 should not be doing the same work.
In this scenario, the worst ensemble model will have 10% error rate.
For finding the best ensemble model we need to combine these models and take the voting criteria.
The best ensemble model will have an error rate of 2.8%.
- two out of three models predicted wrong + all models predicted wrong
- = (3C2)*(0.1)(0.1)(0.9) + (0.1)(0.1)(0.1)
- = 2.8%
Thus we just took three moderately good models each with an error rate of 10% and by merely combining them we could get the error rate to 2.8%.

Overview

Now talking about the overview of the previous topic, i.e., of Ensemble Technique.
Instead of building one model, we prefer to build multiple models and take voting.
This voting criteria makes us less prone to errors, thereby increasing the accuracy level much higher than the individual accuracies levels of each models.
We will be looking forward into the techniques of bagging and boosting in next topic.

Types of Ensemble Models

The above example is a very primitive type of ensemble model whose sole purpose was to give you an idea of the whole ensemble technique.
However, practically, ensemble models are built differently.
There are better and statistically stronger ensemble methods that will yield better results.
Two most popular ensemble methodologies are
- Bagging
- Boosting

Bagging

Bagging is one of the ensemble techniques.
Before studying the bagging, let us take an overview about Bootstrap Sampling.
Bootstrap Sampling refers to taking sample points with replacement again and again. And these samples are called bootstrap samples.
Coming back to bagging, the bagging philosophy says, “Take multiple bootstrap samples from the population and build a classifier on each of the samples”.
Let us say, if we have ten bootstrap samples, then we shall have ten models.
For the prediction, take mean or mode, i.e., take the average or take a voting of all the individual model predictions and choose the final predictor.
Bagging has two major parts:
- Bootstrap sampling
- Aggregation of learners
Thus, Bagging = Bootstrap Aggregating
In Bagging we combine multiple moderately stable models to produce one final stable model.
Hence the predictors will be highly reliable.
Thus, the final model will have less variance and highly consistent coefficients. Boot strapping
Let us look at bootstrapping as it is the first step.
If we have a training data set of size N, then we draw samples of size N with replacement, i.e., we take a single data point, then we note it down and we put it back.
Again we take another data point, then note it down and put it back.
We are selecting records one-at-a-time, returning each selected record back in the population, giving it a chance to be selected again
We repeat this process until all of the N samples in the whole dataset are finished, i.e., we take bootstrap sample-1 and we sample it N times.
Then, we draw a random sample of size N with replacement, that gives us a new data set.

Each of these boot strap samples can have repeated observations. Or sometimes, it can have some observations which might not appear even once.
We are selecting records one-at-a-time, returning each of the selected records. We call these records 1,2,3 and so on up to N records.
Thus we select a record randomly, we shall keep it and then we note it down and put it back and again we select.
The next time we might be select the same one and or a different one, it does not matter.
And we repeat the process N times which is a form a bootstrap sample of size N,
And we call it as bootstrap sample-1.
In this way, we create ‘B’ such new sample data sets.
These sets are called bootstrap samples.
This is the bootstrap part is of bagging.

The Bagging Algorithm

We draw $k$ bootstrap sample sets from the training dataset $D$ .
For each bootstrap sample $i$ .
Now build a classifier model $M_i$ , i.e., we will have total of $k$ classifiers $M_1, M_2, M_3 ... M_k$ .
Then, we can either vote over or take average, whichever, we feel is the best way to aggregate.
Let us say we chose to take a vote-over to find the prediction for the final classifier output or we could choose to take the average in case of regression, then that will be the final bagged model.
Thus , we took $k$ , bootstrap sample sets, then we have built $k$ models on each of these.
Then all these models are combined together to form a consolidated model which is called a bagged model.

Why Bagging Works

Recall that we had a similar question, “Why Ensemble Works?”.
In fact, bagging is one of the typical ensemble models.
And we know that, for any ensemble model, we need to make sure that all the samples are independent.
We are selecting records one-at-a-time, returning each of the selected records back to the population, giving it a chance to be selected again.
Note that the variance in the consolidated prediction is reduced, if we have independent samples.
This way, we could reduce the unavoidable errors made by a single model.
Because if we just have one single model, it might catch some unwanted pattern or an outlier.
However, when we draw the bootstrap samples again and again, then we tend to have a very consistent and a robust model by the end of the bagging process.
We know the fact that, in a given bootstrap sample, some observations might get selected multiple times where as some observations might not get a chance at all.
There a proven theory that bootstrap samples contain only 63% of the overall population and rest 37% is not present.
Thus, the data used in each of these models is not exactly same. But this makes our learning models independent and this helps our predictors have the uncorrelated errors.
Finally the errors from the individual models cancel out and give us a better ensemble model with higher accuracy.
Bagging is extremely useful when there is lot of variance in our data.
When there are too many points that are away from the rest of the points or when there are too many outliers may be bagging is the way to go.
As we take bootstrap samples and build several models and finally combine them to build a bagging model.

LAB: Bagging Models

Import Boston house price data.
Get some basic meta details of the data.
Take 90% data use it for training and take rest 10% as holdout data.
Build a single linear regression model on the training data.
On the hold out data, calculate the error (squared deviation) for the regression model.
Build the regression model using bagging technique. Build at least 25 models.
On the hold out data, calculate the error (squared deviation) for the consolidated bagged regression model.
What is the improvement of the bagged model when compared with the single model?

1) Importing Boston house price data

In [1]:

#Importing Boston house price data
import pandas as pd
import sklearn as sk
import numpy as np
import scipy as sp
house=pd.read_csv("datasets/Housing/Boston.csv")

2) Get some basic meta details of the data.

In [2]:

house.head(5)

Out[2]:

	crim	zn	indus	nox	rm	age	dis	rad	tax	ptratio	black	lstat	medv
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1	296	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2	242	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2	242	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3	222	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3	222	18.7	396.90	5.33	36.2

In [3]:

###columns of the dataset##
house.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
crim       506 non-null float64
zn         506 non-null float64
indus      506 non-null float64
chas       506 non-null int64
nox        506 non-null float64
rm         506 non-null float64
age        506 non-null float64
dis        506 non-null float64
rad        506 non-null int64
tax        506 non-null int64
ptratio    506 non-null float64
black      506 non-null float64
lstat      506 non-null float64
medv       506 non-null float64
dtypes: float64(11), int64(3)
memory usage: 55.4 KB

3) Take 90% data use it for training and take rest 10% as holdout data.

In [4]:

###Splitting the dataset into training and testing datasets
from sklearn.cross_validation import train_test_split
house_train,house_test=train_test_split(house,train_size=0.9)

4) Build a single linear regression model on the training data.

In [5]:

###Building a linear Regression with medv as the predictor variable on the traiing dadaset ###
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(house_train[['crim']+['zn']+['indus']+['chas']+['nox']+['rm']+['age']+['dis']+['rad']+['tax']+['ptratio']+['black']+['lstat']],house_train[['medv']])

Out[5]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [6]:

###predicting the model on test dataset
predict_test=lr.predict(house_test[['crim']+['zn']+['indus']+['chas']+['nox']+['rm']+['age']+['dis']+['rad']+['tax']+['ptratio']+['black']+['lstat']])

5) On the hold out data, calculate the error (squared deviation) for the regression model.

In [7]:

from sklearn.metrics import mean_squared_error

###error in linear regression model ###
mean_squared_error(house_test['medv'],predict_test, sample_weight=None, multioutput='uniform_average')

Out[7]:

34.536177420855552

6) Build the regression model using bagging technique. Build at least 25 models.

In [8]:

#Build the regression model using bagging technique. 
from sklearn.ensemble import BaggingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

Bag=BaggingRegressor(base_estimator=LinearRegression(), n_estimators=10, max_samples=1.0, max_features=1.0, bootstrap=True, bootstrap_features=False, oob_score=False, warm_start=False, n_jobs=1, random_state=None, verbose=0)
features = list(house.columns[:13])
X = house_train[features]
y = house_train['medv']
Bag.fit(X,y)
bagpredict_test=Bag.predict(house_test[features])
z=(house_test[['medv']])

7) On the hold out data, calculate the error (squared deviation) for the consolidated bagged regression model.

In [9]:

### to estimate the accuracy of the Bagging model ###
mean_squared_error(z, bagpredict_test, sample_weight=None, multioutput='uniform_average')

Out[9]:

35.494008715334623

We can see the error of the model has been reduced.

Random Forest

Random forest is a specific case of ensembling techniques.
Precisely putting, then it is a specific case of bagging methodology.
Bagging, specifically on the decision trees is known as random forest.
Just Like many trees form a forest, many decision tree models together form a Random Forest model.
In random forest, we induce two types of randomness.
Firstly, we take the bootstrap samples of the population and build decision trees on each of the sample.
While building the individual trees on bootstrap samples, we take a subset of the features randomly.
We will not build the model on each of the sample using all the features.
We do not use all the predictor variables.
We do not use all the variables that are impacting.
Instead, we use a random set of variables that the model needs not to be one of the best models.
At least it will use a few variables and give us a loose classifier.
Thus, that is the major point in random forest that we will be using a subset of variables.
Now the question is why the subset of variables are used.
Thus, that if we use one subset of features or variables in model-1; another subset of variables in model-2, and again another sample of features in model-3, then all these models or trees will be totally independent.
Even if, each tree will give us a different result, however, all the trees together might give us a good consolidated random forest model.
Random forests are very stable. They are as good as SVMs and sometimes better than other algorithms.

Random Forest Algorithm

The random forest algorithm is very close to bagging.
With the training dataset $D$ with $t$ number of features let us draw k boot strap sample sets from the dataset $D$ .
For each bootstrap sample $i$ , build a decision tree model $M_i$ using only $p$ number of features, where $p$ is much less than $t$ .
For example; if there is a dataset with 200 features or variables, then we might only use randomly chosen 20 or 30 or 50 or 100 variables.
And we build the decision tree model $M_1$ with $b_1$ bootstrap samples, with p1 number of randomly chosen features.
Again, we build model $M_2$ with $b_2$ bootstrap samples, with p2 number of randomly chosen features.
Similarly we further build models till $M_i$ which are completely independent so there is randomness because of the different bootstrap samples being used.
Note that, $p$ features are randomly chosen thus there is randomness being induced in the second level too.
And finally we consolidate them as the random forest model.
The way we consolidate is again through voting.

Recap..

Let us quickly have a recap.
We have a training dataset $D$ , with $t$ number of features.
We draw $k$ number of bootstrap samples.
For each bootstrap sample $I$ , build a decision tree model $M_i$ using only $p$ number of features where $p$ is much less than t.
Each tree has maximal strength they are fully grown and not pruned.
We shall have a total of $k$ decision trees $M_1, M_2 ,., M_k$ ;
Each of these trees are built on relatively different training data and different set of features
Finally, we shall vote over for the final classifier output.
And then we shall take the average for regression output $w_h$ .

Random forest model is nothing but we have all these models together consolidation is done based on the voting.
If it is regression, then it will take the average of all the outputs that is the regression consolidation.
If it is classification, then let’s say predicting +1 and -1 with 30 trees.
Then 25 trees are predicting as +1 based on voting, then we will say the final prediction is +1 that is the random forest algorithm.

The Random Factors in Random Forest

We need to note the most important aspect of random forest, i.e inducing randomness into the bagging of trees. There are two major sources of randomness.
- Randomness in data: Boot strapping, this will make sure that any two samples data is somewhat different.
- Randomness in features: While building the decision trees on boot strapped samples we consider only a random subset of features.
Why to induce the randomness?
- The major trick of ensemble models is the independence of models.
- If we take the same data and build same model for 100 times, we will not see any improvement
- To make all our decision trees independent, we take independent samples set and independent features set.
- As a rule of thumb, if $'t'$ is very large, then we can consider square root of the number of features; or else $p=t/3$ .

Why Random Forest Works

We need to note the most important aspect of random forest, i.e inducing randomness into the bagging of trees.
There are two major sources of randomness;
- Being the randomness in the data
- Being the randomness in the features.
Randomness in data is induced because of the bootstrapping, thus this will make sure that any two samples from the data are somewhat different.
While randomness in features: is induced while building the decision trees on the bootstrapped samples since we consider only a random subset of the features.
Why to induce the randomness?
The major trick of ensemble models is in “the models being independent”.
If we take the same data and build the same model for 100 times, we shall not see any improvement.
To make all our decision trees independent, we take a set of independent samples and a set of independent features.
As a rule of thumb for choosing the value of $p$ , we should look at the total number of features, $t$ .
We shall consider square root of $'t'$ if the value of $'t'$ is very large.
Else we shall consider ‘t by three’ if the value of $'t'$ is not so large, i.e., we go for one-third of the number of features when the total number of features is not so large.
Why Random Forest works?
For a training data with 20 features, we build 100 decision trees with 5 features each, instead of a single great decision tree, because the individual trees may be weak classifiers.
It is like building weak classifiers on subsets of data. The grouping of large sets of random trees generally produces more accurate models.
Suppose if we have 100 trees and each one them does one single thing very clearly, say each one of them is a fairly good classifier to identify a particular pattern.
Then all of them will make a good random forest in identifying the patterns.

In this example we have three simple classifiers.
M1 classifies anything above the line as +1 and below as -1, M2 classifies all the points above the line as -1 and below as +1 and M3 classifies everything on the left as -1 and right as +1.
Each of these models have fair amount of misclassification error.
All these three weak models together make a strong model.
If we take all these boundaries, then anything inside the boundary is -1 and anything outside the boundary is +1.
Hence that way the model consolidated based on theses 3 models is the best one.
It has almost zero error as it is not wrongly classifying anything.

LAB: Random Forest

Dataset: /Car Accidents IOT/Train.csv
Build a decision tree model to predict the fatality of accident.
Build a decision tree model on the training data.
On the test data, calculate the classification error and accuracy.
Build a random forest model on the training data.
On the test data, calculate the classification error and accuracy.
What is the improvement of the Random Forest model when compared with the single tree?

In [10]:

#Importing dataset
car_train=pd.read_csv("datasetsCar Accidents IOTtrain.csv")
car_test=pd.read_csv("datasetsCar Accidents IOTtest.csv")

In [11]:

from sklearn import tree

var=list(car_train.columns[1:22])
c=car_train[var]
d=car_train['Fatal']

###buildng Decision tree on the training data ####
clf = tree.DecisionTreeClassifier()
clf.fit(c,d)

Out[11]:

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [12]:

#####predicting on test data ####
tree_predict=clf.predict(car_test[var])

In [13]:

from sklearn.metrics import confusion_matrix###for using confusion matrix###
cm1 = confusion_matrix(car_test[['Fatal']],tree_predict)
print(cm1)

[[3250  642]
 [ 733 4440]]

In [14]:

#####from confusion matrix calculate accuracy
total1=sum(sum(cm1))
accuracy_tree=(cm1[0,0]+cm1[1,1])/total1
accuracy_tree

Out[14]:

0.84831770546056262

In [15]:

from sklearn.metrics import confusion_matrix###for using confusion matrix###
cm1 = confusion_matrix(car_test[['Fatal']],tree_predict)
print(cm1)
total1=sum(sum(cm1))
#####from confusion matrix calculate accuracy
accuracy_tree=(cm1[0,0]+cm1[1,1])/total1
accuracy_tree

[[3250  642]
 [ 733 4440]]

Out[15]:

0.84831770546056262

In [16]:

### accuracy_score() also gives the same result[using confusion matrix]
from sklearn.metrics import accuracy_score
accuracy_score(car_test[['Fatal']],tree_predict, normalize=True, sample_weight=None)

Out[16]:

0.84831770546056262

In [17]:

####buliding a random forest classifier on training data#####
from sklearn.ensemble import RandomForestClassifier
forest=RandomForestClassifier(n_estimators=10, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False, class_weight=None)

forest.fit(c,d)

Out[17]:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [18]:

###predicting on test data with RF model
forestpredict_test=forest.predict(car_test[var])
e=car_test['Fatal']

In [19]:

###check the accuracy on test data
from sklearn.metrics import confusion_matrix###for using confusion matrix###
cm2 = confusion_matrix(car_test[['Fatal']],forestpredict_test)
print(cm2)
total2=sum(sum(cm2))
#####from confusion matrix calculate accuracy
accuracy_forest=(cm2[0,0]+cm2[1,1])/total2
accuracy_forest

[[3396  496]
 [ 436 4737]]

Out[19]:

0.89718698290126864

We can see an improvement in the Accuracy.

Boosting

Boosting is one more famous ensemble method
Boosting uses a slightly different techniques to that of bagging.
Boosting is a well proven theory that works really well on many of the machine learning problems like speech recognition.
If bagging is wisdom of crowds then boosting is wisdom of crowds where each individual is given some weight based on their expertise.
Boosting in general decreases the bias error and builds strong predictive models.
Boosting is an iterative technique. We adjust the weight of the observation based on the previous classification.
If an observation was classified incorrectly, it tries to increase the weight of this observation and vice versa.

Boosting Main idea

Final Classifier $C = sum alpha_i c_i$

The main idea or philosophy of boosting is as follows:
Take a random sample of the population of size N where each record has a 1/N chance of picking.
Let variable w denote the weight of each observation point.
Initially, the weight (w) is 1/N.
Secondly, we shall pick a random sample from the population and build a classifier on that particular random sample.
Then, we shall note down the accuracy.
Obviously, the classifier might not classify all the records correctly and some might be wrong.
We shall identify the previously misclassified samples and add more weight.
So that means, in the new weighted model, the previously misclassified observations will be picked more often.
Then, we shall build a new model on the re-weighted sample that we just collected.
So, in the new re-weighted model, since we picked up most of the previously wrongly classified records; we expect this model to build a better model for those records.
Now, let us check the error for this new model.
And if the classifier still misclassifies some of the records then we shall repeat the same process.
Make sure that the wrongly classified records are picked more often than the correctly classified ones.
Based on the new weighted samples, we shall build a new model.
The final weighted classifier will be the sum of the products of accuracy and classification of each of the models.
That’s the main idea of boosting.

Recap..

Let us quickly have a recap.
Initially, we need to pick up one of the samples as all the samples will have the same weight.
Then we shall build a model directly.
We know that any model will most likely have some error.
That means, there will be an error, say epsilon-1 and an accuracy factor, say alpha-1 for this particular model.
And then, we update the weight.
That means we resample again from the previous sample which will give us a weighted sample.
How do we take the weighted sample??
That will be, by picking up the previously misclassified records more often.
And those which were correctly classified will be chosen less often.
Then we build a new model C2 , with an error epsilon-2, the accuracy factor is alpha-2.
Again here, whatever is misclassified will be picked more often.
And the correctly classified will be picked less often.
We repeat the same process until we achieve the desired accuracy.

How weighted samples are taken

Let us try to understand in more detail, how the weighted samples are taken.
Consider an example:
Imagine a data with 10 points and actual classes are minus (-) and plus (+).
Now we build model M1. In each observation, The predicted class of M1 will classify the data into plus or minus.
So the let us take model M1.
Here in the observation-1, minus is correctly predicted as minus.
Even in the observation-2, minus is correctly predicted as a minus.
But in the third and the fourth observations, plus is wrongly predicted as minus.
So we can see that the model M1 is wrongly classifying the third, the fourth and the sixth observations.
However the rest of the observations are predicted correctly.
Now we shall take weighted samples, that is, adding more weight to the wrongly predicted observations.
In this weighted sample, the observations 3, 4 and 6 should appear more often than others.
So to build the model M2, observations 1 & 2 have been picked, 3 & 4 have been picked 5, 6 & 7 have been picked.
We could observe that the observation 4 is picked again because 4 has been misclassified. Similarly, observations 3 & 6 have also been picked again because of the misclassification.
Then again we build a model M2 on this new data set.
M2 is classifying everything just like the model M1.
Previously misclassified ones are classified correctly now in the observations 3, 4 & 6 however, observations 5 and 7 are misclassified.
We need to repeat the weighted sampling once again.
This time we shall give more weightage to 5 and 7, because they were previously misclassified in the new weighted sample-1.
They will be picked more often.
Thus in the weighted sample-2; 5 is chosen 3 times, 7 is chosen 3 times, 6 is picked twice and the rest of the observations are picked only once.
Now in the model M3, these are the actual classes.
We can see that M3 predicts everything correctly that is how weighted sampling and boosting work.

Boosting Illustration

To understand boosting clearly, we shall see a visual illustration of the same example that we discussed in the boosting earlier
Now take a look at the data observation record numbers because in boosting the sampling and weighted sampling we need to remember the records
What exactly is the observation number because it will be picked later based on whether it is classified correctly or wrongly

Note – Below is the training data and their classes. And we need to take a note of record numbers, they will help us in weighted sampling later.

These are the 10 data points some of them are positive and some of them are negative
Thus here, in the illustration we can see, 5 positive points, that are in blue and 5 negative points that are in red.

Classifier model M1 is built, anything above the line is – and below the line is +.
3 out of 10 are misclassified by the model M1.
3, 4 and 6: These are the misclassified by model M1.
As per boosting technique, these will be resampled again.
Next time we shall give more weight to these points.
Thus here 3, 4 and 6 are misclassified and clearly; we can see that, anything above is classified as red and anything below the line is classified as blue.
These are correctly classified and these 3 are wrongly classified this is model M1.

This is the first sampling and this is model M1 result.
Now we shall resample and give more weight to the data points 3, 4 and 6 and then build a model M2.
We resampled so that each one of them is picked more often than the others.
Thus the sample points 9 and 10 didn’t appear at all.
But the points 3, 4, 6 are picked again.
M2 is built on this data. Anything above this line is red and below the line is blue.
M2 is classifying the points 5 & 7 incorrectly.

Model M2 made sure that the data points 3, 4 and 6 are classified correctly.
They are positive and they are classified as positive.
From previous model, the current model has an improvement.
This model has classify the points are correctly.
But again this model has misclassified 5 and 7.
Thus in the next iteration, we have to pick 5 and 7 more often than other points that means we will give these observations will be more weights.

This is the 3rd weighted sample; which is the final one.
Here the point 5 is picked thrice and 7 is picked thrice.
Now we build a new model M3 which is built on this data.
Anything to the left hand side of the line is blue and anything to the right is red.
We can observe that M3 is now classifying everything correctly.
Thus we don’t need to do further weighted sampling.

Thus by now we have built three models $M_1, M_2 & M_3$ which all together are giving the final result.
The final model now will be picked on weighted Votes.
For a given data point more than 2 models seem to be indicating the right class.
For example take point 6, it is classified as minus $(-) by M_1 and plus(+) by M_2 and as plus(+) by M_3$ thus the final result will be +
Similarly take a point 2, it will be classified as minus(-) by $M_1$ , minus(-) by $M_2$ and plus(+) by $M_3$ , final result will be minus(-).
So the final weighted combination of all the three models and the predictions will yield a highly accurate model.
That is how the boosting works.

Theory behind Boosting Algorithm

Take the dataset
Build a classifier $c_m$ and find the error
Calculate error rate of the classifier
- Error rate of $epsilon _m = sum w_i I (y_i neq C_m (x)) / sum w_i$ = $Sum of misclassification weight / sum of sample weights$
Calculate an intermediate factor called a. It analogous to accuracy rate of the model. It will be later used in weight updating. It is derived from error
- $alpha _m = log(1- epsilon _m)/epsilon _m)$
Update weights of each record in the sample using the a factor. The indicator function will make sure that the misclassifications are given more weight
- For
  - $W_(i+1) = w_i e^(alpha _m I(y_ineq C_m (x)))$
  - Renormalize so that sum of weights is 1.
Repeat this model building and weight update process until we have no misclassification
Final collation is done by voting from all the modes. While taking the votes, each model is weighted by the accuracy factor
- $C = sign(sum alpha _i C_i(x))$

Gradient Boosting

Ada boosting
- Adaptive Boosting
- Till now we discussed Ada boosting technique. Here we give high weight to misclassified records.
Gradient Boosting
- Similar to Ada boosting algorithm.
- The approach is same but there are slight modifications during re-weighted sampling.
- We update the weights based on misclassification rate and gradient.
- Gradient boosting serves better for some class of problems like regression.

LAB: Boosting

Rightly categorizing the items based on their detailed feature specifications. More than 100 specifications have been collected.
Data: Ecom_Products_Menu/train.csv
Build a decision tree model and check the training and testing accuracy.
Build a boosted decision tree.
Is there any improvement from the earlier decision tree?

In [20]:

#importing the datasets
menu_train=pd.read_csv("datasetsEcom_Products_Menutrain.csv")
menu_test=pd.read_csv("datasetsEcom_Products_Menutest.csv")

In [21]:

lab=list(menu_train.columns[1:101])
g=menu_train[lab]
h=menu_train['Category']

In [22]:

###buildng Decision tree on the training data ####
from sklearn import tree
tree = tree.DecisionTreeClassifier()
tree.fit(g,h)

Out[22]:

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [23]:

#####predicting the tree  on test data ####
tree_predict=tree.predict(menu_test[lab])
from sklearn.metrics import f1_score
f1_score(menu_test['Category'], tree_predict, average='micro')

Out[23]:

0.70891459680163316

In [24]:

##Gradient BOOSTING ##

###Building a gradient boosting clssifier ###
from sklearn import ensemble
from sklearn.ensemble import GradientBoostingClassifier
boost=GradientBoostingClassifier(loss='deviance', learning_rate=0.1, n_estimators=100, subsample=1.0, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, init=None, random_state=None, max_features=None, verbose=0, max_leaf_nodes=None, warm_start=False, presort='auto')

In [25]:

##calculating the time while fitting the Gradient boosting classifier
import datetime
start_time = datetime.datetime.now()
##fitting the gradient boost classifier
boost.fit(g,h)
end_time = datetime.datetime.now()
print(end_time-start_time)

0:02:47.757216

In [26]:

###predicting Gradient boosting model on the test Data
boost_predict=boost.predict(menu_test[lab])
from sklearn.metrics import f1_score
f1_score(menu_test['Category'], boost_predict, average='micro')

Out[26]:

0.78725757060224566

We can see an accuracy of 78% after Gradient boosting model, where as it is 70% in decison tree building. Our accuracy has improved by 8%.

ADA Boosting

In [27]:

##building an AdaBoosting Classifier #### 
from sklearn import ensemble
from sklearn.ensemble import AdaBoostClassifier
ada=AdaBoostClassifier(base_estimator=None, n_estimators=50, learning_rate=1.0, algorithm='SAMME.R', random_state=None)
ada.fit(g,h)

Out[27]:

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)

In [28]:

### Predicting the AdaBoost clssifier on Test Data
ada_predict=ada.predict(menu_test[lab])
from sklearn.metrics import f1_score
f1_score(menu_test['Category'], ada_predict, average='micro')

Out[28]:

0.69555971418849949

When Ensemble doesn’t work?

Till now we have seen most of the ensemble techniques which were working better than the individual models.
Now we shall see the situations where these ensemble techniques do not work.
That is when the ensemble models do not give any edge over other normal models.
In the ensemble techniques, the models have to be independent and we can not build the same model multiple times and expect the error to reduce.
We may have to bring the independence by choosing subsets of data, or subset of features while building the individual models.
Ensemble may backfire if we use dependent models that are already less accurate. The final ensemble might turn out to be even worse model.
Yes, there is a small disclaimer in “Wisdom of Crowds” theory. We need good independent individuals. If we collate any dependent individuals with poor knowledge, then we might end with an even worse ensemble.
For example, we built three models; model-1 and model-2 are bad but model-3 is good.
Most of the times ensemble will result the combined output of model-1 and model-2, because they are dependent and bad.
Based on voting the results of the model-1 and model-2 will be the final. Even though model-3 is good, it will not be considered, which will result in a worse final ensemble.

Conclusion

Ensemble methods are most widely used methods these days. With advanced machines, its not really a huge task to build multiple models.
Both bagging and boosting does a good job of reducing bias and variance.
Random forests are relatively fast, since we are building many small trees, it does not put a lot of pressure on the computing machine.
Random forest can also give the variable importance. We need to be careful with categorical features, as random forests tend to give higher importance to variables with higher number of levels.
In Boosted algorithms we may have to restrict the number of iterations as more the iterations, the lesser would be the error, but what really matters is the testing error, thus we may have to restrict the number of iterations to avoid over fitting.
Ensemble models are the final effort of a data scientist, while building the most suitable predictive model for the data that we build the ensemble model which is the consolidation of independent models.

Select Category

Handout – Random Forest and Boosting in python

Before start our lesson please download the datasets.

Ensemble Models & Random Forests

The Wisdom of Crowds

What is Ensemble Learning

Ensemble Models

Why Ensemble technique works?

Overview

Types of Ensemble Models

The Bagging Algorithm

Why Bagging Works

LAB: Bagging Models

Random Forest

Random Forest Algorithm

Recap..

The Random Factors in Random Forest

Why Random Forest Works

LAB: Random Forest

Boosting

Boosting Main idea

Recap..

How weighted samples are taken

Boosting Illustration

Theory behind Boosting Algorithm

Gradient Boosting

LAB: Boosting

When Ensemble doesn’t work?

Conclusion