Random Forests
Ensemble Models & Random Forests
Contents
- Introduction
- Ensemble Learning
- How ensemble learning works
- Bagging
- Building models using Bagging
- Random Forest algorithm
- Random Forest model building
- Boosting
- Building models using boosting
- Conclusion
The Wisdom of Crowds
- One should not expend energy trying to identify an expert within a group but instead rely on the group’s collective wisdom, however make sure that Opinions must be independent and some knowledge of the truth must reside with some group members – Surowiecki
- So instead of trying to build one great model, its better to build some independent moderate models and take their average as final prediction
What is Ensemble Learning
- Imagine a classifier problem, there are two classes +1 & -1 in the target
- Imagine that we built a best possible decision tree, it has 91% accuracy
- Let x be the new data point and our decision tree predicts it to be +1. Is there a way I can do better than 91% by using the same data
- Lets build 3 more models on the same data. And see we can improve the performance
- We have four models on the same dataset, Each of them have different accuracy. But unfortunately there seem to be no real improvement in the accuracy.
- What about prediction of the data point x?
- Except the decision tree, the rest all algorithms are predicting the class of x as -1; Intuitively we would like to believe that the class of x is -1
- The combined voting model seem to be having less error than each of the individual models. This is the actual philosophy of ensemble learning
Ensemble Models
- Obtaining a better predictions using multiple models on the same dataset
- Not every time it is possible to find single best fit model for our data, ensemble model combines multiple models to come up with one consolidated model
- Ensemble models work on the principle that multiple moderately accurate models can give us a highly accurate model
- Understandably, the Building and Evaluating the ensemble models is computationally expensive
- Build one really good model is the usual statistical approach. Build many models and average the results is the philosophy of Ensemble learning
Why Ensemble technique works?
- Imagine three models
- M1 with an error rate of 10%
- M2 with an error rate of 10%
- M3 with an error rate of 10%
- The three models have to be independent, we can’t build the same model three times and expect the error to reduce. Any changes to the modeling technique in model -1 should not impact model-2
- In this scenario, the worst ensemble model will have 10% error rate
- The best ensemble model will have an error rate of 2.8%
- 2 out of 3 models predicted wrong + all models predicted wrong
- (3C2)*(0.1)(0.1)(0.9) + (0.1)(0.1)(0.1)
- 2.8% The best ensemble model will have an error rate of 2.8%
Types of Ensemble Models
- The above example is a very primitive type of ensemble model. There are better and statistically stronger ensemble methods that will yield better results
- Two most popular ensemble methodologies are
- Bagging
- Boosting
Bagging
- Take multiple boot strap samples from the population and build classifiers on each of the samples. For prediction take mean or mode of all the individual model predictions.
- Bagging has two major parts 1) Boot strap sampling 2) Aggregation of learners
- Bagging = Bootstrap Aggregating
- In Bagging we combine many unstable models to produce a stable model. Hence the predictors will be very reliable(less variance in the final model).
Boot strapping
- We have a training data is of size N
- Draw random sample with replacement of size N – This gives a new dataset, it might have repeated observations, some observations might not have even appeared once.
- We are selecting records one-at-a-time, returning each selected record back in the population, giving it a chance to be selected again
- Create B such new datasets. These are called boot strap datasets
The Bagging Algorithm
- The training dataset D
- Draw k boot strap sample sets from dataset D
- For each boot strap sample i
- Build a classifier model (M_i)
- We will have total of k classifiers (M_1 , M_2 ,… M_k)
- Vote over for the final classifier output and take the average for regression output
Why Bagging Works
- We are selecting records one-at-a-time, returning each selected record back in the population, giving it a chance to be selected again
- Note that the variance in the consolidated prediction is reduced, if we have independent samples. That way we can reduce the unavoidable errors made by the single model.
- In a given boot strap sample, some observations have chance to select multiple times and some observations might not have selected at all.
- There a proven theory that boot strap samples have only 63% of overall population and rest 37% is not present.
- So the data used in each of these models is not exactly same, This makes our learning models independent. This helps our predictors have the uncorrelated errors.
- Finally the errors from the individual models cancel out and give us a better ensemble model with higher accuracy
- Bagging is really useful when there is lot of variance in our data
LAB: Bagging Models
- Import Boston house price data. It is part of MASS package
- Get some basic meta details of the data
- Take 90% data use it for training and take rest 10% as holdout data
- Build a single linear regression model on the training data.
- On the hold out data, calculate the error (squared deviation) for the regression model.
- Build the regression model using bagging technique. Build at least 25 models
- On the hold out data, calculate the error (squared deviation) for the consolidated bagged regression model.
- What is the improvement of the bagged model when compared with the single model?
Solution
#Importing Boston house pricing data.
library(MASS)
data(Boston)
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12
## lstat medv
## 1 4.98 24.0
## 2 9.14 21.6
## 3 4.03 34.7
## 4 2.94 33.4
## 5 5.33 36.2
## 6 5.21 28.7
dim(Boston)
## [1] 506 14
##Training and holdout sample
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(500)
sampleseed <- createDataPartition(Boston$medv, p=0.9, list=FALSE)
train_boston<-Boston[sampleseed,]
test_boston<-Boston[-sampleseed,]
###Regression Model
reg_model<- lm(medv ~ ., data=train_boston)
summary(reg_model)
##
## Call:
## lm(formula = medv ~ ., data = train_boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.4763 -2.7684 -0.4912 1.9030 26.4569
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.637e+01 5.534e+00 6.572 1.40e-10 ***
## crim -1.042e-01 3.513e-02 -2.965 0.003195 **
## zn 4.482e-02 1.459e-02 3.073 0.002248 **
## indus 1.986e-02 6.566e-02 0.302 0.762462
## chas 2.733e+00 8.765e-01 3.118 0.001939 **
## nox -1.844e+01 4.018e+00 -4.590 5.79e-06 ***
## rm 3.845e+00 4.670e-01 8.234 2.04e-15 ***
## age 8.782e-04 1.434e-02 0.061 0.951211
## dis -1.488e+00 2.096e-01 -7.101 4.94e-12 ***
## rad 2.770e-01 6.993e-02 3.960 8.71e-05 ***
## tax -1.062e-02 3.944e-03 -2.693 0.007348 **
## ptratio -9.799e-01 1.385e-01 -7.073 5.92e-12 ***
## black 9.620e-03 2.827e-03 3.403 0.000726 ***
## lstat -5.051e-01 5.706e-02 -8.852 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.787 on 444 degrees of freedom
## Multiple R-squared: 0.7309, Adjusted R-squared: 0.723
## F-statistic: 92.75 on 13 and 444 DF, p-value: < 2.2e-16
###Accuracy testing on holdout data
pred_reg<-predict(reg_model, newdata=test_boston[,-14])
reg_err<-sum((test_boston$medv-pred_reg)^2)
reg_err
## [1] 918.5927
###Bagging Ensemble Model
library(ipred)
bagg_model<- bagging(medv ~ ., data=train_boston , nbagg=30)
###Accuracy testing on holout data
pred_bagg<-predict(bagg_model, newdata=test_boston[,-14])
bgg_err<-sum((test_boston$medv-pred_bagg)^2)
bgg_err
## [1] 390.9028
###Overall Improvement
reg_err
## [1] 918.5927
bgg_err
## [1] 390.9028
(reg_err-bgg_err)/reg_err
## [1] 0.5744547
Random Forest
- Like many trees form a forest, many decision tree model together form a Random Forest model
- Random forest is a specific case of bagging methodology. Bagging on decision trees is random forest
- In random forest we induce two types of randomness
- Firstly, we take the boot strap samples of the population and build decision trees on each of the sample.
- While building the individual trees on boot strap samples, we take a subset of the features randomly
- Random forests are very stable they are as good as SVMs and sometimes better
Random Forest Algorithm
- The training dataset D with t number of features
- Draw k boot strap sample sets from dataset D
- For each boot strap sample i
- Build a decision tree model (M_i) using only p number of features (where p<<t)
- Each tree has maximal strength they are fully grown and not pruned.
- We will have total of k decision treed (M_1 , M_2 ,… M_k); Each of these trees are built on reactively different training data and different set of features
- Vote over for the final classifier output and take the average for regression output
The Random Factors in Random Forest
- We need to note the most important aspect of random forest, i.e inducing randomness into the bagging of trees. There are two major sources of randomness
- Randomness in data: Boot strapping, this will make sure that any two samples data is somewhat different
- Randomness in features: While building the decision trees on boot strapped samples we consider only a random subset of features.
- Why to induce the randomness?
- The major trick of ensemble models is the independence of models.
- If we take the same data and build same model for 100 times, we will not see any improvement
- To make all our decision trees independent, we take independent samples set and independent features set
- As a rule of thumb we can consider square root of the number of features, if ‘t’ is very large else p=t/3
Why Random Forest Works
- For a training data with 20 features we are building 100 decision trees with 5 features each, instated of single great decision. The individual trees may be weak classifiers.
- Its like building weak classifiers on subsets of data. The grouping of large sets of random trees generally produces accurate models.
- In this example we have three simple classifiers.
- m1 classifies anything above the line as +1 and below as -1, m2 classifies all the points above the line as -1 and below as +1 and m3 classifies everything on the left as -1 and right as +1
- Each of these models have fair amount of misclassification error.
- All these three weak models together make a strong model.
LAB: Random Forest
- Dataset: /Car Accidents IOT/Train.csv
- Build a decision tree model to predict the fatality of accident
- Build a decision tree model on the training data.
- On the test data, calculate the classification error and accuracy.
- Build a random forest model on the training data.
- On the test data, calculate the classification error and accuracy.
- What is the improvement of the Random Forest model when compared with the single tree?
Solution
#Data Import
train<- read.csv("C:/Amrita/Datavedi/Car Accidents IOT/Train.csv")
test<- read.csv("C:/Amrita/Datavedi/Car Accidents IOT/Test.csv")
dim(train)
## [1] 15109 23
head(train)
## Fatal S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11
## 1 1 36.2247 10.77330 0.243897 596 100.6710 0 0 1 28 0.016064 313
## 2 1 35.7343 17.45510 0.243897 600 100.0000 0 0 1 14 0.015812 319
## 3 1 31.6561 7.61366 0.308763 604 99.3377 0 0 1 4 0.015560 323
## 4 1 33.8320 13.11190 0.293195 616 97.4026 0 0 1 8 0.016001 320
## 5 1 42.5138 13.99850 0.259465 632 94.9367 0 0 1 8 0.016064 322
## 6 1 36.1261 14.85930 0.278925 600 100.0000 0 0 1 4 0.015749 314
## S12 S13 S14 S15 S16 S17 S18 S19 S20 S21 S22
## 1 1 1 57 0 0.280 240 5.99375 0 0.0 4 14.9382
## 2 1 1 57 0 0.175 240 5.99375 0 0.0 4 14.8827
## 3 1 1 58 0 0.280 240 5.99375 0 0.0 4 14.6005
## 4 1 1 58 0 0.385 240 4.50625 0 13.0 4 14.6782
## 5 1 1 57 0 0.070 240 5.99375 0 19.5 4 15.3461
## 6 1 1 58 0 0.175 1008 4.50625 0 23.9 4 15.0559
###Decision Tree
library(rpart)
crash_model_ds<-rpart(Fatal ~ ., method="class", control=rpart.control(minsplit=30, cp=0.03), data=train)
#Training accuarcy
predicted_y<-predict(crash_model_ds, type="class")
table(predicted_y)
## predicted_y
## 0 1
## 5745 9364
confusionMatrix(predicted_y,train$Fatal)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 4735 1010
## 1 1581 7783
##
## Accuracy : 0.8285
## 95% CI : (0.8224, 0.8345)
## No Information Rate : 0.582
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.643
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.7497
## Specificity : 0.8851
## Pos Pred Value : 0.8242
## Neg Pred Value : 0.8312
## Prevalence : 0.4180
## Detection Rate : 0.3134
## Detection Prevalence : 0.3802
## Balanced Accuracy : 0.8174
##
## 'Positive' Class : 0
##
#Accuaracy on Test data
predicted_test_ds<-predict(crash_model_ds, test, type="class")
confusionMatrix(predicted_test_ds,test$Fatal)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2897 561
## 1 995 4612
##
## Accuracy : 0.8284
## 95% CI : (0.8204, 0.8361)
## No Information Rate : 0.5707
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6448
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.7443
## Specificity : 0.8916
## Pos Pred Value : 0.8378
## Neg Pred Value : 0.8225
## Prevalence : 0.4293
## Detection Rate : 0.3196
## Detection Prevalence : 0.3815
## Balanced Accuracy : 0.8179
##
## 'Positive' Class : 0
##
###Random Forest
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
rf_model <- randomForest(as.factor(train$Fatal) ~ ., ntree=200, mtry=ncol(train)/3, data=train)
#Training accuaracy
predicted_y<-predict(rf_model)
table(predicted_y)
## predicted_y
## 0 1
## 5921 9188
confusionMatrix(predicted_y,train$Fatal)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5600 321
## 1 716 8472
##
## Accuracy : 0.9314
## 95% CI : (0.9272, 0.9353)
## No Information Rate : 0.582
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8577
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8866
## Specificity : 0.9635
## Pos Pred Value : 0.9458
## Neg Pred Value : 0.9221
## Prevalence : 0.4180
## Detection Rate : 0.3706
## Detection Prevalence : 0.3919
## Balanced Accuracy : 0.9251
##
## 'Positive' Class : 0
##
#Accuaracy on Test data
predicted_test_rf<-predict(rf_model,test, type="class")
confusionMatrix(predicted_test_rf,test$Fatal)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3479 192
## 1 413 4981
##
## Accuracy : 0.9333
## 95% CI : (0.9279, 0.9383)
## No Information Rate : 0.5707
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8628
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8939
## Specificity : 0.9629
## Pos Pred Value : 0.9477
## Neg Pred Value : 0.9234
## Prevalence : 0.4293
## Detection Rate : 0.3838
## Detection Prevalence : 0.4050
## Balanced Accuracy : 0.9284
##
## 'Positive' Class : 0
##
Boosting
- Boosting is one more famous ensemble method
- Boosting uses a slightly different techniques to that of bagging.
- Boosting is a well proven theory that works really well on many of the machine learning problems like speech recognition
- If bagging is wisdom of crowds then boosting is wisdom of crowds where each individual is given some weight based on their expertise
- Boosting in general decreases the bias error and builds strong predictive models.
- Boosting is an iterative technique. We adjust the weight of the observation based on the previous classification.
- If an observation was classified incorrectly, it tries to increase the weight of this observation and vice versa.
Boosting Main idea
Final Classifier (C = sum alpha_i c_i)
How weighted samples are taken
Boosting Illustration
Below is the training data and their classes We need to take a note of record numbers, they will help us in weighted sampling later
Theory behind Boosting Algorithm
- Take the dataset
- Build a classifier (C_m) and find the error
- Calculate error rate of the classifier
- Error rate of (epsilon _m = sum w_i I (y_i neq C_m (x)) / sum w_i) = Sum of misclassification weight / sum of sample weights
- Calculate an intermediate factor called a. It analogous to accuracy rate of the model. It will be later used in weight updating. It is derived from error
- (alpha _m = log(1- epsilon _m)/epsilon _m))
- Update weights of each record in the sample using the a factor. The indicator function will make sure that the misclassifications are given more weight
- For i =1,2,… N
- (W_(i+1) = w_i e^(alpha _m I(y_ineq C_m (x))))
- Renormalize so that sum of weights is 1
- For i =1,2,… N
- Repeat this model building and weight update process until we have no misclassification
- Final collation is done by voting from all the modes. While taking the votes, each model is weighted by the accuracy factor (alpha)
- (C = sign(sum alpha _i C_i(x)))
Gradient Boosting
- Ada boosting
- Adaptive Boosting
- Till now we discussed Ada boosting technique. Here we give high weight to misclassified records.
- Gradient Boosting
- Similar to Ada boosting algorithm.
- The approach is same but there are slight modifications during re-weighted sampling.
- We update the weights based on misclassification rate and gradient
- Gradient boosting serves better for some class of problems like regression.
LAB: Boosting
- Rightly categorizing the items based on their detailed feature specifications. More than 100 specifications have been collected.
- Data: Ecom_Products_Menu/train.csv
- Build a decision tree model and check the training and testing accuracy
- Build a boosted decision tree.
- Is there any improvement from the earlier decision tree
Solution
train <- read.csv("C:/Amrita/Datavedi/Ecom_Products_Menu/train.csv")
test <- read.csv("C:/Amrita/Datavedi/Ecom_Products_Menu/test.csv")
dim(train)
## [1] 50122 102
##Decison Tree
library(rpart)
ecom_products_ds<-rpart(Category ~ ., method="class", control=rpart.control(minsplit=30, cp=0.01), data=train[,-1])
library(rattle)
## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
fancyRpartPlot(ecom_products_ds)
#Training accuarcy
library(caret)
predicted_y<-predict(ecom_products_ds, type="class")
table(predicted_y)
## predicted_y
## Accessories Appliances Camara Ipod Laptops
## 0 10899 2733 2442 0
## Mobiles Personal_Care Tablets TV
## 0 10288 23760 0
confusionMatrix(predicted_y,train$Category)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Accessories Appliances Camara Ipod Laptops Mobiles
## Accessories 0 0 0 0 0 0
## Appliances 825 5536 1086 130 506 709
## Camara 88 387 1456 4 55 388
## Ipod 30 17 23 2032 144 5
## Laptops 0 0 0 0 0 0
## Mobiles 0 0 0 0 0 0
## Personal_Care 110 308 152 0 18 79
## Tablets 1288 615 1247 51 5743 377
## TV 0 0 0 0 0 0
## Reference
## Prediction Personal_Care Tablets TV
## Accessories 0 0 0
## Appliances 1035 932 140
## Camara 252 84 19
## Ipod 13 159 19
## Laptops 0 0 0
## Mobiles 0 0 0
## Personal_Care 9545 19 57
## Tablets 607 11885 1947
## TV 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.6076
## 95% CI : (0.6033, 0.6119)
## No Information Rate : 0.2609
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5053
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Accessories Class: Appliances Class: Camara
## Sensitivity 0.00000 0.8066 0.36731
## Specificity 1.00000 0.8760 0.97233
## Pos Pred Value NaN 0.5079 0.53275
## Neg Pred Value 0.95329 0.9662 0.94708
## Prevalence 0.04671 0.1369 0.07909
## Detection Rate 0.00000 0.1105 0.02905
## Detection Prevalence 0.00000 0.2174 0.05453
## Balanced Accuracy 0.50000 0.8413 0.66982
## Class: Ipod Class: Laptops Class: Mobiles
## Sensitivity 0.91655 0.000 0.00000
## Specificity 0.99144 1.000 1.00000
## Pos Pred Value 0.83210 NaN NaN
## Neg Pred Value 0.99612 0.871 0.96892
## Prevalence 0.04423 0.129 0.03108
## Detection Rate 0.04054 0.000 0.00000
## Detection Prevalence 0.04872 0.000 0.00000
## Balanced Accuracy 0.95400 0.500 0.50000
## Class: Personal_Care Class: Tablets Class: TV
## Sensitivity 0.8335 0.9087 0.00000
## Specificity 0.9808 0.6794 1.00000
## Pos Pred Value 0.9278 0.5002 NaN
## Neg Pred Value 0.9521 0.9547 0.95647
## Prevalence 0.2285 0.2609 0.04353
## Detection Rate 0.1904 0.2371 0.00000
## Detection Prevalence 0.2053 0.4740 0.00000
## Balanced Accuracy 0.9071 0.7941 0.50000
#Accuarcy on Test data
predicted_test_ds<-predict(ecom_products_ds, test[,-1], type="class")
confusionMatrix(predicted_test_ds,test$Category)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Accessories Appliances Camara Ipod Laptops Mobiles
## Accessories 0 0 0 0 0 0
## Appliances 172 1308 269 40 92 170
## Camara 15 80 383 1 16 95
## Ipod 14 4 3 469 28 0
## Laptops 0 0 0 0 0 0
## Mobiles 0 0 0 0 0 0
## Personal_Care 23 75 42 0 1 23
## Tablets 274 134 294 12 1401 83
## TV 0 0 0 0 0 0
## Reference
## Prediction Personal_Care Tablets TV
## Accessories 0 0 0
## Appliances 234 210 42
## Camara 52 23 3
## Ipod 3 49 5
## Laptops 0 0 0
## Mobiles 0 0 0
## Personal_Care 2242 10 17
## Tablets 152 2751 442
## TV 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.6085
## 95% CI : (0.5996, 0.6173)
## No Information Rate : 0.2588
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5071
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Accessories Class: Appliances Class: Camara
## Sensitivity 0.00000 0.8170 0.38648
## Specificity 1.00000 0.8790 0.97353
## Pos Pred Value NaN 0.5156 0.57335
## Neg Pred Value 0.95764 0.9682 0.94517
## Prevalence 0.04236 0.1362 0.08430
## Detection Rate 0.00000 0.1113 0.03258
## Detection Prevalence 0.00000 0.2158 0.05682
## Balanced Accuracy 0.50000 0.8480 0.68000
## Class: Ipod Class: Laptops Class: Mobiles
## Sensitivity 0.89847 0.0000 0.00000
## Specificity 0.99056 1.0000 1.00000
## Pos Pred Value 0.81565 NaN NaN
## Neg Pred Value 0.99526 0.8692 0.96844
## Prevalence 0.04440 0.1308 0.03156
## Detection Rate 0.03989 0.0000 0.00000
## Detection Prevalence 0.04891 0.0000 0.00000
## Balanced Accuracy 0.94452 0.5000 0.50000
## Class: Personal_Care Class: Tablets Class: TV
## Sensitivity 0.8356 0.9040 0.0000
## Specificity 0.9789 0.6796 1.0000
## Pos Pred Value 0.9215 0.4963 NaN
## Neg Pred Value 0.9527 0.9530 0.9567
## Prevalence 0.2282 0.2588 0.0433
## Detection Rate 0.1907 0.2340 0.0000
## Detection Prevalence 0.2070 0.4715 0.0000
## Balanced Accuracy 0.9073 0.7918 0.5000
###Boosting
library(xgboost)
library(methods)
library(data.table)
library(magrittr)
# converting datasets to Numeric format. xgboost needs at least one numeric column
train[,c(-1,-102)] <- lapply( train[,c(-1,-102)], as.numeric)
test[,c(-1,-102)] <- lapply( test[,c(-1,-102)], as.numeric)
# converting datasets to Matrix format. Data frame is not supported by xgboost
trainMatrix <- train[,c(-1,-102)] %>% as.matrix
testMatrix <- test[,c(-1,-102)] %>% as.matrix
#The label should be in numeric format and it should start from 0
y<-as.integer(train$Category)-1
table(y,train$Category)
##
## y Accessories Appliances Camara Ipod Laptops Mobiles Personal_Care
## 0 2341 0 0 0 0 0 0
## 1 0 6863 0 0 0 0 0
## 2 0 0 3964 0 0 0 0
## 3 0 0 0 2217 0 0 0
## 4 0 0 0 0 6466 0 0
## 5 0 0 0 0 0 1558 0
## 6 0 0 0 0 0 0 11452
## 7 0 0 0 0 0 0 0
## 8 0 0 0 0 0 0 0
##
## y Tablets TV
## 0 0 0
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## 7 13079 0
## 8 0 2182
test_y<-as.integer(test$Category)-1
table(test_y,test$Category)
##
## test_y Accessories Appliances Camara Ipod Laptops Mobiles Personal_Care
## 0 498 0 0 0 0 0 0
## 1 0 1601 0 0 0 0 0
## 2 0 0 991 0 0 0 0
## 3 0 0 0 522 0 0 0
## 4 0 0 0 0 1538 0 0
## 5 0 0 0 0 0 371 0
## 6 0 0 0 0 0 0 2683
## 7 0 0 0 0 0 0 0
## 8 0 0 0 0 0 0 0
##
## test_y Tablets TV
## 0 0 0
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## 7 3043 0
## 8 0 509
#Setting the parameters for multiclass classification
param <- list("objective" = "multi:softprob","eval.metric" = "merror", "num_class" =9)
#"multi:softmax" --set XGBoost to do multiclass classification using the softmax objective, you also need to set num_class(number of classes)
#"merror": Multiclass classification error rate. It is calculated as #(wrong cases)/#(all cases).
XGBModel <- xgboost(param=param, data = trainMatrix, label = y, nrounds=40)
## [0] train-merror:0.269223
## [1] train-merror:0.241750
## [2] train-merror:0.229500
## [3] train-merror:0.222776
## [4] train-merror:0.218966
## [5] train-merror:0.211923
## [6] train-merror:0.208312
## [7] train-merror:0.203703
## [8] train-merror:0.199553
## [9] train-merror:0.196481
## [10] train-merror:0.192969
## [11] train-merror:0.190695
## [12] train-merror:0.188241
## [13] train-merror:0.185487
## [14] train-merror:0.183193
## [15] train-merror:0.180400
## [16] train-merror:0.177886
## [17] train-merror:0.175552
## [18] train-merror:0.173217
## [19] train-merror:0.171362
## [20] train-merror:0.168968
## [21] train-merror:0.166474
## [22] train-merror:0.164379
## [23] train-merror:0.162743
## [24] train-merror:0.161925
## [25] train-merror:0.160389
## [26] train-merror:0.158214
## [27] train-merror:0.156478
## [28] train-merror:0.155521
## [29] train-merror:0.154284
## [30] train-merror:0.152628
## [31] train-merror:0.151271
## [32] train-merror:0.149356
## [33] train-merror:0.147879
## [34] train-merror:0.146283
## [35] train-merror:0.144827
## [36] train-merror:0.143749
## [37] train-merror:0.142053
## [38] train-merror:0.140358
## [39] train-merror:0.139240
#Training accuarcy
predicted_y<-predict(XGBModel, trainMatrix)
probs <- data.frame(matrix(predicted_y, nrow=nrow(train), ncol=9, byrow = TRUE))
probs_final<-as.data.frame(cbind(row.names(probs),apply(probs,1, function(x) c(0:8)[which(x==max(x))])))
table(probs_final$V2)
##
## 0 1 2 3 4 5 6 7 8
## 2140 6969 3997 2227 5142 1242 11418 15605 1382
confusionMatrix(probs_final$V2,y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2 3 4 5 6 7 8
## 0 1820 32 13 1 74 26 94 58 22
## 1 73 6495 123 1 13 129 119 12 4
## 2 13 78 3584 2 4 204 103 9 0
## 3 8 5 3 2192 0 1 0 8 10
## 4 84 20 4 3 3830 5 12 970 214
## 5 28 55 60 2 2 1051 37 7 0
## 6 81 105 93 1 5 95 10987 20 31
## 7 216 73 82 15 2500 46 92 11932 649
## 8 18 0 2 0 38 1 8 63 1252
##
## Overall Statistics
##
## Accuracy : 0.8608
## 95% CI : (0.8577, 0.8638)
## No Information Rate : 0.2609
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8306
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 0.77745 0.9464 0.90414 0.98872 0.59233 0.67458
## Specificity 0.99330 0.9890 0.99105 0.99927 0.96995 0.99607
## Pos Pred Value 0.85047 0.9320 0.89667 0.98428 0.74485 0.84622
## Neg Pred Value 0.98914 0.9915 0.99176 0.99948 0.94140 0.98963
## Prevalence 0.04671 0.1369 0.07909 0.04423 0.12901 0.03108
## Detection Rate 0.03631 0.1296 0.07151 0.04373 0.07641 0.02097
## Detection Prevalence 0.04270 0.1390 0.07975 0.04443 0.10259 0.02478
## Balanced Accuracy 0.88537 0.9677 0.94759 0.99400 0.78114 0.83532
## Class: 6 Class: 7 Class: 8
## Sensitivity 0.9594 0.9123 0.57379
## Specificity 0.9889 0.9008 0.99729
## Pos Pred Value 0.9623 0.7646 0.90593
## Neg Pred Value 0.9880 0.9668 0.98092
## Prevalence 0.2285 0.2609 0.04353
## Detection Rate 0.2192 0.2381 0.02498
## Detection Prevalence 0.2278 0.3113 0.02757
## Balanced Accuracy 0.9741 0.9066 0.78554
#Accuarcy on Test data
predicted_test_boost<-predict(XGBModel, testMatrix)
probs_test <- data.frame(matrix(predicted_test_boost, nrow=nrow(test), ncol=9, byrow = TRUE))
probs_final_test<-as.data.frame(cbind(row.names(probs_test),apply(probs_test,1, function(x) c(0:8)[which(x==max(x))])))
table(probs_final_test$V2)
##
## 0 1 2 3 4 5 6 7 8
## 446 1654 1037 514 1202 231 2699 3707 266
confusionMatrix(probs_final_test$V2,test_y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2 3 4 5 6 7 8
## 0 327 15 2 1 26 8 38 22 7
## 1 27 1476 34 0 4 66 37 9 1
## 2 1 29 881 0 4 78 34 10 0
## 3 1 1 1 502 0 1 2 4 2
## 4 29 6 2 1 743 4 2 344 71
## 5 11 21 22 0 0 163 13 0 1
## 6 38 35 32 1 2 35 2526 9 21
## 7 58 18 17 15 733 16 26 2620 204
## 8 6 0 0 2 26 0 5 25 202
##
## Overall Statistics
##
## Accuracy : 0.803
## 95% CI : (0.7957, 0.8102)
## No Information Rate : 0.2588
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.76
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 0.65663 0.9219 0.88900 0.96169 0.4831 0.43935
## Specificity 0.98943 0.9825 0.98551 0.99893 0.9551 0.99403
## Pos Pred Value 0.73318 0.8924 0.84957 0.97665 0.6181 0.70563
## Neg Pred Value 0.98488 0.9876 0.98974 0.99822 0.9247 0.98195
## Prevalence 0.04236 0.1362 0.08430 0.04440 0.1308 0.03156
## Detection Rate 0.02782 0.1256 0.07494 0.04270 0.0632 0.01387
## Detection Prevalence 0.03794 0.1407 0.08821 0.04372 0.1022 0.01965
## Balanced Accuracy 0.82303 0.9522 0.93725 0.98031 0.7191 0.71669
## Class: 6 Class: 7 Class: 8
## Sensitivity 0.9415 0.8610 0.39686
## Specificity 0.9809 0.8752 0.99431
## Pos Pred Value 0.9359 0.7068 0.75940
## Neg Pred Value 0.9827 0.9474 0.97328
## Prevalence 0.2282 0.2588 0.04330
## Detection Rate 0.2149 0.2229 0.01718
## Detection Prevalence 0.2296 0.3153 0.02263
## Balanced Accuracy 0.9612 0.8681 0.69558
When Ensemble doesn’t work?
- The models have to be independent, we can’t build the same model multiple times and expect the error to reduce.
- We may have to bring in the independence by choosing subsets of data, or subset of features while building the individual models
- Ensemble may backfire if we use dependent models that are already less accurate. The final ensemble might turn out to be even worse model.
- Yes, there is a small disclaimer in “Wisdom of Crowd” theory. We need good independent individuals. If we collate any dependent individuals with poor knowledge, then we might end with an even worse ensemble.
- For example, we built three models, model-1 , model-2 are bad, model-3 is good. Most of the times ensemble will result the combined output of model-1 and model-2, based on voting
LAB: When Ensemble doesn’t work?
- When the individual models/ sample are dependent
#Data Import
train<- read.csv("C:/Amrita/Datavedi/Car Accidents IOT/Train.csv")
test<- read.csv("C:/Amrita/Datavedi/Car Accidents IOT/Test.csv")
####Logistic Regression
crash_model_logistic <- glm(Fatal ~ . , data=train, family = binomial())
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(crash_model_logistic)
##
## Call:
## glm(formula = Fatal ~ ., family = binomial(), data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.4904 -0.8571 0.3656 0.8242 3.1945
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 8.954e-01 5.412e-01 1.654 0.098067 .
## S1 -1.045e-02 2.860e-03 -3.653 0.000259 ***
## S2 -3.740e-03 5.454e-03 -0.686 0.492915
## S3 2.638e-01 6.112e-02 4.316 1.59e-05 ***
## S4 1.605e-03 2.197e-04 7.304 2.80e-13 ***
## S5 3.161e-02 2.718e-03 11.631 < 2e-16 ***
## S6 3.748e-03 2.414e-03 1.553 0.120537
## S7 -8.739e-04 2.476e-04 -3.530 0.000415 ***
## S8 1.684e-01 3.209e-02 5.247 1.54e-07 ***
## S9 -8.099e-04 7.008e-04 -1.156 0.247805
## S10 -9.886e+01 9.210e+00 -10.734 < 2e-16 ***
## S11 -1.538e-02 8.875e-04 -17.334 < 2e-16 ***
## S12 -2.447e-01 2.161e-02 -11.324 < 2e-16 ***
## S13 3.227e+00 1.092e-01 29.549 < 2e-16 ***
## S14 7.233e-03 1.663e-03 4.350 1.36e-05 ***
## S15 6.571e-03 4.373e-03 1.503 0.132889
## S16 -7.763e-02 5.666e-02 -1.370 0.170693
## S17 -3.497e-04 6.861e-05 -5.097 3.46e-07 ***
## S18 -2.865e-04 4.433e-04 -0.646 0.518052
## S19 -6.798e-02 6.262e-02 -1.086 0.277665
## S20 -1.001e-02 2.043e-03 -4.902 9.49e-07 ***
## S21 -4.146e-01 2.398e-02 -17.291 < 2e-16 ***
## S22 1.678e-01 6.718e-03 24.981 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 20538 on 15108 degrees of freedom
## Residual deviance: 14794 on 15086 degrees of freedom
## AIC: 14840
##
## Number of Fisher Scoring iterations: 8
#Training accuarcy
predicted_y<-round(predict(crash_model_logistic,type="response"),0)
confusionMatrix(predicted_y,crash_model_logistic$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 4394 1300
## 1 1922 7493
##
## Accuracy : 0.7867
## 95% CI : (0.7801, 0.7933)
## No Information Rate : 0.582
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5556
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.6957
## Specificity : 0.8522
## Pos Pred Value : 0.7717
## Neg Pred Value : 0.7959
## Prevalence : 0.4180
## Detection Rate : 0.2908
## Detection Prevalence : 0.3769
## Balanced Accuracy : 0.7739
##
## 'Positive' Class : 0
##
#Accuarcy on Test data
predicted_test_logistic<-round(predict(crash_model_logistic,test, type="response"),0)
confusionMatrix(predicted_test_logistic,test$Fatal)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2766 781
## 1 1126 4392
##
## Accuracy : 0.7896
## 95% CI : (0.7811, 0.798)
## No Information Rate : 0.5707
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5659
## Mcnemar's Test P-Value : 3.343e-15
##
## Sensitivity : 0.7107
## Specificity : 0.8490
## Pos Pred Value : 0.7798
## Neg Pred Value : 0.7959
## Prevalence : 0.4293
## Detection Rate : 0.3051
## Detection Prevalence : 0.3913
## Balanced Accuracy : 0.7799
##
## 'Positive' Class : 0
##
###Decision Tree
library(rpart)
crash_model_ds<-rpart(Fatal ~ ., method="class", data=train)
#Training accuarcy
predicted_y<-predict(crash_model_ds, type="class")
table(predicted_y)
## predicted_y
## 0 1
## 5544 9565
confusionMatrix(predicted_y,train$Fatal)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 4705 839
## 1 1611 7954
##
## Accuracy : 0.8378
## 95% CI : (0.8319, 0.8437)
## No Information Rate : 0.582
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6609
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.7449
## Specificity : 0.9046
## Pos Pred Value : 0.8487
## Neg Pred Value : 0.8316
## Prevalence : 0.4180
## Detection Rate : 0.3114
## Detection Prevalence : 0.3669
## Balanced Accuracy : 0.8248
##
## 'Positive' Class : 0
##
#Accuaracy on Test data
predicted_test_ds<-predict(crash_model_ds, test, type="class")
confusionMatrix(predicted_test_ds,test$Fatal)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2884 454
## 1 1008 4719
##
## Accuracy : 0.8387
## 95% CI : (0.831, 0.8462)
## No Information Rate : 0.5707
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.665
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.7410
## Specificity : 0.9122
## Pos Pred Value : 0.8640
## Neg Pred Value : 0.8240
## Prevalence : 0.4293
## Detection Rate : 0.3181
## Detection Prevalence : 0.3682
## Balanced Accuracy : 0.8266
##
## 'Positive' Class : 0
##
####SVM Model
library(e1071)
pc <- proc.time()
crash_model_svm <- svm(Fatal ~ . , type="C", data = train)
proc.time() - pc
## user system elapsed
## 89.49 0.13 92.84
summary(crash_model_svm)
##
## Call:
## svm(formula = Fatal ~ ., data = train, type = "C")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.04545455
##
## Number of Support Vectors: 6992
##
## ( 3582 3410 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
#Confusion Matrix
library(caret)
label_predicted<-predict(crash_model_svm, type = "class")
confusionMatrix(label_predicted,train$Fatal)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 4811 538
## 1 1505 8255
##
## Accuracy : 0.8648
## 95% CI : (0.8592, 0.8702)
## No Information Rate : 0.582
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.716
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.7617
## Specificity : 0.9388
## Pos Pred Value : 0.8994
## Neg Pred Value : 0.8458
## Prevalence : 0.4180
## Detection Rate : 0.3184
## Detection Prevalence : 0.3540
## Balanced Accuracy : 0.8503
##
## 'Positive' Class : 0
##
#Out of time validation with test data
predicted_test_svm<-predict(crash_model_svm, newdata =test[,-1] , type = "class")
confusionMatrix(predicted_test_svm,test[,1])
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2933 399
## 1 959 4774
##
## Accuracy : 0.8502
## 95% CI : (0.8427, 0.8575)
## No Information Rate : 0.5707
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6887
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.7536
## Specificity : 0.9229
## Pos Pred Value : 0.8803
## Neg Pred Value : 0.8327
## Prevalence : 0.4293
## Detection Rate : 0.3236
## Detection Prevalence : 0.3676
## Balanced Accuracy : 0.8382
##
## 'Positive' Class : 0
##
####Ensemble Model
#DS and SVM are predictng 1 & 2
predicted_test_logistic1<-predicted_test_logistic+1
Ens_predicted_data<-data.frame(lg=as.numeric(predicted_test_logistic1),ds=as.numeric(predicted_test_ds), svm=as.numeric(predicted_test_svm))
Ens_predicted_data$final<-ifelse(Ens_predicted_data$lg+Ens_predicted_data$ds+Ens_predicted_data$svm<4.5,0,1)
table(Ens_predicted_data$final)
##
## 0 1
## 3340 5725
##Ensemble Model accuracy test data
confusionMatrix(Ens_predicted_data$final,test[,1])
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2878 462
## 1 1014 4711
##
## Accuracy : 0.8372
## 95% CI : (0.8294, 0.8447)
## No Information Rate : 0.5707
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6618
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.7395
## Specificity : 0.9107
## Pos Pred Value : 0.8617
## Neg Pred Value : 0.8229
## Prevalence : 0.4293
## Detection Rate : 0.3175
## Detection Prevalence : 0.3685
## Balanced Accuracy : 0.8251
##
## 'Positive' Class : 0
##
Conclusion
- Ensemble methods are most widely used methods these days. With advanced machines, its not really a huge task to build multiple models.
- Both bagging and boosting does a good job of reducing bias and variance
- Random forests are relatively fast, since we are building many small trees, it doesn’t put lot of pressure on the computing machine
- Random forest can also give the variable importance. We need to be careful with categorical features, random forests tend to give higher importance to variables with higher number of levels.
- In Boosted algorithms we may have to restrict the number of iterations to avoid overfitting
- Ensemble models are the final effort of a data scientist, while building the most suitable predictive model for the data


