You can download the datasets and R code file for this session here.

Introduction
Decision Boundary with largest margin
SVM-The large margin classifier
SVM Algorithm
The kernel trick
Buildinng SVM model
Conclusion

Introduction

SVM is one more black box method in Machine Learning.Compared to other Machine Learning algorithms, SVM is totally a different approach to learning. SVM was first introduced by Vapnik and Chervonenkis and initially it was majorly compared with Neural Networks.Neural Networks has some issue with over-fitting and computation time.The real theory behind SVM is not very straightforward to understand but with some effort let us try to understand. The in-depth theory and mathematics of SVM needs great knowledge in vector algebra and numerical analysis .We will try to learn the basic principal, philosophy, implementation of SVM. SVM algorithm has better generalization ability. There are many applications where SVM works better than neural networks. Most of the times SVM takes lesser computation time than Neural Networks.To understand the SVM algorithm we will start with the Classifier.

Classifier

Classifier is nothing but a line that separates two classes. A good classifier is the one that generalises well and work well on both training and testing data. Classifier need not be a straight line always. Classifiers need not be unique they can be many classifiers that does a good job of separating good from bad. From these multipe classifiers how do we will choose the best classifier. Solution for this is the “Margin of Classifier”

Margin of Classifier

From the above picture we observed that there are two classifiers. If you see the margin from the classifier to the nearest data points it shows that classifier 1 has the largest margin when compared to classifier 2. The classifier that has the maximum margin will generalize well.But why? Let us see this through an example

From the above picture we observed that there are two new data points there are at pretty much at the same location in the two graphs.Classifier1 still holds good it is still doing good job in separating the blue and red data points whereas classifier2 failed in both cases here blue triangle is classified as “RED” and Red circular point is classified as “BLUE” because it is working well on training data but when we take testing data on new data points classifier1 with maximum margin works better compared to classifier2 with smaller margin .SO, the decision boundary or classifier with the maximum Margin is the best Classifier.

Maximum Margin Classifier

From the above picture we observed that there are two classifiers: m1 and m2. m1 has larger margin and m1 is close to data points a,b and c whereas m2 is having lesser margin when compared with m1 and it is close to data points a,c and d.So, the best classifier out of this two will be m1 because it has large margin. For a given dataset the classifier that has maximum margin will have maximum training accuracy and testing accuracy.So, we would generally prefer the classifier that has maximum margin.

LAB: Simple Classifiers

Dataset: Fraud Transaction/Transactions_sample.csv
Draw a classification graph that shows all the classes
Build a logistic regression classifier
Draw the classifier on the data plot

Solution

Transactions_sample <- read.csv("~SVMDatasetsTransactions_sample.csv")
head(Transactions_sample)

##      id Total_Amount Tr_Count_week Fraud_id
## 1 16078      7294.60          4.79        0
## 2 41365      7659.53          2.45        0
## 3 11666      8259.29         10.77        0
## 4 11824     11630.25         15.29        1
## 5 36414     12286.63         22.18        1
## 6    90     12783.34         16.34        1

names(Transactions_sample)

## [1] "id"            "Total_Amount"  "Tr_Count_week" "Fraud_id"

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.3.2

ggplot(Transactions_sample)+geom_point(aes(x=Total_Amount,y=Tr_Count_week,color=factor(Fraud_id),shape=factor(Fraud_id)),size=5)

#####Logit model

logit_model<-glm(Fraud_id~Total_Amount+Tr_Count_week,data=Transactions_sample,family=binomial())
logit_model

## 
## Call:  glm(formula = Fraud_id ~ Total_Amount + Tr_Count_week, family = binomial(), 
##     data = Transactions_sample)
## 
## Coefficients:
##   (Intercept)   Total_Amount  Tr_Count_week  
##    -26.148132       0.002534       0.108896  
## 
## Degrees of Freedom: 209 Total (i.e. Null);  207 Residual
## Null Deviance:       291.1 
## Residual Deviance: 16.85     AIC: 22.85

###The classifier slope & intercept
coef(logit_model)

##   (Intercept)  Total_Amount Tr_Count_week 
## -26.148131643   0.002533707   0.108895819

coef(logit_model)[1]

## (Intercept) 
##   -26.14813

coef(logit_model)[2]

## Total_Amount 
##  0.002533707

coef(logit_model)[3]

## Tr_Count_week 
##     0.1088958

logit_slope <- coef(logit_model)[2]/(-coef(logit_model)[3])
logit_intercept<- coef(logit_model)[1]/(-coef(logit_model)[3]) 

###The classifier diagram

base<-ggplot(Transactions_sample)+geom_point(aes(x=Total_Amount,y=Tr_Count_week,color=factor(Fraud_id),shape=factor(Fraud_id)),size=5)
base+geom_abline(intercept = logit_intercept , slope = logit_slope, color = "red", size = 2)

SVM – The large margin classifier

SVM is all about the finding the maximum-margin Classifier. Classifier is a generic name, mathematically it is called as hyper plane. Imagine in a 2 Dimensional plane it is a kind of line. SVM uses the nearest training data points in the objective space. Based on them SVM finds the hyper-plane or classifier. Each data point is considered as a p-dimensional vector(a list of p numbers).To find the optimal hyper-plane that has maximum margin SVM uses vector algebra and mathematical optimisation.

SVM Algorithm

If a dataset is really linearly seperable then we can definitely find a classifier that classifies the overall objective space. One side of the classifier everything is labelled as positive and the other side is labelled as negative.

Math behind SVM Algorithm

SVM Algorithm – The Math

If you already understood the SVM technique and If you find this slide is too technical, you may want to skip it. The tool will take care of this optimization

$f(x)=w^T x+b$
$(w^T x^+ +b=1)$ and $(w^T x^- +b = -1)$
$(x^+ = x^- + \lambda w)$
- $(w^T(x^- + \lambda w)+b=1)$
- $(w^T x^- +\lambda w.w+b=1)$
- $(-1+\lambda w.w=1)$
- $(\lambda = 2/w.w)$
- $(m=|\lambda w|)$
- $(m=(2/w.w)*|w|)$
- $(m=2/||w||)$
Objective is to maximize
- i.e minimize $(||w||)$
A good decision boundary should be
- $(w^T x^+ +b>=1)$ for all y=1
- $(w^T x^- +b<=-1)$ for all y=-1
- $(y*(w^T x+b)>=1)$ i.e for all points
Now we have the optimization problem with objective and constraints
- minimize $(||w||)$ or $((\frac{1}{2})*||w||^2)$
- With constant $(y(w^T x+b)>=1)$
We can solve the above optimization problem to obtain w & b

SVM Result

SVM is all about fitting the Hyperplane that has maximum margin. SVM output doesn’t contain any probability.It directly gives which class the new data point belongs to. For a new point xk calculate w^T x_k +b. If this value is positive then the prediction is positive else negative.At the end of SVM as a result we get the class of that particular new data point we will not get any probability as a result.

SVM on R

There are multiple SVM packages available in R to build SVM model. The package “e1071” is the most widely used. Within e1071 package there is a function called svm(). e1071 has several other functions inside it. There are various options within svm() function to build the model.

library(e1071)

## Warning: package 'e1071' was built under R version 3.3.2

svm_model <- svm(Fraud_id~Total_Amount+Tr_Count_week, data=Transactions_sample)

summary(svm_model)

## 
## Call:
## svm(formula = Fraud_id ~ Total_Amount + Tr_Count_week, data = Transactions_sample)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.5 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  24

LAB: First SVM Learning Problem

Dataset: Fraud Transaction/Transactions_sample.csv
Draw a classification graph that shows all the classes
Build a SVM classifier
Draw the classifier on the data plots
Predict the (Fraud vs not-Fraud) class for the data points Total_Amount=11000, Tr_Count_week=15 & Total_Amount=2000, Tr_Count_week=4
Download the complete Dataset: Fraud Transaction/Transaction.csv
Draw a classification graph that shows all the classes
Build a SVM classifier
Draw the classifier on the data plots

Solution

###SVM Building needs e1071 package
library(e1071)
svm_model <- svm(Fraud_id~Total_Amount+Tr_Count_week, data=Transactions_sample)

summary(svm_model)

## 
## Call:
## svm(formula = Fraud_id ~ Total_Amount + Tr_Count_week, data = Transactions_sample)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.5 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  24

Converting the output into factor, otherwise SVM will fit a regression model

Transactions_sample(Fraud_id<-factor(Transactions_sample)Fraud_id) head(Transactions_sample)

SVM Model building

svm_model <- svm(Fraud_id~Total_Amount+Tr_Count_week, data=Transactions_sample) summary(svm_model)

Plotting SVM Classification graph o the data

ggplot(Transactions_sample)+geom_point(aes(x=Total_Amount,y=Tr_Count_week,color=factor(Fraud_id),shape=factor(Fraud_id)),size=5)

Data With SVM model

plot(svm_model, Transactions_sample,Tr_Count_week~Total_Amount ) #x2~x1

Prediction in SVM

new_data1<-data.frame(Total_Amount=11000, Tr_Count_week=15) p1<-predict(svm_model, new_data1) p1 new_data2<-data.frame(Total_Amount=2000, Tr_Count_week=4) p2<-predict(svm_model, new_data2) p2

SVM on overall data

Transactions<- read.csv(“C:AmritaDatavediFraud TransactionTransaction.csv”) dim(Transactions)

Converting the output into factor, otherwise SVM will fit a regression model

svm_model_1 <- svm(Fraud_id~Total_Amount+Tr_Count_week, type=“C”, data=Transactions) summary(svm_model_1)

Plotting SVM Classification graph

plot(svm_model_1, Transactions,Tr_Count_week~Total_Amount )

Non-Linear Decision Boundary

Till now we have seen a linear classifier.what happens if the decision boundary is non-linear

From the above pictures we observed that there are positive classes then negative classes again some positive classes.In that case just fitting one line and finding the maximum margin wont be meaningful(since the decision boundary is not linear).when the decision boundary is non-linear SVM struggles to classify the classes. Infact SVM has no direct theory to set the non-linear decision boundary models. To fit a non-linear boundary classifier we might have to create new variables or new dimensions in the data and see whether the decision boundary is linear. This phenomenon is called kernel trick.

What we do in kernel trick?

In kernel tree we try to increase the number of dimensions and try to make non-linear data into linear in a higher dimensional space

Mapping to higher dimensional space

Example

In this example we have 0’s then 1’s again 0’s this cant be directly linearly classifiable what we try to do is we create a new variable called x2 it is just (x1)^2. So instead of just taking one variable i.e., x1, we will now use two variables x1 and x2 so we increase the dimension of this dataset by adding new variable. After adding new variable we can see our objective space is transformed into new dimensional space and we can clearly see a single linear decision boundary. A single linear decision boundary is not possible in lower dimensional space we can increase the number of dimensions and then see whether we can fit a linear decision boundary. so, SVM doesn’t have direct theory for non-linear decision boundary but we can increase the dimensions and we can use kernel trick to fit non-linear decision boundary.

Kernel Trick

Kernel Trick is the most important and trickiest part of SVM because most of the problems we see are not linearly separable. so when there is non-linear scenario then we have to use kernel Trick.

In earlier example We used a function f( $x$ )=( $x$ , $x^2$ ) to transform the data into a higher dimensional space.In that higher dimensional space, we could easily fit a liner decision boundary.The function f(x) is known as kernel function and this process is known as kernel trick. From the above picture we have noticed that initially the classes are non-linear separable. After using the function f(x) the classes which are non-linearly separable are transformed into new higher dimensional space. Kernel trick solves the non-linear decision boundary problem much like the hidden layers in neural networks.Kernel trick simply increases the number of dimensions to make the non-linear decision boundary in lower dimensional space as a linear decision boundary in higher dimensional space.In simple words, Kernel trick makes the non-linear decision boundary to linear (in higher dimensional space)

In Example 1 if we use kernel function f( $x$ )=( $x$ , $x^2$ ) then we can transform non-linear to linear decision boundary and obviously x^2 looks like a parabolic curve in that we can find linearly separable plane. In Example 2 it looks like circle within the circle there are class 1 outside the circle there are class 2. Here a non-linear decision boundary in 2D space is mapped on to a 3D space using kernel function $f(x)=(x_1^2,x_2^2, \sqrt{2x_1x_2})$ . If we use this function then we can find a linearly separable hyperplane in 3D space using SVM.

Kernel Function Examples

There are many more kernel functions. In some tools RBF is choosen as default kernel function.

Choosing the kernel Function

Choosing the kernel function is the most tricky part of the SVM and there is no specific theory that tells us that you should use this kernel function only.There is no proven theory that tells us which kernel function is going to work but there is lot of research going on that. In practice a low degree polynomial kernel or RBF kernel are generally used as trails.Choosing Kernel function is similar to choosing number of hidden layers in neural networks. Both of them have no proven theory to arrive at a standard value.As a first step, we can choose low degree polynomial or radial basis function or one of those from the list.

LAB: Kernel – Non linear classifier

Dataset : Software users/sw_user_profile.csv
How many variables are there in software user profile data?
Plot the active users against and check weather the relation between age and “Active” status is linear or non-linear
Build an SVM model(model-1), make sure that there is no kernel or the kernel is linear
For model-1, create the confusion matrix and find out the accuracy
Create a new variable. By using the polynomial kernel
Build an SVM model(model-2), with the new data mapped on to higher dimensions. Keep the default kernel in R as linear
For model-2, create the confusion matrix and find out the accuracy
Plot the SVM with results.
With the original data re-cerate the model(model-3) and let R choose the default kernel function.
What is the accuracy of model-3?

Solution

sw_user_profile <- read.csv("H:studiesDATA ANALYTICS2.Machine Learning AlgorithmsSVMDatasetssw_user_profile.csv")
head(sw_user_profile)

##   Id       Age Active
## 1  1  9.438867      0
## 2  2  8.614807      0
## 3  3  5.817555      0
## 4  4 10.329219      0
## 5  5  6.527926      0
## 6  6  8.231147      0

###How many variables are there in software user profile data?
names(sw_user_profile)

## [1] "Id"     "Age"    "Active"

###Plot the active users against and check weather the relation between age and "Active" status is linear or non-linear
plot(sw_user_profile$Age,sw_user_profile$Id,  col=as.integer(sw_user_profile$Active+1))

###Build an SVM model(model-1), make sure that there is no kernel or the kernel is linear
library(e1071)
svm_model_nl <- svm(Active~Age,  type="C",  data=sw_user_profile)
summary(svm_model_nl)

## 
## Call:
## svm(formula = Active ~ Age, data = sw_user_profile, type = "C")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  1 
## 
## Number of Support Vectors:  16
## 
##  ( 8 8 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

###Making the kernel to linear
svm_model_nl <- svm(Active~Age,  type="C", kernel="linear", data=sw_user_profile)
summary(svm_model_nl)

## 
## Call:
## svm(formula = Active ~ Age, data = sw_user_profile, type = "C", 
##     kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
##       gamma:  1 
## 
## Number of Support Vectors:  347
## 
##  ( 174 173 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

###For model-1, create the confusion matrix and find out the accuracy
library(caret)

## Warning: package 'caret' was built under R version 3.3.2

## Loading required package: lattice

Age_predicted<-predict(svm_model_nl)
confusionMatrix(Age_predicted,sw_user_profile$Active)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 317 173
##          1   0   0
##                                           
##                Accuracy : 0.6469          
##                  95% CI : (0.6028, 0.6893)
##     No Information Rate : 0.6469          
##     P-Value [Acc > NIR] : 0.5207          
##                                           
##                   Kappa : 0               
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.6469          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.6469          
##          Detection Rate : 0.6469          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 0               
##

###Create a new variable. By using the polynomial kernel

###Standardizing the data to visualize the results clearly
sw_user_profile$age_nor<-(sw_user_profile$Age-mean(sw_user_profile$Age))/sd(sw_user_profile$Age)
plot(sw_user_profile$age_nor,sw_user_profile$Id,  col=as.integer(sw_user_profile$Active+1))

#Creating the new variable
sw_user_profile$new<-(sw_user_profile$age_nor)^2
plot(sw_user_profile$Age,sw_user_profile$new,  col=as.integer(sw_user_profile$Active+1))

###Build an SVM model(model-2), with the new data mapped on to higher dimensions. Keep the default kernel in R as linear
svm_model_2 <- svm(Active~Age+new,  type="C", kernel="linear", data=sw_user_profile)
summary(svm_model_2)

## 
## Call:
## svm(formula = Active ~ Age + new, data = sw_user_profile, type = "C", 
##     kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
##       gamma:  0.5 
## 
## Number of Support Vectors:  15
## 
##  ( 8 7 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

###For model-2, create the confusion matrix and find out the accuracy
library(caret)
Age_predicted<-predict(svm_model_2)
confusionMatrix(Age_predicted,sw_user_profile$Active)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 317   0
##          1   0 173
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9925, 1)
##     No Information Rate : 0.6469     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.6469     
##          Detection Rate : 0.6469     
##    Detection Prevalence : 0.6469     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 0          
##

###Plot the SVM with results.
plot(svm_model_2, sw_user_profile,new~Age )

###With the original data re-cerate the model(model-3) and let R choose the default kernel function. 
library(e1071)
svm_model_3 <- svm(Active~Age,  type="C", data=sw_user_profile)
summary(svm_model_3)

## 
## Call:
## svm(formula = Active ~ Age, data = sw_user_profile, type = "C")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  1 
## 
## Number of Support Vectors:  16
## 
##  ( 8 8 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

###What is the accuracy of model-3?
library(caret)
Age_predicted<-predict(svm_model_3)
confusionMatrix(Age_predicted,sw_user_profile$Active)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 317   0
##          1   0 173
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9925, 1)
##     No Information Rate : 0.6469     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.6469     
##          Detection Rate : 0.6469     
##    Detection Prevalence : 0.6469     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 0          
##

Soft Margin Classification-Noisy data

Till now we have seen an example where two classes are perfectly separable. what if there is some noise in the data. what if the points are not that easily separable. If there is noise in the data we should not fit a model because it leads to Overfitting. To have a model very generalized we need to ignore the noise points that means we need to fit a model that might not 100% classify positive from negatives. we can allow some errors.

In above picture we observed that there is a hyperplane even the data points are wrongly classified.That is called allowing a slack variable in the hyperplane. sVM will find the Maximum Margin Classifier allowing some minimum errror due to noise that is called Soft Margin Classification for Noisy data.

SVM Validation

SVM doesn’t give us the probability, it directly gives us the resultant classes.Usual methods of validation like sensitivity, specificity, cross validation, ROC and AUC are the validation methods

SVM Advantages & Disadvantages

SVM Advantages

1.SVM’s are very good when we have no idea on the data 2.Works well with even unstructured and semi structured data like text, Images and trees. 3.The kernel trick is real strength of SVM. With an appropriate kernel function, we can solve any complex problem 4.Unlike in neural networks, SVM is not solved for local optima. 5.It scales relatively well to high dimensional data 6.SVM models have generalization in practice, the risk of overfitting is less in SVM compared to other algorithms.

SVM Disadvantages

1.Choosing a “good” kernel function is not easy. 2.Long training time for large datasets 3.Difficult to understand and interpret the final model, variable weights and individual impact. 4.Since the final model is not so easy to see, we can not do small calibrations to the model hence its tough to incorporate our business logic.

SVM Application

SVM can be applied wherever Neural Networks is applicable. For most of the business classes we can apply SVM. It is applicable in the following areas, 1.Protein Structure Prediction 2.Intrusion Detection 3.Handwriting Recognition 4.Detecting Steganography in digital images 5.Breast Cancer Diagnosis

LAB: Digit Recognition using SVM

Take an image of a handwritten single digit, and determine what that digit is.
Normalized handwritten digits, automatically scanned from envelopes by the U.S. Postal Service. The original scanned digits are binary and of different sizes and orientations; the images here have been de slanted and size normalized, resultingin 16 x 16 grayscale images (Le Cun et al., 1990).
The data are in two gzipped files, and each line consists of the digitid (0-9) followed by the 256 grayscale values.
Build an SVM model that can be used as the digit recognizer
Use the test dataset to validate the true classification power of the model
What is the final accuracy of the model?

Solution

#Importing test and training data
digits_train <- read.table("~SVMDatasetsDigit RecognizerUSPSzip.train.txt", quote=""", comment.char="")
digits_test <- read.table("~SVMDatasetsDigit RecognizerUSPSzip.test.txt", quote=""", comment.char="")
dim(digits_train)

## [1] 7291  257

dim(digits_test)

## [1] 2007  257

#Lets see some images. 
for(i in 1:6 )
{
data_row<-digits_train[i,-1]
pixels = matrix(as.numeric(data_row),16,16,byrow=TRUE)
image(pixels, axes = FALSE)
title(main = paste("Label is" , digits_train[i,1]), font.main = 4)
}

#Are there any missing values?

sum(is.na(digits_train))

## [1] 0

sum(is.na(digits_test))

## [1] 0

#The first variable is label
table(digits_train$V1)

## 
##    0    1    2    3    4    5    6    7    8    9 
## 1194 1005  731  658  652  556  664  645  542  644

table(digits_test$V1)

## 
##   0   1   2   3   4   5   6   7   8   9 
## 359 264 198 166 200 160 170 147 166 177

########SVM Model Building 
library(e1071)

#Lets keep an eye on runtime
pc <- proc.time()

#Verify the code with limited data 5000 rows
number.svm <- svm(V1 ~. , type="C", data = digits_train[1:5000,])

proc.time() - pc

##    user  system elapsed 
##   18.19    0.11   18.33

summary(number.svm)

## 
## Call:
## svm(formula = V1 ~ ., data = digits_train[1:5000, ], type = "C")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.00390625 
## 
## Number of Support Vectors:  2028
## 
##  ( 181 232 245 189 195 45 220 206 305 210 )
## 
## 
## Number of Classes:  10 
## 
## Levels: 
##  0 1 2 3 4 5 6 7 8 9

#Confusion Matrix
library(caret)
label_predicted<-predict(number.svm, type = "class")
confusionMatrix(label_predicted,digits_train[1:5000, 1])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4   5   6   7   8   9
##          0 847   0   0   0   0   0   1   0   0   0
##          1   0 674   1   0   1   0   1   0   0   0
##          2   0   0 484   0   0   1   0   0   0   0
##          3   0   0   1 392   0   0   0   0   1   1
##          4   0   0   2   0 429   0   0   1   0   0
##          5   0   0   0   1   0 350   1   0   2   0
##          6   0   0   0   0   1   1 475   0   0   0
##          7   0   0   0   0   0   0   0 459   1   2
##          8   0   0   0   2   0   0   0   0 383   0
##          9   0   0   0   0   3   0   0   1   0 481
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9948          
##                  95% CI : (0.9924, 0.9966)
##     No Information Rate : 0.1694          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9942          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            1.0000   1.0000   0.9918   0.9924   0.9885   0.9943
## Specificity            0.9998   0.9993   0.9998   0.9993   0.9993   0.9991
## Pos Pred Value         0.9988   0.9956   0.9979   0.9924   0.9931   0.9887
## Neg Pred Value         1.0000   1.0000   0.9991   0.9993   0.9989   0.9996
## Prevalence             0.1694   0.1348   0.0976   0.0790   0.0868   0.0704
## Detection Rate         0.1694   0.1348   0.0968   0.0784   0.0858   0.0700
## Detection Prevalence   0.1696   0.1354   0.0970   0.0790   0.0864   0.0708
## Balanced Accuracy      0.9999   0.9997   0.9958   0.9959   0.9939   0.9967
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity            0.9937   0.9957   0.9897   0.9938
## Specificity            0.9996   0.9993   0.9996   0.9991
## Pos Pred Value         0.9958   0.9935   0.9948   0.9918
## Neg Pred Value         0.9993   0.9996   0.9991   0.9993
## Prevalence             0.0956   0.0922   0.0774   0.0968
## Detection Rate         0.0950   0.0918   0.0766   0.0962
## Detection Prevalence   0.0954   0.0924   0.0770   0.0970
## Balanced Accuracy      0.9966   0.9975   0.9946   0.9965

table(label_predicted,digits_train[1:5000, 1])

##                
## label_predicted   0   1   2   3   4   5   6   7   8   9
##               0 847   0   0   0   0   0   1   0   0   0
##               1   0 674   1   0   1   0   1   0   0   0
##               2   0   0 484   0   0   1   0   0   0   0
##               3   0   0   1 392   0   0   0   0   1   1
##               4   0   0   2   0 429   0   0   1   0   0
##               5   0   0   0   1   0 350   1   0   2   0
##               6   0   0   0   0   1   1 475   0   0   0
##               7   0   0   0   0   0   0   0 459   1   2
##               8   0   0   0   2   0   0   0   0 383   0
##               9   0   0   0   0   3   0   0   1   0 481

###Out of time validation with test data
test_label_predicted<-predict(number.svm, newdata =digits_test[,-1] , type = "class")
confusionMatrix(test_label_predicted,digits_test[,1])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4   5   6   7   8   9
##          0 351   0   3   0   0   3   5   0   3   0
##          1   0 253   0   0   1   0   0   0   0   0
##          2   6   2 182   6   5   4   4   3   4   0
##          3   0   0   4 144   0   3   0   0   4   0
##          4   1   5   4   0 185   1   2   5   0   4
##          5   0   0   0  11   2 145   1   0   5   1
##          6   0   3   1   0   3   0 158   0   1   0
##          7   0   0   1   1   1   0   0 137   0   1
##          8   1   0   3   3   0   1   0   0 146   2
##          9   0   1   0   1   3   3   0   2   3 169
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9317          
##                  95% CI : (0.9198, 0.9424)
##     No Information Rate : 0.1789          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9233          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            0.9777   0.9583  0.91919  0.86747  0.92500  0.90625
## Specificity            0.9915   0.9994  0.98121  0.99402  0.98783  0.98917
## Pos Pred Value         0.9616   0.9961  0.84259  0.92903  0.89372  0.87879
## Neg Pred Value         0.9951   0.9937  0.99107  0.98812  0.99167  0.99186
## Prevalence             0.1789   0.1315  0.09865  0.08271  0.09965  0.07972
## Detection Rate         0.1749   0.1261  0.09068  0.07175  0.09218  0.07225
## Detection Prevalence   0.1819   0.1266  0.10762  0.07723  0.10314  0.08221
## Balanced Accuracy      0.9846   0.9789  0.95020  0.93075  0.95641  0.94771
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity           0.92941  0.93197  0.87952  0.95480
## Specificity           0.99565  0.99785  0.99457  0.99290
## Pos Pred Value        0.95181  0.97163  0.93590  0.92857
## Neg Pred Value        0.99348  0.99464  0.98920  0.99562
## Prevalence            0.08470  0.07324  0.08271  0.08819
## Detection Rate        0.07872  0.06826  0.07275  0.08421
## Detection Prevalence  0.08271  0.07025  0.07773  0.09068
## Balanced Accuracy     0.96253  0.96491  0.93704  0.97385

#####Model on Full Data 
pc <- proc.time()
number.svm <- svm(V1 ~. , type="C", data = digits_train)
proc.time() - pc

##    user  system elapsed 
##   31.22    0.19   31.42

summary(number.svm)

## 
## Call:
## svm(formula = V1 ~ ., data = digits_train, type = "C")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.00390625 
## 
## Number of Support Vectors:  2606
## 
##  ( 213 326 319 235 285 63 256 262 401 246 )
## 
## 
## Number of Classes:  10 
## 
## Levels: 
##  0 1 2 3 4 5 6 7 8 9

#Confusion Matrix
library(caret)
label_predicted<-predict(number.svm, type = "class")
confusionMatrix(label_predicted,digits_train[,1])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2    3    4    5    6    7    8    9
##          0 1194    0    0    0    0    0    2    0    0    0
##          1    0 1005    1    1    2    0    1    0    1    0
##          2    0    0  724    0    0    1    0    0    0    0
##          3    0    0    2  651    0    0    0    0    0    1
##          4    0    0    4    0  648    1    0    2    1    1
##          5    0    0    0    3    0  553    0    0    2    0
##          6    0    0    0    0    0    1  661    0    0    0
##          7    0    0    0    0    0    0    0  641    2    3
##          8    0    0    0    3    0    0    0    0  536    0
##          9    0    0    0    0    2    0    0    2    0  639
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9947          
##                  95% CI : (0.9927, 0.9962)
##     No Information Rate : 0.1638          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.994           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            1.0000   1.0000  0.99042  0.98936  0.99387  0.99460
## Specificity            0.9997   0.9990  0.99985  0.99955  0.99864  0.99926
## Pos Pred Value         0.9983   0.9941  0.99862  0.99541  0.98630  0.99104
## Neg Pred Value         1.0000   1.0000  0.99893  0.99895  0.99940  0.99955
## Prevalence             0.1638   0.1378  0.10026  0.09025  0.08943  0.07626
## Detection Rate         0.1638   0.1378  0.09930  0.08929  0.08888  0.07585
## Detection Prevalence   0.1640   0.1387  0.09944  0.08970  0.09011  0.07653
## Balanced Accuracy      0.9998   0.9995  0.99514  0.99445  0.99625  0.99693
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity           0.99548  0.99380  0.98893  0.99224
## Specificity           0.99985  0.99925  0.99956  0.99940
## Pos Pred Value        0.99849  0.99226  0.99443  0.99378
## Neg Pred Value        0.99955  0.99940  0.99911  0.99925
## Prevalence            0.09107  0.08847  0.07434  0.08833
## Detection Rate        0.09066  0.08792  0.07352  0.08764
## Detection Prevalence  0.09080  0.08860  0.07393  0.08819
## Balanced Accuracy     0.99767  0.99652  0.99424  0.99582

###Out of time validation with test data
test_label_predicted<-predict(number.svm, newdata =digits_test[,-1] , type = "class")
confusionMatrix(test_label_predicted,digits_test[,1])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4   5   6   7   8   9
##          0 351   0   2   0   0   3   4   0   4   0
##          1   0 253   0   0   1   0   0   0   0   0
##          2   6   1 183   5   3   2   4   2   2   0
##          3   0   0   4 146   0   3   0   0   3   0
##          4   1   5   3   0 186   1   2   5   0   4
##          5   0   1   0  11   1 147   1   0   2   1
##          6   0   3   1   0   2   0 158   0   1   0
##          7   0   1   1   1   3   0   0 138   0   0
##          8   1   0   4   3   1   1   1   0 151   2
##          9   0   0   0   0   3   3   0   2   3 170
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9382          
##                  95% CI : (0.9268, 0.9484)
##     No Information Rate : 0.1789          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9306          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            0.9777   0.9583  0.92424  0.87952  0.93000  0.91875
## Specificity            0.9921   0.9994  0.98618  0.99457  0.98838  0.99080
## Pos Pred Value         0.9643   0.9961  0.87981  0.93590  0.89855  0.89634
## Neg Pred Value         0.9951   0.9937  0.99166  0.98920  0.99222  0.99295
## Prevalence             0.1789   0.1315  0.09865  0.08271  0.09965  0.07972
## Detection Rate         0.1749   0.1261  0.09118  0.07275  0.09268  0.07324
## Detection Prevalence   0.1814   0.1266  0.10364  0.07773  0.10314  0.08171
## Balanced Accuracy      0.9849   0.9789  0.95521  0.93704  0.95919  0.95477
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity           0.92941  0.93878  0.90964  0.96045
## Specificity           0.99619  0.99677  0.99294  0.99399
## Pos Pred Value        0.95758  0.95833  0.92073  0.93923
## Neg Pred Value        0.99349  0.99517  0.99186  0.99617
## Prevalence            0.08470  0.07324  0.08271  0.08819
## Detection Rate        0.07872  0.06876  0.07524  0.08470
## Detection Prevalence  0.08221  0.07175  0.08171  0.09018
## Balanced Accuracy     0.96280  0.96777  0.95129  0.97722

#Lets see some predictions. 
digits_test$predicted<-test_label_predicted

for(i in 1:10 )
{
data_row<-digits_test[i,c(-1,-ncol(digits_test))]
pixels = matrix(as.numeric(data_row),16,16,byrow=TRUE)
image(pixels, axes = FALSE)
title(main = paste("Label is" , digits_test[i,1] ,"  Prediction is" , digits_test[i,ncol(digits_test)]))
}

#Lets see some errors in predictions images. 
# Wrong predictions
digits_test$predicted<-test_label_predicted
wrong_predictions<-digits_test[!(digits_test$predicted==digits_test$V1),]
nrow(wrong_predictions)

## [1] 124

for(i in 1:10 )
{
data_row<-wrong_predictions[i,c(-1,-ncol(wrong_predictions))]
pixels = matrix(as.numeric(data_row),16,16,byrow=TRUE)
image(pixels, axes = FALSE)
title(main = paste("Label is" , wrong_predictions[i,1] ,"  Prediction is" , wrong_predictions[i,ncol(wrong_predictions)]))
}

Conclusion

There are many software tools that are available for SVM implementation SVM’s are really good for text classification. They also good at finding the best linear separator. The kernel trick makes SVM’s non-linear learning algorithms. Choosing an appropriate kernel is the key for good SVM and choosing the right kernel function is not easy. We need to be patient while building SVM’s on large datasets. They take a lot of time for training.

Select Category

Handout – Support Vector Machine

You can download the datasets and R code file for this session here.

Support Vector Machines(SVM)

Contents

Introduction

Classifier

Margin of Classifier

Maximum Margin Classifier

LAB: Simple Classifiers

Solution

SVM – The large margin classifier

SVM Algorithm

Math behind SVM Algorithm

SVM Algorithm – The Math

SVM Result

SVM on R

LAB: First SVM Learning Problem

Solution

Converting the output into factor, otherwise SVM will fit a regression model

SVM Model building

Plotting SVM Classification graph o the data

Data With SVM model

Prediction in SVM

SVM on overall data

Converting the output into factor, otherwise SVM will fit a regression model

Plotting SVM Classification graph

Non-Linear Decision Boundary

What we do in kernel trick?

Mapping to higher dimensional space

Example

Kernel Trick

Kernel Function Examples

Choosing the kernel Function

LAB: Kernel – Non linear classifier

Solution

Soft Margin Classification-Noisy data

SVM Validation

SVM Advantages & Disadvantages

SVM Advantages

SVM Disadvantages

SVM Application

LAB: Digit Recognition using SVM

Solution

Conclusion