Introduction to SVM

Introduction
The decision boundary with largest margin
SVM- The large margin classifier
SVM algorithm
The kernel trick
Building SVM model
Conclusion

Introduction

SVM is another black box method in Machine Learning space
Compared to other ml algorithms, SVM totally a different approach to learning.
The in-depth theory and mathematics of SVM needs great knowledge in vector algebra and numerical analysis
We will try to learn the basic principal, philosophy, implementation of SVM
SVM was first introduced by Vapnik and Chervonenkis
Neural networks try to reduce the squared error and often suffer from overfitting.
SVM algorithm has better generalization ability. There are many applications where SVM works better than neural networks

The Classifier

To understand the SVM algorithm easily, we will start with the decision boundary
The line or margin that separates the classes
Classification algorithms are all about finding the decision boundaries
A good classifier is the one that generalizes well. It should work well on both training and testing data
It need not be a straight line always

Many Classifiers

The Margin of Classifier

Out of all the classifiers, the one that has maximum margin will generalize well. But why?

The Best Decision Boundary

Imagine two more data points. The classifier with maximum margin will be able to classify them more accurately.

The Maximum Margin Classifier

So, the best classifier has maximum margin
The classifier that maximizes the distance between itself and the nearest training data
In our example a,b,c are the training data points that are near to m1, and a,c,d are the training examples that are near to model m2.
The model m1 has maximum margin
The model m1 works well with the unseen examples
The model m1 does good generalization
For a given dataset, if we can find a classifier that has maximum margin, then it will assure maximum accuracy.

LAB: Simple Classifiers

Dataset: Fraud Transaction/Transactions_sample.csv
Draw a classification graph that shows all the classes
Build a logistic regression classifier
Draw the classifier on the data plot

Solution

Transactions_sample <- read.csv("C:AmritaDatavediFraud TransactionTransactions_sample.csv")
head(Transactions_sample)

##      id Total_Amount Tr_Count_week Fraud_id
## 1 16078      7294.60          4.79        0
## 2 41365      7659.53          2.45        0
## 3 11666      8259.29         10.77        0
## 4 11824     11630.25         15.29        1
## 5 36414     12286.63         22.18        1
## 6    90     12783.34         16.34        1

names(Transactions_sample)

## [1] "id"            "Total_Amount"  "Tr_Count_week" "Fraud_id"

library(ggplot2)
ggplot(Transactions_sample)+geom_point(aes(x=Total_Amount,y=Tr_Count_week,color=factor(Fraud_id),shape=factor(Fraud_id)),size=5)

#####Logit model

logit_model<-glm(Fraud_id~Total_Amount+Tr_Count_week,data=Transactions_sample,family=binomial())
logit_model

## 
## Call:  glm(formula = Fraud_id ~ Total_Amount + Tr_Count_week, family = binomial(), 
##     data = Transactions_sample)
## 
## Coefficients:
##   (Intercept)   Total_Amount  Tr_Count_week  
##    -26.148132       0.002534       0.108896  
## 
## Degrees of Freedom: 209 Total (i.e. Null);  207 Residual
## Null Deviance:       291.1 
## Residual Deviance: 16.85     AIC: 22.85

###The classifier slope & intercept
coef(logit_model)

##   (Intercept)  Total_Amount Tr_Count_week 
## -26.148131643   0.002533707   0.108895819

coef(logit_model)[1]

## (Intercept) 
##   -26.14813

coef(logit_model)[2]

## Total_Amount 
##  0.002533707

coef(logit_model)[3]

## Tr_Count_week 
##     0.1088958

logit_slope <- coef(logit_model)[2]/(-coef(logit_model)[3])
logit_intercept<- coef(logit_model)[1]/(-coef(logit_model)[3]) 

###The classifier diagram

base<-ggplot(Transactions_sample)+geom_point(aes(x=Total_Amount,y=Tr_Count_week,color=factor(Fraud_id),shape=factor(Fraud_id)),size=5)
base+geom_abline(intercept = logit_intercept , slope = logit_slope, color = "red", size = 2)

SVM- The large margin classifier

SVM is all about finding the maximum-margin Classifier.
Classifier is a generic name, its actually called the hyper plane
- Hyper plane: In 3-dimensional system hyperplanes are the 2-dimensional planes, in 2-dimensional space its hyperplanes are the 1-dimensional lines.
SVM algorithm makes use of the nearest training examples to derive the classifier with maximum margin
Each data point is considered as a p-dimensional vector (a list of p numbers)
SVM uses vector algebra and mathematical optimization to find the optimal hyperplane that has maximum margin

The SVM Algorithm

If a dataset is linearly separable then we can always find a hyperplane f(x) such that
- For all negative labeled records f(x)<0
- For all positive labeled records f(x)>0
- This hyper plane f(x) is nothing but the linear classifier
- (f(x)=w_1 x_1+ w_2 x_2 +b)
- (f(x)=w^T x+b)

Math behind SVM Algorithm

SVM Algorithm – The Math

If you already understood the SVM technique and If you find this slide is too technical, you may want to skip it. The tool will take care of this optimization

(f(x)=w^T x+b)
(w^T x^+ +b=1) and (w^T x^- +b = -1)
(x^+ = x^- + lambda w)
(w^T x^+ +b=1)
- (w^T(x^- + lambda w)+b=1)
- (w^T x^- +lambda w.w+b=1)
- (-1+lambda w.w=1)
- (lambda = 2/w.w)
(m =|x^+ – x^-|)
- (m=|lambda w|)
- (m=(2/w.w)*|w|)
- (m=2/||w||)
Objective is to maximize (2/||w||)
- i.e minimize (||w||)
A good decision boundary should be
- (w^T x^+ +b>=1) for all y=1
- (w^T x^- +b<=-1) for all y=-1
- i.e (y*(w^T x+b)>=1) for all points
Now we have the optimization problem with objective and constraints
- minimize (||w||) or ((½)*||w||^2)
- With constant (y(w^T x+b)>=1)
We can solve the above optimization problem to obtain w & b

SVM Result

SVM doesn’t output probability. It directly gives which class the new data point belongs to
For a new point (x_k) calculate $w^T x_k +b. If this value is positive then the prediction is +1 else -1

SVM on R

There are multiple SVM packages available in R. The package “e1071” is the most widely used
There is a function called svm() within e1071 package
There are various options within svm() function to customize the training process

library(e1071)
svm_model <- svm(Fraud_id~Total_Amount+Tr_Count_week, data=Transactions_sample)

summary(svm_model)

## 
## Call:
## svm(formula = Fraud_id ~ Total_Amount + Tr_Count_week, data = Transactions_sample)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.5 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  24

LAB: First SVM Learning Problem

Dataset: Fraud Transaction/Transactions_sample.csv
Draw a classification graph that shows all the classes
Build a SVM classifier
Draw the classifier on the data plots
Predict the (Fraud vs not-Fraud) class for the data points Total_Amount=11000, Tr_Count_week=15 & Total_Amount=2000, Tr_Count_week=4
Download the complete Dataset: Fraud Transaction/Transaction.csv
Draw a classification graph that shows all the classes
Build a SVM classifier
Draw the classifier on the data plots

Solution

#SVM Building needs e1071 package
library(e1071)

#Converting the output into factor, otherwise SVM will fit a regression model
Transactions_sample$Fraud_id<-factor(Transactions_sample$Fraud_id) 
head(Transactions_sample)

##      id Total_Amount Tr_Count_week Fraud_id
## 1 16078      7294.60          4.79        0
## 2 41365      7659.53          2.45        0
## 3 11666      8259.29         10.77        0
## 4 11824     11630.25         15.29        1
## 5 36414     12286.63         22.18        1
## 6    90     12783.34         16.34        1

#SVM Model building
svm_model <- svm(Fraud_id~Total_Amount+Tr_Count_week, data=Transactions_sample)
summary(svm_model)

## 
## Call:
## svm(formula = Fraud_id ~ Total_Amount + Tr_Count_week, data = Transactions_sample)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.5 
## 
## Number of Support Vectors:  12
## 
##  ( 6 6 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

#Plotting SVM Clasification graph o the data
ggplot(Transactions_sample)+geom_point(aes(x=Total_Amount,y=Tr_Count_week,color=factor(Fraud_id),shape=factor(Fraud_id)),size=5)

#Data With SVM model
plot(svm_model, Transactions_sample,Tr_Count_week~Total_Amount ) #x2~x1

#Prediction in SVM
new_data1<-data.frame(Total_Amount=11000, Tr_Count_week=15)
p1<-predict(svm_model, new_data1)
p1

## 1 
## 1 
## Levels: 0 1

new_data2<-data.frame(Total_Amount=2000, Tr_Count_week=4)
p2<-predict(svm_model, new_data2)
p2

## 1 
## 0 
## Levels: 0 1

#SVM on overall data
Transactions<- read.csv("C:AmritaDatavediFraud TransactionTransaction.csv")
dim(Transactions)

## [1] 45000     4

#Converting the output into factor, otherwise SVM will fit a regression model
svm_model_1 <- svm(Fraud_id~Total_Amount+Tr_Count_week, type="C", data=Transactions)
summary(svm_model_1)

## 
## Call:
## svm(formula = Fraud_id ~ Total_Amount + Tr_Count_week, data = Transactions, 
##     type = "C")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.5 
## 
## Number of Support Vectors:  44
## 
##  ( 21 23 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

#Plotting SVM Clasification graph
plot(svm_model_1, Transactions,Tr_Count_week~Total_Amount )

The Non-Linear Decision Boundary

In the above examples we can clearly see the decision boundary is linear
SVM works well when the data points are linearly separable
If the decision boundary is non-liner then SVM may struggle to classify
Observe the below examples, the classes are not linearly separable
SVM has no direct theory to set the non-liner decision boundary models.

Mapping to Higher Dimensional Space

The original maximum-margin hyperplane algorithm proposed by Vapnik in 1963 constructed a linear classifier.
To fit a non liner boundary classier, we can create new variables(dimensions) in the data and see whether the decision boundary is linear.
In 1992, Bernhard E. Boser, Isabelle M. Guyon and Vladimir N. Vapnik suggested a way to create nonlinear classifiers by applying the kernel trick
In the below example, A single linear classifier is not sufficient
Lets create a new variable (x2=(x1)^2). In the higher dimensional space
We can clearly see a possibility of single linear decision boundary
This is called kernel trick

Kernel Trick

We used a function f(x)=(x,(x^2)) to transform the data x into a higher dimensional space.
In the higher dimensional space, we could easily fit a liner decision boundary.
This function f(x) is known as kernel function and this process is known as kernel trick in SVM

Kernel trick solves the non-linear decision boundary problem much like the hidden layers in neural networks.
Kernel trick is simply increasing the number of dimensions. It is to make the non-linear decision boundary in lower dimensional space as a linear decision boundary, in higher dimensional space.
In simple words, Kernel trick makes the non-linear decision boundary to linear (in higher dimensional space)

Kernel Function Examples

Name	Function	Type problem
Polynomial Kernel	((x_i^t x_j +1)^q) q is degree of polynomial	Best for Image processing
Sigmoid Kernel	(tanh(ax_i^t x_j +k)) k is offset value	Very similar to neural network
Gaussian Kernel	(e^(\|\|x_i – x_j\|\|^2/2 sigma^2))	No prior knowledge on data
Linear Kernel	(1+x_i x_j min(x_i , x_j) – frac{(x_i + x_j)}{2} min(x_i , x_j)^2 + frac{min(x_i , x_j)^3}{3})	Text Classification
Laplace Radial Basis Function (RBF)	(e^(-lambda \|\|x_i – x_j\|\|) , lambda >= 0)	No prior knowledge on data

There are many more kernel functions.

Choosing the Kernel Function

Probably the most tricky part of using SVM.
The kernel function is important because it creates the kernel matrix, which summarizes all the data
There is no proven theory for choosing a kernel function for any given problem. Still there is lot of research going on.
In practice, a low degree polynomial kernel or RBF kernel with a reasonable width is a good initial try
Choosing Kernel function is similar to choosing number of hidden layers in neural networks. Both of them have no proven theory to arrive at a standard value.
As a first step, we can choose low degree polynomial or radial basis function or one of those from the list

LAB: Kernel – Non linear classifier

Dataset : Software users/sw_user_profile.csv
How many variables are there in software user profile data?
Plot the active users against and check weather the relation between age and “Active” status is linear or non-linear
Build an SVM model(model-1), make sure that there is no kernel or the kernel is linear
For model-1, create the confusion matrix and find out the accuracy
Create a new variable. By using the polynomial kernel
Build an SVM model(model-2), with the new data mapped on to higher dimensions. Keep the default kernel in R as linear
For model-2, create the confusion matrix and find out the accuracy
Plot the SVM with results.
With the original data re-cerate the model(model-3) and let R choose the default kernel function.
What is the accuracy of model-3?

Solution

sw_user_profile <- read.csv("C:AmritaDatavediSoftware Userssw_user_profile.csv")
head(sw_user_profile)

##   Id       Age Active
## 1  1  9.438867      0
## 2  2  8.614807      0
## 3  3  5.817555      0
## 4  4 10.329219      0
## 5  5  6.527926      0
## 6  6  8.231147      0

#How many variables are there in software user profile data?
names(sw_user_profile)

## [1] "Id"     "Age"    "Active"

#Plot the active users against and check weather the relation between age and "Active" status is linear or non-linear
plot(sw_user_profile$Age,sw_user_profile$Id,  col=as.integer(sw_user_profile$Active+1))

#Build an SVM model(model-1), make sure that there is no kernel or the kernel is linear
library(e1071)
svm_model_nl <- svm(Active~Age,  type="C",  data=sw_user_profile)
summary(svm_model_nl)

## 
## Call:
## svm(formula = Active ~ Age, data = sw_user_profile, type = "C")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  1 
## 
## Number of Support Vectors:  16
## 
##  ( 8 8 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

#Making the kernel to linear
svm_model_nl <- svm(Active~Age,  type="C", kernel="linear", data=sw_user_profile)
summary(svm_model_nl)

## 
## Call:
## svm(formula = Active ~ Age, data = sw_user_profile, type = "C", 
##     kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
##       gamma:  1 
## 
## Number of Support Vectors:  347
## 
##  ( 174 173 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

#For model-1, create the confusion matrix and find out the accuracy
library(caret)

## Loading required package: lattice

Age_predicted<-predict(svm_model_nl)
confusionMatrix(Age_predicted,sw_user_profile$Active)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 317 173
##          1   0   0
##                                           
##                Accuracy : 0.6469          
##                  95% CI : (0.6028, 0.6893)
##     No Information Rate : 0.6469          
##     P-Value [Acc > NIR] : 0.5207          
##                                           
##                   Kappa : 0               
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.6469          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.6469          
##          Detection Rate : 0.6469          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 0               
##

#Create a new variable. By using the polynomial kernel

###Standardizing the data to visualize the results clearly
sw_user_profile$age_nor<-(sw_user_profile$Age-mean(sw_user_profile$Age))/sd(sw_user_profile$Age)
plot(sw_user_profile$age_nor,sw_user_profile$Id,  col=as.integer(sw_user_profile$Active+1))

#Creating the new variable
sw_user_profile$new<-(sw_user_profile$age_nor)^2
plot(sw_user_profile$Age,sw_user_profile$new,  col=as.integer(sw_user_profile$Active+1))

#Build an SVM model(model-2), with the new data mapped on to higher dimensions. Keep the default kernel in R as linear
svm_model_2 <- svm(Active~Age+new,  type="C", kernel="linear", data=sw_user_profile)
summary(svm_model_2)

## 
## Call:
## svm(formula = Active ~ Age + new, data = sw_user_profile, type = "C", 
##     kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
##       gamma:  0.5 
## 
## Number of Support Vectors:  15
## 
##  ( 8 7 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

#For model-2, create the confusion matrix and find out the accuracy
library(caret)
Age_predicted<-predict(svm_model_2)
confusionMatrix(Age_predicted,sw_user_profile$Active)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 317   0
##          1   0 173
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9925, 1)
##     No Information Rate : 0.6469     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.6469     
##          Detection Rate : 0.6469     
##    Detection Prevalence : 0.6469     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 0          
##

#Plot the SVM with results.
plot(svm_model_2, sw_user_profile,new~Age )

#With the original data re-cerate the model(model-3) and let R choose the default kernel function. 
library(e1071)
svm_model_3 <- svm(Active~Age,  type="C", data=sw_user_profile)
summary(svm_model_3)

## 
## Call:
## svm(formula = Active ~ Age, data = sw_user_profile, type = "C")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  1 
## 
## Number of Support Vectors:  16
## 
##  ( 8 8 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

#What is the accuracy of model-3?
library(caret)
Age_predicted<-predict(svm_model_3)
confusionMatrix(Age_predicted,sw_user_profile$Active)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 317   0
##          1   0 173
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9925, 1)
##     No Information Rate : 0.6469     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.6469     
##          Detection Rate : 0.6469     
##    Detection Prevalence : 0.6469     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 0          
##

Soft Margin Classification – Noisy data

Noisy data

What if there is some noise in the data.
What id the overall data can be classified perfectly except few points.
How to find the hyperplane when few points are on the wrong side.

Soft Margin Classification – Noisy data

The non-separable cases can be solved by allowing a slack variable(x) for the point on the wrong side.
We are allowing some errors while building the classifier
In SVM optimization problem we are initially adding some error and then finding the hyperplane
SVM will find the maximum margin classifier allowing some minimum error due to noise.
Hard Margin -Classifying all data points correctly,
Soft margin – Allowing some error

SVM Validation

SVM doesn’t give us the probability, it directly gives us the resultant classes
Usual methods of validation like sensitivity, specificity, cross validation, ROC and AUC are the validation methods

SVM Advantages & Disadvantages

SVM Advantages

SVM’s are very good when we have no idea on the data
Works well with even unstructured and semi structured data like text, Images and trees.
The kernel trick is real strength of SVM. With an appropriate kernel function, we can solve any complex problem
Unlike in neural networks, SVM is not solved for local optima.
It scales relatively well to high dimensional data
SVM models have generalization in practice, the risk of overfitting is less in SVM.

SVM Disadvantages

Choosing a “good” kernel function is not easy.
Long training time for large datasets
Difficult to understand and interpret the final model, variable weights and individual impact
Since the final model is not so easy to see, we can not do small calibrations to the model hence its tough to incorporate our business logic

SVM Application

Protein Structure Prediction
Intrusion Detection
Handwriting Recognition
Detecting Steganography in digital images
Breast Cancer Diagnosis

LAB: Digit Recognition using SVM

Take an image of a handwritten single digit, and determine what that digit is.
Normalized handwritten digits, automatically scanned from envelopes by the U.S. Postal Service. The original scanned digits are binary and of different sizes and orientations; the images here have been de slanted and size normalized, resultingin 16 x 16 grayscale images (Le Cun et al., 1990).
The data are in two gzipped files, and each line consists of the digitid (0-9) followed by the 256 grayscale values.
Build an SVM model that can be used as the digit recognizer
Use the test dataset to validate the true classification power of the model
What is the final accuracy of the model?

Solution

#Importing test and training data
digits_train <- read.table("C:AmritaDatavediDigit RecognizerUSPSzip.train.txt", quote=""", comment.char="")
digits_test <- read.table("C:AmritaDatavediDigit RecognizerUSPSzip.test.txt", quote=""", comment.char="")
dim(digits_train)

## [1] 7291  257

dim(digits_test)

## [1] 2007  257

#Lets see some images. 
for(i in 1:6 )
{
data_row<-digits_train[i,-1]
pixels = matrix(as.numeric(data_row),16,16,byrow=TRUE)
image(pixels, axes = FALSE)
title(main = paste("Label is" , digits_train[i,1]), font.main = 4)
}

#Are there any missing values?

sum(is.na(digits_train))

## [1] 0

sum(is.na(digits_test))

## [1] 0

#The first variable is label
table(digits_train$V1)

## 
##    0    1    2    3    4    5    6    7    8    9 
## 1194 1005  731  658  652  556  664  645  542  644

table(digits_test$V1)

## 
##   0   1   2   3   4   5   6   7   8   9 
## 359 264 198 166 200 160 170 147 166 177

########SVM Model Building 
library(e1071)

#Lets keep an eye on runtime
pc <- proc.time()

#Verify the code with limited data 5000 rows
number.svm <- svm(V1 ~. , type="C", data = digits_train[1:5000,])

proc.time() - pc

##    user  system elapsed 
##   38.25    0.14   39.37

summary(number.svm)

## 
## Call:
## svm(formula = V1 ~ ., data = digits_train[1:5000, ], type = "C")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.00390625 
## 
## Number of Support Vectors:  2028
## 
##  ( 181 232 245 189 195 45 220 206 305 210 )
## 
## 
## Number of Classes:  10 
## 
## Levels: 
##  0 1 2 3 4 5 6 7 8 9

#Confusion Matrix
library(caret)
label_predicted<-predict(number.svm, type = "class")
confusionMatrix(label_predicted,digits_train[1:5000, 1])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4   5   6   7   8   9
##          0 847   0   0   0   0   0   1   0   0   0
##          1   0 674   1   0   1   0   1   0   0   0
##          2   0   0 484   0   0   1   0   0   0   0
##          3   0   0   1 392   0   0   0   0   1   1
##          4   0   0   2   0 429   0   0   1   0   0
##          5   0   0   0   1   0 350   1   0   2   0
##          6   0   0   0   0   1   1 475   0   0   0
##          7   0   0   0   0   0   0   0 459   1   2
##          8   0   0   0   2   0   0   0   0 383   0
##          9   0   0   0   0   3   0   0   1   0 481
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9948          
##                  95% CI : (0.9924, 0.9966)
##     No Information Rate : 0.1694          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9942          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            1.0000   1.0000   0.9918   0.9924   0.9885   0.9943
## Specificity            0.9998   0.9993   0.9998   0.9993   0.9993   0.9991
## Pos Pred Value         0.9988   0.9956   0.9979   0.9924   0.9931   0.9887
## Neg Pred Value         1.0000   1.0000   0.9991   0.9993   0.9989   0.9996
## Prevalence             0.1694   0.1348   0.0976   0.0790   0.0868   0.0704
## Detection Rate         0.1694   0.1348   0.0968   0.0784   0.0858   0.0700
## Detection Prevalence   0.1696   0.1354   0.0970   0.0790   0.0864   0.0708
## Balanced Accuracy      0.9999   0.9997   0.9958   0.9959   0.9939   0.9967
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity            0.9937   0.9957   0.9897   0.9938
## Specificity            0.9996   0.9993   0.9996   0.9991
## Pos Pred Value         0.9958   0.9935   0.9948   0.9918
## Neg Pred Value         0.9993   0.9996   0.9991   0.9993
## Prevalence             0.0956   0.0922   0.0774   0.0968
## Detection Rate         0.0950   0.0918   0.0766   0.0962
## Detection Prevalence   0.0954   0.0924   0.0770   0.0970
## Balanced Accuracy      0.9966   0.9975   0.9946   0.9965

table(label_predicted,digits_train[1:5000, 1])

##                
## label_predicted   0   1   2   3   4   5   6   7   8   9
##               0 847   0   0   0   0   0   1   0   0   0
##               1   0 674   1   0   1   0   1   0   0   0
##               2   0   0 484   0   0   1   0   0   0   0
##               3   0   0   1 392   0   0   0   0   1   1
##               4   0   0   2   0 429   0   0   1   0   0
##               5   0   0   0   1   0 350   1   0   2   0
##               6   0   0   0   0   1   1 475   0   0   0
##               7   0   0   0   0   0   0   0 459   1   2
##               8   0   0   0   2   0   0   0   0 383   0
##               9   0   0   0   0   3   0   0   1   0 481

###Out of time validation with test data
test_label_predicted<-predict(number.svm, newdata =digits_test[,-1] , type = "class")
confusionMatrix(test_label_predicted,digits_test[,1])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4   5   6   7   8   9
##          0 351   0   3   0   0   3   5   0   3   0
##          1   0 253   0   0   1   0   0   0   0   0
##          2   6   2 182   6   5   4   4   3   4   0
##          3   0   0   4 144   0   3   0   0   4   0
##          4   1   5   4   0 185   1   2   5   0   4
##          5   0   0   0  11   2 145   1   0   5   1
##          6   0   3   1   0   3   0 158   0   1   0
##          7   0   0   1   1   1   0   0 137   0   1
##          8   1   0   3   3   0   1   0   0 146   2
##          9   0   1   0   1   3   3   0   2   3 169
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9317          
##                  95% CI : (0.9198, 0.9424)
##     No Information Rate : 0.1789          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9233          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            0.9777   0.9583  0.91919  0.86747  0.92500  0.90625
## Specificity            0.9915   0.9994  0.98121  0.99402  0.98783  0.98917
## Pos Pred Value         0.9616   0.9961  0.84259  0.92903  0.89372  0.87879
## Neg Pred Value         0.9951   0.9937  0.99107  0.98812  0.99167  0.99186
## Prevalence             0.1789   0.1315  0.09865  0.08271  0.09965  0.07972
## Detection Rate         0.1749   0.1261  0.09068  0.07175  0.09218  0.07225
## Detection Prevalence   0.1819   0.1266  0.10762  0.07723  0.10314  0.08221
## Balanced Accuracy      0.9846   0.9789  0.95020  0.93075  0.95641  0.94771
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity           0.92941  0.93197  0.87952  0.95480
## Specificity           0.99565  0.99785  0.99457  0.99290
## Pos Pred Value        0.95181  0.97163  0.93590  0.92857
## Neg Pred Value        0.99348  0.99464  0.98920  0.99562
## Prevalence            0.08470  0.07324  0.08271  0.08819
## Detection Rate        0.07872  0.06826  0.07275  0.08421
## Detection Prevalence  0.08271  0.07025  0.07773  0.09068
## Balanced Accuracy     0.96253  0.96491  0.93704  0.97385

#####Model on Full Data 
pc <- proc.time()
number.svm <- svm(V1 ~. , type="C", data = digits_train)
proc.time() - pc

##    user  system elapsed 
##   76.94    0.26   87.24

summary(number.svm)

## 
## Call:
## svm(formula = V1 ~ ., data = digits_train, type = "C")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.00390625 
## 
## Number of Support Vectors:  2606
## 
##  ( 213 326 319 235 285 63 256 262 401 246 )
## 
## 
## Number of Classes:  10 
## 
## Levels: 
##  0 1 2 3 4 5 6 7 8 9

#Confusion Matrix
library(caret)
label_predicted<-predict(number.svm, type = "class")
confusionMatrix(label_predicted,digits_train[,1])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2    3    4    5    6    7    8    9
##          0 1194    0    0    0    0    0    2    0    0    0
##          1    0 1005    1    1    2    0    1    0    1    0
##          2    0    0  724    0    0    1    0    0    0    0
##          3    0    0    2  651    0    0    0    0    0    1
##          4    0    0    4    0  648    1    0    2    1    1
##          5    0    0    0    3    0  553    0    0    2    0
##          6    0    0    0    0    0    1  661    0    0    0
##          7    0    0    0    0    0    0    0  641    2    3
##          8    0    0    0    3    0    0    0    0  536    0
##          9    0    0    0    0    2    0    0    2    0  639
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9947          
##                  95% CI : (0.9927, 0.9962)
##     No Information Rate : 0.1638          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.994           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            1.0000   1.0000  0.99042  0.98936  0.99387  0.99460
## Specificity            0.9997   0.9990  0.99985  0.99955  0.99864  0.99926
## Pos Pred Value         0.9983   0.9941  0.99862  0.99541  0.98630  0.99104
## Neg Pred Value         1.0000   1.0000  0.99893  0.99895  0.99940  0.99955
## Prevalence             0.1638   0.1378  0.10026  0.09025  0.08943  0.07626
## Detection Rate         0.1638   0.1378  0.09930  0.08929  0.08888  0.07585
## Detection Prevalence   0.1640   0.1387  0.09944  0.08970  0.09011  0.07653
## Balanced Accuracy      0.9998   0.9995  0.99514  0.99445  0.99625  0.99693
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity           0.99548  0.99380  0.98893  0.99224
## Specificity           0.99985  0.99925  0.99956  0.99940
## Pos Pred Value        0.99849  0.99226  0.99443  0.99378
## Neg Pred Value        0.99955  0.99940  0.99911  0.99925
## Prevalence            0.09107  0.08847  0.07434  0.08833
## Detection Rate        0.09066  0.08792  0.07352  0.08764
## Detection Prevalence  0.09080  0.08860  0.07393  0.08819
## Balanced Accuracy     0.99767  0.99652  0.99424  0.99582

###Out of time validation with test data
test_label_predicted<-predict(number.svm, newdata =digits_test[,-1] , type = "class")
confusionMatrix(test_label_predicted,digits_test[,1])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4   5   6   7   8   9
##          0 351   0   2   0   0   3   4   0   4   0
##          1   0 253   0   0   1   0   0   0   0   0
##          2   6   1 183   5   3   2   4   2   2   0
##          3   0   0   4 146   0   3   0   0   3   0
##          4   1   5   3   0 186   1   2   5   0   4
##          5   0   1   0  11   1 147   1   0   2   1
##          6   0   3   1   0   2   0 158   0   1   0
##          7   0   1   1   1   3   0   0 138   0   0
##          8   1   0   4   3   1   1   1   0 151   2
##          9   0   0   0   0   3   3   0   2   3 170
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9382          
##                  95% CI : (0.9268, 0.9484)
##     No Information Rate : 0.1789          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9306          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            0.9777   0.9583  0.92424  0.87952  0.93000  0.91875
## Specificity            0.9921   0.9994  0.98618  0.99457  0.98838  0.99080
## Pos Pred Value         0.9643   0.9961  0.87981  0.93590  0.89855  0.89634
## Neg Pred Value         0.9951   0.9937  0.99166  0.98920  0.99222  0.99295
## Prevalence             0.1789   0.1315  0.09865  0.08271  0.09965  0.07972
## Detection Rate         0.1749   0.1261  0.09118  0.07275  0.09268  0.07324
## Detection Prevalence   0.1814   0.1266  0.10364  0.07773  0.10314  0.08171
## Balanced Accuracy      0.9849   0.9789  0.95521  0.93704  0.95919  0.95477
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity           0.92941  0.93878  0.90964  0.96045
## Specificity           0.99619  0.99677  0.99294  0.99399
## Pos Pred Value        0.95758  0.95833  0.92073  0.93923
## Neg Pred Value        0.99349  0.99517  0.99186  0.99617
## Prevalence            0.08470  0.07324  0.08271  0.08819
## Detection Rate        0.07872  0.06876  0.07524  0.08470
## Detection Prevalence  0.08221  0.07175  0.08171  0.09018
## Balanced Accuracy     0.96280  0.96777  0.95129  0.97722

#Lets see some predictions. 
digits_test$predicted<-test_label_predicted

for(i in 1:10 )
{
data_row<-digits_test[i,c(-1,-ncol(digits_test))]
pixels = matrix(as.numeric(data_row),16,16,byrow=TRUE)
image(pixels, axes = FALSE)
title(main = paste("Label is" , digits_test[i,1] ,"  Prediction is" , digits_test[i,ncol(digits_test)]))
}

#Lets see some errors in predictions images. 
# Wrong predictions
digits_test$predicted<-test_label_predicted
wrong_predictions<-digits_test[!(digits_test$predicted==digits_test$V1),]
nrow(wrong_predictions)

## [1] 124

for(i in 1:10 )
{
data_row<-wrong_predictions[i,c(-1,-ncol(wrong_predictions))]
pixels = matrix(as.numeric(data_row),16,16,byrow=TRUE)
image(pixels, axes = FALSE)
title(main = paste("Label is" , wrong_predictions[i,1] ,"  Prediction is" , wrong_predictions[i,ncol(wrong_predictions)]))
}

Conclusion

Many software tools are available for SVM implementation
SVMs are really good for text classification
SVMs are good at finding the best linear separator. The kernel trick makes SVMs non-linear learning algorithms
Choosing an appropriate kernel is the key for good SVM and choosing the right kernel function is not easy
We need to be patient while building SVMs on large datasets. They take a lot of time for training.

Support Vector Machines

Contents

Introduction

The Classifier

Many Classifiers

The Margin of Classifier

The Best Decision Boundary

The Maximum Margin Classifier

LAB: Simple Classifiers

Solution

SVM- The large margin classifier

The SVM Algorithm

Math behind SVM Algorithm

SVM Algorithm – The Math

SVM Result

SVM on R

LAB: First SVM Learning Problem

Solution

The Non-Linear Decision Boundary

Mapping to Higher Dimensional Space

Kernel Trick

Kernel Function Examples

Choosing the Kernel Function

LAB: Kernel – Non linear classifier

Solution

Soft Margin Classification – Noisy data

Noisy data

Soft Margin Classification – Noisy data

SVM Validation

SVM Advantages & Disadvantages

SVM Advantages

SVM Disadvantages

SVM Application

LAB: Digit Recognition using SVM

Solution

Conclusion