• No products in the cart.

Introduction to SVM

Contents

  • Introduction
  • The decision boundary with largest margin
  • SVM- The large margin classifier
  • SVM algorithm
  • The kernel trick
  • Building SVM model
  • Conclusion

Introduction

  • SVM is another black box method in Machine Learning space
  • Compared to other ml algorithms, SVM totally a different approach to learning.
  • The in-depth theory and mathematics of SVM needs great knowledge in vector algebra and numerical analysis
  • We will try to learn the basic principal, philosophy, implementation of SVM
  • SVM was first introduced by Vapnik and Chervonenkis
  • Neural networks try to reduce the squared error and often suffer from overfitting.
  • SVM algorithm has better generalization ability. There are many applications where SVM works better than neural networks

The Classifier

  • To understand the SVM algorithm easily, we will start with the decision boundary
  • The line or margin that separates the classes
  • Classification algorithms are all about finding the decision boundaries
  • A good classifier is the one that generalizes well. It should work well on both training and testing data
  • It need not be a straight line always

Many Classifiers

The Margin of Classifier

Out of all the classifiers, the one that has maximum margin will generalize well. But why?

The Best Decision Boundary

  • Imagine two more data points. The classifier with maximum margin will be able to classify them more accurately.

The Maximum Margin Classifier

  • So, the best classifier has maximum margin
  • The classifier that maximizes the distance between itself and the nearest training data
  • In our example a,b,c are the training data points that are near to m1, and a,c,d are the training examples that are near to model m2.
  • The model m1 has maximum margin
  • The model m1 works well with the unseen examples
  • The model m1 does good generalization
  • For a given dataset, if we can find a classifier that has maximum margin, then it will assure maximum accuracy.

LAB: Simple Classifiers

  • Dataset: Fraud Transaction/Transactions_sample.csv
  • Draw a classification graph that shows all the classes
  • Build a logistic regression classifier
  • Draw the classifier on the data plot

Solution

Transactions_sample <- read.csv("C:AmritaDatavediFraud TransactionTransactions_sample.csv")
head(Transactions_sample)
##      id Total_Amount Tr_Count_week Fraud_id
## 1 16078      7294.60          4.79        0
## 2 41365      7659.53          2.45        0
## 3 11666      8259.29         10.77        0
## 4 11824     11630.25         15.29        1
## 5 36414     12286.63         22.18        1
## 6    90     12783.34         16.34        1
names(Transactions_sample)
## [1] "id"            "Total_Amount"  "Tr_Count_week" "Fraud_id"
library(ggplot2)
ggplot(Transactions_sample)+geom_point(aes(x=Total_Amount,y=Tr_Count_week,color=factor(Fraud_id),shape=factor(Fraud_id)),size=5)

#####Logit model

logit_model<-glm(Fraud_id~Total_Amount+Tr_Count_week,data=Transactions_sample,family=binomial())
logit_model
## 
## Call:  glm(formula = Fraud_id ~ Total_Amount + Tr_Count_week, family = binomial(), 
##     data = Transactions_sample)
## 
## Coefficients:
##   (Intercept)   Total_Amount  Tr_Count_week  
##    -26.148132       0.002534       0.108896  
## 
## Degrees of Freedom: 209 Total (i.e. Null);  207 Residual
## Null Deviance:       291.1 
## Residual Deviance: 16.85     AIC: 22.85
###The classifier slope & intercept
coef(logit_model)
##   (Intercept)  Total_Amount Tr_Count_week 
## -26.148131643   0.002533707   0.108895819
coef(logit_model)[1]
## (Intercept) 
##   -26.14813
coef(logit_model)[2]
## Total_Amount 
##  0.002533707
coef(logit_model)[3]
## Tr_Count_week 
##     0.1088958
logit_slope <- coef(logit_model)[2]/(-coef(logit_model)[3])
logit_intercept<- coef(logit_model)[1]/(-coef(logit_model)[3]) 

###The classifier diagram

base<-ggplot(Transactions_sample)+geom_point(aes(x=Total_Amount,y=Tr_Count_week,color=factor(Fraud_id),shape=factor(Fraud_id)),size=5)
base+geom_abline(intercept = logit_intercept , slope = logit_slope, color = "red", size = 2) 

SVM- The large margin classifier

  • SVM is all about finding the maximum-margin Classifier.
  • Classifier is a generic name, its actually called the hyper plane
    • Hyper plane: In 3-dimensional system hyperplanes are the 2-dimensional planes, in 2-dimensional space its hyperplanes are the 1-dimensional lines.
  • SVM algorithm makes use of the nearest training examples to derive the classifier with maximum margin
  • Each data point is considered as a p-dimensional vector (a list of p numbers)
  • SVM uses vector algebra and mathematical optimization to find the optimal hyperplane that has maximum margin

The SVM Algorithm

  • If a dataset is linearly separable then we can always find a hyperplane f(x) such that
    • For all negative labeled records f(x)<0
    • For all positive labeled records f(x)>0
    • This hyper plane f(x) is nothing but the linear classifier
    • (f(x)=w_1 x_1+ w_2 x_2 +b)
    • (f(x)=w^T x+b)

Math behind SVM Algorithm

SVM Algorithm – The Math

If you already understood the SVM technique and If you find this slide is too technical, you may want to skip it. The tool will take care of this optimization

  1. (f(x)=w^T x+b)
  2. (w^T x^+ +b=1) and (w^T x^- +b = -1)
  3. (x^+ = x^- + lambda w)
  4. (w^T x^+ +b=1)
    • (w^T(x^- + lambda w)+b=1)
    • (w^T x^- +lambda w.w+b=1)
    • (-1+lambda w.w=1)
    • (lambda = 2/w.w)
  5. (m =|x^+ – x^-|)
    • (m=|lambda w|)
    • (m=(2/w.w)*|w|)
    • (m=2/||w||)
  6. Objective is to maximize (2/||w||)
    • i.e minimize (||w||)
  7. A good decision boundary should be
    • (w^T x^+ +b>=1) for all y=1
    • (w^T x^- +b<=-1) for all y=-1
    • i.e (y*(w^T x+b)>=1) for all points
  8. Now we have the optimization problem with objective and constraints
    • minimize (||w||) or ((½)*||w||^2)
    • With constant (y(w^T x+b)>=1)
  9. We can solve the above optimization problem to obtain w & b

SVM Result

  • SVM doesn’t output probability. It directly gives which class the new data point belongs to
  • For a new point (x_k) calculate $w^T x_k +b. If this value is positive then the prediction is +1 else -1

SVM on R

  • There are multiple SVM packages available in R. The package “e1071” is the most widely used
  • There is a function called svm() within e1071 package
  • There are various options within svm() function to customize the training process
library(e1071)
svm_model <- svm(Fraud_id~Total_Amount+Tr_Count_week, data=Transactions_sample)

summary(svm_model)
## 
## Call:
## svm(formula = Fraud_id ~ Total_Amount + Tr_Count_week, data = Transactions_sample)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.5 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  24

LAB: First SVM Learning Problem

  • Dataset: Fraud Transaction/Transactions_sample.csv
  • Draw a classification graph that shows all the classes
  • Build a SVM classifier
  • Draw the classifier on the data plots
  • Predict the (Fraud vs not-Fraud) class for the data points Total_Amount=11000, Tr_Count_week=15 & Total_Amount=2000, Tr_Count_week=4
  • Download the complete Dataset: Fraud Transaction/Transaction.csv
  • Draw a classification graph that shows all the classes
  • Build a SVM classifier
  • Draw the classifier on the data plots

Solution

#SVM Building needs e1071 package
library(e1071)

#Converting the output into factor, otherwise SVM will fit a regression model
Transactions_sample$Fraud_id<-factor(Transactions_sample$Fraud_id) 
head(Transactions_sample)
##      id Total_Amount Tr_Count_week Fraud_id
## 1 16078      7294.60          4.79        0
## 2 41365      7659.53          2.45        0
## 3 11666      8259.29         10.77        0
## 4 11824     11630.25         15.29        1
## 5 36414     12286.63         22.18        1
## 6    90     12783.34         16.34        1
#SVM Model building
svm_model <- svm(Fraud_id~Total_Amount+Tr_Count_week, data=Transactions_sample)
summary(svm_model)
## 
## Call:
## svm(formula = Fraud_id ~ Total_Amount + Tr_Count_week, data = Transactions_sample)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.5 
## 
## Number of Support Vectors:  12
## 
##  ( 6 6 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1
#Plotting SVM Clasification graph o the data
ggplot(Transactions_sample)+geom_point(aes(x=Total_Amount,y=Tr_Count_week,color=factor(Fraud_id),shape=factor(Fraud_id)),size=5)

#Data With SVM model
plot(svm_model, Transactions_sample,Tr_Count_week~Total_Amount ) #x2~x1

#Prediction in SVM
new_data1<-data.frame(Total_Amount=11000, Tr_Count_week=15)
p1<-predict(svm_model, new_data1)
p1
## 1 
## 1 
## Levels: 0 1
new_data2<-data.frame(Total_Amount=2000, Tr_Count_week=4)
p2<-predict(svm_model, new_data2)
p2
## 1 
## 0 
## Levels: 0 1
#SVM on overall data
Transactions<- read.csv("C:AmritaDatavediFraud TransactionTransaction.csv")
dim(Transactions)
## [1] 45000     4
#Converting the output into factor, otherwise SVM will fit a regression model
svm_model_1 <- svm(Fraud_id~Total_Amount+Tr_Count_week, type="C", data=Transactions)
summary(svm_model_1)
## 
## Call:
## svm(formula = Fraud_id ~ Total_Amount + Tr_Count_week, data = Transactions, 
##     type = "C")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.5 
## 
## Number of Support Vectors:  44
## 
##  ( 21 23 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1
#Plotting SVM Clasification graph
plot(svm_model_1, Transactions,Tr_Count_week~Total_Amount ) 

The Non-Linear Decision Boundary

  • In the above examples we can clearly see the decision boundary is linear
  • SVM works well when the data points are linearly separable
  • If the decision boundary is non-liner then SVM may struggle to classify
  • Observe the below examples, the classes are not linearly separable
  • SVM has no direct theory to set the non-liner decision boundary models.

Mapping to Higher Dimensional Space

  • The original maximum-margin hyperplane algorithm proposed by Vapnik in 1963 constructed a linear classifier.
  • To fit a non liner boundary classier, we can create new variables(dimensions) in the data and see whether the decision boundary is linear.
  • In 1992, Bernhard E. Boser, Isabelle M. Guyon and Vladimir N. Vapnik suggested a way to create nonlinear classifiers by applying the kernel trick
  • In the below example, A single linear classifier is not sufficient
  • Lets create a new variable (x2=(x1)^2). In the higher dimensional space
  • We can clearly see a possibility of single linear decision boundary
  • This is called kernel trick

Kernel Trick

  • We used a function f(x)=(x,(x^2)) to transform the data x into a higher dimensional space.
  • In the higher dimensional space, we could easily fit a liner decision boundary.
  • This function f(x) is known as kernel function and this process is known as kernel trick in SVM
  • Kernel trick solves the non-linear decision boundary problem much like the hidden layers in neural networks.
  • Kernel trick is simply increasing the number of dimensions. It is to make the non-linear decision boundary in lower dimensional space as a linear decision boundary, in higher dimensional space.
  • In simple words, Kernel trick makes the non-linear decision boundary to linear (in higher dimensional space)

Kernel Function Examples

Name Function Type problem
Polynomial Kernel ((x_i^t x_j +1)^q) q is degree of polynomial Best for Image processing
Sigmoid Kernel (tanh(ax_i^t x_j +k)) k is offset value Very similar to neural network
Gaussian Kernel (e^(||x_i – x_j||^2/2 sigma^2)) No prior knowledge on data
Linear Kernel (1+x_i x_j min(x_i , x_j) – frac{(x_i + x_j)}{2} min(x_i , x_j)^2 + frac{min(x_i , x_j)^3}{3}) Text Classification
Laplace Radial Basis Function (RBF) (e^(-lambda ||x_i – x_j||) , lambda >= 0) No prior knowledge on data
  • There are many more kernel functions.

Choosing the Kernel Function

  • Probably the most tricky part of using SVM.
  • The kernel function is important because it creates the kernel matrix, which summarizes all the data
  • There is no proven theory for choosing a kernel function for any given problem. Still there is lot of research going on.
  • In practice, a low degree polynomial kernel or RBF kernel with a reasonable width is a good initial try
  • Choosing Kernel function is similar to choosing number of hidden layers in neural networks. Both of them have no proven theory to arrive at a standard value.
  • As a first step, we can choose low degree polynomial or radial basis function or one of those from the list

LAB: Kernel – Non linear classifier

  • Dataset : Software users/sw_user_profile.csv
  • How many variables are there in software user profile data?
  • Plot the active users against and check weather the relation between age and “Active” status is linear or non-linear
  • Build an SVM model(model-1), make sure that there is no kernel or the kernel is linear
  • For model-1, create the confusion matrix and find out the accuracy
  • Create a new variable. By using the polynomial kernel
  • Build an SVM model(model-2), with the new data mapped on to higher dimensions. Keep the default kernel in R as linear
  • For model-2, create the confusion matrix and find out the accuracy
  • Plot the SVM with results.
  • With the original data re-cerate the model(model-3) and let R choose the default kernel function.
  • What is the accuracy of model-3?

Solution

sw_user_profile <- read.csv("C:AmritaDatavediSoftware Userssw_user_profile.csv")
head(sw_user_profile)
##   Id       Age Active
## 1  1  9.438867      0
## 2  2  8.614807      0
## 3  3  5.817555      0
## 4  4 10.329219      0
## 5  5  6.527926      0
## 6  6  8.231147      0
#How many variables are there in software user profile data?
names(sw_user_profile)
## [1] "Id"     "Age"    "Active"
#Plot the active users against and check weather the relation between age and "Active" status is linear or non-linear
plot(sw_user_profile$Age,sw_user_profile$Id,  col=as.integer(sw_user_profile$Active+1))

#Build an SVM model(model-1), make sure that there is no kernel or the kernel is linear
library(e1071)
svm_model_nl <- svm(Active~Age,  type="C",  data=sw_user_profile)
summary(svm_model_nl)
## 
## Call:
## svm(formula = Active ~ Age, data = sw_user_profile, type = "C")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  1 
## 
## Number of Support Vectors:  16
## 
##  ( 8 8 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1
#Making the kernel to linear
svm_model_nl <- svm(Active~Age,  type="C", kernel="linear", data=sw_user_profile)
summary(svm_model_nl)
## 
## Call:
## svm(formula = Active ~ Age, data = sw_user_profile, type = "C", 
##     kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
##       gamma:  1 
## 
## Number of Support Vectors:  347
## 
##  ( 174 173 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1
#For model-1, create the confusion matrix and find out the accuracy
library(caret)
## Loading required package: lattice
Age_predicted<-predict(svm_model_nl)
confusionMatrix(Age_predicted,sw_user_profile$Active)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 317 173
##          1   0   0
##                                           
##                Accuracy : 0.6469          
##                  95% CI : (0.6028, 0.6893)
##     No Information Rate : 0.6469          
##     P-Value [Acc > NIR] : 0.5207          
##                                           
##                   Kappa : 0               
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.6469          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.6469          
##          Detection Rate : 0.6469          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 0               
## 
#Create a new variable. By using the polynomial kernel

###Standardizing the data to visualize the results clearly
sw_user_profile$age_nor<-(sw_user_profile$Age-mean(sw_user_profile$Age))/sd(sw_user_profile$Age)
plot(sw_user_profile$age_nor,sw_user_profile$Id,  col=as.integer(sw_user_profile$Active+1))

#Creating the new variable
sw_user_profile$new<-(sw_user_profile$age_nor)^2
plot(sw_user_profile$Age,sw_user_profile$new,  col=as.integer(sw_user_profile$Active+1))

#Build an SVM model(model-2), with the new data mapped on to higher dimensions. Keep the default kernel in R as linear
svm_model_2 <- svm(Active~Age+new,  type="C", kernel="linear", data=sw_user_profile)
summary(svm_model_2)
## 
## Call:
## svm(formula = Active ~ Age + new, data = sw_user_profile, type = "C", 
##     kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
##       gamma:  0.5 
## 
## Number of Support Vectors:  15
## 
##  ( 8 7 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1
#For model-2, create the confusion matrix and find out the accuracy
library(caret)
Age_predicted<-predict(svm_model_2)
confusionMatrix(Age_predicted,sw_user_profile$Active)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 317   0
##          1   0 173
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9925, 1)
##     No Information Rate : 0.6469     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.6469     
##          Detection Rate : 0.6469     
##    Detection Prevalence : 0.6469     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 0          
## 
#Plot the SVM with results.
plot(svm_model_2, sw_user_profile,new~Age ) 

#With the original data re-cerate the model(model-3) and let R choose the default kernel function. 
library(e1071)
svm_model_3 <- svm(Active~Age,  type="C", data=sw_user_profile)
summary(svm_model_3)
## 
## Call:
## svm(formula = Active ~ Age, data = sw_user_profile, type = "C")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  1 
## 
## Number of Support Vectors:  16
## 
##  ( 8 8 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1
#What is the accuracy of model-3?
library(caret)
Age_predicted<-predict(svm_model_3)
confusionMatrix(Age_predicted,sw_user_profile$Active)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 317   0
##          1   0 173
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9925, 1)
##     No Information Rate : 0.6469     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.6469     
##          Detection Rate : 0.6469     
##    Detection Prevalence : 0.6469     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 0          
## 

Soft Margin Classification – Noisy data

Noisy data

  • What if there is some noise in the data.
  • What id the overall data can be classified perfectly except few points.
  • How to find the hyperplane when few points are on the wrong side.

Soft Margin Classification – Noisy data

  • The non-separable cases can be solved by allowing a slack variable(x) for the point on the wrong side.
  • We are allowing some errors while building the classifier
  • In SVM optimization problem we are initially adding some error and then finding the hyperplane
  • SVM will find the maximum margin classifier allowing some minimum error due to noise.
  • Hard Margin -Classifying all data points correctly,
  • Soft margin – Allowing some error

SVM Validation

  • SVM doesn’t give us the probability, it directly gives us the resultant classes
  • Usual methods of validation like sensitivity, specificity, cross validation, ROC and AUC are the validation methods

SVM Advantages & Disadvantages

SVM Advantages

  • SVM’s are very good when we have no idea on the data
  • Works well with even unstructured and semi structured data like text, Images and trees.
  • The kernel trick is real strength of SVM. With an appropriate kernel function, we can solve any complex problem
  • Unlike in neural networks, SVM is not solved for local optima.
  • It scales relatively well to high dimensional data
  • SVM models have generalization in practice, the risk of overfitting is less in SVM.

SVM Disadvantages

  • Choosing a “good” kernel function is not easy.
  • Long training time for large datasets
  • Difficult to understand and interpret the final model, variable weights and individual impact
  • Since the final model is not so easy to see, we can not do small calibrations to the model hence its tough to incorporate our business logic

SVM Application

  • Protein Structure Prediction
  • Intrusion Detection
  • Handwriting Recognition
  • Detecting Steganography in digital images
  • Breast Cancer Diagnosis

LAB: Digit Recognition using SVM

  • Take an image of a handwritten single digit, and determine what that digit is.
  • Normalized handwritten digits, automatically scanned from envelopes by the U.S. Postal Service. The original scanned digits are binary and of different sizes and orientations; the images here have been de slanted and size normalized, resultingin 16 x 16 grayscale images (Le Cun et al., 1990).
  • The data are in two gzipped files, and each line consists of the digitid (0-9) followed by the 256 grayscale values.
  • Build an SVM model that can be used as the digit recognizer
  • Use the test dataset to validate the true classification power of the model
  • What is the final accuracy of the model?

Solution

#Importing test and training data
digits_train <- read.table("C:AmritaDatavediDigit RecognizerUSPSzip.train.txt", quote=""", comment.char="")
digits_test <- read.table("C:AmritaDatavediDigit RecognizerUSPSzip.test.txt", quote=""", comment.char="")
dim(digits_train)
## [1] 7291  257
dim(digits_test)
## [1] 2007  257
#Lets see some images. 
for(i in 1:6 )
{
data_row<-digits_train[i,-1]
pixels = matrix(as.numeric(data_row),16,16,byrow=TRUE)
image(pixels, axes = FALSE)
title(main = paste("Label is" , digits_train[i,1]), font.main = 4)
}

#Are there any missing values?

sum(is.na(digits_train))
## [1] 0
sum(is.na(digits_test))
## [1] 0
#The first variable is label
table(digits_train$V1)
## 
##    0    1    2    3    4    5    6    7    8    9 
## 1194 1005  731  658  652  556  664  645  542  644
table(digits_test$V1)
## 
##   0   1   2   3   4   5   6   7   8   9 
## 359 264 198 166 200 160 170 147 166 177
########SVM Model Building 
library(e1071)

#Lets keep an eye on runtime
pc <- proc.time()

#Verify the code with limited data 5000 rows
number.svm <- svm(V1 ~. , type="C", data = digits_train[1:5000,])

proc.time() - pc
##    user  system elapsed 
##   38.25    0.14   39.37
summary(number.svm)
## 
## Call:
## svm(formula = V1 ~ ., data = digits_train[1:5000, ], type = "C")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.00390625 
## 
## Number of Support Vectors:  2028
## 
##  ( 181 232 245 189 195 45 220 206 305 210 )
## 
## 
## Number of Classes:  10 
## 
## Levels: 
##  0 1 2 3 4 5 6 7 8 9
#Confusion Matrix
library(caret)
label_predicted<-predict(number.svm, type = "class")
confusionMatrix(label_predicted,digits_train[1:5000, 1])
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4   5   6   7   8   9
##          0 847   0   0   0   0   0   1   0   0   0
##          1   0 674   1   0   1   0   1   0   0   0
##          2   0   0 484   0   0   1   0   0   0   0
##          3   0   0   1 392   0   0   0   0   1   1
##          4   0   0   2   0 429   0   0   1   0   0
##          5   0   0   0   1   0 350   1   0   2   0
##          6   0   0   0   0   1   1 475   0   0   0
##          7   0   0   0   0   0   0   0 459   1   2
##          8   0   0   0   2   0   0   0   0 383   0
##          9   0   0   0   0   3   0   0   1   0 481
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9948          
##                  95% CI : (0.9924, 0.9966)
##     No Information Rate : 0.1694          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9942          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            1.0000   1.0000   0.9918   0.9924   0.9885   0.9943
## Specificity            0.9998   0.9993   0.9998   0.9993   0.9993   0.9991
## Pos Pred Value         0.9988   0.9956   0.9979   0.9924   0.9931   0.9887
## Neg Pred Value         1.0000   1.0000   0.9991   0.9993   0.9989   0.9996
## Prevalence             0.1694   0.1348   0.0976   0.0790   0.0868   0.0704
## Detection Rate         0.1694   0.1348   0.0968   0.0784   0.0858   0.0700
## Detection Prevalence   0.1696   0.1354   0.0970   0.0790   0.0864   0.0708
## Balanced Accuracy      0.9999   0.9997   0.9958   0.9959   0.9939   0.9967
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity            0.9937   0.9957   0.9897   0.9938
## Specificity            0.9996   0.9993   0.9996   0.9991
## Pos Pred Value         0.9958   0.9935   0.9948   0.9918
## Neg Pred Value         0.9993   0.9996   0.9991   0.9993
## Prevalence             0.0956   0.0922   0.0774   0.0968
## Detection Rate         0.0950   0.0918   0.0766   0.0962
## Detection Prevalence   0.0954   0.0924   0.0770   0.0970
## Balanced Accuracy      0.9966   0.9975   0.9946   0.9965
table(label_predicted,digits_train[1:5000, 1])
##                
## label_predicted   0   1   2   3   4   5   6   7   8   9
##               0 847   0   0   0   0   0   1   0   0   0
##               1   0 674   1   0   1   0   1   0   0   0
##               2   0   0 484   0   0   1   0   0   0   0
##               3   0   0   1 392   0   0   0   0   1   1
##               4   0   0   2   0 429   0   0   1   0   0
##               5   0   0   0   1   0 350   1   0   2   0
##               6   0   0   0   0   1   1 475   0   0   0
##               7   0   0   0   0   0   0   0 459   1   2
##               8   0   0   0   2   0   0   0   0 383   0
##               9   0   0   0   0   3   0   0   1   0 481
###Out of time validation with test data
test_label_predicted<-predict(number.svm, newdata =digits_test[,-1] , type = "class")
confusionMatrix(test_label_predicted,digits_test[,1])
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4   5   6   7   8   9
##          0 351   0   3   0   0   3   5   0   3   0
##          1   0 253   0   0   1   0   0   0   0   0
##          2   6   2 182   6   5   4   4   3   4   0
##          3   0   0   4 144   0   3   0   0   4   0
##          4   1   5   4   0 185   1   2   5   0   4
##          5   0   0   0  11   2 145   1   0   5   1
##          6   0   3   1   0   3   0 158   0   1   0
##          7   0   0   1   1   1   0   0 137   0   1
##          8   1   0   3   3   0   1   0   0 146   2
##          9   0   1   0   1   3   3   0   2   3 169
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9317          
##                  95% CI : (0.9198, 0.9424)
##     No Information Rate : 0.1789          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9233          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            0.9777   0.9583  0.91919  0.86747  0.92500  0.90625
## Specificity            0.9915   0.9994  0.98121  0.99402  0.98783  0.98917
## Pos Pred Value         0.9616   0.9961  0.84259  0.92903  0.89372  0.87879
## Neg Pred Value         0.9951   0.9937  0.99107  0.98812  0.99167  0.99186
## Prevalence             0.1789   0.1315  0.09865  0.08271  0.09965  0.07972
## Detection Rate         0.1749   0.1261  0.09068  0.07175  0.09218  0.07225
## Detection Prevalence   0.1819   0.1266  0.10762  0.07723  0.10314  0.08221
## Balanced Accuracy      0.9846   0.9789  0.95020  0.93075  0.95641  0.94771
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity           0.92941  0.93197  0.87952  0.95480
## Specificity           0.99565  0.99785  0.99457  0.99290
## Pos Pred Value        0.95181  0.97163  0.93590  0.92857
## Neg Pred Value        0.99348  0.99464  0.98920  0.99562
## Prevalence            0.08470  0.07324  0.08271  0.08819
## Detection Rate        0.07872  0.06826  0.07275  0.08421
## Detection Prevalence  0.08271  0.07025  0.07773  0.09068
## Balanced Accuracy     0.96253  0.96491  0.93704  0.97385
#####Model on Full Data 
pc <- proc.time()
number.svm <- svm(V1 ~. , type="C", data = digits_train)
proc.time() - pc
##    user  system elapsed 
##   76.94    0.26   87.24
summary(number.svm)
## 
## Call:
## svm(formula = V1 ~ ., data = digits_train, type = "C")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.00390625 
## 
## Number of Support Vectors:  2606
## 
##  ( 213 326 319 235 285 63 256 262 401 246 )
## 
## 
## Number of Classes:  10 
## 
## Levels: 
##  0 1 2 3 4 5 6 7 8 9
#Confusion Matrix
library(caret)
label_predicted<-predict(number.svm, type = "class")
confusionMatrix(label_predicted,digits_train[,1])
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2    3    4    5    6    7    8    9
##          0 1194    0    0    0    0    0    2    0    0    0
##          1    0 1005    1    1    2    0    1    0    1    0
##          2    0    0  724    0    0    1    0    0    0    0
##          3    0    0    2  651    0    0    0    0    0    1
##          4    0    0    4    0  648    1    0    2    1    1
##          5    0    0    0    3    0  553    0    0    2    0
##          6    0    0    0    0    0    1  661    0    0    0
##          7    0    0    0    0    0    0    0  641    2    3
##          8    0    0    0    3    0    0    0    0  536    0
##          9    0    0    0    0    2    0    0    2    0  639
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9947          
##                  95% CI : (0.9927, 0.9962)
##     No Information Rate : 0.1638          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.994           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            1.0000   1.0000  0.99042  0.98936  0.99387  0.99460
## Specificity            0.9997   0.9990  0.99985  0.99955  0.99864  0.99926
## Pos Pred Value         0.9983   0.9941  0.99862  0.99541  0.98630  0.99104
## Neg Pred Value         1.0000   1.0000  0.99893  0.99895  0.99940  0.99955
## Prevalence             0.1638   0.1378  0.10026  0.09025  0.08943  0.07626
## Detection Rate         0.1638   0.1378  0.09930  0.08929  0.08888  0.07585
## Detection Prevalence   0.1640   0.1387  0.09944  0.08970  0.09011  0.07653
## Balanced Accuracy      0.9998   0.9995  0.99514  0.99445  0.99625  0.99693
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity           0.99548  0.99380  0.98893  0.99224
## Specificity           0.99985  0.99925  0.99956  0.99940
## Pos Pred Value        0.99849  0.99226  0.99443  0.99378
## Neg Pred Value        0.99955  0.99940  0.99911  0.99925
## Prevalence            0.09107  0.08847  0.07434  0.08833
## Detection Rate        0.09066  0.08792  0.07352  0.08764
## Detection Prevalence  0.09080  0.08860  0.07393  0.08819
## Balanced Accuracy     0.99767  0.99652  0.99424  0.99582
###Out of time validation with test data
test_label_predicted<-predict(number.svm, newdata =digits_test[,-1] , type = "class")
confusionMatrix(test_label_predicted,digits_test[,1])
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4   5   6   7   8   9
##          0 351   0   2   0   0   3   4   0   4   0
##          1   0 253   0   0   1   0   0   0   0   0
##          2   6   1 183   5   3   2   4   2   2   0
##          3   0   0   4 146   0   3   0   0   3   0
##          4   1   5   3   0 186   1   2   5   0   4
##          5   0   1   0  11   1 147   1   0   2   1
##          6   0   3   1   0   2   0 158   0   1   0
##          7   0   1   1   1   3   0   0 138   0   0
##          8   1   0   4   3   1   1   1   0 151   2
##          9   0   0   0   0   3   3   0   2   3 170
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9382          
##                  95% CI : (0.9268, 0.9484)
##     No Information Rate : 0.1789          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9306          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            0.9777   0.9583  0.92424  0.87952  0.93000  0.91875
## Specificity            0.9921   0.9994  0.98618  0.99457  0.98838  0.99080
## Pos Pred Value         0.9643   0.9961  0.87981  0.93590  0.89855  0.89634
## Neg Pred Value         0.9951   0.9937  0.99166  0.98920  0.99222  0.99295
## Prevalence             0.1789   0.1315  0.09865  0.08271  0.09965  0.07972
## Detection Rate         0.1749   0.1261  0.09118  0.07275  0.09268  0.07324
## Detection Prevalence   0.1814   0.1266  0.10364  0.07773  0.10314  0.08171
## Balanced Accuracy      0.9849   0.9789  0.95521  0.93704  0.95919  0.95477
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity           0.92941  0.93878  0.90964  0.96045
## Specificity           0.99619  0.99677  0.99294  0.99399
## Pos Pred Value        0.95758  0.95833  0.92073  0.93923
## Neg Pred Value        0.99349  0.99517  0.99186  0.99617
## Prevalence            0.08470  0.07324  0.08271  0.08819
## Detection Rate        0.07872  0.06876  0.07524  0.08470
## Detection Prevalence  0.08221  0.07175  0.08171  0.09018
## Balanced Accuracy     0.96280  0.96777  0.95129  0.97722
#Lets see some predictions. 
digits_test$predicted<-test_label_predicted

for(i in 1:10 )
{
data_row<-digits_test[i,c(-1,-ncol(digits_test))]
pixels = matrix(as.numeric(data_row),16,16,byrow=TRUE)
image(pixels, axes = FALSE)
title(main = paste("Label is" , digits_test[i,1] ,"  Prediction is" , digits_test[i,ncol(digits_test)]))
}

#Lets see some errors in predictions images. 
# Wrong predictions
digits_test$predicted<-test_label_predicted
wrong_predictions<-digits_test[!(digits_test$predicted==digits_test$V1),]
nrow(wrong_predictions)
## [1] 124
for(i in 1:10 )
{
data_row<-wrong_predictions[i,c(-1,-ncol(wrong_predictions))]
pixels = matrix(as.numeric(data_row),16,16,byrow=TRUE)
image(pixels, axes = FALSE)
title(main = paste("Label is" , wrong_predictions[i,1] ,"  Prediction is" , wrong_predictions[i,ncol(wrong_predictions)]))
}

Conclusion

  • Many software tools are available for SVM implementation
  • SVMs are really good for text classification
  • SVMs are good at finding the best linear separator. The kernel trick makes SVMs non-linear learning algorithms
  • Choosing an appropriate kernel is the key for good SVM and choosing the right kernel function is not easy
  • We need to be patient while building SVMs on large datasets. They take a lot of time for training.


DV Analytics

DV Data & Analytics is a leading data science training and consulting firm, led by industry experts. We are aiming to train and prepare resources to acquire the most in-demand data science job opportunities in India and abroad.

Bangalore Center

DV Data & Analytics Bangalore Private Limited
#52, 2nd Floor:
Malleshpalya Maruthinagar Bengaluru.
Bangalore 560075
India
(+91) 9019 030 033 (+91) 8095 881 188
Email: info@dvanalyticsmds.com

Bhubneshwar Center

DV Data & Analytics Private Limited Bhubaneswar
Plot No A/7 :
Adjacent to Maharaja Cine Complex, Bhoinagar, Acharya Vihar
Bhubaneswar 751022
(+91) 8095 881 188 (+91) 8249 430 414
Email: info@dvanalyticsmds.com

top
© 2020. All Rights Reserved.