• No products in the cart.

203.6.8 Digit Recognition using SVM

LAB: Digit Recognition using SVM

In previous section, we studied about Soft Margin Classification – Noisy Data and Validation

  • Take an image of a handwritten single digit, and determine what that digit is.
  • Normalized handwritten digits, automatically scanned from envelopes by the U.S. Postal Service. The original scanned digits are binary and of different sizes and orientations; the images here have been de slanted and size normalized, resultingin 16 x 16 grayscale images (Le Cun et al., 1990).
  • The data are in two zipped files, and each line consists of the digitid (0-9) followed by the 256 grayscale values.
  • Build an SVM model that can be used as the digit recognizer
  • Use the test dataset to validate the true classification power of the model
  • What is the final accuracy of the model?

Solution

#Importing test and training data
digits_train <- read.table("C:\\Amrita\\Datavedi\\Digit Recognizer\\USPS\\zip.train.txt", quote="\"", comment.char="")
digits_test <- read.table("C:\\Amrita\\Datavedi\\Digit Recognizer\\USPS\\zip.test.txt", quote="\"", comment.char="")
dim(digits_train)
## [1] 7291  257
dim(digits_test)
## [1] 2007  257
#Lets see some images. 
for(i in 1:6 )
{
data_row<-digits_train[i,-1]
pixels = matrix(as.numeric(data_row),16,16,byrow=TRUE)
image(pixels, axes = FALSE)
title(main = paste("Label is" , digits_train[i,1]), font.main = 4)
}

#Are there any missing values?

sum(is.na(digits_train))
## [1] 0
sum(is.na(digits_test))
## [1] 0
#The first variable is label
table(digits_train$V1)
## 
##    0    1    2    3    4    5    6    7    8    9 
## 1194 1005  731  658  652  556  664  645  542  644
table(digits_test$V1)
## 
##   0   1   2   3   4   5   6   7   8   9 
## 359 264 198 166 200 160 170 147 166 177
########SVM Model Building 
library(e1071)

#Lets keep an eye on runtime
pc <- proc.time()

#Verify the code with limited data 5000 rows
number.svm <- svm(V1 ~. , type="C", data = digits_train[1:5000,])

proc.time() - pc
##    user  system elapsed 
##   38.25    0.14   39.37
summary(number.svm)
## 
## Call:
## svm(formula = V1 ~ ., data = digits_train[1:5000, ], type = "C")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.00390625 
## 
## Number of Support Vectors:  2028
## 
##  ( 181 232 245 189 195 45 220 206 305 210 )
## 
## 
## Number of Classes:  10 
## 
## Levels: 
##  0 1 2 3 4 5 6 7 8 9
#Confusion Matrix
library(caret)
label_predicted<-predict(number.svm, type = "class")
confusionMatrix(label_predicted,digits_train[1:5000, 1])
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4   5   6   7   8   9
##          0 847   0   0   0   0   0   1   0   0   0
##          1   0 674   1   0   1   0   1   0   0   0
##          2   0   0 484   0   0   1   0   0   0   0
##          3   0   0   1 392   0   0   0   0   1   1
##          4   0   0   2   0 429   0   0   1   0   0
##          5   0   0   0   1   0 350   1   0   2   0
##          6   0   0   0   0   1   1 475   0   0   0
##          7   0   0   0   0   0   0   0 459   1   2
##          8   0   0   0   2   0   0   0   0 383   0
##          9   0   0   0   0   3   0   0   1   0 481
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9948          
##                  95% CI : (0.9924, 0.9966)
##     No Information Rate : 0.1694          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9942          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            1.0000   1.0000   0.9918   0.9924   0.9885   0.9943
## Specificity            0.9998   0.9993   0.9998   0.9993   0.9993   0.9991
## Pos Pred Value         0.9988   0.9956   0.9979   0.9924   0.9931   0.9887
## Neg Pred Value         1.0000   1.0000   0.9991   0.9993   0.9989   0.9996
## Prevalence             0.1694   0.1348   0.0976   0.0790   0.0868   0.0704
## Detection Rate         0.1694   0.1348   0.0968   0.0784   0.0858   0.0700
## Detection Prevalence   0.1696   0.1354   0.0970   0.0790   0.0864   0.0708
## Balanced Accuracy      0.9999   0.9997   0.9958   0.9959   0.9939   0.9967
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity            0.9937   0.9957   0.9897   0.9938
## Specificity            0.9996   0.9993   0.9996   0.9991
## Pos Pred Value         0.9958   0.9935   0.9948   0.9918
## Neg Pred Value         0.9993   0.9996   0.9991   0.9993
## Prevalence             0.0956   0.0922   0.0774   0.0968
## Detection Rate         0.0950   0.0918   0.0766   0.0962
## Detection Prevalence   0.0954   0.0924   0.0770   0.0970
## Balanced Accuracy      0.9966   0.9975   0.9946   0.9965
table(label_predicted,digits_train[1:5000, 1])
##                
## label_predicted   0   1   2   3   4   5   6   7   8   9
##               0 847   0   0   0   0   0   1   0   0   0
##               1   0 674   1   0   1   0   1   0   0   0
##               2   0   0 484   0   0   1   0   0   0   0
##               3   0   0   1 392   0   0   0   0   1   1
##               4   0   0   2   0 429   0   0   1   0   0
##               5   0   0   0   1   0 350   1   0   2   0
##               6   0   0   0   0   1   1 475   0   0   0
##               7   0   0   0   0   0   0   0 459   1   2
##               8   0   0   0   2   0   0   0   0 383   0
##               9   0   0   0   0   3   0   0   1   0 481
###Out of time validation with test data
test_label_predicted<-predict(number.svm, newdata =digits_test[,-1] , type = "class")
confusionMatrix(test_label_predicted,digits_test[,1])
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4   5   6   7   8   9
##          0 351   0   3   0   0   3   5   0   3   0
##          1   0 253   0   0   1   0   0   0   0   0
##          2   6   2 182   6   5   4   4   3   4   0
##          3   0   0   4 144   0   3   0   0   4   0
##          4   1   5   4   0 185   1   2   5   0   4
##          5   0   0   0  11   2 145   1   0   5   1
##          6   0   3   1   0   3   0 158   0   1   0
##          7   0   0   1   1   1   0   0 137   0   1
##          8   1   0   3   3   0   1   0   0 146   2
##          9   0   1   0   1   3   3   0   2   3 169
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9317          
##                  95% CI : (0.9198, 0.9424)
##     No Information Rate : 0.1789          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9233          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            0.9777   0.9583  0.91919  0.86747  0.92500  0.90625
## Specificity            0.9915   0.9994  0.98121  0.99402  0.98783  0.98917
## Pos Pred Value         0.9616   0.9961  0.84259  0.92903  0.89372  0.87879
## Neg Pred Value         0.9951   0.9937  0.99107  0.98812  0.99167  0.99186
## Prevalence             0.1789   0.1315  0.09865  0.08271  0.09965  0.07972
## Detection Rate         0.1749   0.1261  0.09068  0.07175  0.09218  0.07225
## Detection Prevalence   0.1819   0.1266  0.10762  0.07723  0.10314  0.08221
## Balanced Accuracy      0.9846   0.9789  0.95020  0.93075  0.95641  0.94771
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity           0.92941  0.93197  0.87952  0.95480
## Specificity           0.99565  0.99785  0.99457  0.99290
## Pos Pred Value        0.95181  0.97163  0.93590  0.92857
## Neg Pred Value        0.99348  0.99464  0.98920  0.99562
## Prevalence            0.08470  0.07324  0.08271  0.08819
## Detection Rate        0.07872  0.06826  0.07275  0.08421
## Detection Prevalence  0.08271  0.07025  0.07773  0.09068
## Balanced Accuracy     0.96253  0.96491  0.93704  0.97385
#####Model on Full Data 
pc <- proc.time()
number.svm <- svm(V1 ~. , type="C", data = digits_train)
proc.time() - pc
##    user  system elapsed 
##   76.94    0.26   87.24
summary(number.svm)
## 
## Call:
## svm(formula = V1 ~ ., data = digits_train, type = "C")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.00390625 
## 
## Number of Support Vectors:  2606
## 
##  ( 213 326 319 235 285 63 256 262 401 246 )
## 
## 
## Number of Classes:  10 
## 
## Levels: 
##  0 1 2 3 4 5 6 7 8 9
#Confusion Matrix
library(caret)
label_predicted<-predict(number.svm, type = "class")
confusionMatrix(label_predicted,digits_train[,1])
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2    3    4    5    6    7    8    9
##          0 1194    0    0    0    0    0    2    0    0    0
##          1    0 1005    1    1    2    0    1    0    1    0
##          2    0    0  724    0    0    1    0    0    0    0
##          3    0    0    2  651    0    0    0    0    0    1
##          4    0    0    4    0  648    1    0    2    1    1
##          5    0    0    0    3    0  553    0    0    2    0
##          6    0    0    0    0    0    1  661    0    0    0
##          7    0    0    0    0    0    0    0  641    2    3
##          8    0    0    0    3    0    0    0    0  536    0
##          9    0    0    0    0    2    0    0    2    0  639
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9947          
##                  95% CI : (0.9927, 0.9962)
##     No Information Rate : 0.1638          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.994           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            1.0000   1.0000  0.99042  0.98936  0.99387  0.99460
## Specificity            0.9997   0.9990  0.99985  0.99955  0.99864  0.99926
## Pos Pred Value         0.9983   0.9941  0.99862  0.99541  0.98630  0.99104
## Neg Pred Value         1.0000   1.0000  0.99893  0.99895  0.99940  0.99955
## Prevalence             0.1638   0.1378  0.10026  0.09025  0.08943  0.07626
## Detection Rate         0.1638   0.1378  0.09930  0.08929  0.08888  0.07585
## Detection Prevalence   0.1640   0.1387  0.09944  0.08970  0.09011  0.07653
## Balanced Accuracy      0.9998   0.9995  0.99514  0.99445  0.99625  0.99693
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity           0.99548  0.99380  0.98893  0.99224
## Specificity           0.99985  0.99925  0.99956  0.99940
## Pos Pred Value        0.99849  0.99226  0.99443  0.99378
## Neg Pred Value        0.99955  0.99940  0.99911  0.99925
## Prevalence            0.09107  0.08847  0.07434  0.08833
## Detection Rate        0.09066  0.08792  0.07352  0.08764
## Detection Prevalence  0.09080  0.08860  0.07393  0.08819
## Balanced Accuracy     0.99767  0.99652  0.99424  0.99582
###Out of time validation with test data
test_label_predicted<-predict(number.svm, newdata =digits_test[,-1] , type = "class")
confusionMatrix(test_label_predicted,digits_test[,1])
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4   5   6   7   8   9
##          0 351   0   2   0   0   3   4   0   4   0
##          1   0 253   0   0   1   0   0   0   0   0
##          2   6   1 183   5   3   2   4   2   2   0
##          3   0   0   4 146   0   3   0   0   3   0
##          4   1   5   3   0 186   1   2   5   0   4
##          5   0   1   0  11   1 147   1   0   2   1
##          6   0   3   1   0   2   0 158   0   1   0
##          7   0   1   1   1   3   0   0 138   0   0
##          8   1   0   4   3   1   1   1   0 151   2
##          9   0   0   0   0   3   3   0   2   3 170
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9382          
##                  95% CI : (0.9268, 0.9484)
##     No Information Rate : 0.1789          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9306          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            0.9777   0.9583  0.92424  0.87952  0.93000  0.91875
## Specificity            0.9921   0.9994  0.98618  0.99457  0.98838  0.99080
## Pos Pred Value         0.9643   0.9961  0.87981  0.93590  0.89855  0.89634
## Neg Pred Value         0.9951   0.9937  0.99166  0.98920  0.99222  0.99295
## Prevalence             0.1789   0.1315  0.09865  0.08271  0.09965  0.07972
## Detection Rate         0.1749   0.1261  0.09118  0.07275  0.09268  0.07324
## Detection Prevalence   0.1814   0.1266  0.10364  0.07773  0.10314  0.08171
## Balanced Accuracy      0.9849   0.9789  0.95521  0.93704  0.95919  0.95477
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity           0.92941  0.93878  0.90964  0.96045
## Specificity           0.99619  0.99677  0.99294  0.99399
## Pos Pred Value        0.95758  0.95833  0.92073  0.93923
## Neg Pred Value        0.99349  0.99517  0.99186  0.99617
## Prevalence            0.08470  0.07324  0.08271  0.08819
## Detection Rate        0.07872  0.06876  0.07524  0.08470
## Detection Prevalence  0.08221  0.07175  0.08171  0.09018
## Balanced Accuracy     0.96280  0.96777  0.95129  0.97722
#Lets see some predictions. 
digits_test$predicted<-test_label_predicted

for(i in 1:10 )
{
data_row<-digits_test[i,c(-1,-ncol(digits_test))]
pixels = matrix(as.numeric(data_row),16,16,byrow=TRUE)
image(pixels, axes = FALSE)
title(main = paste("Label is" , digits_test[i,1] ,"  Prediction is" , digits_test[i,ncol(digits_test)]))
}

#Lets see some errors in predictions images. 
# Wrong predictions
digits_test$predicted<-test_label_predicted
wrong_predictions<-digits_test[!(digits_test$predicted==digits_test$V1),]
nrow(wrong_predictions)
## [1] 124
for(i in 1:10 )
{
data_row<-wrong_predictions[i,c(-1,-ncol(wrong_predictions))]
pixels = matrix(as.numeric(data_row),16,16,byrow=TRUE)
image(pixels, axes = FALSE)
title(main = paste("Label is" , wrong_predictions[i,1] ,"  Prediction is" , wrong_predictions[i,ncol(wrong_predictions)]))
}

Out[49]:

116

We can see out of 2007 images only 116 were wrongly identified by our SVM model.
With this post we will be ending the series here. In next series we will cover Random Forest and Boosting.

The next post is about SVM conclusion.

DV Analytics

DV Data & Analytics is a leading data science,  Cyber Security training and consulting firm, led by industry experts. We are aiming to train and prepare resources to acquire the most in-demand data science job opportunities in India and abroad.

Bangalore Center

DV Data & Analytics Bangalore Private Limited
#52, 2nd Floor:
Malleshpalya Maruthinagar Bengaluru.
Bangalore 560075
India
(+91) 9019 030 033 (+91) 8095 881 188
Email: info@dvanalyticsmds.com

Bhubneshwar Center

DV Data & Analytics Private Limited Bhubaneswar
Plot No A/7 :
Adjacent to Maharaja Cine Complex, Bhoinagar, Acharya Vihar
Bhubaneswar 751022
(+91) 8095 881 188 (+91) 8249 430 414
Email: info@dvanalyticsmds.com

top
© 2020. All Rights Reserved.