Click to Download

Introduction
The decision boundary with largest margin
SVM- The large margin classifier
SVM algorithm
The kernel trick
Building SVM model
Conclusion

Introduction

SVM is another black box method in Machine Learning. Compared to other Machine Learning algorithms, SVM is totally a different approach. SVM was first introduced by Vapnik and Chervonenkis and initially it was majorly compared with Neural Networks. Neural Networks has some issue with overfitting and computation time. The real theory behind SVM is not very straight forward to understand, but with some effort let us try to understand. The in-depth theory and mathematics of SVM needs great knowledge in vector algebra and numerical analysis. We will try to learn the basic principal, philosophy and implementation of SVM. SVM algorithm has better generalization ability. There are many applications where SVM works better than neural networks. Most of the times, SVM takes lesser computation time than Neural Networks. To understand the SVM algorithm, we will start with the Classifier.

Classifier

Classifier is nothing but a line that separates two classes. A good classifier is the one that generalizes well and works well on both training and testing data. Classifier need not be a straight line always. Classifiers need not be uniqe, there can be many classifiers that does a good job of separating good from bad. From these multipe classifiers, how do we choose the best classifier? Solution for this is the “Margin of Classifier”.

Margin of Classifier

From the above picture, we observed that there are two classifiers. If you see the margin from the classifier to the nearest data points, it shows that classifier1 has the largest margin when compared to classifier2. The classifier that has the maximum margin will generalize well. But why? Let us see this through an example

From the above picture we observed that there are two new data points there are at pretty much at the same location in the two graphs. Classifier1 is still doing a good job in separating the blue and red data points, whereas classifier2 failed in both cases. Here blue triangle is classified as “RED” and Red circular point is classified as “BLUE”, because it is working well on training data. But when we take testing data on new data points, classifier1 with maximum margin works better compared to classifier2 with smaller margin. SO, the decision boundary or classifier with the maximum Margin is the best Classifier.

Many Classifiers

The Margin of Classifier

Out of all the classifiers, the one that has maximum margin will generalize well. But why?

The Best Decision Boundary

Imagine two more data points. The classifier with maximum margin will be able to classify them more accurately.

The Maximum Margin Classifier

From the above picture we observed that there are two classifiers: m1 and m2. m1 has larger margin and m1 is close to data points a,b and c. Whereas m2 is having lesser margin when compared with m1 and it is close to data points a, c and d. So, the best classifier out of these two will be m1, because it has large margin. For a given dataset the classifier that has maximum margin will have maximum training accuracy and testing accuracy. So, we would generally prefer the classifier that has maximum margin.

LAB: Simple Classifiers

Dataset:Fraud Transaction/Transactions_sample.csv
Draw a classification graph that shows all the classes
Build a logistic regression classifier
Draw the classifier on the data plot

Solution

In [1]:

#Importing the dataset:
import pandas as pd
Transactions_sample = pd.read_csv("Datasets/Fraud Transaction/Transactions_sample.csv")
Transactions_sample.head(6)

Out[1]:

	id	Total_Amount	Tr_Count_week	Fraud_id
0	16078	7294.60	4.79	0
1	41365	7659.53	2.45	0
2	11666	8259.29	10.77	0
3	11824	11630.25	15.29	1
4	36414	12286.63	22.18	1
5	90	12783.34	16.34	1

In [2]:

#Name of the columns 
Transactions_sample.columns

Out[2]:

Index(['id', 'Total_Amount', 'Tr_Count_week', 'Fraud_id'], dtype='object')

In [3]:

#The clasification graph distinguishing the two classes with colors or shapes.
import matplotlib.pyplot as plt
%matplotlib inline
fig = plt.figure()
ax1 = fig.add_subplot(111)

ax1.scatter(Transactions_sample.Total_Amount[Transactions_sample.Fraud_id==0],Transactions_sample.Tr_Count_week[Transactions_sample.Fraud_id==0], s=10, c='b', marker="o", label='Fraud_id=0')
ax1.scatter(Transactions_sample.Total_Amount[Transactions_sample.Fraud_id==1],Transactions_sample.Tr_Count_week[Transactions_sample.Fraud_id==1], s=10, c='r', marker="+", label='Fraud_id=1')
plt.legend(loc='upper left');
plt.show()

In [4]:

#build a logistic regression model
###Logistic Regerssion model1
import statsmodels.formula.api as sm
model1 = sm.logit(formula='Fraud_id ~ Total_Amount+Tr_Count_week', data=Transactions_sample)
fitted1 = model1.fit()
fitted1.summary()

Optimization terminated successfully.
         Current function value: 0.040114
         Iterations 10

Out[4]:

Logit Regression Results
Dep. Variable:	Fraud_id	No. Observations:	210
Model:	Logit	Df Residuals:	207
Method:	MLE	Df Model:	2
Date:	Tue, 14 Feb 2017	Pseudo R-squ.:	0.9421
Time:	17:20:04	Log-Likelihood:	-8.4239
converged:	True	LL-Null:	-145.55
		LLR p-value:	2.795e-60

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	-26.1481	7.817	-3.345	0.001	-41.469 -10.827
Total_Amount	0.0025	0.001	2.386	0.017	0.000 0.005
Tr_Count_week	0.1089	0.270	0.403	0.687	-0.421 0.638

In [5]:

# Getting slope and intercept of the line
#coefficients
coef=fitted1.normalized_cov_params
print(coef)

slope1=coef.Intercept[1]/(-coef.Intercept[2])
intercept1=coef.Intercept[0]/(-coef.Intercept[2])

               Intercept  Total_Amount  Tr_Count_week
Intercept      61.106470     -0.008058       1.428005
Total_Amount   -0.008058      0.000001      -0.000237
Tr_Count_week   1.428005     -0.000237       0.072958

In [6]:

import matplotlib.pyplot as plt

fig = plt.figure()
ax1 = fig.add_subplot(111)

ax1.scatter(Transactions_sample.Total_Amount[Transactions_sample.Fraud_id==0],Transactions_sample.Tr_Count_week[Transactions_sample.Fraud_id==0], s=30, c='b', marker="o", label='Fraud_id 0')
ax1.scatter(Transactions_sample.Total_Amount[Transactions_sample.Fraud_id==1],Transactions_sample.Tr_Count_week[Transactions_sample.Fraud_id==1], s=30, c='r', marker="+", label='Fraud_id 1')

plt.xlim(min(Transactions_sample.Total_Amount), max(Transactions_sample.Total_Amount))
plt.ylim(min(Transactions_sample.Tr_Count_week), max(Transactions_sample.Tr_Count_week))

plt.legend(loc='upper left');

x_min, x_max = ax1.get_xlim()
ax1.plot([0, x_max], [intercept1, x_max*slope1+intercept1])
plt.show()

In [7]:

#Accuracy of the model
#Creating the confusion matrix
predicted_values=fitted1.predict(Transactions_sample[["Total_Amount"]+["Tr_Count_week"]])
print('Predicted Values: ', predicted_values[1:10])

threshold=0.5

import numpy as np
predicted_class=np.zeros(predicted_values.shape)
predicted_class[predicted_values>threshold]=1

print('Predicted Class: ', predicted_class)

from sklearn.metrics import confusion_matrix as cm
ConfusionMatrix = cm(Transactions_sample[['Fraud_id']],predicted_class)
print('Confusion Matrix: ', ConfusionMatrix)

accuracy=(ConfusionMatrix[0,0]+ConfusionMatrix[1,1])/sum(sum(ConfusionMatrix))
print('Accuracy: ', accuracy)

error=1-accuracy
print('Error: ', error)

Predicted Values:  [ 0.00154015  0.01714584  0.9932035   0.99938783  0.99967144  0.99846609
  0.99793177  0.99981494  0.99991438]
Predicted Class:  [ 0.  0.  0.  1.  1.  1.  1.  1.  1.  1.  1.  0.  0.  1.  0.  0.  0.  0.
  1.  0.  0.  0.  1.  0.  1.  1.  0.  1.  0.  1.  0.  0.  0.  0.  1.  0.
  1.  1.  0.  0.  0.  1.  1.  1.  0.  1.  1.  0.  1.  1.  0.  0.  0.  1.
  1.  1.  1.  1.  0.  0.  0.  0.  0.  1.  1.  1.  1.  1.  0.  0.  0.  0.
  0.  1.  1.  1.  1.  1.  0.  0.  0.  0.  0.  1.  1.  1.  1.  1.  0.  0.
  0.  0.  0.  1.  1.  1.  1.  1.  0.  0.  0.  0.  0.  1.  1.  1.  1.  1.
  0.  0.  0.  0.  0.  1.  1.  1.  1.  1.  0.  0.  0.  0.  0.  1.  1.  1.
  1.  1.  0.  0.  0.  0.  0.  1.  1.  1.  1.  1.  0.  0.  0.  0.  0.  1.
  1.  1.  1.  1.  0.  0.  0.  0.  0.  1.  1.  1.  1.  1.  0.  0.  0.  0.
  0.  1.  1.  1.  1.  1.  0.  0.  0.  0.  0.  1.  1.  1.  1.  1.  0.  0.
  0.  0.  0.  1.  1.  1.  1.  1.  0.  0.  0.  0.  0.  1.  1.  1.  1.  1.
  0.  0.  0.  0.  0.  1.  1.  1.  1.  1.  0.  0.]
Confusion Matrix:  [[104   0]
 [  1 105]]
Accuracy:  0.995238095238
Error:  0.0047619047619

SVM- The large margin classifier

SVM is all about finding the maximum-margin Classifier. Classifier is a generic name, mathematically it is called as hyper plane. Imagine in a 2-Dimensional plane it is a kind of line. SVM uses the nearest training data points in the objective space. Based on them, SVM finds the hyperplane or the classifier. Each data point is considered as a p-dimensional vector(a list of p numbers). To find the optimal hyperplane that has maximum margin, SVM uses vector algebra and mathematical optimization.

The SVM Algorithm

If a dataset is linearly separable, then we can always find a hyperplane f(x) such that,
- For all negative labeled records f(x)<0
- For all positive labeled records f(x)>0
- This hyper plane f(x) is nothing but the linear classifier
- $f(x)=w_1 x_1+ w_2 x_2 +b$
- $f(x)=w^T x+b$

Math behind SVM Algorithm

SVM Algorithm – The Math

If you already understood the SVM technique and If you find this slide is too technical, then you can skip it. The tool will take care of this optimization

$f(x)=w^T x+b$
$(w^T x^+ +b=1$ and $w^T x^- +b = -1$
$x^+ = x^- + lambda w$
- $w^T(x^- + lambda w)+b=1$ +lambda
- $w^T x^- +lambda w.w+b=1$
- $-1+lambda w.w=1$
- $lambda = 2/w.w$
- $m =|x^+ - x^-|$
- $m=(2/w.w)*|w|$
- $m=(2/w.w)*|w|$
Objective is to maximize 2/||||
- i.e minimize || $w$ ||
A good decision boundary should be
- $w^T x^+ +b>=1$ for all y=1
- $w^T x^- +b<=-1$ for all y=-1
- i.e $y(w^T x+b)>=1$ $y(w^T x+b)>=1$ for all points
Now we have the optimization problem with objective and constraints
- minimize ||w|| or (1/2)* ||w2||With constant $y(w^T x+b)>=1$
We can solve the above optimization problem to obtain w & b

SVM Result

SVM is all about fitting the Hyperplane that has maximum margin. SVM output doesn’t contain any probability. It directly gives the class, in which the new data point belongs to. For a new point xk, calculate w^T x_k +b. If this value is positive then the prediction is positive, else negative. At the end of SVM, as a result we get the class of that particular new data point, we will not get any probability as a result.

SVM on Python

There are multiple SVM libraries available in Python.
- The package ‘Scikit’ is the most widely used for machine learning.
There is a function called svm() within ‘Scikit’ package.
There are various options within svm() function to customize the training process.

LAB: First SVM Learning Problem

Dataset: Fraud Transaction/Transactions_sample.csv
Draw a classification graph that shows all the classes
Build a SVM classifier
Draw the classifier on the data plots
Predict the (Fraud vs not-Fraud) class for the data points Total_Amount=11000, Tr_Count_week=15 & Total_Amount=2000, Tr_Count_week=4
Download the complete Dataset: Fraud Transaction/Transaction.csv
Draw a classification graph that shows all the classes
Build a SVM classifier
Draw the classifier on the data plots

In [8]:

# Importing the sample data
import pandas as pd
Transactions_sample= pd.read_csv("DatasetsFraud TransactionTransactions_sample.csv")
X = Transactions_sample[['Total_Amount']+['Tr_Count_week']]  # we only take the first two features. We could
                      # avoid this ugly slicing by using a two-dim dataset
y = Transactions_sample[['Fraud_id']].values.ravel()

In [9]:

#Drawing a classification graph of all classes
import matplotlib.pyplot as plt

plt.scatter(X['Total_Amount'], X['Tr_Count_week'], c=y, cmap=plt.cm.Paired)
plt.show()

In [10]:

#Building a SVM Classifier in python
from sklearn import svm
import numpy
X = Transactions_sample[['Total_Amount']+['Tr_Count_week']]  # we only take the first two features. We could
                      # avoid this ugly slicing by using a two-dim dataset
y = Transactions_sample[['Fraud_id']].values.ravel()

clf = svm.SVC(kernel='linear')
model =clf.fit(X,y)

Predicted = numpy.zeros(50)

# NOTE: If i is in range(0,n), then i takes vales [0,n-1] 
for i in range(0,50):
    a = Transactions_sample.Total_Amount[i]
    b = Transactions_sample.Tr_Count_week[i]
    Predicted[i]=clf.predict([[a,b]])
    del a,b

In [11]:

#Plotting in SVM
import matplotlib.pyplot as plt
plt.scatter(X['Total_Amount'], X['Tr_Count_week'], c=y, cmap=plt.cm.Paired)
w = clf.coef_[0]
o = -w[0] / w[1]
plt.xlim(min(Transactions_sample.Total_Amount), max(Transactions_sample.Total_Amount))
x_min, x_max = ax1.get_xlim()
xx = np.linspace(x_min, x_max)
yy = o * xx - (clf.intercept_[0]) / w[1]

plt.plot(xx, yy, 'k-')
plt.show()

In [12]:

#Predict the (Fraud vs not-Fraud) class for the data points Total_Amount=11000, Tr_Count_week=15 & Total_Amount=2000, Tr_Count_week=4
#Prediction in SVM
new_data1=[11000, 15]
new_data2=[2000,4]

#Predict the (Fraud vs not-Fraud) class for the data points Total_Amount=11000, Tr_Count_week=15 & Total_Amount=2000, Tr_Count_week=4
NewPredicted1=model.predict([new_data1])
print(NewPredicted1)

NewPredicted2=clf.predict([new_data2])
print(NewPredicted2)

[1]
[0]

In [13]:

# Importing the whole dataset
import pandas as pd
Transactions= pd.read_csv("DatasetsFraud TransactionTransaction.csv")
X = Transactions[['Total_Amount']+['Tr_Count_week']]  # we only take the first two features. We could
                      # avoid this ugly slicing by using a two-dim dataset
y = Transactions[['Fraud_id']].values.ravel()

In [14]:

#Drawing a classification graph of all classes
import matplotlib.pyplot as plt

plt.scatter(X['Total_Amount'], X['Tr_Count_week'], c=y, cmap=plt.cm.Paired)
plt.show()

In [15]:

#Build a SVM classifier 
clf = svm.SVC(kernel='linear')
Smodel =clf.fit(X,y)

In [16]:

#Plotting in SVM
import matplotlib.pyplot as plt
plt.scatter(X['Total_Amount'], X['Tr_Count_week'], c=y, cmap=plt.cm.Paired)
w = clf.coef_[0]
o = -w[0] / w[1]
plt.xlim(min(Transactions_sample.Total_Amount), max(Transactions_sample.Total_Amount))
x_min, x_max = ax1.get_xlim()
xx = np.linspace(x_min, x_max)
yy = o * xx - (clf.intercept_[0]) / w[1]
plt.plot(xx, yy, 'k-')
plt.show()

The Non-Linear Decision Boundary

Till now we have seen a linear classifier. What happens if the decision boundary is non-linear?

From the above pictures, we observed that there are positive classes, then negative classes and again some positive classes. In that case, just fitting one line and finding the maximum margin won’t be meaningful (since the decision boundary is not linear). When the decision boundary is non-linear, SVM struggles to classify the classes. Infact SVM has no direct theory to set the non-linear decision boundary models. To fit a non-linear boundary classifier, we might have to create new variables or new dimensions in the data and see whether the decision boundary is linear. This phenomenon is called kernel trick.

what we do in kernel trick?

In kernel tree, we try to increase the number of dimensions and try to make non-linear data into linear in a higher dimensional space.

Mapping to Higher Dimensional Space

In this example, we have 0’s then 1’s and then again 0’s, this cannot be directly linearly classifiable. So what we try to do here is, we create a new variable called x2, it is just (x1)^2. So instead of just taking one (x1) variable, we will now use two variables, x1 and x2. We increase the dimension of this dataset by adding a new variable. After adding new variable we can see our objective space is transformed into new dimensional space and we can clearly see a single linear decision boundary. A single linear decision boundary is not possible in lower dimensional space, so we can increase the number of dimensions and then see whether we can fit a linear decision boundary. SVM doesn’t have direct theory for non-linear decision boundary, but we can increase the dimensions and we can use kernel trick to fit non-linear decision boundary.

Kernel Trick

Kernel Trick is the most important and trickiest part of SVM, because most of the problems we see are not linearly separable. When there is non-linear scenario, we have to use the kernel Trick. We used a function $phi(x)=(x,(x^2))$ to transform the data x into a higher $phi(x)$ dimensional space. In the higher dimensional space, we can easily fit a liner decision boundary. This function $phi(x)$ is known as kernel function and this process is known as kernel trick in SVM. Kernel trick solves the non-linear decision boundary problem much like the hidden layers in neural networks. Kernel trick is simply increasing the number of dimensions. It is to make the non-linear decision boundary in lower dimensional space, as a linear decision boundary, in higher dimensional space. In simple words, Kernel trick makes the non-linear decision boundary, linear (in higher dimensional space).

Kernel Function Examples

Name	Function	Type problem
Polynomial Kernel	$(x_i^t x_j +1)^q$ q is degree of polynomial	Best for Image processing
Sigmoid Kernel	$tanh(ax_i^t x_j +k)$ k is offset value	Very similar to neural network
Gaussian Kernel	$e^$ (		x_i – x_j	^2/2 sigma^2)	No prior knowledge on data
Linear Kernel	$1+x_i x_j min(x_i , x_j) - frac{(x_i + x_j)}{2} min(x_i , x_j)^2 + frac{min(x_i , x_j)^3}{3}$	Text Classification
Laplace Radial Basis Function (RBF)	$e^(-lambda$		x_i – x_j	) , $lambda$ >= 0	No prior knowledge on data

There are many more kernel functions.

Choosing the Kernel Function

Choosing the kernel function is the most tricky part of the SVM and there is no specific theory that tells us that you should use only this kernel function. There is no proven theory that tells us, which kernel function is going to work, but there is lot of research going on that. In practice, a low degree polynomial kernel or RBF kernel are generally used as trails. Choosing Kernel function is similar to choosing number of hidden layers in neural networks. Both of them have no proven theory to arrive at a standard value. As a first step, we can choose low degree polynomial or radial basis function or one of those from the list.

LAB: Kernel – Non linear classifier

Dataset : Software users/sw_user_profile.csv
How many variables are there in software user profile data?
Plot the active users against and check weather the relation between age and “Active” status is linear or non-linear.
Build an SVM model(model-1), make sure that there is no kernel or the kernel is linear.
For model-1, create the confusion matrix and find out the accuracy.
Create a new variable, using the polynomial kernel.
Build an SVM model(model-2), with the new data mapped on to higher dimensions. Keep the default kernel as linear
For model-2, create the confusion matrix and find out the accuracy
Plot the SVM with results.
With the original data re-cerate the model(model-3) and let python choose the default kernel function.
What is the accuracy of model-3?

In [17]:

#Dataset : Software users/sw_user_profile.csv  
sw_user_profile = pd.read_csv("Datasets/Software users/sw_user_profile.csv")

In [18]:

#How many variables are there in software user profile data?
sw_user_profile.shape

Out[18]:

(490, 3)

In [19]:

#Plot the active users against and check weather the relation between age and "Active" status is linear or non-linear
plt.scatter(sw_user_profile.Age,sw_user_profile.Id,color='blue')

Out[19]:

<matplotlib.collections.PathCollection at 0x1d21bad5e80>

In [20]:

#Build an SVM model(model-1), make sure that there is no kernel or the kernel is linear

#Model Building 
X= sw_user_profile[['Age']]
y= sw_user_profile[['Active']].values.ravel()
Linsvc = svm.SVC(kernel='linear', C=1).fit(X, y)

In [21]:

#Predicting values
predict3 = Linsvc.predict(X)

In [22]:

#For model-1, create the confusion matrix and find out the accuracy
#Confusion Matrix
from sklearn.metrics import confusion_matrix
conf_mat = confusion_matrix(sw_user_profile[['Active']],predict3)
conf_mat

Out[22]:

array([[317,   0],
       [173,   0]])

In [23]:

#Accuracy 
Accuracy3 = Linsvc.score(X, y)
Accuracy3

Out[23]:

0.64693877551020407

New variable derivation. Mapping to higher dimensions

In [24]:

#Standardizing the data to visualize the results clearly
sw_user_profile['age_nor']=(sw_user_profile.Age-numpy.mean(sw_user_profile.Age))/numpy.std(sw_user_profile.Age)

In [25]:

#Create a new variable. By using the polynomial kernel
#Creating the new variable
sw_user_profile['new']=(sw_user_profile.age_nor)*(sw_user_profile.age_nor)

In [26]:

#Build an SVM model(model-2), with the new data mapped on to higher dimensions. Keep the default kernel as linear

#Model Building with new variable
X= sw_user_profile[['Age']+['new']]
y= sw_user_profile[['Active']].values.ravel()
Linsvc = svm.SVC(kernel='linear', C=1).fit(X, y)
predict4 = Linsvc.predict(X)

In [27]:

#For model-2, create the confusion matrix and find out the accuracy
#Confusion Matrix
conf_mat = confusion_matrix(sw_user_profile[['Active']],predict4)
conf_mat

Out[27]:

array([[317,   0],
       [  0, 173]])

In [28]:

#Accuracy 
Accuracy4 = Linsvc.score(X, y)
Accuracy4

Out[28]:

1.0

In [29]:

#With the original data re-cerate the model(model-3) and let python choose the default kernel function. 
########Model Building with radial kernel function
X= sw_user_profile[['Age']]
y= sw_user_profile[['Active']].values.ravel()
Linsvc = svm.SVC(kernel='rbf', C=1).fit(X, y)
predict5 = Linsvc.predict(X)
conf_mat = confusion_matrix(sw_user_profile[['Active']],predict5)
conf_mat

Out[29]:

array([[317,   0],
       [  0, 173]])

In [30]:

#Accuracy model-3
Accuracy5 = Linsvc.score(X, y)
Accuracy5

Out[30]:

1.0

Soft Margin Classification – Noisy data

Noisy data

What if there is some noise in the data?
What if the overall data can be classified perfectly except few points?
How to find the hyperplane when few points are on the wrong side?### Soft Margin Classification – Noisy data
The non-separable cases can be solved by allowing a slack variable(x) for the point on the wrong side.
We are allowing some errors while building the classifier
In SVM optimization problem, we are initially adding some error and then finding the hyperplane
SVM will find the maximum margin classifier allowing some minimum error due to noise.
Hard Margin -Classifying all data points correctly
Soft margin – Allowing some error### SVM Validation
SVM doesn’t give us the probability, it directly gives us the resultant classes
Usual methods of validation like sensitivity, specificity, cross validation, ROC and AUC are the validation methods

SVM Advantages & Disadvantages

SVM Advantages

SVM’s are very good when we have no idea on the data
Works well with even unstructured and semi structured data like text, Images and trees.
The kernel trick is the real strength of SVM. With an appropriate kernel function, we can solve any complex problem
Unlike in neural networks, SVM is not solved for local optima.
It scales relatively well to high dimensional data.
SVM models have generalization in practice, the risk of overfitting is less in SVM.

SVM Disadvantages

Choosing a “good” kernel function is not easy.
Long training time for large datasets
Difficult to understand and interpret the final model, variable weights and individual impact.
Since the final model is not so easy to see, we cannot do small calibrations to the model, hence its tough to incorporate our business logic.

SVM Application

Protein Structure Prediction
Intrusion Detection
Handwriting Recognition
Detecting Steganography in digital images
Breast Cancer Diagnosis

LAB: Digit Recognition using SVM

Take an image of a handwritten single digit, and determine what that digit is.
Normalized handwritten digits, automatically scanned from envelopes by the U.S. Postal Service. The original scanned digits are binary and of different sizes and orientations; the images here have been de slanted and size normalized, resultingin 16 x 16 grayscale images (Le Cun et al., 1990).
The data are in two zipped files, and each line consists of the digitid (0-9) followed by the 256 grayscale values.
Build an SVM model that can be used as the digit recognizer
Use the test dataset to validate the true classification power of the model
What is the final accuracy of the model?

In [31]:

#Importing test and training data

train_data = numpy.loadtxt('Datasets/Digit Recognizer/USPS/zip.train.txt')
test_data  = numpy.loadtxt('Datasets/Digit Recognizer/USPS/zip.test.txt')

train_data.shape
test_data.shape

Out[31]:

(2007, 257)

In [32]:

for i in range(0,9):
    data_row=train_data[i][1:]
    #pixels = matrix(as.numeric(data_row),16,16,byrow=TRUE)
    pixels = numpy.matrix(data_row)
    pixels=pixels.reshape(16,16)
    plt.figure(figsize=(10,10))
    plt.subplot(3,3,i+1)
    plt.imshow(pixels)

In [33]:

  #Are there any missing values?
sum(sum(pd.isnull(train_data))) 
sum(sum(pd.isnull(test_data)))

Out[33]:

In [34]:

#The data are in two gzipped files, and each line consists of the digitid (0-9) followed by the 256 grayscale values.   
#The first variable is label
train_data1= pd.DataFrame(train_data)
train_data1[0].value_counts()

Out[34]:

0.0    1194
1.0    1005
2.0     731
6.0     664
3.0     658
4.0     652
7.0     645
9.0     644
5.0     556
8.0     542
Name: 0, dtype: int64

In [35]:

#Build an SVM model that can be used as the digit recognizer 
########SVM Model Building 
#Verify the code with small data
X1=train_data[:5000,range(1,257)]
Y1 =train_data[0:5000,0]
import time
start_time = time.time()
numbersvm = svm.SVC(kernel='rbf', C=1).fit(X1,Y1)
print("---Time taken is %s seconds ---" % (time.time() - start_time))

---Time taken is 2.6106622219085693 seconds ---

In [36]:

predict6 = numbersvm.predict(X1)
Y1 = pd.DataFrame(Y1)
Y1[0].value_counts()

predict6=pd.DataFrame(predict6)
predict6[0].value_counts()

Out[36]:

0.0    847
1.0    678
9.0    484
2.0    484
6.0    474
7.0    462
4.0    441
3.0    391
8.0    387
5.0    352
Name: 0, dtype: int64

In [37]:

#Confusion Matrix
conf_mat = confusion_matrix(Y1,predict6)
conf_mat

Out[37]:

array([[845,   0,   0,   0,   1,   0,   1,   0,   0,   0],
       [  0, 674,   0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   1, 478,   2,   3,   0,   0,   1,   3,   0],
       [  0,   0,   3, 385,   1,   2,   0,   0,   2,   2],
       [  0,   1,   0,   0, 428,   1,   1,   0,   0,   3],
       [  1,   0,   1,   2,   1, 346,   1,   0,   0,   0],
       [  1,   1,   1,   0,   3,   1, 471,   0,   0,   0],
       [  0,   0,   0,   0,   2,   0,   0, 455,   3,   1],
       [  0,   1,   1,   1,   1,   2,   0,   2, 379,   0],
       [  0,   0,   0,   1,   1,   0,   0,   4,   0, 478]])

In [38]:

Accuracy = numbersvm.score(X1,Y1)
Accuracy

Out[38]:

0.98780000000000001

In [39]:

#####Model on Full Data 
X2=train_data[:,range(1,257)]
Y2 =train_data[:,0]
import time
start_time = time.time()
numbersvm = svm.SVC(kernel='rbf', C=1).fit(X2,Y2)
print("---Time taken is %s seconds ---" % (time.time() - start_time))

---Time taken is 4.8773932456970215 seconds ---

In [40]:

#Confusion Matrix
predict7 = numbersvm.predict(X2)
conf_mat = confusion_matrix(Y2,predict7)
conf_mat

Out[40]:

array([[1191,    0,    0,    1,    1,    0,    1,    0,    0,    0],
       [   0, 1005,    0,    0,    0,    0,    0,    0,    0,    0],
       [   0,    1,  717,    3,    7,    0,    0,    1,    2,    0],
       [   0,    0,    2,  647,    0,    3,    0,    1,    4,    1],
       [   0,    2,    0,    0,  645,    0,    2,    0,    0,    3],
       [   2,    0,    3,    3,    2,  544,    2,    0,    0,    0],
       [   2,    1,    1,    0,    4,    1,  655,    0,    0,    0],
       [   0,    0,    2,    0,    2,    0,    0,  635,    3,    3],
       [   0,    1,    1,    0,    3,    3,    0,    2,  532,    0],
       [   0,    0,    0,    2,    2,    0,    0,    6,    0,  634]])

In [41]:

print('Accuracy is : ',numbersvm.score(X1,Y1))

Accuracy is :  0.9884

In [42]:

###Out of time validation with test data
Ex1 = test_data[:,range(1,257)]
Ey1 = test_data[:,0]
test_predict = numbersvm.predict(Ex1)
conf_mat = confusion_matrix(Ey1,test_predict)
conf_mat

Out[42]:

array([[355,   0,   2,   0,   1,   0,   0,   0,   1,   0],
       [  0, 255,   0,   0,   5,   0,   3,   0,   0,   1],
       [  3,   0, 181,   2,   5,   2,   0,   1,   4,   0],
       [  1,   0,   3, 146,   0,  10,   0,   1,   5,   0],
       [  0,   1,   3,   0, 188,   1,   1,   1,   1,   4],
       [  4,   0,   0,   4,   1, 147,   0,   0,   1,   3],
       [  3,   0,   3,   0,   2,   2, 159,   0,   1,   0],
       [  0,   0,   1,   0,   5,   1,   0, 137,   1,   2],
       [  3,   0,   2,   1,   0,   4,   1,   1, 153,   1],
       [  0,   0,   0,   1,   4,   0,   0,   0,   2, 170]])

In [43]:

for i in range(0,9):
    data_row=train_data[i][1:]
    #pixels = matrix(as.numeric(data_row),16,16,byrow=TRUE)
    pixels = numpy.matrix(data_row)
    pixels=pixels.reshape(16,16)
    plt.figure(figsize=(10,10))
    plt.subplot(3,3,i+1)
    plt.imshow(pixels)

In [44]:

#Lets see some errors in predictions images. 
# Wrong predictions
wrong_pred = numpy.zeros(2007)
cnt=0  
for i in range(0,2007):
    if test_predict[i]!=Ey1[i]:
       wrong_pred[cnt]=Ey1[i]
       cnt= cnt+1
cnt

Out[44]:

Conclusion

There are many software tools that are available for SVM implementation.
SVMs are really good for text classification. They also good at finding the best linear separator.
The kernel trick makes SVMs non-linear learning algorithms.
Choosing an appropriate kernel is the key for good SVM and choosing the right kernel function is not easy.
We need to be patient while building SVMs on large datasets. They take a lot of time for training.

Support Vector Machines in Python

Contents