Object recognition
Problem Statement:
Data Exploration:
Importing required libraries for Object Recognition
import pandas as pd
import sklearn
import numpy
import scipy
import statsmodels
import math
import matplotlib as matlab
Importing the dataset
train_lab = pd.read_csv("\\train\\Labels.csv")
The Dimension of the dataset is
train_lab.shape
The names of the variables in this dataset are
train_lab.columns.values
Let us see the top 10 observations of the dataset
train_lab.head(10)
Now, as we can see that we have got images but not the intensity values, we have to read the images and extract the intensity values from the images. Before that, we need to store the locations of the image in a variable.
a = []
for i in range(1,50001):
a.append("\\train\\train"+str(i)+".png")
a[1]
Pixel values of the images are extracted and stored in the variable t.
len(a)
t = []
for i in range(0,50000):
t.append(scipy.misc.imread(a[i]))
t[1]
import matplotlib.pyplot as matplot
matplot.imshow(t[1])
As we have colour images, we need to convert them to gray scale to do further processing. A function is defined here to convert them to gray scale and are stored in a variable called ‘gray’.
def rgb2gray(rgb):
return numpy.dot(rgb[...,:3], [0.299, 0.587, 0.114])
gray = []
for i in range(0,50000):
gray.append(rgb2gray(t[i]))
As the image size is [32,32], we have 1024 pixels but we need them to be in a single row. So, we are reshaping the arrays from [32,32] to [1,1024] for further processing.
im_data = []
for i in range(0,50000):
data_row=gray[i]
#pixels = matrix(as.numeric(data_row),16,16,byrow=TRUE)
pixels = data_row
im_data.append(pixels.reshape(1,1024))
Importing the required libraries for Model building
import statsmodels.formula.api as sm
from sklearn import svm
from sklearn.metrics import confusion_matrix
newvie=numpy.array(im_data)
a = newvie.reshape(50000,1024)
If we take each pixel as a feature, then we’ll have 1024 features and 50000 observations which will take a very long time for processing. So we can go through this by two ways. One of them is to extract features from those images i.e., by using 1024 pixels we are extracting four intensity based features namely mean, variance, skewness, kurtsis. The other one is to do PCA and extract 50 most important pixels from 1024 and doing further analysis. Here, we are extracting the intensity based features.
mean_val = []
for i in range(0,50000):
mean_val.append(numpy.mean(a[i]))
variance = []
for i in range(0,50000):
variance.append(numpy.var(a[i]))
skewness = []
for i in range(0,50000):
skewness.append(scipy.stats.skew(a[i]))
skewness = pd.DataFrame(skewness)
kurtosis = []
for i in range(0,50000):
kurtosis.append(scipy.stats.kurtosis(a[i]))
kurtosis = pd.DataFrame(kurtosis)
Model Building
Support Vecotr Machines:
import time start_time = time.time() numbersvm1 = svm.SVC(kernel=’rbf’, C=1).fit(variable,train_lab.label) print(“— %s seconds —” % (time.time() – starttime)) len(numbersvm1.support)
here we are finding out the confusion matrix and doing bootstrap cross validation for the SVM model.
predict = numbersvm.predict(variable)
conf_mat = confusion_matrix(train_lab.label,predict)
numpy.trace(conf_mat)/sum(sum(conf_mat))
from sklearn import cross_validation
####cross-validation
cv = cross_validation.ShuffleSplit(train_lab.label.size, n_iter=10,test_size=0.2, random_state=None)
scores = cross_validation.cross_val_score(numbersvm1,variable,train_lab.label,cv = cv)
score_mean = numpy.mean(scores)
The model accuracy after cross-validation is approximately 87%.
anni = a[:40000]
tt = train_lab.label[:40000]
mean_val = pd.DataFrame(mean_val)
numbersvm = svm.SVC(kernel='rbf', C=1).fit(mean_val,train_lab.label)
konni = a[49000:]
at = train_lab.label[49000:]
Model Building 2
Decision Trees:
In decision trees, if we take intensity based features then the tree is not going to classify properly as there are very less features. So, we have taken all the 1024 pixels as variables and trained the model. Cross Validation is also done for this model.
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf.fit(a,train_lab.label)
tr_score = clf.score(a,train_lab.label)
#ts_score = clf.score(variable,train_lab.label)
cv = cross_validation.ShuffleSplit(tt.size, n_iter=10,test_size=0.2, random_state=None)
scoresTree = cross_validation.cross_val_score(clf,a,train_lab.label,cv = cv)
score_mean = numpy.mean(scoresTree)
score_mean
The accuracy of the above model is around 20% as there is lot of redundancy and too many irrelevant features as well.
Model Building 3
Principal Component Analysis:
Here, Instead of taking all the 1024 values we will take 50 values by doing Principal Component Analysis. Then we build a model using decision trees as well as Random Forest.
from sklearn.decomposition import RandomizedPCA
n_components = 50
pca = RandomizedPCA(n_components=n_components, whiten=True).fit(a)
print("Projecting the input data on the eigenfaces orthonormal basis")
x_train_pca = pca.transform(a)
Decision Tree:
from sklearn import tree
clf1 = tree.DecisionTreeClassifier()
clf1.fit(x_train_pca,train_lab.label)
tr_score1 = clf1.score(x_train_pca,train_lab.label)
#ts_score = clf.score(variable,train_lab.label)
from sklearn import cross_validation
cv = cross_validation.ShuffleSplit(train_lab.label.size, n_iter=10,test_size=0.2, random_state=None)
scoresTree1 = cross_validation.cross_val_score(clf1,x_train_pca,train_lab.label,cv = cv)
score_mean1 = numpy.mean(scoresTree1)
score_mean1
Random Forest:
###############Building a Randomforest classifier ########
from sklearn.ensemble import RandomForestClassifier
forest=RandomForestClassifier(n_estimators=50, criterion='gini', max_depth=None,
min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto',
max_leaf_nodes=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0,
warm_start=False, class_weight=None).fit(x_train_pca,train_lab.label)
cv = cross_validation.ShuffleSplit(train_lab.label.size, n_iter=10,test_size=0.2, random_state=None)
scoresTree2 = cross_validation.cross_val_score(forest,x_train_pca,train_lab.label,cv = cv)
score_mean2 = numpy.mean(scoresTree2)
score_mean2
Conclusion:
Among all the models built, Support vector machines is showing very good result when compared to the rest. But, SVM is taking a good amount of time to build the model.


