Before start our lesson please download the datasets.

Problem Statement:

The objective is to build a predictive model which is able to distinguish between main product categories. Many identical products get classified differently. Therefore, the quality of the product analysis depends heavily on the ability to accurately cluster similar products

Data importing

In [1]:

import pandas as pd
train=pd.read_csv("C:/Users/Personal/Google Drive/train_ecom.csv") 
train.shape

Out[1]:

(50122, 102)

In [2]:

import pandas as pd
test=pd.read_csv("C:/Users/Personal/Google Drive/test_ecom.csv") 
test.shape

Out[2]:

(11756, 102)

In [3]:

fullData = pd.concat([train,test],axis=0) #Combined both Train and Test Data set
fullData.shape

Out[3]:

(61878, 102)

In [4]:

fullData.columns # This will show all the column names
fullData.head(10) # Show first 10 records of dataframe
fullData.describe() #You can look at summary of numerical fields by using describe() function

Out[4]:

	id	spec1	spec2	spec3	spec4	spec5	spec6	spec7	spec8	spec9	…	spec91	spec92	spec93	spec94	spec95	spec96	spec97	spec98	spec99	spec100
count	61878.000000	61878.00000	61878.000000	61878.000000	61878.000000	61878.000000	61878.000000	61878.000000	61878.000000	61878.000000	…	61878.000000	61878.000000	61878.000000	61878.000000	61878.000000	61878.000000	61878.000000	61878.000000	61878.000000	61878.000000
mean	30939.500000	0.38668	0.263066	0.901467	0.779081	0.071043	0.025696	0.193704	0.662433	1.011296	…	0.878729	7.986086	4.838327	0.093733	9.897702	0.100650	5.742122	0.264941	0.380119	0.126135
std	17862.784315	1.52533	1.252073	2.934818	2.788005	0.438902	0.215333	1.030102	2.255770	3.474822	…	2.330139	4.319472	7.556950	0.501244	17.813813	1.042491	8.281352	2.045646	0.982385	1.201720
min	1.000000	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	…	0.000000	1.000000	0.000000	0.000000	-43.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	15470.250000	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	…	0.000000	4.000000	0.000000	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000
50%	30939.500000	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	…	0.000000	8.000000	2.000000	0.000000	3.000000	0.000000	3.000000	0.000000	0.000000	0.000000
75%	46408.750000	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	0.000000	…	1.000000	12.000000	6.000000	0.000000	12.000000	0.000000	7.000000	0.000000	0.000000	0.000000
max	61878.000000	61.00000	51.000000	64.000000	70.000000	19.000000	10.000000	38.000000	76.000000	43.000000	…	56.000000	15.000000	111.000000	18.000000	256.000000	112.000000	111.000000	52.000000	19.000000	87.000000

8 rows × 101 columns

In [5]:

fullData.columns.values

Out[5]:

array(['id', 'spec1', 'spec2', 'spec3', 'spec4', 'spec5', 'spec6', 'spec7',
       'spec8', 'spec9', 'spec10', 'spec11', 'spec12', 'spec13', 'spec14',
       'spec15', 'spec16', 'spec17', 'spec18', 'spec19', 'spec20',
       'spec21', 'spec22', 'spec23', 'spec24', 'spec25', 'spec26',
       'spec27', 'spec28', 'spec29', 'spec30', 'spec31', 'spec32',
       'spec33', 'spec34', 'spec35', 'spec36', 'spec37', 'spec38',
       'spec39', 'spec40', 'spec41', 'spec42', 'spec43', 'spec44',
       'spec45', 'spec46', 'spec47', 'spec48', 'spec49', 'spec50',
       'spec51', 'spec52', 'spec53', 'spec54', 'spec55', 'spec56',
       'spec57', 'spec58', 'spec59', 'spec60', 'spec61', 'spec62',
       'spec63', 'spec64', 'spec65', 'spec66', 'spec67', 'spec68',
       'spec69', 'spec70', 'spec71', 'spec72', 'spec73', 'spec74',
       'spec75', 'spec76', 'spec77', 'spec78', 'spec79', 'spec80',
       'spec81', 'spec82', 'spec83', 'spec84', 'spec85', 'spec86',
       'spec87', 'spec88', 'spec89', 'spec90', 'spec91', 'spec92',
       'spec93', 'spec94', 'spec95', 'spec96', 'spec97', 'spec98',
       'spec99', 'spec100', 'Category'], dtype=object)

In [6]:

#checking missing values
fullData.isnull().sum()

Out[6]:

id          0
spec1       0
spec2       0
spec3       0
spec4       0
spec5       0
spec6       0
spec7       0
spec8       0
spec9       0
spec10      0
spec11      0
spec12      0
spec13      0
spec14      0
spec15      0
spec16      0
spec17      0
spec18      0
spec19      0
spec20      0
spec21      0
spec22      0
spec23      0
spec24      0
spec25      0
spec26      0
spec27      0
spec28      0
spec29      0
           ..
spec72      0
spec73      0
spec74      0
spec75      0
spec76      0
spec77      0
spec78      0
spec79      0
spec80      0
spec81      0
spec82      0
spec83      0
spec84      0
spec85      0
spec86      0
spec87      0
spec88      0
spec89      0
spec90      0
spec91      0
spec92      0
spec93      0
spec94      0
spec95      0
spec96      0
spec97      0
spec98      0
spec99      0
spec100     0
Category    0
dtype: int64

In [7]:

#Since, sklearn requires all inputs to be numeric, we should convert all our categorical variables into numeric by encoding the categories. This can be done using the following code:
from sklearn.preprocessing import LabelEncoder
var_mod = ['Category']
le = LabelEncoder()
for i in var_mod:
    fullData[i] = le.fit_transform(fullData[i])
fullData.dtypes

Out[7]:

id          int64
spec1       int64
spec2       int64
spec3       int64
spec4       int64
spec5       int64
spec6       int64
spec7       int64
spec8       int64
spec9       int64
spec10      int64
spec11      int64
spec12      int64
spec13      int64
spec14      int64
spec15      int64
spec16      int64
spec17      int64
spec18      int64
spec19      int64
spec20      int64
spec21      int64
spec22      int64
spec23      int64
spec24      int64
spec25      int64
spec26      int64
spec27      int64
spec28      int64
spec29      int64
            ...  
spec72      int64
spec73      int64
spec74      int64
spec75      int64
spec76      int64
spec77      int64
spec78      int64
spec79      int64
spec80      int64
spec81      int64
spec82      int64
spec83      int64
spec84      int64
spec85      int64
spec86      int64
spec87      int64
spec88      int64
spec89      int64
spec90      int64
spec91      int64
spec92      int64
spec93      int64
spec94      int64
spec95      int64
spec96      int64
spec97      int64
spec98      int64
spec99      int64
spec100     int64
Category    int32
dtype: object

Spliting the data into training and testing set

In [8]:

from sklearn.cross_validation import train_test_split
features=list(fullData.columns[1:101])
X1 = fullData[features]
y1 = fullData['Category']
X1_train, X1_test, Y1_train, Y1_test = train_test_split(X1, y1, train_size=0.8,random_state=90) 
X1_train.shape,Y1_train.shape,X1_test.shape,Y1_test.shape

C:\Users\Personal\Anaconda3\lib\site-packages\sklearn\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

Out[8]:

((49502, 100), (49502,), (12376, 100), (12376,))

Naive Bayes classifier

In [154]:

from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X1_train,Y1_train)

Out[154]:

GaussianNB(priors=None)

In [10]:

#Naive Bayes classifier on training data
predict1 = model.predict(X1_train)
predict1
from sklearn.metrics import confusion_matrix
cm1=confusion_matrix(Y1_train, predict1)
print (cm1)

[[1112   44   42   88  264  134   77  232  255]
 [ 464 3637  841  104  154 1162  270   23   77]
 [  32   46 3040  256   59  150  102  103  153]
 [   5    0    7 2082    9    0    5    9   69]
 [ 246   35   27  474 1729    4    7  880 3024]
 [  69   55  524  116   54  489   49   67  131]
 [ 339  164  494  122  219  414 8809  703  143]
 [  64    3   30  233   73    1   17  973  767]
 [ 254   60   79 1461 1260   28   17 1161 8527]]

In [11]:

#Accuracy on training data
import numpy as np
from sklearn.metrics import accuracy_score
accuracy_score(Y1_train,predict1)

Out[11]:

0.61407619894145693

In [12]:

#Naive Bayes classifier on test data
predict = model.predict(X1_test)
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(Y1_test, predict)
print (cm)

[[ 283   17   12   36   67   31   12   64   69]
 [ 114  962  198   24   42  295   75   10   12]
 [   3   11  774   73   26   44   27   20   36]
 [   5    0    2  521    2    0    2    2   19]
 [  61   12    4  140  403    1    1  189  767]
 [  16   11  117   33   14  126   10   18   30]
 [  95   41  141   32   50  102 2070  160   37]
 [  11    0    4   49   16    0    1  259  190]
 [  54   14   30  369  338    4    1  262 2203]]

In [13]:

from sklearn import metrics
print(metrics.classification_report(Y1_test, predict))
print(metrics.confusion_matrix(Y1_test, predict))
print(metrics.accuracy_score(Y1_test, predict))

             precision    recall  f1-score   support

          0       0.44      0.48      0.46       591
          1       0.90      0.56      0.69      1732
          2       0.60      0.76      0.67      1014
          3       0.41      0.94      0.57       553
          4       0.42      0.26      0.32      1578
          5       0.21      0.34      0.26       375
          6       0.94      0.76      0.84      2728
          7       0.26      0.49      0.34       530
          8       0.66      0.67      0.66      3275

avg / total       0.67      0.61      0.62     12376

[[ 283   17   12   36   67   31   12   64   69]
 [ 114  962  198   24   42  295   75   10   12]
 [   3   11  774   73   26   44   27   20   36]
 [   5    0    2  521    2    0    2    2   19]
 [  61   12    4  140  403    1    1  189  767]
 [  16   11  117   33   14  126   10   18   30]
 [  95   41  141   32   50  102 2070  160   37]
 [  11    0    4   49   16    0    1  259  190]
 [  54   14   30  369  338    4    1  262 2203]]
0.614172592114

In [14]:

#Accuracy on training data
import numpy as np
from sklearn.metrics import accuracy_score
accuracy_score(Y1_test,predict)

Out[14]:

0.61417259211376862

DecisionTree

Decision Trees (DTs) are a non-parametric supervised learning method used for classification. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

In [15]:

import pandas as pd
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X1_train,Y1_train)
clf

Out[15]:

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [16]:

# Save tree as dot file
with open("tree1.dot", 'w') as f:
     f = tree.export_graphviz(clf, 
                              out_file=f)

In [17]:

import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import Image
Image("tree1.png")

Out[17]:

In [18]:

#DecisionTree on training data
predict = clf.predict(X1_train)
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(Y1_train, predict)
print (cm)

[[ 2248     0     0     0     0     0     0     0     0]
 [    0  6732     0     0     0     0     0     0     0]
 [    0     0  3941     0     0     0     0     0     0]
 [    0     0     0  2186     0     0     0     0     0]
 [    0     0     0     0  6426     0     0     0     0]
 [    0     0     0     0     0  1554     0     0     0]
 [    0     0     0     0     0     0 11407     0     0]
 [    0     0     0     0     0     0     0  2161     0]
 [    0     0     0     0     0     0     0     0 12847]]

In [19]:

#Accuracy on training data
import numpy as np
from sklearn.metrics import accuracy_score
accuracy_score(Y1_train,predict)

Out[19]:

1.0

In [20]:

#decision tree on test data
predict_d = clf.predict(X1_test)
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(Y1_test, predict_d)
print (cm)

[[ 268   50   19    5   61   28   75   17   68]
 [  35 1424   60    4   15   78   72   14   30]
 [  15   56  747    1   10   82   61   10   32]
 [   4    2    1  519    4    0    1    5   17]
 [  41   16   12    2  743    6   19   93  646]
 [  26   59   73    3    8  149   36   10   11]
 [  65   80   63    4   15   48 2395   25   33]
 [  12    3    6    3  112    1   15  223  155]
 [  59   33   25   13  644   13   32  157 2299]]

In [21]:

#Accuracy on training data
import numpy as np
from sklearn.metrics import accuracy_score
accuracy_score(Y1_test,predict_d)

Out[21]:

0.70838720103425989

In [22]:

#model performs very well on training dataset not in test dataset.

In [77]:

#Lets prune the tree further.  Lets oversimplyfy the model
tree1 = tree.DecisionTreeClassifier(criterion='gini', max_depth=16,random_state=90)
tree1.fit(X1_train,Y1_train)

Out[77]:

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=16,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=90, splitter='best')

In [78]:

#DecisionTree on training data
predict_1 = tree1.predict(X1_train)
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(Y1_train, predict_1)
print (cm)

[[ 1126   346    72     1    74    14   109    16   490]
 [  114  6006   161     0    40    27   143     7   234]
 [  148   461  2988     4    12    36    95     3   194]
 [    5   107     4  2001     3     2     5     0    59]
 [   84   201    21     0  2631     1     8    55  3425]
 [  137   505   327     1    10   335   104     3   132]
 [  189   468   129     1    23    14 10351    16   216]
 [   24    63     9     8   192     1    69   853   942]
 [  122   324    57     1   781     1    36    77 11448]]

In [79]:

#Accuracy on training data
import numpy as np
from sklearn.metrics import accuracy_score
accuracy_score(Y1_train,predict_1)

Out[79]:

0.76237323744495167

In [80]:

#decision tree on test data
predict_2 = tree1.predict(X1_test)
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(Y1_test, predict_2)
print (cm)

[[ 232  108   23    0   33    4   49    7  135]
 [  26 1493   55    2   10   18   56    7   65]
 [  41  148  698    0   10   19   40    1   57]
 [   1   15    2  502    0    0    3    1   29]
 [  31   47    9    0  507    3    5   26  950]
 [  31  135   89    0    2   53   33    0   32]
 [  72  135   59    1    6   14 2373    8   60]
 [   7   14    2    3   43    1   14  156  290]
 [  38   93   20    4  297    4   18   44 2757]]

In [81]:

#Accuracy on training data
import numpy as np
from sklearn.metrics import accuracy_score
accuracy_score(Y1_test,predict_2)

Out[81]:

0.70871040723981904

RandomForest

The sklearn.ensemble module includes two averaging algorithms based on randomized decision trees: the RandomForest algorithm and the Extra-Trees method. Both algorithms are perturb-and-combine techniques [B1998] specifically designed for trees. This means a diverse set of classifiers is created by introducing randomness in the classifier construction. The prediction of the ensemble is given as the averaged prediction of the individual classifiers.

In [130]:

from sklearn.ensemble import RandomForestClassifier
forest=RandomForestClassifier(n_estimators=10, criterion='gini', max_depth=16, min_samples_split=4, 
                              min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', 
                              max_leaf_nodes=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=90, 
                              verbose=0, warm_start=False, class_weight=None)

In [131]:

forest.fit(X1_train,Y1_train)

Out[131]:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=16, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=4, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=90,
            verbose=0, warm_start=False)

In [132]:

#RandomForest on training data
Predicted=forest.predict(X1_train)
 
from sklearn.metrics import confusion_matrix as cm
ConfusionMatrix = cm(Y1_train,Predicted)
print(ConfusionMatrix)

[[ 1187   204    34     4    61     4   134     9   611]
 [    3  6382    55     0    10     5   141     0   136]
 [    2   176  3387     1     0     2   137     0   236]
 [    0     8     4  2034     3     0    14     0   123]
 [   11    34     6     0  2450     0    13    20  3892]
 [   14   347   385     2     1   490   147     0   168]
 [   35   174    86     0     4     4 10900     2   202]
 [    3    19     5    10   101     0    62   781  1180]
 [   11    41     5     1   313     1    20    23 12432]]

In [133]:

#Accuracy on training data
import numpy as np
from sklearn.metrics import accuracy_score
accuracy_score(Y1_train,Predicted)

Out[133]:

0.80891681144196192

In [134]:

#RandomForest on test data
Predicted1=forest.predict(X1_test)
 
from sklearn.metrics import confusion_matrix as cm
Confusion_Matrix = cm(Y1_test,Predicted1)
print(Confusion_Matrix)

[[ 208   89   20    1   17    2   52    8  194]
 [   7 1603   22    3    4    4   60    0   29]
 [   1   87  811    0    0    5   57    0   53]
 [   0    0    1  506    2    0    7    0   37]
 [   8   12    4    0  423    0    8    9 1114]
 [  14  118  113    0    0   42   49    1   38]
 [  20   71   52    0    4    3 2514    3   61]
 [   0    2    0    2   41    0   13  128  344]
 [   6   10    9    4  172    0   15   11 3048]]

In [135]:

#Accuracy on test data
import numpy as np
from sklearn.metrics import accuracy_score
accuracy_score(Y1_test,Predicted1)

Out[135]:

0.75008080155138979

In [136]:

from sklearn.ensemble import RandomForestClassifier
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(500,500))
%matplotlib inline
importances = pd.DataFrame({'feature':X1_test.columns,'importance':np.round(forest.feature_importances_,3)})
importances = importances.sort_values('importance',ascending=False).set_index('feature')
print (importances)
importances.plot(kind='bar')

         importance
feature            
spec34        0.068
spec26        0.063
spec11        0.054
spec60        0.047
spec40        0.039
spec3         0.028
spec14        0.027
spec15        0.025
spec4         0.024
spec57        0.021
spec90        0.021
spec42        0.019
spec75        0.018
spec69        0.018
spec86        0.018
spec27        0.017
spec78        0.017
spec62        0.017
spec80        0.017
spec8         0.016
spec30        0.015
spec93        0.015
spec25        0.014
spec67        0.014
spec72        0.014
spec39        0.014
spec68        0.013
spec97        0.013
spec58        0.012
spec36        0.012
...             ...
spec56        0.003
spec29        0.003
spec92        0.003
spec1         0.003
spec19        0.002
spec23        0.002
spec73        0.002
spec10        0.002
spec55        0.002
spec31        0.002
spec46        0.002
spec44        0.002
spec84        0.001
spec51        0.001
spec82        0.001
spec81        0.001
spec77        0.001
spec74        0.001
spec65        0.001
spec63        0.001
spec52        0.001
spec49        0.001
spec28        0.001
spec21        0.001
spec12        0.001
spec5         0.001
spec100       0.001
spec94        0.000
spec96        0.000
spec6         0.000

[100 rows x 1 columns]

Out[136]:

<matplotlib.axes._subplots.AxesSubplot at 0xe6bcf90>

Extremely randomized trees classifier

In extremely randomized trees randomness goes one step further in the way splits are computed. As in random forests, a random subset of candidate features is used, but instead of looking for the most discriminative thresholds, thresholds are drawn at random for each candidate feature and the best of these randomly-generated thresholds is picked as the splitting rule. This usually allows to reduce the variance of the model a bit more, at the expense of a slightly greater increase in bias

In [179]:

#Extremely randomized trees classifier

from sklearn.ensemble import ExtraTreesClassifier
clf = ExtraTreesClassifier(n_estimators=10,random_state=90,max_depth=22)
clf

Out[179]:

ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=22, max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=90,
           verbose=0, warm_start=False)

In [180]:

clf.fit(X1_train,Y1_train)

Out[180]:

ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=22, max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=90,
           verbose=0, warm_start=False)

In [181]:

#Extremely randomized trees classifier on training data
Predicted2=clf.predict(X1_train)
 
from sklearn.metrics import confusion_matrix as cm
ConfusionMatrix = cm(Y1_train,Predicted2)
print(ConfusionMatrix)

[[ 1167   175    13     4    13     0   128     0   748]
 [    1  6282    20     1     1     2   136     0   289]
 [    0   201  3092     0     0     2   138     0   508]
 [    0     5     0  1899     1     0     7     0   274]
 [    1    23     3     0  2448     0    13     0  3938]
 [    0   286   194     1     0   615   152     0   306]
 [    1   110    31     0     1     0 10954     0   310]
 [    0     7     2     3    27     0    50   672  1400]
 [    0    17     1     1    97     0    16     0 12715]]

In [182]:

#Accuracy  on training data
import numpy as np
from sklearn.metrics import accuracy_score
accuracy_score(Y1_train,Predicted2)

Out[182]:

0.80489677184760211

In [183]:

#Extremely randomized trees classifier on test data
Predicted3=clf.predict(X1_test)
 
from sklearn.metrics import confusion_matrix as cm
Confusion_Matrix = cm(Y1_test,Predicted3)
print(Confusion_Matrix)

[[ 158   87   13    0   19    2   67    2  243]
 [   4 1553   15    1    4    4   86    0   65]
 [   2   95  709    0    5    4   61    1  137]
 [   0    2    0  462    1    0    4    0   84]
 [   5    9    2    0  322    0   11    3 1226]
 [   1  100   78    0    1   54   74    0   67]
 [  15   63   38    0    5    1 2498    1  107]
 [   1    1    0    2   28    0   16   62  420]
 [   3   13    3    4  128    0   11    5 3108]]

In [184]:

#Accuracy on test data
import numpy as np
from sklearn.metrics import accuracy_score
accuracy_score(Y1_test,Predicted3)

Out[184]:

0.72123464770523593

GradientBoostingClassifier

Gradient Tree Boosting or Gradient Boosted Regression Trees (GBRT) is a generalization of boosting to arbitrary differentiable loss functions. GBRT is an accurate and effective off-the-shelf procedure that can be used for both regression and classification problems

In [185]:

from sklearn.ensemble import GradientBoostingClassifier
clf_b = GradientBoostingClassifier(n_estimators=10, learning_rate=0.125,random_state=90, max_depth=4)
clf_b

Out[185]:

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.125, loss='deviance', max_depth=4,
              max_features=None, max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=10, presort='auto', random_state=90,
              subsample=1.0, verbose=0, warm_start=False)

In [186]:

clf_b.fit(X1_train, Y1_train)

Out[186]:

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.125, loss='deviance', max_depth=4,
              max_features=None, max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=10, presort='auto', random_state=90,
              subsample=1.0, verbose=0, warm_start=False)

In [187]:

#GradientBoostingClassifier on training data
Predicted4=clf_b.predict(X1_train)
 
from sklearn.metrics import confusion_matrix as cm
ConfusionMatrix = cm(Y1_train,Predicted4)
print(ConfusionMatrix)

[[ 1205   189     7     8   210    39   135    25   430]
 [   64  5912    85     6    44    71   222     1   327]
 [   26   228  2958    17    25   118   219     1   349]
 [    0     1     0  2079     5     0     8     0    93]
 [   96    21    17     6  2020     6     8    59  4193]
 [   42   283   299     7    21   495   147     1   259]
 [  114   249   138     4    61    42 10481     8   310]
 [   34     6     0    17   222     0    93   544  1245]
 [   62    31    26    28  1087     6    31    80 11496]]

In [188]:

#Accuracy on training data
import numpy as np
from sklearn.metrics import accuracy_score
accuracy_score(Y1_train,Predicted4)

Out[188]:

0.75128277645347663

In [189]:

#GradientBoostingClassifier  on test data
Predicted5=clf_b.predict(X1_test)
 
from sklearn.metrics import confusion_matrix as cm
Confusion_Matrix = cm(Y1_test,Predicted5)
print(Confusion_Matrix)

[[ 291   47    7    0   61    9   38    3  135]
 [  20 1535   18    3   11   13   62    0   70]
 [   3   60  767    5    3   27   53    2   94]
 [   0    0    1  518    4    0    1    0   29]
 [  15   11    3    0  498    2    3   11 1035]
 [   7   74   70    0    6  120   37    1   60]
 [  36   66   50    0    9   10 2475    2   80]
 [   7    0    0    4   70    0   18  117  314]
 [  21    9    6    9  318    2   12   20 2878]]

In [190]:

#Accuracy on test data
import numpy as np
from sklearn.metrics import accuracy_score
accuracy_score(Y1_test,Predicted5)

Out[190]:

0.74329347123464773

K Fold Cross Validation(using RandomForest)

In [149]:

from sklearn.cross_validation import train_test_split
features=list(fullData.columns[1:101])
X1 = fullData[features]
y1 = fullData['Category']
X1_train, X1_test, Y1_train, Y1_test = train_test_split(X1, y1, train_size=0.8,random_state=90) 
X1_train.shape,Y1_train.shape,X1_test.shape,Y1_test.shape

Out[149]:

((49502, 100), (49502,), (12376, 100), (12376,))

In [150]:

from sklearn.cross_validation import KFold
from sklearn.cross_validation import cross_val_score
score = cross_val_score(forest, X1,y1, cv=10)
score

Out[150]:

array([ 0.74596253,  0.74818285,  0.74798061,  0.74814216,  0.75      ,
        0.75121242,  0.75493049,  0.74959586,  0.7432913 ,  0.74785703])

In [151]:

print(forest.score(X1_test, Y1_test))
print(score)

0.750080801551
[ 0.74596253  0.74818285  0.74798061  0.74814216  0.75        0.75121242
  0.75493049  0.74959586  0.7432913   0.74785703]

conclusion

In this project i have used 5 classification models.

Naive Bayes classifier: Accuracy on training data is 61% and testing data is 61%.

Decision Tree: Accuracy on training data is 76% and testing data is 70%.

Random Forest: Accuracy on training data is 80% and testing data is 75%.

Extremely randomized trees classifier: Accuracy on training data is 77% and testing data is 70%.

Gradient Boosting Machine: Accuracy on training data is 75% and testing data is 74%.

K Fold Cross Validation accuracy is 75%.

Seem like a RandomForestClassifier is doing a pretty good job getting better accuracy.

In [ ]:

Ecommerce Product Classification