• No products in the cart.

Phishing Websites

Before start our lesson please download the datasets.

Abstract: The main objective is predicting the result of websites.

In [6]:
import pandas as pd
train1=pd.read_csv("~/datasets/phishing_websites.csv") 
train1.shape
Out[6]:
(11055, 31)
In [4]:
train1.head()
Out[4]:
a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a22 a23 a24 a25 a26 a27 a28 a29 a30 result
0 -1 1 1 1 -1 -1 -1 -1 -1 1 1 1 -1 -1 -1 -1 1 1 -1 -1
1 1 1 1 1 1 -1 0 1 -1 1 1 1 -1 -1 0 -1 1 1 1 -1
2 1 0 1 1 1 -1 -1 -1 -1 1 1 1 1 -1 1 -1 1 0 -1 -1
3 1 0 1 1 1 -1 -1 -1 1 1 1 1 -1 -1 1 -1 1 -1 1 -1
4 1 0 -1 1 1 -1 1 1 -1 1 -1 1 -1 -1 0 -1 1 1 1 1

5 rows × 31 columns

In [6]:
import pandas as pd
import sklearn as sk
import math
import numpy as np
from scipy import stats
import matplotlib as matlab
import statsmodels
train1=pd.read_csv("~/datasets/phishing_websites.csv") 
train1.shape
train1.columns.values
train1.head(10)
train1.describe()
Out[6]:
a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a22 a23 a24 a25 a26 a27 a28 a29 a30 result
count 11055.000000 11055.000000 11055.000000 11055.000000 11055.000000 11055.000000 11055.000000 11055.000000 11055.000000 11055.000000 11055.000000 11055.000000 11055.000000 11055.000000 11055.000000 11055.000000 11055.000000 11055.000000 11055.000000 11055.000000
mean 0.313795 -0.633198 0.738761 0.700588 0.741474 -0.734962 0.063953 0.250927 -0.336771 0.628584 0.613388 0.816915 0.061239 0.377114 0.287291 -0.483673 0.721574 0.344007 0.719584 0.113885
std 0.949534 0.766095 0.673998 0.713598 0.671011 0.678139 0.817518 0.911892 0.941629 0.777777 0.789818 0.576784 0.998168 0.926209 0.827733 0.875289 0.692369 0.569944 0.694437 0.993539
min -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000
25% -1.000000 -1.000000 1.000000 1.000000 1.000000 -1.000000 -1.000000 -1.000000 -1.000000 1.000000 1.000000 1.000000 -1.000000 -1.000000 0.000000 -1.000000 1.000000 0.000000 1.000000 -1.000000
50% 1.000000 -1.000000 1.000000 1.000000 1.000000 -1.000000 0.000000 1.000000 -1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 -1.000000 1.000000 0.000000 1.000000 1.000000
75% 1.000000 -1.000000 1.000000 1.000000 1.000000 -1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

8 rows × 31 columns

association between result and a1

In [7]:
stats.pointbiserialr(train1.result,train1.a1)
Out[7]:
PointbiserialrResult(correlation=0.09416009495620388, pvalue=3.3674882178773865e-23)

Logistic Regression

In [11]:
from sklearn.linear_model import LogisticRegression
logistic1= LogisticRegression()
logistic1.fit(train1[['a1']+['a2']+['a3']+['a4']+['a5']+['a6']+['a7']+['a8']+['a9']+['a10']+['a11']+['a12']+['a13']+['a14']+['a15']+['a16']+['a17']+['a18']+['a19']+['a20']+['a21']+['a22']+['a23']+['a24']+['a25']+['a26']+['a27']+['a28']+['a29']+['a30']],train1[['result']])
predict1=logistic1.predict(train1[['a1']+['a2']+['a3']+['a4']+['a5']+['a6']+['a7']+['a8']+['a9']+['a10']+['a11']+['a12']+['a13']+['a14']+['a15']+['a16']+['a17']+['a18']+['a19']+['a20']+['a21']+['a22']+['a23']+['a24']+['a25']+['a26']+['a27']+['a28']+['a29']+['a30']])
predict1
logistic1
C:Anaconda3libsite-packagessklearnutilsvalidation.py:515: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
Out[11]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

confusion matrix

In [10]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
cm1 = confusion_matrix(train1[['result']],predict1)
print(cm1)
total1=sum(sum(cm1))
[[4440  458]
 [ 336 5821]]

Accuracy

In [13]:
accuracy1=(cm1[0,0]+cm1[1,1])/total1
accuracy1
Out[13]:
0.92817729534147442

result is a target variable, contains -1 and 1. -1 denotes phishing(fake website), 1 denotes legitime(genuine website). change -1 as 0 because Logistic Regression is used to predict a binary outcome (1 / 0) given a set of independent variables.In simple words, it predicts the probability of occurrence of an event by fitting data to a logit function.

In [14]:
newdata1=train1[['result']]
newdata1[newdata1<=-1]=0
newdata1.head()
C:Anaconda3libsite-packagesipykernel__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
C:Anaconda3libsite-packagespandascoreframe.py:2383: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.where(-key, value, inplace=True)
Out[14]:
result
0 0
1 0
2 0
3 0
4 1
In [15]:
import statsmodels.formula.api as sm
logistic=sm.Logit(newdata1,train1[['a1']+['a2']+['a3']+['a4']+['a5']+['a6']+['a7']+['a8']+['a9']+['a10']+['a11']+['a12']+['a13']+['a14']+['a15']+['a16']+['a17']+['a18']+['a19']+['a20']+['a21']+['a22']+['a23']+['a24']+['a25']+['a26']+['a27']+['a28']+['a29']+['a30']])
logistic       
result1=logistic.fit()
summary_1=result1.summary()
summary_1 
Optimization terminated successfully.
         Current function value: 0.184237
         Iterations 9
Out[15]:
Logit Regression Results
Dep. Variable: result No. Observations: 11055
Model: Logit Df Residuals: 11025
Method: MLE Df Model: 29
Date: Wed, 15 Jun 2016 Pseudo R-squ.: 0.7317
Time: 13:24:38 Log-Likelihood: -2036.7
converged: True LL-Null: -7590.9
LLR p-value: 0.000
coef std err z P>|z| [95.0% Conf. Int.]
a1 0.6791 0.057 11.880 0.000 0.567 0.791
a2 -0.1658 0.069 -2.397 0.017 -0.301 -0.030
a3 -0.5945 0.138 -4.320 0.000 -0.864 -0.325
a4 0.3207 0.076 4.239 0.000 0.172 0.469
a5 0.2313 0.159 1.453 0.146 -0.081 0.543
a6 1.4086 0.107 13.154 0.000 1.199 1.618
a7 0.6409 0.053 12.002 0.000 0.536 0.746
a8 1.6605 0.048 34.289 0.000 1.566 1.755
a9 -0.0358 0.054 -0.664 0.507 -0.142 0.070
a10 -0.2624 0.178 -1.474 0.140 -0.611 0.086
a11 0.7567 0.144 5.256 0.000 0.475 1.039
a12 -0.4361 0.105 -4.140 0.000 -0.642 -0.230
a13 0.2614 0.052 4.998 0.000 0.159 0.364
a14 3.1868 0.106 30.100 0.000 2.979 3.394
a15 0.7938 0.058 13.636 0.000 0.680 0.908
a16 0.7658 0.069 11.046 0.000 0.630 0.902
a17 -0.4511 0.096 -4.678 0.000 -0.640 -0.262
a18 -0.1493 0.109 -1.365 0.172 -0.364 0.065
a19 -0.8674 0.165 -5.254 0.000 -1.191 -0.544
a20 0.2718 0.119 2.284 0.022 0.039 0.505
a21 0.6544 0.138 4.726 0.000 0.383 0.926
a22 -0.1775 0.172 -1.034 0.301 -0.514 0.159
a23 -0.3671 0.143 -2.560 0.010 -0.648 -0.086
a24 0.1128 0.044 2.589 0.010 0.027 0.198
a25 0.5189 0.059 8.825 0.000 0.404 0.634
a26 0.8085 0.055 14.816 0.000 0.702 0.915
a27 0.1112 0.051 2.198 0.028 0.012 0.210
a28 0.7523 0.061 12.284 0.000 0.632 0.872
a29 1.0804 0.086 12.610 0.000 0.912 1.248
a30 0.3544 0.077 4.592 0.000 0.203 0.506

a5,a9,a10,a18,a22 variables are the less impacting variables.

In [49]:
import pandas as pd
train1=pd.read_csv("~/datasets/phishing_websites.csv")
from sklearn.linear_model import LogisticRegression
logistic2= LogisticRegression( )
logistic2.fit(train1[['a1']+['a2']+['a3']+['a4']+['a6']+['a7']+['a8']+['a11']+['a12']+['a13']+['a14']+['a15']+['a16']+['a17']+['a19']+['a20']+['a21']+['a23']+['a24']+['a25']+['a26']+['a27']+['a28']+['a29']+['a30']],train1[['result']])
predict2=logistic2.predict(train1[['a1']+['a2']+['a3']+['a4']+['a6']+['a7']+['a8']+['a11']+['a12']+['a13']+['a14']+['a15']+['a16']+['a17']+['a19']+['a20']+['a21']+['a23']+['a24']+['a25']+['a26']+['a27']+['a28']+['a29']+['a30']])
predict2
C:Anaconda3libsite-packagessklearnutilsvalidation.py:515: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
Out[49]:
array([-1,  1, -1, ..., -1, -1, -1], dtype=int64)

confusion matrix

In [58]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
cm2=confusion_matrix(train1[['result']],predict2)
print(cm2)
total2=sum(sum(cm2))
[[4444  454]
 [ 334 5823]]

Accuracy for final logistic building

In [59]:
accuracy2=(cm2[0,0]+cm2[1,1])/total2
accuracy2
Out[59]:
0.9287200361827227

Roc and auc

In [71]:
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
actual = train1[['result']]
false_positive_rate, true_positive_rate, thresholds = roc_curve(actual, predict2)
plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, true_positive_rate,label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

roc_auc = auc(false_positive_rate, true_positive_rate)
roc_auc
Out[71]:
0.92653095372329319

Decision tree

In [9]:
import pandas as pd
train1=pd.read_csv("~/datasets/phishing_websites.csv")
from sklearn import tree
features= list(train1.columns[:30])
y=train1[['result']]
X = train1[features]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(y,X)
clf
Out[9]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
In [29]:
# Save tree as dot file

from IPython.display import Image
from sklearn.externals.six import StringIO
import matplotlib.pyplot as plt
import pydot
dot_data = StringIO()
tree.export_graphviz(clf,
                     out_file = dot_data,
                     feature_names = features,
                     filled=True, rounded=True,
                     impurity=False)

graph = pydot.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-29-b5183df44820> in <module>()
      4 from sklearn.externals.six import StringIO
      5 import matplotlib.pyplot as plt
----> 6 import pydot
      7 dot_data = StringIO()
      8 tree.export_graphviz(clf,

ImportError: No module named 'pydot'
In [32]:
predict3 = clf.predict3(X)
predict3
from sklearn.metrics import confusion_matrix
cm3=confusion_matrix(y, predict3)
print (cm3)
total3 = sum(sum(cm3))
accuracy3 = (cm3[0,0]+cm3[1,1])/total3
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-32-74b36d3a89ae> in <module>()
----> 1 predict3 = clf.predict3(X)
      2 predict3
      3 from sklearn.metrics import confusion_matrix
      4 cm3=confusion_matrix(y, predict3)
      5 print (cm3)

AttributeError: 'DecisionTreeClassifier' object has no attribute 'predict3'

K fold cross validation(using logistic model)

In [43]:
import numpy as np
from sklearn.cross_validation import KFold
from sklearn import cross_validation
kfold = cross_validation.KFold(len(train1), n_folds=10)
X=train1[['a1']+['a2']+['a3']+['a4']+['a5']+['a6']+['a7']+['a8']+['a9']+['a10']+['a11']+['a12']+['a13']+['a14']+['a15']+['a16']+['a17']+['a18']+['a19']+['a20']+['a21']+['a22']+['a23']+['a24']+['a25']+['a26']+['a27']+['a28']+['a29']+['a30']]
y=train1[['result']]
score = cross_validation.cross_val_score(logistic1,X, y,scoring='mean_squared_error',cv=kfold)
score = cross_validation.cross_val_score(logistic1,X, y,scoring='accuracy',cv=kfold)
print("Accuracy per fold: ")
print(scores)
print("Average accuracy: ", scores.mean())
C:Anaconda3libsite-packagessklearnutilsvalidation.py:515: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:Anaconda3libsite-packagessklearnutilsvalidation.py:515: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:Anaconda3libsite-packagessklearnutilsvalidation.py:515: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:Anaconda3libsite-packagessklearnutilsvalidation.py:515: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:Anaconda3libsite-packagessklearnutilsvalidation.py:515: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:Anaconda3libsite-packagessklearnutilsvalidation.py:515: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:Anaconda3libsite-packagessklearnutilsvalidation.py:515: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:Anaconda3libsite-packagessklearnutilsvalidation.py:515: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:Anaconda3libsite-packagessklearnutilsvalidation.py:515: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:Anaconda3libsite-packagessklearnutilsvalidation.py:515: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:Anaconda3libsite-packagessklearnutilsvalidation.py:515: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:Anaconda3libsite-packagessklearnutilsvalidation.py:515: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:Anaconda3libsite-packagessklearnutilsvalidation.py:515: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:Anaconda3libsite-packagessklearnutilsvalidation.py:515: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:Anaconda3libsite-packagessklearnutilsvalidation.py:515: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:Anaconda3libsite-packagessklearnutilsvalidation.py:515: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:Anaconda3libsite-packagessklearnutilsvalidation.py:515: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:Anaconda3libsite-packagessklearnutilsvalidation.py:515: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:Anaconda3libsite-packagessklearnutilsvalidation.py:515: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:Anaconda3libsite-packagessklearnutilsvalidation.py:515: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
Accuracy per fold: 
[ 0.92314647  0.92857143  0.92405063  0.92947559  0.92585895  0.93484163
  0.92579186  0.90859729  0.92488688  0.92850679]
Average accuracy:  0.925372750853
In [47]:
mean=np.mean(scores)
mean
Out[47]:
0.92537275085301895
In [45]:
std=np.std(scores)
std
Out[45]:
0.0064444962268194305

Holdout

In [7]:
import numpy as np
from sklearn.cross_validation import train_test_split
k_train, k_test = train_test_split(train1,     # Data set to split
                                   test_size = 0.25,  # Split ratio
                                   random_state=1,    # Set random seed
                                   stratify = train1["result"]) #*
print(k_train.shape)
print(k_test.shape)
(8291, 31)
(2764, 31)
In [10]:
from sklearn.cross_validation import KFold
from sklearn import tree
cv = KFold(n=len(train1),  # Number of elements
           n_folds=10,            # Desired number of cv folds
           random_state=12)       # Set a random seed

fold_accuracy = []
for train_fold, valid_fold in cv:
    train = train1.loc[train_fold] # Extract train data with kf indices
    valid = train1.loc[valid_fold] # Extract valid data with kf indices
    model = clf.fit(X = train[['a1']+['a2']+['a3']+['a4']+['a6']+['a7']+['a8']+['a11']+['a12']+['a13']+['a14']+['a15']+['a16']+['a17']+['a19']+['a20']+['a21']+['a23']+['a24']+['a25']+['a26']+['a27']+['a28']+['a29']+['a30']],y =train[['result']])
    valid_acc = model.score(X = valid[['a1']+['a2']+['a3']+['a4']+['a6']+['a7']+['a8']+['a11']+['a12']+['a13']+['a14']+['a15']+['a16']+['a17']+['a19']+['a20']+['a21']+['a23']+['a24']+['a25']+['a26']+['a27']+['a28']+['a29']+['a30']],y = valid[['result']])
    fold_accuracy.append(valid_acc)
print("Accuracy per fold: ", fold_accuracy, "n")
print("Average accuracy: ", sum(fold_accuracy)/len(fold_accuracy)) 
Accuracy per fold:  [0.98101265822784811, 0.98101265822784811, 0.97287522603978305, 0.98010849909584086, 0.97377938517179019, 0.95927601809954754, 0.9312217194570136, 0.93484162895927603, 0.94660633484162893, 0.94027149321266967] 

Average accuracy:  0.960100562133

K-fold cross validation(using decision tree)

In [19]:
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(estimator=clf, # Model to test
                         X= train1[['a1']+['a2']+['a3']+['a4']+['a6']+['a7']+['a8']+['a11']+['a12']+['a13']+['a14']+['a15']+['a16']+['a17']+['a19']+['a20']+['a21']+['a23']+['a24']+['a25']+['a26']+['a27']+['a28']+['a29']+['a30']],
                         y = train1[['result']],
                         cv=10,scoring = "accuracy") 
                          
                                          
                                                       
print("Accuracy per fold: ")
print(scores)
print("Average accuracy: ", scores.mean())
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-19-bb3bdc61dacc> in <module>()
      3                          X= train1[['a1']+['a2']+['a3']+['a4']+['a6']+['a7']+['a8']+['a11']+['a12']+['a13']+['a14']+['a15']+['a16']+['a17']+['a19']+['a20']+['a21']+['a23']+['a24']+['a25']+['a26']+['a27']+['a28']+['a29']+['a30']],
      4                          y = train1[['result']],
----> 5                          cv=10,scoring = "accuracy") 
      6 
      7 

C:Anaconda3libsite-packagessklearncross_validation.py in cross_val_score(estimator, X, y, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch)
   1422     X, y = indexable(X, y)
   1423 
-> 1424     cv = check_cv(cv, X, y, classifier=is_classifier(estimator))
   1425     scorer = check_scoring(estimator, scoring=scoring)
   1426     # We clone the estimator to make sure that all the folds are

C:Anaconda3libsite-packagessklearncross_validation.py in check_cv(cv, X, y, classifier)
   1675         if classifier:
   1676             if type_of_target(y) in ['binary', 'multiclass']:
-> 1677                 cv = StratifiedKFold(y, cv)
   1678             else:
   1679                 cv = KFold(_num_samples(y), cv)

C:Anaconda3libsite-packagessklearncross_validation.py in __init__(self, y, n_folds, shuffle, random_state)
    531         for test_fold_idx, per_label_splits in enumerate(zip(*per_label_cvs)):
    532             for label, (_, test_split) in zip(unique_labels, per_label_splits):
--> 533                 label_test_folds = test_folds[y == label]
    534                 # the test split can be too big because we used
    535                 # KFold(max(c, self.n_folds), self.n_folds) instead of

IndexError: too many indices for array

Conclusion:

Real accuracy of data is 92%.

DV Analytics

DV Data & Analytics is a leading data science training and consulting firm, led by industry experts. We are aiming to train and prepare resources to acquire the most in-demand data science job opportunities in India and abroad.

Bangalore Center

DV Data & Analytics Bangalore Private Limited
#52, 2nd Floor:
Malleshpalya Maruthinagar Bengaluru.
Bangalore 560075
India
(+91) 9019 030 033 (+91) 8095 881 188
Email: info@dvanalyticsmds.com

Bhubneshwar Center

DV Data & Analytics Private Limited Bhubaneswar
Plot No A/7 :
Adjacent to Maharaja Cine Complex, Bhoinagar, Acharya Vihar
Bhubaneswar 751022
(+91) 8095 881 188 (+91) 8249 430 414
Email: info@dvanalyticsmds.com

top
© 2020. All Rights Reserved.