• No products in the cart.

Bank Tele Marketing Project in Python

Before start our lesson please download the datasets.

Problem statement

Marketing campaigns are very crucial for any institution to generate business by promoting their products. A data driven strategy can be very helpful to achieve great results. This data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls where bank representative will pitch a banking product to potential customer. The classification goal is to predict if the client will subscribe to the specific product: ‘Yes’ or ‘No’.

Data Exploration

The dataset has 18 variables including the dependent variable ‘y’ which denotes if the customer subscribed for the product ‘Term Deposite’. Variable ‘y’(term deposit) is the dependent variable & rest are independent variable.

In [1]:
import pandas as pd
bank_market=pd.read_csv("C:\\Users\\Personal\\Google Drive\\bank_market.csv")
bank_market.shape
Out[1]:
(45211, 18)

we have 18 variables, and 45211 observations.

In [2]:
bank_market.columns.values
Out[2]:
array(['Cust_num', 'age', 'job', 'marital', 'education', 'default',
       'balance', 'housing', 'loan', 'contact', 'day', 'month', 'duration',
       'campaign', 'pdays', 'previous', 'poutcome', 'y'], dtype=object)
In [3]:
bank_market.head()
Out[3]:
Cust_num age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
0 1 58 management married tertiary no 2143 yes no unknown 5 may 261 1 -1 0 unknown no
1 2 44 technician single secondary no 29 yes no unknown 5 may 151 1 -1 0 unknown no
2 3 33 entrepreneur married secondary no 2 yes yes unknown 5 may 76 1 -1 0 unknown no
3 4 47 blue-collar married unknown no 1506 yes no unknown 5 may 92 1 -1 0 unknown no
4 5 33 unknown single unknown no 1 no no unknown 5 may 198 1 -1 0 unknown no

Next, you can look at summary of numerical fields by using describe() function

In [4]:
import pandas as pd
import sklearn as sk
import math
import numpy as np
from scipy import stats
import matplotlib as matlab
import statsmodels
bank_market=pd.read_csv("C:\\Users\\Personal\\Google Drive\\bank_market.csv")
bank_market.shape
bank_market.columns.values
bank_market.head(10)
bank_market.describe()
Out[4]:
Cust_num age balance day duration campaign pdays previous
count 45211.000000 45211.000000 45211.000000 45211.000000 45211.000000 45211.000000 45211.000000 45211.000000
mean 22606.000000 40.936210 1362.272058 15.806419 258.163080 2.763841 40.197828 0.580323
std 13051.435847 10.618762 3044.765829 8.322476 257.527812 3.098021 100.128746 2.303441
min 1.000000 18.000000 -8019.000000 1.000000 0.000000 1.000000 -1.000000 0.000000
25% 11303.500000 33.000000 72.000000 8.000000 103.000000 1.000000 -1.000000 0.000000
50% 22606.000000 39.000000 448.000000 16.000000 180.000000 2.000000 -1.000000 0.000000
75% 33908.500000 48.000000 1428.000000 21.000000 319.000000 3.000000 -1.000000 0.000000
max 45211.000000 95.000000 102127.000000 31.000000 4918.000000 63.000000 871.000000 275.000000

describe() function would provide count, mean, standard deviation (std), min, quartiles and max in its output

The minimum value in Cust_num column is 1, maximum value is 45211 which is number of rows in the data. Its mean and median are equal, which shows an equal distribution.

checking missing values

In [5]:
bank_market.isnull().sum()
Out[5]:
Cust_num     0
age          0
job          0
marital      0
education    0
default      0
balance      0
housing      0
loan         0
contact      0
day          0
month        0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
y            0
dtype: int64

univariate analysis

age

It specifies age of the barrower in years. It’s an integer. lets see the summary of age

In [6]:
bank_market['age'].describe()
Out[6]:
count    45211.000000
mean        40.936210
std         10.618762
min         18.000000
25%         33.000000
50%         39.000000
75%         48.000000
max         95.000000
Name: age, dtype: float64

Minimum age is 18, maximum age is 95 which is okay. Mean and median are very close which indicates outliers may not be present.

Lets see the percentile distribution.

In [7]:
bank_market['age'].quantile([0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.96,0.97,0.98,0.99,1])
Out[7]:
0.00    18.0
0.01    23.0
0.03    26.0
0.05    27.0
0.07    28.0
0.09    28.0
0.10    29.0
0.20    32.0
0.30    34.0
0.40    36.0
0.50    39.0
0.60    42.0
0.70    46.0
0.80    51.0
0.90    56.0
0.95    59.0
0.96    59.0
0.97    60.0
0.98    63.0
0.99    71.0
1.00    95.0
Name: age, dtype: float64

Percentile distribution shows only 10% of customers are younger then 28, and around 75% of the customers are between age of 28-60, meaning a more mature customer group is targeted for campaign.

In [8]:
import matplotlib.pyplot as plt
%matplotlib inline
bank_market.boxplot(column="age")
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:3: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  app.launch_new_instance()
Out[8]:
{'boxes': [<matplotlib.lines.Line2D at 0xad59c50>],
 'caps': [<matplotlib.lines.Line2D at 0xad66dd0>,
  <matplotlib.lines.Line2D at 0xad66e70>],
 'fliers': [<matplotlib.lines.Line2D at 0xad73c30>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0xad73350>],
 'whiskers': [<matplotlib.lines.Line2D at 0xad59f70>,
  <matplotlib.lines.Line2D at 0xad66910>]}

we can notice that good number of customers are in their early mid age. There doesn’t seem to be a sign outliers in the variable.

duration

Variable ‘duration’ specifies the duration of last call made with customer in second. let’s see the summary of duration:

In [9]:
bank_market['duration'].describe()
Out[9]:
count    45211.000000
mean       258.163080
std        257.527812
min          0.000000
25%        103.000000
50%        180.000000
75%        319.000000
max       4918.000000
Name: duration, dtype: float64

The Min value is 0.0 but Max value is 4918.0 seconds which is around 1hour22minutes. Let’s see the percentile distribution.

The Min value is 0.0 but Max value is 4918.0 seconds which is around 1hour22minutes. Let’s see the percentile distribution.

In [10]:
bank_market['duration'].quantile([0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.99,1])
Out[10]:
0.00       0.0
0.01      11.0
0.03      22.0
0.05      35.0
0.07      46.0
0.09      55.0
0.10      58.0
0.20      89.0
0.30     117.0
0.40     147.0
0.50     180.0
0.60     223.0
0.70     280.0
0.80     368.0
0.90     548.0
0.95     751.0
0.99    1269.0
1.00    4918.0
Name: duration, dtype: float64

Percentile distribution shows 95% of the calls ended within first 751 seconds or around 12 minutes.

Boxplot of the variable.

In [11]:
bank_market.boxplot(column="duration")
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  if __name__ == '__main__':
Out[11]:
{'boxes': [<matplotlib.lines.Line2D at 0xad9b510>],
 'caps': [<matplotlib.lines.Line2D at 0xad9bf90>,
  <matplotlib.lines.Line2D at 0xada2890>],
 'fliers': [<matplotlib.lines.Line2D at 0xada2df0>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0xada2930>],
 'whiskers': [<matplotlib.lines.Line2D at 0xad9ba30>,
  <matplotlib.lines.Line2D at 0xad9bef0>]}

Boxplot shows that there are quite a few outliers in the this variable. We can consider last 5% values as outliers.

campaign

This variable represents number of contacts made during this campaign and for this client, this includes the last contact. Summary of variable ‘campaign’:

In [12]:
bank_market['campaign'].describe()
Out[12]:
count    45211.000000
mean         2.763841
std          3.098021
min          1.000000
25%          1.000000
50%          2.000000
75%          3.000000
max         63.000000
Name: campaign, dtype: float64

Min value is 1 and Max value is 63. Contacting same customer 63 times seem too high this might be outlier.(Outlier doesn’t mean that this is flasly entered information, however, it can affect our predictive models)

Let’s get into percentile distribution:

In [13]:
bank_market['campaign'].quantile([0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.99,1])
Out[13]:
0.00     1.0
0.01     1.0
0.03     1.0
0.05     1.0
0.07     1.0
0.09     1.0
0.10     1.0
0.20     1.0
0.30     1.0
0.40     2.0
0.50     2.0
0.60     2.0
0.70     3.0
0.80     4.0
0.90     5.0
0.95     8.0
0.99    16.0
1.00    63.0
Name: campaign, dtype: float64

We can see about 60% of customers were contacted no more than twice and 90% has been contacted less than 5 times. Last 1%ile shows too much of variance form the whole data, considering this as an outlier should be a best call.

Boxlot for better visual understanding:

In [14]:
bank_market.boxplot(column="campaign")
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  if __name__ == '__main__':
Out[14]:
{'boxes': [<matplotlib.lines.Line2D at 0xaddec50>],
 'caps': [<matplotlib.lines.Line2D at 0xade3b10>,
  <matplotlib.lines.Line2D at 0xade3fd0>],
 'fliers': [<matplotlib.lines.Line2D at 0xadeb970>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0xadeb090>],
 'whiskers': [<matplotlib.lines.Line2D at 0xadded50>,
  <matplotlib.lines.Line2D at 0xade3650>]}
In [15]:
freq=bank_market['campaign'].value_counts()
freq
Out[15]:
1     17544
2     12505
3      5521
4      3522
5      1764
6      1291
7       735
8       540
9       327
10      266
11      201
12      155
13      133
14       93
15       84
16       79
17       69
18       51
19       44
20       43
21       35
22       23
23       22
25       22
24       20
28       16
29       16
26       13
31       12
27       10
32        9
30        8
33        6
34        5
36        4
35        4
43        3
38        3
41        2
50        2
37        2
51        1
55        1
46        1
58        1
44        1
39        1
63        1
Name: campaign, dtype: int64

We can see for all different measures that distribution is concentrated in a very small range and if the value is too far from the range we can consider it an outlier. Here any value that is more than 16 or above 99th percentile, we will it an outlier.

In [16]:
bank_market['campaign'].hist(bins=50)
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0xadeba10>

boxplot and barplot shows high number of cutomers were called less than 5 times and 99% of the customers were not called more than 16 times.

pdays

This variable represents number of days that passed by after the client was last contacted from a previous campaign. It is a numeric variable and -1 means client was not previously contacted.

Summary of pdays:

In [17]:
bank_market['pdays'].describe()
Out[17]:
count    45211.000000
mean        40.197828
std        100.128746
min         -1.000000
25%         -1.000000
50%         -1.000000
75%         -1.000000
max        871.000000
Name: pdays, dtype: float64

Minimum value is -1 and maximum value is 871 days. Mean and median have huge gap which indicates there is presence outliers.

Let’s have a look at percentile distribution:

In [18]:
bank_market['pdays'].quantile([0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.99,1])
Out[18]:
0.00     -1.0
0.01     -1.0
0.03     -1.0
0.05     -1.0
0.07     -1.0
0.09     -1.0
0.10     -1.0
0.20     -1.0
0.30     -1.0
0.40     -1.0
0.50     -1.0
0.60     -1.0
0.70     -1.0
0.80     -1.0
0.90    185.0
0.95    317.0
0.99    370.0
1.00    871.0
Name: pdays, dtype: float64

we see about 80% values are given as -1, and these customers were contacted for the first time. Let’s have a look at boxplot for good measure:

In [19]:
bank_market.boxplot(column="pdays")
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  if __name__ == '__main__':
Out[19]:
{'boxes': [<matplotlib.lines.Line2D at 0xb20ef90>],
 'caps': [<matplotlib.lines.Line2D at 0xb275e50>,
  <matplotlib.lines.Line2D at 0xb275ef0>],
 'fliers': [<matplotlib.lines.Line2D at 0xb27acb0>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0xb27a3d0>],
 'whiskers': [<matplotlib.lines.Line2D at 0xb2754d0>,
  <matplotlib.lines.Line2D at 0xb275990>]}

However, if the customer was last contacted over a year ago we can assume value more than 365 to be an outlier and 99th percentile happen to be 370(very near to 365).

previous

This variable represents number of contacts performed before this campaign and for this client, it’s a numeric value.

Summary of ‘previous’:

In [20]:
bank_market['previous'].describe()
Out[20]:
count    45211.000000
mean         0.580323
std          2.303441
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max        275.000000
Name: previous, dtype: float64

Minimum value is 0 maximum value is 275, mean is 0.58 but median is 0.

A look at percentile distribution:

In [21]:
bank_market['previous'].quantile([0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.99,1])
Out[21]:
0.00      0.0
0.01      0.0
0.03      0.0
0.05      0.0
0.07      0.0
0.09      0.0
0.10      0.0
0.20      0.0
0.30      0.0
0.40      0.0
0.50      0.0
0.60      0.0
0.70      0.0
0.80      0.0
0.90      2.0
0.95      3.0
0.99      8.9
1.00    275.0
Name: previous, dtype: float64

Percentile distribution suggest 99% values are under 8.9, so the value 275 must be an outlier.

Visualizing the distribution of ‘previous’ for a clear view:

In [22]:
bank_market.boxplot(column="previous")
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  if __name__ == '__main__':
Out[22]:
{'boxes': [<matplotlib.lines.Line2D at 0xb2b0ff0>],
 'caps': [<matplotlib.lines.Line2D at 0xb2b6eb0>,
  <matplotlib.lines.Line2D at 0xb2b6f50>],
 'fliers': [<matplotlib.lines.Line2D at 0xb2bcd10>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0xb2bc430>],
 'whiskers': [<matplotlib.lines.Line2D at 0xb2b6530>,
  <matplotlib.lines.Line2D at 0xb2b69f0>]}

day

It’s the last contact day of the month with customer. The numeric value must be between 1 to 31.

Let’s have a look at summary to see if there is any outlier or missing value.

In [23]:
bank_market['day'].describe()
Out[23]:
count    45211.000000
mean        15.806419
std          8.322476
min          1.000000
25%          8.000000
50%         16.000000
75%         21.000000
max         31.000000
Name: day, dtype: float64

Summary shows min value 1 and max value being 31, which is the exact range of days in months.

Have a look at boxplot:

In [24]:
bank_market.boxplot(column="day")
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  if __name__ == '__main__':
Out[24]:
{'boxes': [<matplotlib.lines.Line2D at 0xb2f0d50>],
 'caps': [<matplotlib.lines.Line2D at 0xb2f6c10>,
  <matplotlib.lines.Line2D at 0xb2f6cb0>],
 'fliers': [<matplotlib.lines.Line2D at 0xb2fba70>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0xb2fb190>],
 'whiskers': [<matplotlib.lines.Line2D at 0xb2f0e50>,
  <matplotlib.lines.Line2D at 0xb2f6750>]}

Distribution of this variable seems pretty fine.

balance

It’s average yearly balance of the customer, in euros. This could have large impact on subscription by customer. Let’s have a look at summary

In [25]:
bank_market['balance'].describe()
Out[25]:
count     45211.000000
mean       1362.272058
std        3044.765829
min       -8019.000000
25%          72.000000
50%         448.000000
75%        1428.000000
max      102127.000000
Name: balance, dtype: float64

Min value is -8019 and max value is 102127, can’t really doubt these values being impossible.

In [26]:
bank_market['balance'].quantile([0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.99,1])
Out[26]:
0.00     -8019.0
0.01      -627.0
0.03      -322.0
0.05      -172.0
0.07       -51.0
0.09         0.0
0.10         0.0
0.20        22.0
0.30       131.0
0.40       272.0
0.50       448.0
0.60       701.0
0.70      1126.0
0.80      1859.0
0.90      3574.0
0.95      5768.0
0.99     13164.9
1.00    102127.0
Name: balance, dtype: float64

Percentile distribution shows 99% of the values are under 13164.9, the rest might be outliers.

Have a look at boxplot to understand distribution:

In [27]:
bank_market.boxplot(column="balance")
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  if __name__ == '__main__':
Out[27]:
{'boxes': [<matplotlib.lines.Line2D at 0xb334f30>],
 'caps': [<matplotlib.lines.Line2D at 0xb33adf0>,
  <matplotlib.lines.Line2D at 0xb33ae90>],
 'fliers': [<matplotlib.lines.Line2D at 0xb33fc50>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0xb33f370>],
 'whiskers': [<matplotlib.lines.Line2D at 0xb33a470>,
  <matplotlib.lines.Line2D at 0xb33a930>]}
In [28]:
bank_market['balance'].hist(bins=50)
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0xb31deb0>

boxplot indicates no outliers

Exploring catagorical variables

job

This represents type of job the customer has. we can see the the count value of each value of a catagorical variable by using summary function:

In [29]:
frequency_table=bank_market['job'].value_counts()
frequency_table
Out[29]:
blue-collar      9732
management       9458
technician       7597
admin.           5171
services         4154
retired          2264
self-employed    1579
entrepreneur     1487
unemployed       1303
housemaid        1240
student           938
unknown           288
Name: job, dtype: int64

marital

Marital status of the customer. It’s a categorical value: “married”,“divorced”,“single”. note: “divorced” means divorced or widowed

In [30]:
bank_market['marital'].value_counts()
Out[30]:
married     27214
single      12790
divorced     5207
Name: marital, dtype: int64

education

Education level of the customer. (categorical: “unknown”,“secondary”,“primary”,“tertiary”)

In [31]:
bank_market['education'].value_counts()
Out[31]:
secondary    23202
tertiary     13301
primary       6851
unknown       1857
Name: education, dtype: int64

default

This variable shows if the customer has credit in default? (binary: “yes”,“no”)

In [32]:
bank_market['default'].value_counts()
Out[32]:
no     44396
yes      815
Name: default, dtype: int64

housing

If the customer has any housing loan.

In [33]:
bank_market['housing'].value_counts()
Out[33]:
yes    25130
no     20081
Name: housing, dtype: int64

loan

If the customer has any personal loan

In [34]:
bank_market['loan'].value_counts()
Out[34]:
no     37967
yes     7244
Name: loan, dtype: int64

contact

The contact communication method used to aproach customer: ‘cellular’, ‘telephone’, ‘unknown’

In [35]:
bank_market['contact'].value_counts()
Out[35]:
cellular     29285
unknown      13020
telephone     2906
Name: contact, dtype: int64

month

It’s the last contact month of the year.

In [36]:
bank_market['month'].value_counts()
Out[36]:
may    13766
jul     6895
aug     6247
jun     5341
nov     3970
apr     2932
feb     2649
jan     1403
oct      738
sep      579
mar      477
dec      214
Name: month, dtype: int64

poutcome

outcome of the previous marketing campaign.

In [37]:
bank_market['poutcome'].value_counts()
Out[37]:
unknown    36959
failure     4901
other       1840
success     1511
Name: poutcome, dtype: int64

y

this is our output variable(desired varible). Has the client subscribed a term deposit? (binary: ‘yes’,‘no’)

In [38]:
bank_market['y'].value_counts()
Out[38]:
no     39922
yes     5289
Name: y, dtype: int64

Summary of Univariate Analysis on Numerical Variables

Lets tabularize what we found in univariate analysis

Variable | Outliers |Remarks —————-|—————|—————

Cust_num | Nill |

age | Nill |

duration | 1% | Value more than 1269

campaign | 1% | Value more than 16

pdays | 1% | Value more than 370

previous | 1% | Value more than 8.9

day | Nill |

balance | Nill |

| | y | | Output Variable

Model Building

As our dependent variable ‘y’ is binary variable(Yes-No type), basic algorithm to aproach would be logistic regression.

We will need to split the data into training and testing sets. Using sample.split() function form library caTools, we will split the dataset into 80-20 ratio of training and testing set.

Initially we will go with the raw data set and then we will go with a basic claeaning process to clean the outliers or any NA values and see if the cleaning improves our results or not.

In [39]:
bank_market.dtypes
Out[39]:
Cust_num      int64
age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
y            object
dtype: object

Since, sklearn requires all inputs to be numeric, we should convert all our categorical variables into numeric by encoding the categories. This can be done using the following code:

In [40]:
from sklearn.preprocessing import LabelEncoder
var_mod = ['job','marital','education','default','housing','loan','contact','month','poutcome','y']
le = LabelEncoder()
for i in var_mod:
    bank_market[i] = le.fit_transform(bank_market[i])
bank_market.dtypes 
Out[40]:
Cust_num     int64
age          int64
job          int32
marital      int32
education    int32
default      int32
balance      int64
housing      int32
loan         int32
contact      int32
day          int64
month        int32
duration     int64
campaign     int64
pdays        int64
previous     int64
poutcome     int32
y            int32
dtype: object

Spliting the data into training and testing set

In [41]:
from sklearn.cross_validation import train_test_split
features=list(bank_market[["age"]+["job"]+["marital"]+["education"]+["default"]+["balance"]+["housing"]+["loan"]+["contact"]+["day"]+["month"]+["duration"]+["campaign"]+["pdays"]+["previous"]+["poutcome"]])
X1 = bank_market[features]
y1 = bank_market['y']
X1_train, X1_test, Y1_train, Y1_test = train_test_split(X1, y1, train_size=0.8,random_state=90) 
Y1_test.shape, X1_test.shape,X1_train.shape,Y1_train.shape
Out[41]:
((9043,), (9043, 16), (36168, 16), (36168,))
In [42]:
from sklearn.linear_model import LogisticRegression
logistic1= LogisticRegression()
logistic1.fit(X1_train,Y1_train)
Out[42]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Let’s predict the class on the test set and find the Accuracy, sensitivity and specificity of this logistic regression model we just built:

In [43]:
predict1=logistic1.predict(X1_test)
In [44]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
cm1 = confusion_matrix(Y1_test,predict1)
print(cm1)
total1=sum(sum(cm1))
[[7800  158]
 [ 844  241]]
In [45]:
accuracy1=(cm1[0,0]+cm1[1,1])/total1
accuracy1
Out[45]:
0.8891960632533451
In [46]:
specificity1=cm1[1,1]/(cm1[1,1]+cm1[1,0])
specificity1
Out[46]:
0.22211981566820277
In [47]:
sensitivity1=cm1[0,0]/(cm1[0,0]+cm1[0,1])
sensitivity1
Out[47]:
0.98014576526765518

Okay, accuracy is 89%, that’s not bad. But specificity is very low, meaning proportion of negatives that are correctly identified is very low. In simple terms the model is identifying a large amount of potential non-subscribers as subsribers, which is bad. Depending on this model representatives migh waste energy on customers who he/she might not necessarily be able to convert into a subscribers. We will work on decreasing the Specificity of our model.

Remove outliers

We created a ummary table of the continuous variables from Univariate Analysis to document the outliers and missing values. 4 Continuous Variables shows the sign of outliers: duration, campaign, pdays and previous. Removing the outliers one by one:

First create a new dataset in which we will put the changed variables to keep original dataset intact.

In [48]:
bank_market1=bank_market
bank_market1.shape
Out[48]:
(45211, 18)
In [49]:
bank_market1['duration_new']=bank_market1['duration']
bank_market1['duration_new'][bank_market1['duration_new']>1269]=180
bank_market1['duration_new'].describe()
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
Out[49]:
count    45211.000000
mean       243.457035
std        211.614474
min          0.000000
25%        103.000000
50%        180.000000
75%        310.000000
max       1269.000000
Name: duration_new, dtype: float64
In [50]:
bank_market1.boxplot(column="duration_new")
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  if __name__ == '__main__':
Out[50]:
{'boxes': [<matplotlib.lines.Line2D at 0x7c59f50>],
 'caps': [<matplotlib.lines.Line2D at 0x7c5fe30>,
  <matplotlib.lines.Line2D at 0x7c5fed0>],
 'fliers': [<matplotlib.lines.Line2D at 0x7c65c90>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x7c653b0>],
 'whiskers': [<matplotlib.lines.Line2D at 0x7c5f4b0>,
  <matplotlib.lines.Line2D at 0x7c5f970>]}
In [51]:
bank_market1['duration_new'].quantile([0.1, .25,.50,.75,0.8, 0.85, .90,0.95, .99,1])
Out[51]:
0.10      58.0
0.25     103.0
0.50     180.0
0.75     310.0
0.80     358.0
0.85     421.0
0.90     521.0
0.95     696.0
0.99    1051.0
1.00    1269.0
Name: duration_new, dtype: float64

campaign

Values above 16 are outliers and should be replaced with median value : 2

In [52]:
bank_market1['campaign_new']=bank_market1['campaign']
bank_market1['campaign_new'][bank_market1['campaign_new']>16]=2
bank_market1['campaign_new'].describe()
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
Out[52]:
count    45211.000000
mean         2.551746
std          2.214597
min          1.000000
25%          1.000000
50%          2.000000
75%          3.000000
max         16.000000
Name: campaign_new, dtype: float64
In [53]:
bank_market1.boxplot(column="campaign_new")
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  if __name__ == '__main__':
Out[53]:
{'boxes': [<matplotlib.lines.Line2D at 0x7c9fb10>],
 'caps': [<matplotlib.lines.Line2D at 0x7ca59f0>,
  <matplotlib.lines.Line2D at 0x7ca5eb0>],
 'fliers': [<matplotlib.lines.Line2D at 0x7cab850>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x7ca5f50>],
 'whiskers': [<matplotlib.lines.Line2D at 0x7c9ffd0>,
  <matplotlib.lines.Line2D at 0x7ca5530>]}
In [54]:
bank_market1['campaign_new'].quantile([0.1, .25,.50,.75,0.8, 0.85, .90,0.95, .99,1])
Out[54]:
0.10     1.0
0.25     1.0
0.50     2.0
0.75     3.0
0.80     4.0
0.85     4.0
0.90     5.0
0.95     7.0
0.99    12.0
1.00    16.0
Name: campaign_new, dtype: float64

pdays

Values above 370 can be considered outliers according to our observations while univariate analysis.

In [55]:
bank_market1['pdays_new']=bank_market1['pdays']
bank_market1['pdays_new'][bank_market1['pdays_new']>370]=40.2
bank_market1['pdays_new'].describe()
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
Out[55]:
count    45211.000000
mean        36.505430
std         91.014331
min         -1.000000
25%         -1.000000
50%         -1.000000
75%         -1.000000
max        370.000000
Name: pdays_new, dtype: float64
In [56]:
bank_market1.boxplot(column="pdays_new")
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  if __name__ == '__main__':
Out[56]:
{'boxes': [<matplotlib.lines.Line2D at 0x7d84a90>],
 'caps': [<matplotlib.lines.Line2D at 0x7d8ba10>,
  <matplotlib.lines.Line2D at 0x7d8bed0>],
 'fliers': [<matplotlib.lines.Line2D at 0x7d90870>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x7d8bf70>],
 'whiskers': [<matplotlib.lines.Line2D at 0x7d84ff0>,
  <matplotlib.lines.Line2D at 0x7d8b550>]}
In [57]:
bank_market1['pdays_new'].quantile([0.1, .25,.50,.75,0.8, 0.85, .90,0.95, .99,1])
Out[57]:
0.10     -1.0
0.25     -1.0
0.50     -1.0
0.75     -1.0
0.80     -1.0
0.85     94.0
0.90    182.0
0.95    288.0
0.99    362.0
1.00    370.0
Name: pdays_new, dtype: float64

previous

1% values that are above 8.9 can be considered outliers. We will replace these values with mean 0.58.

In [58]:
bank_market1['previous_new']=bank_market1['previous']
bank_market1['previous_new'][bank_market1['previous_new']>8.9]=0.58
bank_market1['previous_new'].describe()
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
Out[58]:
count    45211.000000
mean         0.441325
std          1.189729
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          8.000000
Name: previous_new, dtype: float64
In [59]:
bank_market1.boxplot(column="previous_new")
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  if __name__ == '__main__':
Out[59]:
{'boxes': [<matplotlib.lines.Line2D at 0x7dcf0f0>],
 'caps': [<matplotlib.lines.Line2D at 0x7dcffd0>,
  <matplotlib.lines.Line2D at 0x7dd5530>],
 'fliers': [<matplotlib.lines.Line2D at 0x7dd5eb0>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x7dd55d0>],
 'whiskers': [<matplotlib.lines.Line2D at 0x7dcf6d0>,
  <matplotlib.lines.Line2D at 0x7dcfb90>]}
In [60]:
bank_market1['previous_new'].quantile([0.1, .25,.50,.75,0.8, 0.85, .90,0.95, .99,1])
Out[60]:
0.10    0.0
0.25    0.0
0.50    0.0
0.75    0.0
0.80    0.0
0.85    1.0
0.90    2.0
0.95    3.0
0.99    6.0
1.00    8.0
Name: previous_new, dtype: float64

After removing outliers we can still detect a hint of outliers. We were lineant toward the deciding a boundary for the outliers to keep integrety of the data and not induce any bias. If needed we can again change the margin of outliers and replace those values.

Rebuild the model after outlier removal

Again build a Logistic Model and see if we made any improvements with outlier removal But first devide the dataset bank.market1 into training and testing sets.

In [61]:
from sklearn.cross_validation import train_test_split
feature=list(bank_market1[["age"]+["job"]+["marital"]+["education"]+["default"]+["balance"]+["housing"]+["loan"]+["contact"]+["day"]+["month"]+["duration_new"]+["campaign_new"]+["pdays_new"]+["previous_new"]+["poutcome"]])
X2 = bank_market1[feature]
y2 = bank_market1['y']
X2_train, X2_test, Y2_train, Y2_test = train_test_split(X2, y2, train_size=0.8,random_state=90) 
Y2_test.shape, X1_test.shape,X2_train.shape,Y2_train.shape
Out[61]:
((9043,), (9043, 16), (36168, 16), (36168,))
In [62]:
from sklearn.linear_model import LogisticRegression
logistic2= LogisticRegression()
logistic2.fit(X2_train,Y2_train)
Out[62]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
In [63]:
predict2=logistic2.predict(X2_test)
In [64]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
cm2 = confusion_matrix(Y2_test,predict2)
print(cm2)
total2=sum(sum(cm2))
[[7774  184]
 [ 839  246]]
In [65]:
accuracy2=(cm2[0,0]+cm2[1,1])/total2
accuracy2
Out[65]:
0.88687382505805601
In [66]:
specificity2=cm2[1,1]/(cm2[1,1]+cm2[1,0])
specificity2
Out[66]:
0.22672811059907835
In [67]:
sensitivity2=cm2[0,0]/(cm2[0,0]+cm2[0,1])
sensitivity2
Out[67]:
0.97687861271676302

ROC AND AUC

In [68]:
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
actual = Y2_test
false_positive_rate, true_positive_rate, thresholds = roc_curve(actual, predict2)
plt.title('Receiver Operating Characteristic')
roc_auc = auc(false_positive_rate, true_positive_rate)
plt.plot(false_positive_rate, true_positive_rate,label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

roc_auc = auc(false_positive_rate, true_positive_rate)
roc_auc
Out[68]:
0.6018033616579207

Building Decission tree

In [69]:
import pandas as pd
from sklearn import tree
clf = tree.DecisionTreeClassifier(random_state=90)
clf = clf.fit(X2_train,Y2_train)
clf
Out[69]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=90, splitter='best')
In [70]:
predict3 = clf.predict(X2_test)
from sklearn.metrics import confusion_matrix
cm3=confusion_matrix(Y2_test, predict3)
print (cm3)
[[7371  587]
 [ 537  548]]
In [71]:
total3 = sum(sum(cm3))
accuracy3 = (cm3[0,0]+cm3[1,1])/total3
accuracy3
Out[71]:
0.87570496516642704
In [72]:
specificity3=cm3[1,1]/(cm3[1,1]+cm3[1,0])
specificity3
Out[72]:
0.50506912442396312
In [73]:
sensitivity3=cm3[0,0]/(cm3[0,0]+cm3[0,1])
sensitivity3
Out[73]:
0.92623774817793414

RandomForest

In [80]:
from sklearn.ensemble import RandomForestClassifier
forest=RandomForestClassifier(n_estimators=100, criterion='gini', max_depth=None, min_samples_split=2, 
                              min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', 
                              max_leaf_nodes=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, 
                              verbose=0, warm_start=False, class_weight=None)
In [81]:
forest.fit(X2_train,Y2_train)
Out[81]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
In [82]:
Predicted=forest.predict(X2_test)

from sklearn.metrics import confusion_matrix as cm
ConfusionMatrix = cm(Y2_test,Predicted)
print(ConfusionMatrix)
[[7725  233]
 [ 645  440]]
In [83]:
total = sum(sum(ConfusionMatrix))
accuracy = (ConfusionMatrix[0,0]+ConfusionMatrix[1,1])/total
accuracy
Out[83]:
0.90290832688267164
In [84]:
sensitivity=ConfusionMatrix[0,0]/(ConfusionMatrix[0,0]+ConfusionMatrix[0,1])
sensitivity
Out[84]:
0.97072128675546621
In [85]:
specificity=ConfusionMatrix[1,1]/(ConfusionMatrix[1,1]+ConfusionMatrix[1,0])
specificity
Out[85]:
0.40552995391705071

Seem like a decision tree model is doing a pretty good job getting better specificity significantly. However, it’s still not that good but we reached to good result.

 

DV Analytics

DV Data & Analytics is a leading data science training and consulting firm, led by industry experts. We are aiming to train and prepare resources to acquire the most in-demand data science job opportunities in India and abroad.

Bangalore Center

DV Data & Analytics Bangalore Private Limited
#52, 2nd Floor:
Malleshpalya Maruthinagar Bengaluru.
Bangalore 560075
India
(+91) 9019 030 033 (+91) 8095 881 188
Email: info@dvanalyticsmds.com

Bhubneshwar Center

DV Data & Analytics Private Limited Bhubaneswar
Plot No A/7 :
Adjacent to Maharaja Cine Complex, Bhoinagar, Acharya Vihar
Bhubaneswar 751022
(+91) 8095 881 188 (+91) 8249 430 414
Email: info@dvanalyticsmds.com

top
© 2020. All Rights Reserved.