Before start our lesson please download the datasets.

Problem statement

Marketing campaigns are very crucial for any institution to generate business by promoting their products. A data driven strategy can be very helpful to achieve great results. This data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls where bank representative will pitch a banking product to potential customer. The classification goal is to predict if the client will subscribe to the specific product: ‘Yes’ or ‘No’.

Data Exploration

The dataset has 18 variables including the dependent variable ‘y’ which denotes if the customer subscribed for the product ‘Term Deposite’. Variable ‘y’(term deposit) is the dependent variable & rest are independent variable.

In [1]:

import pandas as pd
bank_market=pd.read_csv("C:\\Users\\Personal\\Google Drive\\bank_market.csv")
bank_market.shape

Out[1]:

(45211, 18)

we have 18 variables, and 45211 observations.

In [2]:

bank_market.columns.values

Out[2]:

array(['Cust_num', 'age', 'job', 'marital', 'education', 'default',
       'balance', 'housing', 'loan', 'contact', 'day', 'month', 'duration',
       'campaign', 'pdays', 'previous', 'poutcome', 'y'], dtype=object)

In [3]:

bank_market.head()

Out[3]:

	Cust_num	age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	poutcome	y
0	1	58	management	married	tertiary	no	2143	yes	no	unknown	5	may	261	1	-1	unknown	no
1	2	44	technician	single	secondary	no	29	yes	no	unknown	5	may	151	1	-1	unknown	no
2	3	33	entrepreneur	married	secondary	no	2	yes	yes	unknown	5	may	76	1	-1	unknown	no
3	4	47	blue-collar	married	unknown	no	1506	yes	no	unknown	5	may	92	1	-1	unknown	no
4	5	33	unknown	single	unknown	no	1	no	no	unknown	5	may	198	1	-1	unknown	no

Next, you can look at summary of numerical fields by using describe() function

In [4]:

import pandas as pd
import sklearn as sk
import math
import numpy as np
from scipy import stats
import matplotlib as matlab
import statsmodels
bank_market=pd.read_csv("C:\\Users\\Personal\\Google Drive\\bank_market.csv")
bank_market.shape
bank_market.columns.values
bank_market.head(10)
bank_market.describe()

Out[4]:

	Cust_num	age	balance	day	duration	campaign	pdays	previous
count	45211.000000	45211.000000	45211.000000	45211.000000	45211.000000	45211.000000	45211.000000	45211.000000
mean	22606.000000	40.936210	1362.272058	15.806419	258.163080	2.763841	40.197828	0.580323
std	13051.435847	10.618762	3044.765829	8.322476	257.527812	3.098021	100.128746	2.303441
min	1.000000	18.000000	-8019.000000	1.000000	0.000000	1.000000	-1.000000	0.000000
25%	11303.500000	33.000000	72.000000	8.000000	103.000000	1.000000	-1.000000	0.000000
50%	22606.000000	39.000000	448.000000	16.000000	180.000000	2.000000	-1.000000	0.000000
75%	33908.500000	48.000000	1428.000000	21.000000	319.000000	3.000000	-1.000000	0.000000
max	45211.000000	95.000000	102127.000000	31.000000	4918.000000	63.000000	871.000000	275.000000

describe() function would provide count, mean, standard deviation (std), min, quartiles and max in its output

The minimum value in Cust_num column is 1, maximum value is 45211 which is number of rows in the data. Its mean and median are equal, which shows an equal distribution.

checking missing values

In [5]:

bank_market.isnull().sum()

Out[5]:

Cust_num     0
age          0
job          0
marital      0
education    0
default      0
balance      0
housing      0
loan         0
contact      0
day          0
month        0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
y            0
dtype: int64

univariate analysis

age

It specifies age of the barrower in years. It’s an integer. lets see the summary of age

In [6]:

bank_market['age'].describe()

Out[6]:

count    45211.000000
mean        40.936210
std         10.618762
min         18.000000
25%         33.000000
50%         39.000000
75%         48.000000
max         95.000000
Name: age, dtype: float64

Minimum age is 18, maximum age is 95 which is okay. Mean and median are very close which indicates outliers may not be present.

Lets see the percentile distribution.

In [7]:

bank_market['age'].quantile([0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.96,0.97,0.98,0.99,1])

Out[7]:

0.00    18.0
0.01    23.0
0.03    26.0
0.05    27.0
0.07    28.0
0.09    28.0
0.10    29.0
0.20    32.0
0.30    34.0
0.40    36.0
0.50    39.0
0.60    42.0
0.70    46.0
0.80    51.0
0.90    56.0
0.95    59.0
0.96    59.0
0.97    60.0
0.98    63.0
0.99    71.0
1.00    95.0
Name: age, dtype: float64

Percentile distribution shows only 10% of customers are younger then 28, and around 75% of the customers are between age of 28-60, meaning a more mature customer group is targeted for campaign.

In [8]:

import matplotlib.pyplot as plt
%matplotlib inline
bank_market.boxplot(column="age")

C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:3: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  app.launch_new_instance()

Out[8]:

{'boxes': [<matplotlib.lines.Line2D at 0xad59c50>],
 'caps': [<matplotlib.lines.Line2D at 0xad66dd0>,
  <matplotlib.lines.Line2D at 0xad66e70>],
 'fliers': [<matplotlib.lines.Line2D at 0xad73c30>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0xad73350>],
 'whiskers': [<matplotlib.lines.Line2D at 0xad59f70>,
  <matplotlib.lines.Line2D at 0xad66910>]}

we can notice that good number of customers are in their early mid age. There doesn’t seem to be a sign outliers in the variable.

duration

Variable ‘duration’ specifies the duration of last call made with customer in second. let’s see the summary of duration:

In [9]:

bank_market['duration'].describe()

Out[9]:

count    45211.000000
mean       258.163080
std        257.527812
min          0.000000
25%        103.000000
50%        180.000000
75%        319.000000
max       4918.000000
Name: duration, dtype: float64

The Min value is 0.0 but Max value is 4918.0 seconds which is around 1hour22minutes. Let’s see the percentile distribution.

In [10]:

bank_market['duration'].quantile([0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.99,1])

Out[10]:

0.00       0.0
0.01      11.0
0.03      22.0
0.05      35.0
0.07      46.0
0.09      55.0
0.10      58.0
0.20      89.0
0.30     117.0
0.40     147.0
0.50     180.0
0.60     223.0
0.70     280.0
0.80     368.0
0.90     548.0
0.95     751.0
0.99    1269.0
1.00    4918.0
Name: duration, dtype: float64

Percentile distribution shows 95% of the calls ended within first 751 seconds or around 12 minutes.

Boxplot of the variable.

In [11]:

bank_market.boxplot(column="duration")

C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  if __name__ == '__main__':

Out[11]:

{'boxes': [<matplotlib.lines.Line2D at 0xad9b510>],
 'caps': [<matplotlib.lines.Line2D at 0xad9bf90>,
  <matplotlib.lines.Line2D at 0xada2890>],
 'fliers': [<matplotlib.lines.Line2D at 0xada2df0>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0xada2930>],
 'whiskers': [<matplotlib.lines.Line2D at 0xad9ba30>,
  <matplotlib.lines.Line2D at 0xad9bef0>]}

Boxplot shows that there are quite a few outliers in the this variable. We can consider last 5% values as outliers.

campaign

This variable represents number of contacts made during this campaign and for this client, this includes the last contact. Summary of variable ‘campaign’:

In [12]:

bank_market['campaign'].describe()

Out[12]:

count    45211.000000
mean         2.763841
std          3.098021
min          1.000000
25%          1.000000
50%          2.000000
75%          3.000000
max         63.000000
Name: campaign, dtype: float64

Min value is 1 and Max value is 63. Contacting same customer 63 times seem too high this might be outlier.(Outlier doesn’t mean that this is flasly entered information, however, it can affect our predictive models)

Let’s get into percentile distribution:

In [13]:

bank_market['campaign'].quantile([0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.99,1])

Out[13]:

0.00     1.0
0.01     1.0
0.03     1.0
0.05     1.0
0.07     1.0
0.09     1.0
0.10     1.0
0.20     1.0
0.30     1.0
0.40     2.0
0.50     2.0
0.60     2.0
0.70     3.0
0.80     4.0
0.90     5.0
0.95     8.0
0.99    16.0
1.00    63.0
Name: campaign, dtype: float64

We can see about 60% of customers were contacted no more than twice and 90% has been contacted less than 5 times. Last 1%ile shows too much of variance form the whole data, considering this as an outlier should be a best call.

Boxlot for better visual understanding:

In [14]:

bank_market.boxplot(column="campaign")

C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  if __name__ == '__main__':

Out[14]:

{'boxes': [<matplotlib.lines.Line2D at 0xaddec50>],
 'caps': [<matplotlib.lines.Line2D at 0xade3b10>,
  <matplotlib.lines.Line2D at 0xade3fd0>],
 'fliers': [<matplotlib.lines.Line2D at 0xadeb970>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0xadeb090>],
 'whiskers': [<matplotlib.lines.Line2D at 0xadded50>,
  <matplotlib.lines.Line2D at 0xade3650>]}

In [15]:

freq=bank_market['campaign'].value_counts()
freq

Out[15]:

1     17544
2     12505
3      5521
4      3522
5      1764
6      1291
7       735
8       540
9       327
10      266
11      201
12      155
13      133
14       93
15       84
16       79
17       69
18       51
19       44
20       43
21       35
22       23
23       22
25       22
24       20
28       16
29       16
26       13
31       12
27       10
32        9
30        8
33        6
34        5
36        4
35        4
43        3
38        3
41        2
50        2
37        2
51        1
55        1
46        1
58        1
44        1
39        1
63        1
Name: campaign, dtype: int64

We can see for all different measures that distribution is concentrated in a very small range and if the value is too far from the range we can consider it an outlier. Here any value that is more than 16 or above 99th percentile, we will it an outlier.

In [16]:

bank_market['campaign'].hist(bins=50)

Out[16]:

<matplotlib.axes._subplots.AxesSubplot at 0xadeba10>

boxplot and barplot shows high number of cutomers were called less than 5 times and 99% of the customers were not called more than 16 times.

pdays

This variable represents number of days that passed by after the client was last contacted from a previous campaign. It is a numeric variable and -1 means client was not previously contacted.

Summary of pdays:

In [17]:

bank_market['pdays'].describe()

Out[17]:

count    45211.000000
mean        40.197828
std        100.128746
min         -1.000000
25%         -1.000000
50%         -1.000000
75%         -1.000000
max        871.000000
Name: pdays, dtype: float64

Minimum value is -1 and maximum value is 871 days. Mean and median have huge gap which indicates there is presence outliers.

Let’s have a look at percentile distribution:

In [18]:

bank_market['pdays'].quantile([0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.99,1])

Out[18]:

0.00     -1.0
0.01     -1.0
0.03     -1.0
0.05     -1.0
0.07     -1.0
0.09     -1.0
0.10     -1.0
0.20     -1.0
0.30     -1.0
0.40     -1.0
0.50     -1.0
0.60     -1.0
0.70     -1.0
0.80     -1.0
0.90    185.0
0.95    317.0
0.99    370.0
1.00    871.0
Name: pdays, dtype: float64

we see about 80% values are given as -1, and these customers were contacted for the first time. Let’s have a look at boxplot for good measure:

In [19]:

bank_market.boxplot(column="pdays")

C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  if __name__ == '__main__':

Out[19]:

{'boxes': [<matplotlib.lines.Line2D at 0xb20ef90>],
 'caps': [<matplotlib.lines.Line2D at 0xb275e50>,
  <matplotlib.lines.Line2D at 0xb275ef0>],
 'fliers': [<matplotlib.lines.Line2D at 0xb27acb0>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0xb27a3d0>],
 'whiskers': [<matplotlib.lines.Line2D at 0xb2754d0>,
  <matplotlib.lines.Line2D at 0xb275990>]}

However, if the customer was last contacted over a year ago we can assume value more than 365 to be an outlier and 99th percentile happen to be 370(very near to 365).

This variable represents number of contacts performed before this campaign and for this client, it’s a numeric value.

Summary of ‘previous’:

In [20]:

bank_market['previous'].describe()

Out[20]:

count    45211.000000
mean         0.580323
std          2.303441
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max        275.000000
Name: previous, dtype: float64

Minimum value is 0 maximum value is 275, mean is 0.58 but median is 0.

A look at percentile distribution:

In [21]:

bank_market['previous'].quantile([0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.99,1])

Out[21]:

0.00      0.0
0.01      0.0
0.03      0.0
0.05      0.0
0.07      0.0
0.09      0.0
0.10      0.0
0.20      0.0
0.30      0.0
0.40      0.0
0.50      0.0
0.60      0.0
0.70      0.0
0.80      0.0
0.90      2.0
0.95      3.0
0.99      8.9
1.00    275.0
Name: previous, dtype: float64

Percentile distribution suggest 99% values are under 8.9, so the value 275 must be an outlier.

Visualizing the distribution of ‘previous’ for a clear view:

In [22]:

bank_market.boxplot(column="previous")

C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  if __name__ == '__main__':

Out[22]:

{'boxes': [<matplotlib.lines.Line2D at 0xb2b0ff0>],
 'caps': [<matplotlib.lines.Line2D at 0xb2b6eb0>,
  <matplotlib.lines.Line2D at 0xb2b6f50>],
 'fliers': [<matplotlib.lines.Line2D at 0xb2bcd10>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0xb2bc430>],
 'whiskers': [<matplotlib.lines.Line2D at 0xb2b6530>,
  <matplotlib.lines.Line2D at 0xb2b69f0>]}

day

It’s the last contact day of the month with customer. The numeric value must be between 1 to 31.

Let’s have a look at summary to see if there is any outlier or missing value.

In [23]:

bank_market['day'].describe()

Out[23]:

count    45211.000000
mean        15.806419
std          8.322476
min          1.000000
25%          8.000000
50%         16.000000
75%         21.000000
max         31.000000
Name: day, dtype: float64

Summary shows min value 1 and max value being 31, which is the exact range of days in months.

Have a look at boxplot:

In [24]:

bank_market.boxplot(column="day")

C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  if __name__ == '__main__':

Out[24]:

{'boxes': [<matplotlib.lines.Line2D at 0xb2f0d50>],
 'caps': [<matplotlib.lines.Line2D at 0xb2f6c10>,
  <matplotlib.lines.Line2D at 0xb2f6cb0>],
 'fliers': [<matplotlib.lines.Line2D at 0xb2fba70>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0xb2fb190>],
 'whiskers': [<matplotlib.lines.Line2D at 0xb2f0e50>,
  <matplotlib.lines.Line2D at 0xb2f6750>]}

Distribution of this variable seems pretty fine.

balance

It’s average yearly balance of the customer, in euros. This could have large impact on subscription by customer. Let’s have a look at summary

In [25]:

bank_market['balance'].describe()

Out[25]:

count     45211.000000
mean       1362.272058
std        3044.765829
min       -8019.000000
25%          72.000000
50%         448.000000
75%        1428.000000
max      102127.000000
Name: balance, dtype: float64

Min value is -8019 and max value is 102127, can’t really doubt these values being impossible.

In [26]:

bank_market['balance'].quantile([0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.99,1])

Out[26]:

0.00     -8019.0
0.01      -627.0
0.03      -322.0
0.05      -172.0
0.07       -51.0
0.09         0.0
0.10         0.0
0.20        22.0
0.30       131.0
0.40       272.0
0.50       448.0
0.60       701.0
0.70      1126.0
0.80      1859.0
0.90      3574.0
0.95      5768.0
0.99     13164.9
1.00    102127.0
Name: balance, dtype: float64

Percentile distribution shows 99% of the values are under 13164.9, the rest might be outliers.

Have a look at boxplot to understand distribution:

In [27]:

bank_market.boxplot(column="balance")

C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  if __name__ == '__main__':

Out[27]:

{'boxes': [<matplotlib.lines.Line2D at 0xb334f30>],
 'caps': [<matplotlib.lines.Line2D at 0xb33adf0>,
  <matplotlib.lines.Line2D at 0xb33ae90>],
 'fliers': [<matplotlib.lines.Line2D at 0xb33fc50>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0xb33f370>],
 'whiskers': [<matplotlib.lines.Line2D at 0xb33a470>,
  <matplotlib.lines.Line2D at 0xb33a930>]}

In [28]:

bank_market['balance'].hist(bins=50)

Out[28]:

<matplotlib.axes._subplots.AxesSubplot at 0xb31deb0>

boxplot indicates no outliers

Exploring catagorical variables

job

This represents type of job the customer has. we can see the the count value of each value of a catagorical variable by using summary function:

In [29]:

frequency_table=bank_market['job'].value_counts()
frequency_table

Out[29]:

blue-collar      9732
management       9458
technician       7597
admin.           5171
services         4154
retired          2264
self-employed    1579
entrepreneur     1487
unemployed       1303
housemaid        1240
student           938
unknown           288
Name: job, dtype: int64

marital

Marital status of the customer. It’s a categorical value: “married”,“divorced”,“single”. note: “divorced” means divorced or widowed

In [30]:

bank_market['marital'].value_counts()

Out[30]:

married     27214
single      12790
divorced     5207
Name: marital, dtype: int64

education

Education level of the customer. (categorical: “unknown”,“secondary”,“primary”,“tertiary”)

In [31]:

bank_market['education'].value_counts()

Out[31]:

secondary    23202
tertiary     13301
primary       6851
unknown       1857
Name: education, dtype: int64

default

This variable shows if the customer has credit in default? (binary: “yes”,“no”)

In [32]:

bank_market['default'].value_counts()

Out[32]:

no     44396
yes      815
Name: default, dtype: int64

housing

If the customer has any housing loan.

In [33]:

bank_market['housing'].value_counts()

Out[33]:

yes    25130
no     20081
Name: housing, dtype: int64

loan

If the customer has any personal loan

In [34]:

bank_market['loan'].value_counts()

Out[34]:

no     37967
yes     7244
Name: loan, dtype: int64

contact

The contact communication method used to aproach customer: ‘cellular’, ‘telephone’, ‘unknown’

In [35]:

bank_market['contact'].value_counts()

Out[35]:

cellular     29285
unknown      13020
telephone     2906
Name: contact, dtype: int64

month

It’s the last contact month of the year.

In [36]:

bank_market['month'].value_counts()

Out[36]:

may    13766
jul     6895
aug     6247
jun     5341
nov     3970
apr     2932
feb     2649
jan     1403
oct      738
sep      579
mar      477
dec      214
Name: month, dtype: int64

poutcome

outcome of the previous marketing campaign.

In [37]:

bank_market['poutcome'].value_counts()

Out[37]:

unknown    36959
failure     4901
other       1840
success     1511
Name: poutcome, dtype: int64

y

this is our output variable(desired varible). Has the client subscribed a term deposit? (binary: ‘yes’,‘no’)

In [38]:

bank_market['y'].value_counts()

Out[38]:

no     39922
yes     5289
Name: y, dtype: int64

Summary of Univariate Analysis on Numerical Variables

Lets tabularize what we found in univariate analysis

Variable | Outliers |Remarks —————-|—————|—————

Cust_num | Nill |

age | Nill |

duration | 1% | Value more than 1269

campaign | 1% | Value more than 16

pdays | 1% | Value more than 370

previous | 1% | Value more than 8.9

day | Nill |

balance | Nill |

| | y | | Output Variable

Model Building

As our dependent variable ‘y’ is binary variable(Yes-No type), basic algorithm to aproach would be logistic regression.

We will need to split the data into training and testing sets. Using sample.split() function form library caTools, we will split the dataset into 80-20 ratio of training and testing set.

Initially we will go with the raw data set and then we will go with a basic claeaning process to clean the outliers or any NA values and see if the cleaning improves our results or not.

In [39]:

bank_market.dtypes

Out[39]:

Cust_num      int64
age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
y            object
dtype: object

Since, sklearn requires all inputs to be numeric, we should convert all our categorical variables into numeric by encoding the categories. This can be done using the following code:

In [40]:

from sklearn.preprocessing import LabelEncoder
var_mod = ['job','marital','education','default','housing','loan','contact','month','poutcome','y']
le = LabelEncoder()
for i in var_mod:
    bank_market[i] = le.fit_transform(bank_market[i])
bank_market.dtypes

Out[40]:

Cust_num     int64
age          int64
job          int32
marital      int32
education    int32
default      int32
balance      int64
housing      int32
loan         int32
contact      int32
day          int64
month        int32
duration     int64
campaign     int64
pdays        int64
previous     int64
poutcome     int32
y            int32
dtype: object

Spliting the data into training and testing set

In [41]:

from sklearn.cross_validation import train_test_split
features=list(bank_market[["age"]+["job"]+["marital"]+["education"]+["default"]+["balance"]+["housing"]+["loan"]+["contact"]+["day"]+["month"]+["duration"]+["campaign"]+["pdays"]+["previous"]+["poutcome"]])
X1 = bank_market[features]
y1 = bank_market['y']
X1_train, X1_test, Y1_train, Y1_test = train_test_split(X1, y1, train_size=0.8,random_state=90) 
Y1_test.shape, X1_test.shape,X1_train.shape,Y1_train.shape

Out[41]:

((9043,), (9043, 16), (36168, 16), (36168,))

In [42]:

from sklearn.linear_model import LogisticRegression
logistic1= LogisticRegression()
logistic1.fit(X1_train,Y1_train)

Out[42]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Let’s predict the class on the test set and find the Accuracy, sensitivity and specificity of this logistic regression model we just built:

In [43]:

predict1=logistic1.predict(X1_test)

In [44]:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
cm1 = confusion_matrix(Y1_test,predict1)
print(cm1)
total1=sum(sum(cm1))

[[7800  158]
 [ 844  241]]

In [45]:

accuracy1=(cm1[0,0]+cm1[1,1])/total1
accuracy1

Out[45]:

0.8891960632533451

In [46]:

specificity1=cm1[1,1]/(cm1[1,1]+cm1[1,0])
specificity1

Out[46]:

0.22211981566820277

In [47]:

sensitivity1=cm1[0,0]/(cm1[0,0]+cm1[0,1])
sensitivity1

Out[47]:

0.98014576526765518

Okay, accuracy is 89%, that’s not bad. But specificity is very low, meaning proportion of negatives that are correctly identified is very low. In simple terms the model is identifying a large amount of potential non-subscribers as subsribers, which is bad. Depending on this model representatives migh waste energy on customers who he/she might not necessarily be able to convert into a subscribers. We will work on decreasing the Specificity of our model.

Remove outliers

We created a ummary table of the continuous variables from Univariate Analysis to document the outliers and missing values. 4 Continuous Variables shows the sign of outliers: duration, campaign, pdays and previous. Removing the outliers one by one:

First create a new dataset in which we will put the changed variables to keep original dataset intact.

In [48]:

bank_market1=bank_market
bank_market1.shape

Out[48]:

(45211, 18)

In [49]:

bank_market1['duration_new']=bank_market1['duration']
bank_market1['duration_new'][bank_market1['duration_new']>1269]=180
bank_market1['duration_new'].describe()

C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app

Out[49]:

count    45211.000000
mean       243.457035
std        211.614474
min          0.000000
25%        103.000000
50%        180.000000
75%        310.000000
max       1269.000000
Name: duration_new, dtype: float64

In [50]:

bank_market1.boxplot(column="duration_new")

C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  if __name__ == '__main__':

Out[50]:

{'boxes': [<matplotlib.lines.Line2D at 0x7c59f50>],
 'caps': [<matplotlib.lines.Line2D at 0x7c5fe30>,
  <matplotlib.lines.Line2D at 0x7c5fed0>],
 'fliers': [<matplotlib.lines.Line2D at 0x7c65c90>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x7c653b0>],
 'whiskers': [<matplotlib.lines.Line2D at 0x7c5f4b0>,
  <matplotlib.lines.Line2D at 0x7c5f970>]}

In [51]:

bank_market1['duration_new'].quantile([0.1, .25,.50,.75,0.8, 0.85, .90,0.95, .99,1])

Out[51]:

0.10      58.0
0.25     103.0
0.50     180.0
0.75     310.0
0.80     358.0
0.85     421.0
0.90     521.0
0.95     696.0
0.99    1051.0
1.00    1269.0
Name: duration_new, dtype: float64

campaign

Values above 16 are outliers and should be replaced with median value : 2

In [52]:

bank_market1['campaign_new']=bank_market1['campaign']
bank_market1['campaign_new'][bank_market1['campaign_new']>16]=2
bank_market1['campaign_new'].describe()

C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app

Out[52]:

count    45211.000000
mean         2.551746
std          2.214597
min          1.000000
25%          1.000000
50%          2.000000
75%          3.000000
max         16.000000
Name: campaign_new, dtype: float64

In [53]:

bank_market1.boxplot(column="campaign_new")

C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  if __name__ == '__main__':

Out[53]:

{'boxes': [<matplotlib.lines.Line2D at 0x7c9fb10>],
 'caps': [<matplotlib.lines.Line2D at 0x7ca59f0>,
  <matplotlib.lines.Line2D at 0x7ca5eb0>],
 'fliers': [<matplotlib.lines.Line2D at 0x7cab850>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x7ca5f50>],
 'whiskers': [<matplotlib.lines.Line2D at 0x7c9ffd0>,
  <matplotlib.lines.Line2D at 0x7ca5530>]}

In [54]:

bank_market1['campaign_new'].quantile([0.1, .25,.50,.75,0.8, 0.85, .90,0.95, .99,1])

Out[54]:

0.10     1.0
0.25     1.0
0.50     2.0
0.75     3.0
0.80     4.0
0.85     4.0
0.90     5.0
0.95     7.0
0.99    12.0
1.00    16.0
Name: campaign_new, dtype: float64

pdays

Values above 370 can be considered outliers according to our observations while univariate analysis.

In [55]:

bank_market1['pdays_new']=bank_market1['pdays']
bank_market1['pdays_new'][bank_market1['pdays_new']>370]=40.2
bank_market1['pdays_new'].describe()

C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app

Out[55]:

count    45211.000000
mean        36.505430
std         91.014331
min         -1.000000
25%         -1.000000
50%         -1.000000
75%         -1.000000
max        370.000000
Name: pdays_new, dtype: float64

In [56]:

bank_market1.boxplot(column="pdays_new")

C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  if __name__ == '__main__':

Out[56]:

{'boxes': [<matplotlib.lines.Line2D at 0x7d84a90>],
 'caps': [<matplotlib.lines.Line2D at 0x7d8ba10>,
  <matplotlib.lines.Line2D at 0x7d8bed0>],
 'fliers': [<matplotlib.lines.Line2D at 0x7d90870>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x7d8bf70>],
 'whiskers': [<matplotlib.lines.Line2D at 0x7d84ff0>,
  <matplotlib.lines.Line2D at 0x7d8b550>]}

In [57]:

bank_market1['pdays_new'].quantile([0.1, .25,.50,.75,0.8, 0.85, .90,0.95, .99,1])

Out[57]:

0.10     -1.0
0.25     -1.0
0.50     -1.0
0.75     -1.0
0.80     -1.0
0.85     94.0
0.90    182.0
0.95    288.0
0.99    362.0
1.00    370.0
Name: pdays_new, dtype: float64

1% values that are above 8.9 can be considered outliers. We will replace these values with mean 0.58.

In [58]:

bank_market1['previous_new']=bank_market1['previous']
bank_market1['previous_new'][bank_market1['previous_new']>8.9]=0.58
bank_market1['previous_new'].describe()

C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app

Out[58]:

count    45211.000000
mean         0.441325
std          1.189729
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          8.000000
Name: previous_new, dtype: float64

In [59]:

bank_market1.boxplot(column="previous_new")

C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  if __name__ == '__main__':

Out[59]:

{'boxes': [<matplotlib.lines.Line2D at 0x7dcf0f0>],
 'caps': [<matplotlib.lines.Line2D at 0x7dcffd0>,
  <matplotlib.lines.Line2D at 0x7dd5530>],
 'fliers': [<matplotlib.lines.Line2D at 0x7dd5eb0>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x7dd55d0>],
 'whiskers': [<matplotlib.lines.Line2D at 0x7dcf6d0>,
  <matplotlib.lines.Line2D at 0x7dcfb90>]}

In [60]:

bank_market1['previous_new'].quantile([0.1, .25,.50,.75,0.8, 0.85, .90,0.95, .99,1])

Out[60]:

0.10    0.0
0.25    0.0
0.50    0.0
0.75    0.0
0.80    0.0
0.85    1.0
0.90    2.0
0.95    3.0
0.99    6.0
1.00    8.0
Name: previous_new, dtype: float64

After removing outliers we can still detect a hint of outliers. We were lineant toward the deciding a boundary for the outliers to keep integrety of the data and not induce any bias. If needed we can again change the margin of outliers and replace those values.

Rebuild the model after outlier removal

Again build a Logistic Model and see if we made any improvements with outlier removal But first devide the dataset bank.market1 into training and testing sets.

In [61]:

from sklearn.cross_validation import train_test_split
feature=list(bank_market1[["age"]+["job"]+["marital"]+["education"]+["default"]+["balance"]+["housing"]+["loan"]+["contact"]+["day"]+["month"]+["duration_new"]+["campaign_new"]+["pdays_new"]+["previous_new"]+["poutcome"]])
X2 = bank_market1[feature]
y2 = bank_market1['y']
X2_train, X2_test, Y2_train, Y2_test = train_test_split(X2, y2, train_size=0.8,random_state=90) 
Y2_test.shape, X1_test.shape,X2_train.shape,Y2_train.shape

Out[61]:

((9043,), (9043, 16), (36168, 16), (36168,))

In [62]:

from sklearn.linear_model import LogisticRegression
logistic2= LogisticRegression()
logistic2.fit(X2_train,Y2_train)

Out[62]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [63]:

predict2=logistic2.predict(X2_test)

In [64]:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
cm2 = confusion_matrix(Y2_test,predict2)
print(cm2)
total2=sum(sum(cm2))

[[7774  184]
 [ 839  246]]

In [65]:

accuracy2=(cm2[0,0]+cm2[1,1])/total2
accuracy2

Out[65]:

0.88687382505805601

In [66]:

specificity2=cm2[1,1]/(cm2[1,1]+cm2[1,0])
specificity2

Out[66]:

0.22672811059907835

In [67]:

sensitivity2=cm2[0,0]/(cm2[0,0]+cm2[0,1])
sensitivity2

Out[67]:

0.97687861271676302

ROC AND AUC

In [68]:

from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
actual = Y2_test
false_positive_rate, true_positive_rate, thresholds = roc_curve(actual, predict2)
plt.title('Receiver Operating Characteristic')
roc_auc = auc(false_positive_rate, true_positive_rate)
plt.plot(false_positive_rate, true_positive_rate,label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

roc_auc = auc(false_positive_rate, true_positive_rate)
roc_auc

Out[68]:

0.6018033616579207

Building Decission tree

In [69]:

import pandas as pd
from sklearn import tree
clf = tree.DecisionTreeClassifier(random_state=90)
clf = clf.fit(X2_train,Y2_train)
clf

Out[69]:

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=90, splitter='best')

In [70]:

predict3 = clf.predict(X2_test)
from sklearn.metrics import confusion_matrix
cm3=confusion_matrix(Y2_test, predict3)
print (cm3)

[[7371  587]
 [ 537  548]]

In [71]:

total3 = sum(sum(cm3))
accuracy3 = (cm3[0,0]+cm3[1,1])/total3
accuracy3

Out[71]:

0.87570496516642704

In [72]:

specificity3=cm3[1,1]/(cm3[1,1]+cm3[1,0])
specificity3

Out[72]:

0.50506912442396312

In [73]:

sensitivity3=cm3[0,0]/(cm3[0,0]+cm3[0,1])
sensitivity3

Out[73]:

0.92623774817793414

RandomForest

In [80]:

from sklearn.ensemble import RandomForestClassifier
forest=RandomForestClassifier(n_estimators=100, criterion='gini', max_depth=None, min_samples_split=2, 
                              min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', 
                              max_leaf_nodes=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, 
                              verbose=0, warm_start=False, class_weight=None)

In [81]:

forest.fit(X2_train,Y2_train)

Out[81]:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [82]:

Predicted=forest.predict(X2_test)

from sklearn.metrics import confusion_matrix as cm
ConfusionMatrix = cm(Y2_test,Predicted)
print(ConfusionMatrix)

[[7725  233]
 [ 645  440]]

In [83]:

total = sum(sum(ConfusionMatrix))
accuracy = (ConfusionMatrix[0,0]+ConfusionMatrix[1,1])/total
accuracy

Out[83]:

0.90290832688267164

In [84]:

sensitivity=ConfusionMatrix[0,0]/(ConfusionMatrix[0,0]+ConfusionMatrix[0,1])
sensitivity

Out[84]:

0.97072128675546621

In [85]:

specificity=ConfusionMatrix[1,1]/(ConfusionMatrix[1,1]+ConfusionMatrix[1,0])
specificity

Out[85]:

0.40552995391705071

Seem like a decision tree model is doing a pretty good job getting better specificity significantly. However, it’s still not that good but we reached to good result.

Bank Tele Marketing Project in Python

Before start our lesson please download the datasets.

Problem statement

Data Exploration

checking missing values

univariate analysis

age

duration

campaign

pdays

previous

day

balance

Exploring catagorical variables

job

marital

education

default

housing

loan

contact

month

poutcome

y

Summary of Univariate Analysis on Numerical Variables

Model Building

Spliting the data into training and testing set

Remove outliers

campaign

pdays

previous

Rebuild the model after outlier removal

ROC AND AUC

Building Decission tree

RandomForest