Before start our lesson please download the datasets.

Problem Statement

An insurance company need to come up with a good marketing strategy. They want to run an e-mail campaign. Before sending a mail to all the available e-mail addresses, they want to first build a predictive model and identify the customers who are most likely to respond.

Analyze the historical data and build a predictive model that helps us in maximizing the response rate. Thereby the company will be able to manage their cost by sending email only to those who is most likely to respond.

Data importing

In [1]:

import pandas as pd
direct_mail=pd.read_csv("C:\\Users\\Personal\\Google Drive\\DirectMail.csv")
direct_mail.shape

Out[1]:

(29904, 26)

In [2]:

direct_mail.columns.values

Out[2]:

array(['AGE', 'CRED', 'MS', 'HEQ', 'INCOME', 'DEPC', 'MOB', 'MILEAGE',
       'RESTYPE', 'GENDER', 'EMP_STA', 'RES_STA', 'DELINQ', 'NUMTR',
       'MRTGI', 'MFDU', 'resp', 'female', 'HOME', 'CONDO', 'COOP',
       'renter', 'emp1', 'emp2', 'msn', 'cuscode'], dtype=object)

Summary of the dataset Summary and Structure lets us have a prilimilary look at the data to understand the kind of data we are dealing with. dtypes : dtypes shows datatype of each variable, it can be int, factor or a number. the number of observations and number of variables are mentioned on the top. describe() : describe gives us more details about the variable. for numerical or integer variable it gives Min value, Max value, Mean, Median, 1st and 3rd quartile values. Most importantly it gives us any missing values in form of NA’s.

In [3]:

direct_mail.dtypes

Out[3]:

AGE          int64
CRED       float64
MS          object
HEQ        float64
INCOME       int64
DEPC        object
MOB         object
MILEAGE    float64
RESTYPE     object
GENDER      object
EMP_STA     object
RES_STA     object
DELINQ       int64
NUMTR        int64
MRTGI       object
MFDU         int64
resp         int64
female       int64
HOME         int64
CONDO        int64
COOP         int64
renter       int64
emp1         int64
emp2         int64
msn          int64
cuscode      int64
dtype: object

Data Exploration

In [4]:

import pandas as pd
import sklearn as sk
import math
import numpy as np
from scipy import stats
import matplotlib as matlab
import statsmodels
direct_mail=pd.read_csv("C:\\Users\\Personal\\Google Drive\\DirectMail.csv")
direct_mail.shape
direct_mail.columns.values
direct_mail.head(10)
direct_mail.describe()

Out[4]:

	AGE	CRED	HEQ	INCOME	MILEAGE	DELINQ	NUMTR	MFDU	resp	female	HOME	CONDO	COOP	renter	emp1	emp2	msn	cuscode
count	29904.000000	29871.000000	29904.000000	29904.000000	29904.000000	29904.000000	29904.000000	29904.000000	29904.000000	29904.000000	29904.000000	29904.000000	29904.000000	29904.0	29904.000000	29904.000000	29904.000000	29904.000000
mean	49.300361	603.627364	38.328892	41.360688	11.803395	0.766152	0.763343	0.452749	0.095539	0.422285	0.547251	0.018927	0.029595	1.0	0.907772	0.059858	0.575876	14952.500000
std	15.546298	98.136661	37.205701	15.441137	5.744523	1.140847	1.143284	0.497771	0.293963	0.493932	0.497771	0.136270	0.169469	0.0	0.289353	0.237228	0.494218	8632.685561
min	18.000000	300.000000	0.100000	20.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.0	0.000000	0.000000	0.000000	1.000000
25%	36.000000	574.000000	10.000000	30.000000	7.775750	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.0	1.000000	0.000000	0.000000	7476.750000
50%	50.000000	617.000000	30.000000	40.000000	12.536500	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	1.0	1.000000	0.000000	1.000000	14952.500000
75%	61.000000	652.000000	50.000000	50.000000	16.193250	1.000000	1.000000	1.000000	0.000000	1.000000	1.000000	0.000000	0.000000	1.0	1.000000	0.000000	1.000000	22428.250000
max	90.000000	1789.000000	200.000000	110.000000	94.640000	7.000000	7.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.0	1.000000	1.000000	1.000000	29904.000000

checking missing values

Checking for any missing values in dataset To check if any na values are present in the dataset and count the NA values in the dataset

In [5]:

direct_mail.isnull().sum()

Out[5]:

AGE         0
CRED       33
MS          0
HEQ         0
INCOME      0
DEPC        0
MOB         0
MILEAGE     0
RESTYPE     0
GENDER      0
EMP_STA     0
RES_STA     0
DELINQ      0
NUMTR       0
MRTGI       0
MFDU        0
resp        0
female      0
HOME        0
CONDO       0
COOP        0
renter      0
emp1        0
emp2        0
msn         0
cuscode     0
dtype: int64

We can see all the 33 na values are in the CRED column, this column need to be taken care of while cleaning the data.

Univariate analysis

AGE

Age of the customer

In [6]:

direct_mail['AGE'].describe()

Out[6]:

count    29904.000000
mean        49.300361
std         15.546298
min         18.000000
25%         36.000000
50%         50.000000
75%         61.000000
max         90.000000
Name: AGE, dtype: float64

In [7]:

import matplotlib.pyplot as plt
%matplotlib inline
direct_mail.boxplot(column="AGE")

C:\Users\Personal\Anaconda3\lib\site-packages\ipykernel\__main__.py:3: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  app.launch_new_instance()

Out[7]:

{'boxes': [<matplotlib.lines.Line2D at 0xa5cb1b0>],
 'caps': [<matplotlib.lines.Line2D at 0xa5cbef0>,
  <matplotlib.lines.Line2D at 0xa5da7f0>],
 'fliers': [<matplotlib.lines.Line2D at 0xa5dad50>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0xa5da890>],
 'whiskers': [<matplotlib.lines.Line2D at 0xa5cb990>,
  <matplotlib.lines.Line2D at 0xa5cbe50>]}

This variable is pretty clean.Mean and median are very close, Min and Max values are also in a possible range. Boxplot shows a healthy distribution of the data.

CRED : Credit score

Credit score is the creditworthiness of the person

In [8]:

direct_mail['CRED'].describe()

Out[8]:

count    29871.000000
mean       603.627364
std         98.136661
min        300.000000
25%        574.000000
50%        617.000000
75%        652.000000
max       1789.000000
Name: CRED, dtype: float64

In [9]:

direct_mail['CRED_new']=direct_mail['CRED']
#to display all the rows which have missing values in 'CRED_new' Column:
direct_mail.ix[direct_mail['CRED_new'].isnull()]
#to get axis=0 index (row index) which have missing values in this column
direct_mail.ix[direct_mail['CRED_new'].isnull()].index

#Once identified where missing values exist, the next task usually is to fill them (data imputation). Depending upon the context,
#in this case, I am assigning mean value(603.6) to all those positions where missing value is present:
direct_mail.loc[direct_mail['CRED_new'].isnull(),'CRED_new']=603.6
sum(direct_mail['CRED_new'].isnull())
#and as the output suggests, this column doesn't have any missing values now

Out[9]:

In [10]:

direct_mail['CRED_new'].describe()

Out[10]:

count    29904.000000
mean       603.627334
std         98.082496
min        300.000000
25%        574.000000
50%        617.000000
75%        652.000000
max       1789.000000
Name: CRED_new, dtype: float64

In [11]:

direct_mail.boxplot(column="CRED_new")

C:\Users\Personal\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  if __name__ == '__main__':

Out[11]:

{'boxes': [<matplotlib.lines.Line2D at 0xb628330>],
 'caps': [<matplotlib.lines.Line2D at 0xb628dd0>,
  <matplotlib.lines.Line2D at 0xb62f6d0>],
 'fliers': [<matplotlib.lines.Line2D at 0xb62ffd0>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0xb62f770>],
 'whiskers': [<matplotlib.lines.Line2D at 0xb628870>,
  <matplotlib.lines.Line2D at 0xb628d30>]}

In [12]:

direct_mail['CRED_new'].quantile([0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.99,1])

Out[12]:

0.00     300.00
0.01     318.03
0.03     353.00
0.05     386.00
0.07     422.00
0.09     456.00
0.10     472.00
0.20     563.00
0.30     586.00
0.40     605.00
0.50     617.00
0.60     628.00
0.70     642.00
0.80     661.00
0.90     686.00
0.95     725.00
0.99     863.00
1.00    1789.00
Name: CRED_new, dtype: float64

From the boxplot we can see there is one value ‘1789.0’ that’s standing out of the distribution and this is the max value in the column, this sould be considered outlier.

MS : Marital Status

This variable gives the marital status of the customer, M: Married, U:Unmarried, X:Other/unknown

In [13]:

frequency=direct_mail['MS'].value_counts()
frequency

Out[13]:

M    17221
U    11721
X      962
Name: MS, dtype: int64

In [14]:

import matplotlib.pyplot as plt
fig = plt.figure(figsize=(5,5))
frequency.plot(kind='bar')

Out[14]:

<matplotlib.axes._subplots.AxesSubplot at 0xb609e30>

HEQ : Home Equity

Home equity is the value of ownership built up in a home or property that represents the current market value of the house, minus any remaining mortgage payments.

In [15]:

direct_mail['HEQ'].describe()

Out[15]:

count    29904.000000
mean        38.328892
std         37.205701
min          0.100000
25%         10.000000
50%         30.000000
75%         50.000000
max        200.000000
Name: HEQ, dtype: float64

In [16]:

direct_mail.boxplot(column='HEQ')

C:\Users\Personal\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  if __name__ == '__main__':

Out[16]:

{'boxes': [<matplotlib.lines.Line2D at 0x9075b30>],
 'caps': [<matplotlib.lines.Line2D at 0x907da10>,
  <matplotlib.lines.Line2D at 0x907ded0>],
 'fliers': [<matplotlib.lines.Line2D at 0x9082870>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x907df70>],
 'whiskers': [<matplotlib.lines.Line2D at 0x9075c30>,
  <matplotlib.lines.Line2D at 0x907d550>]}

In [17]:

direct_mail['HEQ'].quantile([0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.99,1])

Out[17]:

0.00      0.1
0.01      0.1
0.03      0.1
0.05      0.1
0.07      0.1
0.09      0.1
0.10      0.1
0.20     10.0
0.30     10.0
0.40     30.0
0.50     30.0
0.60     30.0
0.70     50.0
0.80     50.0
0.90     70.0
0.95     90.0
0.99    200.0
1.00    200.0
Name: HEQ, dtype: float64

In [18]:

direct_mail['HEQ'].value_counts()

Out[18]:

30.0     9156
50.0     6411
10.0     5556
0.1      3772
70.0     2734
90.0     1376
200.0     899
Name: HEQ, dtype: int64

There aren’t any missing values. However, the Boxplot and percentile distributin shows some values too far from the distribution. The max value 200 seem to be outlier.

INCOME : Income of the customer

Income of the customer

In [19]:

direct_mail['INCOME'].describe()

Out[19]:

count    29904.000000
mean        41.360688
std         15.441137
min         20.000000
25%         30.000000
50%         40.000000
75%         50.000000
max        110.000000
Name: INCOME, dtype: float64

In [20]:

direct_mail.boxplot(column="INCOME")

C:\Users\Personal\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  if __name__ == '__main__':

Out[20]:

{'boxes': [<matplotlib.lines.Line2D at 0x90b6cd0>],
 'caps': [<matplotlib.lines.Line2D at 0x90bdbb0>,
  <matplotlib.lines.Line2D at 0x90bdff0>],
 'fliers': [<matplotlib.lines.Line2D at 0xb682a10>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0xb682130>],
 'whiskers': [<matplotlib.lines.Line2D at 0x90b6dd0>,
  <matplotlib.lines.Line2D at 0x90bd6f0>]}

In [21]:

direct_mail['INCOME'].quantile([0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.96,0.97,0.98,0.99,1])

Out[21]:

0.00     20.0
0.01     20.0
0.03     20.0
0.05     20.0
0.07     20.0
0.09     20.0
0.10     20.0
0.20     30.0
0.30     30.0
0.40     40.0
0.50     40.0
0.60     40.0
0.70     50.0
0.80     50.0
0.90     60.0
0.95     70.0
0.96     70.0
0.97     70.0
0.98     70.0
0.99    110.0
1.00    110.0
Name: INCOME, dtype: float64

In [22]:

direct_mail['INCOME'].value_counts()

Out[22]:

40     9373
50     6348
30     6063
20     3977
60     2631
70     1027
110     485
Name: INCOME, dtype: int64

We can clearly see in boxplot that a group of data points out of the plot.This group represents the value ‘110’ which is 1.6% of the total data. However, this values is not that distinct from the rest of the values.

DEPC : Depriciation

Is there any reduction in the value of an asset.

In [23]:

frequ_tab=direct_mail['DEPC'].value_counts()
frequ_tab

Out[23]:

N    18502
Y    11402
Name: DEPC, dtype: int64

In [24]:

import matplotlib.pyplot as plt
fig = plt.figure(figsize=(5,5))
frequ_tab.plot(kind='bar')

Out[24]:

<matplotlib.axes._subplots.AxesSubplot at 0x9097350>

MOB : Existing Customer

Parameter shows if the customer is existing or new.

In [25]:

feq=direct_mail['MOB'].value_counts()
feq

Out[25]:

Y    18564
N    11340
Name: MOB, dtype: int64

In [26]:

import matplotlib.pyplot as plt
fig = plt.figure(figsize=(5,5))
feq.plot(kind='bar')

Out[26]:

<matplotlib.axes._subplots.AxesSubplot at 0x909b150>

MILEAGE

Mileage of the customer vehicle

In [27]:

direct_mail['MILEAGE'].describe()

Out[27]:

count    29904.000000
mean        11.803395
std          5.744523
min          0.000000
25%          7.775750
50%         12.536500
75%         16.193250
max         94.640000
Name: MILEAGE, dtype: float64

In [28]:

direct_mail.boxplot(column="MILEAGE")

C:\Users\Personal\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  if __name__ == '__main__':

Out[28]:

{'boxes': [<matplotlib.lines.Line2D at 0xbb59eb0>],
 'caps': [<matplotlib.lines.Line2D at 0xbb60d90>,
  <matplotlib.lines.Line2D at 0xbb60e30>],
 'fliers': [<matplotlib.lines.Line2D at 0xbb66bf0>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0xbb66310>],
 'whiskers': [<matplotlib.lines.Line2D at 0xbb59fb0>,
  <matplotlib.lines.Line2D at 0xbb608d0>]}

In [29]:

direct_mail['MILEAGE'].quantile([0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.96,0.97,0.98,0.99,1])

Out[29]:

0.00     0.00000
0.01     0.32006
0.03     0.98909
0.05     1.60700
0.07     2.18921
0.09     2.77927
0.10     3.07860
0.20     6.28400
0.30     9.34770
0.40    11.10320
0.50    12.53650
0.60    13.99100
0.70    15.38610
0.80    16.98900
0.90    18.58670
0.95    19.32485
0.96    19.47500
0.97    19.62591
0.98    19.78100
0.99    19.93300
1.00    94.64000
Name: MILEAGE, dtype: float64

Mean and medians are close, but the max value is quite skewed. Boxplot shows values after around 20 are dispersed. Percentile distribution shows 99% values have an even distribution but rest 1% goes too high. This 1% can be counted as outliers.

RESTYPE : Real Estate Type

This variable shows type of the house,customer is living in.

In [30]:

frequ=direct_mail['RESTYPE'].value_counts()
frequ

Out[30]:

HOME      16365
RENTER    12088
COOP        885
CONDO       566
Name: RESTYPE, dtype: int64

In [31]:

import matplotlib.pyplot as plt
fig = plt.figure(figsize=(5,5))
frequ.plot(kind='bar')

Out[31]:

<matplotlib.axes._subplots.AxesSubplot at 0xb6df050>

GENDER

Gender of the customer.

In [32]:

feq=direct_mail['GENDER'].value_counts()
feq

Out[32]:

M    17276
F    12628
Name: GENDER, dtype: int64

In [33]:

import matplotlib.pyplot as plt
fig = plt.figure(figsize=(5,5))
feq.plot(kind='bar')

Out[33]:

<matplotlib.axes._subplots.AxesSubplot at 0xbc1bef0>

EMP_STA : Employer status

In [34]:

frequ=direct_mail['EMP_STA'].value_counts()
frequ

Out[34]:

1,2    27146
3+      1790
0        968
Name: EMP_STA, dtype: int64

In [35]:

import matplotlib.pyplot as plt
fig = plt.figure(figsize=(5,5))
frequ.plot(kind='bar')

Out[35]:

<matplotlib.axes._subplots.AxesSubplot at 0xbb8b230>

RES_STA : Residential status

Residentail status of the customer

In [36]:

feq_tab=direct_mail['RES_STA'].value_counts()
feq_tab

Out[36]:

1,2    28109
3+      1795
Name: RES_STA, dtype: int64

In [37]:

import matplotlib.pyplot as plt
fig = plt.figure(figsize=(5,5))
feq_tab.plot(kind='bar')

Out[37]:

<matplotlib.axes._subplots.AxesSubplot at 0xbbb5090>

DELINQ : Delinquency Status

Delinquency is Failure in repaying the borrowed sum . This variable shows how many times the customer has been deliquent.

In [38]:

direct_mail['DELINQ'].describe()

Out[38]:

count    29904.000000
mean         0.766152
std          1.140847
min          0.000000
25%          0.000000
50%          0.000000
75%          1.000000
max          7.000000
Name: DELINQ, dtype: float64

In [39]:

feq=direct_mail['DELINQ'].value_counts()
feq

Out[39]:

0    17688
1     5991
2     3084
3     1984
4     1071
7       29
5       29
6       28
Name: DELINQ, dtype: int64

In [40]:

import matplotlib.pyplot as plt
fig = plt.figure(figsize=(5,5))
feq.plot(kind='bar')

Out[40]:

<matplotlib.axes._subplots.AxesSubplot at 0xbc63d30>

NUMTR : Number of active trades

Buying and selling the properties in a very short duration

In [41]:

direct_mail['NUMTR'].describe()

Out[41]:

count    29904.000000
mean         0.763343
std          1.143284
min          0.000000
25%          0.000000
50%          0.000000
75%          1.000000
max          7.000000
Name: NUMTR, dtype: float64

In [42]:

frequ=direct_mail['NUMTR'].value_counts()
frequ

Out[42]:

0    17778
1     5923
2     3046
3     1997
4     1071
6       34
7       29
5       26
Name: NUMTR, dtype: int64

In [43]:

import matplotlib.pyplot as plt
fig = plt.figure(figsize=(5,5))
frequ.plot(kind='bar')

Out[43]:

<matplotlib.axes._subplots.AxesSubplot at 0xbbeb510>

MRTGI : Mortgage Indicator

If customer has Mortagaged properties. N:No; Y:Yes and U:Unknown.

In [44]:

frequency=direct_mail['MRTGI'].value_counts()
frequency

Out[44]:

N    17484
Y    10657
U     1763
Name: MRTGI, dtype: int64

In [45]:

import matplotlib.pyplot as plt
fig = plt.figure(figsize=(5,5))
frequency.plot(kind='bar')

Out[45]:

<matplotlib.axes._subplots.AxesSubplot at 0xbcd4b10>

MFDU : Multiple Family Dwelling input

If the customer is living in multi home complex,like apartments etc.

In [46]:

direct_mail['MFDU'].describe()

Out[46]:

count    29904.000000
mean         0.452749
std          0.497771
min          0.000000
25%          0.000000
50%          0.000000
75%          1.000000
max          1.000000
Name: MFDU, dtype: float64

In [47]:

freq=direct_mail['MFDU'].value_counts()
freq

Out[47]:

0    16365
1    13539
Name: MFDU, dtype: int64

In [48]:

import matplotlib.pyplot as plt
fig = plt.figure(figsize=(5,5))
freq.plot(kind='bar')

Out[48]:

<matplotlib.axes._subplots.AxesSubplot at 0xbd3b170>

resp :Response

This is our target variable; the Response.

In [49]:

freq=direct_mail['resp'].value_counts()
freq

Out[49]:

0    27047
1     2857
Name: resp, dtype: int64

In [50]:

import matplotlib.pyplot as plt
fig = plt.figure(figsize=(5,5))
freq.plot(kind='bar')

Out[50]:

<matplotlib.axes._subplots.AxesSubplot at 0xcd30150>

we see that response is more skewed towards 0’s. This is an Unbalanced Data, many predictive models will also show same kind of skewness in predictions.

msn : Medical Safety Net Program

This varible shows if the customer is enrolled in this particular medical program; a medical backup insurance program generally offerd by goverment health departments to low income families.

In [51]:

freq_tab=direct_mail['msn'].value_counts()
freq_tab

Out[51]:

1    17221
0    12683
Name: msn, dtype: int64

In [52]:

import matplotlib.pyplot as plt
fig = plt.figure(figsize=(5,5))
freq_tab.plot(kind='bar')

Out[52]:

<matplotlib.axes._subplots.AxesSubplot at 0xcd81db0>

cuscode : Customer Identification Code

In [53]:

direct_mail['cuscode'].describe()

Out[53]:

count    29904.000000
mean     14952.500000
std       8632.685561
min          1.000000
25%       7476.750000
50%      14952.500000
75%      22428.250000
max      29904.000000
Name: cuscode, dtype: float64

In [54]:

direct_mail.boxplot(column="cuscode")

C:\Users\Personal\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  if __name__ == '__main__':

Out[54]:

{'boxes': [<matplotlib.lines.Line2D at 0xce19a30>],
 'caps': [<matplotlib.lines.Line2D at 0xce20910>,
  <matplotlib.lines.Line2D at 0xce20dd0>],
 'fliers': [<matplotlib.lines.Line2D at 0xce25770>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0xce20e70>],
 'whiskers': [<matplotlib.lines.Line2D at 0xce19f70>,
  <matplotlib.lines.Line2D at 0xce20450>]}

*** below are some dummy variables, derived from some of the variables above which are already included in the dataset. > female

This variable is derived from variable ‘GENDER’

In [55]:

table=direct_mail['female'].value_counts()
table

Out[55]:

0    17276
1    12628
Name: female, dtype: int64

In [56]:

import matplotlib.pyplot as plt
fig = plt.figure(figsize=(5,5))
table.plot(kind='bar')

Out[56]:

<matplotlib.axes._subplots.AxesSubplot at 0xcdf9a70>

***4 varibales: ‘HOME’, ‘CONDO’, ‘COOP’, ‘renter’ are derived binary dummy-variables(0,1) form parent variable ‘RESTYPE’ which was a multiclass varible. > HOME : Home Indicator

In [57]:

tab=direct_mail['HOME'].value_counts()
tab

Out[57]:

1    16365
0    13539
Name: HOME, dtype: int64

In [58]:

import matplotlib.pyplot as plt
fig = plt.figure(figsize=(5,5))
tab.plot(kind='bar')

Out[58]:

<matplotlib.axes._subplots.AxesSubplot at 0xce422d0>

CONDO : Condominium Indicator

In [59]:

tab_1=direct_mail['CONDO'].value_counts()
tab_1

Out[59]:

0    29338
1      566
Name: CONDO, dtype: int64

In [60]:

import matplotlib.pyplot as plt
fig = plt.figure(figsize=(5,5))
tab_1.plot(kind='bar')

Out[60]:

<matplotlib.axes._subplots.AxesSubplot at 0xce04af0>

COOP : Co-Op Residence Indicator

In [61]:

tab_2=direct_mail['COOP'].value_counts()
tab_2

Out[61]:

0    29019
1      885
Name: COOP, dtype: int64

In [62]:

import matplotlib.pyplot as plt
fig = plt.figure(figsize=(5,5))
tab_2.plot(kind='bar')

Out[62]:

<matplotlib.axes._subplots.AxesSubplot at 0xcdf4b10>

renter : Rental Home Indicator

In [63]:

direct_mail['renter'].describe()

Out[63]:

count    29904.0
mean         1.0
std          0.0
min          1.0
25%          1.0
50%          1.0
75%          1.0
max          1.0
Name: renter, dtype: float64

In [64]:

tab_3=direct_mail['renter'].value_counts()
tab_3

Out[64]:

1    29904
Name: renter, dtype: int64

In [65]:

import matplotlib.pyplot as plt
fig = plt.figure(figsize=(3,3))
tab_3.plot(kind='bar')

Out[65]:

<matplotlib.axes._subplots.AxesSubplot at 0xcedee50>

*** emp1 and emp2 are derived from variable EMP_STA; employment status > emp1 : Employee1

this variable distinguishes if the customer had 1-2 jobs or else(unemployed or more than 3 jobs)

In [66]:

tab_4=direct_mail['emp1'].value_counts()
tab_4

Out[66]:

1    27146
0     2758
Name: emp1, dtype: int64

In [67]:

import matplotlib.pyplot as plt
fig = plt.figure(figsize=(4,3))
tab_4.plot(kind='bar')

Out[67]:

<matplotlib.axes._subplots.AxesSubplot at 0xcf1c170>

emp2 : Employee2

This variable distinguishes if the customer has/had 3 or more than 3 jobs.

In [68]:

tab_5=direct_mail['emp2'].value_counts()
tab_5

Out[68]:

0    28114
1     1790
Name: emp2, dtype: int64

In [69]:

import matplotlib.pyplot as plt
fig = plt.figure(figsize=(4,3))
tab_5.plot(kind='bar')

Out[69]:

<matplotlib.axes._subplots.AxesSubplot at 0xcf583f0>

Summary of Univariate Analysis Lets tabularize what we found in univariate analysis

Variable Missing Values Outliers Remarks

AGE Nill Nill

CRED Present (<1%) Present (<1%)

MS Nill Nill

HEQ Nill Present (<1%)

INCOME Nill Nill

DEPC Nill Nill

MOB Nill Nill

MILEAGE Nill Present (<1%)

RESTYPE Nill Nill

GENDER Nill Nill

EMP_STA Nill Nill

DELINQ Nill Nill

NUMTR Nill Nill

MRTGI Nill Nill

MFDU Nill Nill

resp Nill Nill

female Nill Nill Dummy Variable

HOME Nill Nill Dummy Variable

CONDO Nill Nill Dummy Variable

COOP Nill Nill Dummy Variable

renter Nill Nill Dummy Variable

emp1 Nill Nill Dummy Variable

emp2 Nill Nill Dummy Variable

msn Nill Nill

cuscode Nill Nill ID number

Since, sklearn requires all inputs to be numeric, we should convert all our categorical variables into numeric by encoding the categories. This can be done using the following code:

In [70]:

from sklearn.preprocessing import LabelEncoder
var_mod = ['MS','DEPC','MOB','RESTYPE','GENDER','EMP_STA','RES_STA','MRTGI']
le = LabelEncoder()
for i in var_mod:
    direct_mail[i] = le.fit_transform(direct_mail[i])
direct_mail.dtypes

Out[70]:

AGE           int64
CRED        float64
MS            int32
HEQ         float64
INCOME        int64
DEPC          int32
MOB           int32
MILEAGE     float64
RESTYPE       int32
GENDER        int32
EMP_STA       int32
RES_STA       int32
DELINQ        int64
NUMTR         int64
MRTGI         int32
MFDU          int64
resp          int64
female        int64
HOME          int64
CONDO         int64
COOP          int64
renter        int64
emp1          int64
emp2          int64
msn           int64
cuscode       int64
CRED_new    float64
dtype: object

Spliting the data into training and testing set

In [71]:

from sklearn.cross_validation import train_test_split
features=list(direct_mail[["AGE"]+["CRED_new"]+["MS"]+["HEQ"]+["INCOME"]+["DEPC"]+["MOB"]+["MILEAGE"]+["RESTYPE"]+["GENDER"]+["EMP_STA"]+["RES_STA"]+["DELINQ"]+["NUMTR"]+["MRTGI"]+["MFDU"]+["female"]+["HOME"]+["CONDO"]+["COOP"]+["renter"]+["emp1"]+["emp2"]+["msn"]+["cuscode"]])
X = direct_mail[features]
y = direct_mail['resp']
X_train, X_test, Y_train, Y_test = train_test_split(X, y, train_size=0.8) 
Y_test.shape, X_test.shape,X_train.shape,Y_train.shape

Out[71]:

((5981,), (5981, 25), (23923, 25), (23923,))

Model Building

The response variable ‘resp’ is logical(Yes-No; 0-1 type), hance we use the logistic regression

logistic regression With respect to age

In [72]:

from sklearn.linear_model import LogisticRegression
logistic1= LogisticRegression()
logistic1.fit(X_train,Y_train)

Out[72]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [73]:

predict1=logistic1.predict(X_test)

In [74]:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
cm1 = confusion_matrix(Y_test,predict1)
print(cm1)
total1=sum(sum(cm1))

[[5402    0]
 [ 579    0]]

In [75]:

accuracy1=(cm1[0,0]+cm1[1,1])/total1
accuracy1

Out[75]:

0.90319344591205486

In [76]:

specificity1=cm1[1,1]/(cm1[1,1]+cm1[1,0])
specificity1

Out[76]:

0.0

In [77]:

sensitivity1=cm1[0,0]/(cm1[0,0]+cm1[0,1])
sensitivity1

Out[77]:

1.0

Remove outliers

4 Continuous Variables shows the sign of outliers: CRED_new, HEQ, INCOME and MILEAGE. Removing the outliers one by one:

First create a new dataset in which we will put the changed variables to keep original dataset intact.

In [78]:

direct_mail1=direct_mail
direct_mail1.shape

Out[78]:

(29904, 27)

CRED_new

1789 consider as an outlier and should be replaced with mean value : 603.6

In [79]:

direct_mail1['CRED_new1']=direct_mail1['CRED_new']
direct_mail1['CRED_new1'][direct_mail1['CRED_new1']==1789]=603.6
direct_mail1['CRED_new1'].describe()

C:\Users\Personal\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app

Out[79]:

count    29904.000000
mean       602.993091
std         94.169803
min        300.000000
25%        574.000000
50%        616.000000
75%        652.000000
max        979.000000
Name: CRED_new1, dtype: float64

In [80]:

direct_mail1.boxplot(column="CRED_new1")

C:\Users\Personal\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  if __name__ == '__main__':

Out[80]:

{'boxes': [<matplotlib.lines.Line2D at 0xba53990>],
 'caps': [<matplotlib.lines.Line2D at 0xba5a870>,
  <matplotlib.lines.Line2D at 0xba5ad30>],
 'fliers': [<matplotlib.lines.Line2D at 0xba5f6d0>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0xba5add0>],
 'whiskers': [<matplotlib.lines.Line2D at 0xba53ed0>,
  <matplotlib.lines.Line2D at 0xba53f70>]}

In [81]:

direct_mail1['CRED_new1'].quantile([0.1, .25,.50,.75,0.8, 0.85, .90,0.95, .99,1])

Out[81]:

0.10    472.0
0.25    574.0
0.50    616.0
0.75    652.0
0.80    661.0
0.85    671.0
0.90    685.0
0.95    725.0
0.99    858.0
1.00    979.0
Name: CRED_new1, dtype: float64

HEQ

Values above 90 can be considered outliers according to our observations while univariate analysis.and should be replaced with mean value : 38.33

In [82]:

direct_mail1['HEQ_new']=direct_mail1['HEQ']
direct_mail1['HEQ_new'][direct_mail1['HEQ_new']>=90]=38.33
direct_mail1['HEQ_new'].describe()

C:\Users\Personal\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app

Out[82]:

count    29904.000000
mean        31.091090
std         20.572407
min          0.100000
25%         10.000000
50%         30.000000
75%         50.000000
max         70.000000
Name: HEQ_new, dtype: float64

In [83]:

direct_mail1.boxplot(column="HEQ_new")

C:\Users\Personal\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  if __name__ == '__main__':

Out[83]:

{'boxes': [<matplotlib.lines.Line2D at 0xba99690>],
 'caps': [<matplotlib.lines.Line2D at 0xba9f610>,
  <matplotlib.lines.Line2D at 0xba9fad0>],
 'fliers': [<matplotlib.lines.Line2D at 0xbaa4470>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0xba9fb70>],
 'whiskers': [<matplotlib.lines.Line2D at 0xba99c70>,
  <matplotlib.lines.Line2D at 0xba99d10>]}

In [84]:

direct_mail1['HEQ_new'].quantile([0.1, .25,.50,.75,0.8, 0.85, .90,0.95, .99,1])

Out[84]:

0.10     0.1
0.25    10.0
0.50    30.0
0.75    50.0
0.80    50.0
0.85    50.0
0.90    50.0
0.95    70.0
0.99    70.0
1.00    70.0
Name: HEQ_new, dtype: float64

Income

Values above 100 can be considered outliers according to our observations while univariate analysis.and should be replaced with mean value : 41.36

In [85]:

direct_mail1['INCOME_new']=direct_mail1['INCOME']
direct_mail1['INCOME_new'][direct_mail1['INCOME_new']>=100]=41.36
direct_mail1['INCOME_new'].describe()

C:\Users\Personal\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app

Out[85]:

count    29904.000000
mean        40.247445
std         12.679723
min         20.000000
25%         30.000000
50%         40.000000
75%         50.000000
max         70.000000
Name: INCOME_new, dtype: float64

In [86]:

direct_mail1.boxplot(column="INCOME_new")

C:\Users\Personal\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  if __name__ == '__main__':

Out[86]:

{'boxes': [<matplotlib.lines.Line2D at 0xbade030>],
 'caps': [<matplotlib.lines.Line2D at 0xbadeef0>,
  <matplotlib.lines.Line2D at 0xbadef90>],
 'fliers': [<matplotlib.lines.Line2D at 0xbae3d50>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0xbae3470>],
 'whiskers': [<matplotlib.lines.Line2D at 0xbade570>,
  <matplotlib.lines.Line2D at 0xbadea30>]}

In [87]:

direct_mail1['INCOME_new'].quantile([0.1, .25,.50,.75,0.8, 0.85, .90,0.95, .99,1])

Out[87]:

0.10    20.0
0.25    30.0
0.50    40.0
0.75    50.0
0.80    50.0
0.85    50.0
0.90    60.0
0.95    60.0
0.99    70.0
1.00    70.0
Name: INCOME_new, dtype: float64

Mileage

Values above 25 can be considered outliers according to our observations while univariate analysis.and should be replaced with mean value : 11.8

In [88]:

direct_mail1['MILEAGE_new']=direct_mail1['MILEAGE']
direct_mail1['MILEAGE_new'][direct_mail1['MILEAGE_new']>=25]=11.8
direct_mail1['MILEAGE_new'].describe()

C:\Users\Personal\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app

Out[88]:

count    29904.000000
mean        11.739503
std          5.473469
min          0.000000
25%          7.775750
50%         12.504000
75%         16.161500
max         24.958000
Name: MILEAGE_new, dtype: float64

In [89]:

direct_mail1.boxplot(column="MILEAGE_new")

C:\Users\Personal\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  if __name__ == '__main__':

Out[89]:

{'boxes': [<matplotlib.lines.Line2D at 0xddbc070>],
 'caps': [<matplotlib.lines.Line2D at 0xddbcf30>,
  <matplotlib.lines.Line2D at 0xddbcfd0>],
 'fliers': [<matplotlib.lines.Line2D at 0xddc1d90>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0xddc14b0>],
 'whiskers': [<matplotlib.lines.Line2D at 0xddbc5b0>,
  <matplotlib.lines.Line2D at 0xddbca70>]}

In [90]:

direct_mail1['MILEAGE_new'].quantile([0.1, .25,.50,.75,0.8, 0.85, .90,0.95, .99,1])

Out[90]:

0.10     3.07860
0.25     7.77575
0.50    12.50400
0.75    16.16150
0.80    16.95900
0.85    17.76200
0.90    18.55870
0.95    19.29800
0.99    19.90500
1.00    24.95800
Name: MILEAGE_new, dtype: float64

Rebuild the model after outlier removal

Again build a Logistic Model and see if we made any improvements with outlier removal But first devide the dataset direct_mail1 into training and testing sets.

In [130]:

from sklearn.cross_validation import train_test_split
features=list(direct_mail1[["AGE"]+["CRED_new1"]+["MS"]+["HEQ_new"]+["INCOME_new"]+["DEPC"]+["MOB"]+["MILEAGE_new"]+["RESTYPE"]+["GENDER"]+["EMP_STA"]+["RES_STA"]+["DELINQ"]+["NUMTR"]+["MRTGI"]+["MFDU"]+["female"]+["HOME"]+["CONDO"]+["COOP"]+["renter"]+["emp1"]+["emp2"]+["msn"]+["cuscode"]])
X1 = direct_mail1[features]
y1 = direct_mail1['resp']
X1_train, X1_test, Y1_train, Y1_test = train_test_split(X1, y1, train_size=0.8) 
Y1_test.shape, X1_test.shape,X1_train.shape,Y1_train.shape

Out[130]:

((5981,), (5981, 25), (23923, 25), (23923,))

In [92]:

from sklearn.linear_model import LogisticRegression
logistic2= LogisticRegression()
logistic2.fit(X1_train,Y1_train)

Out[92]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [93]:

predict2=logistic2.predict(X1_test)

In [94]:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
cm2 = confusion_matrix(Y1_test,predict2)
print(cm2)
total2=sum(sum(cm2))

[[5386    0]
 [ 594    1]]

In [95]:

accuracy2=(cm2[0,0]+cm2[1,1])/total2
accuracy2

Out[95]:

0.90068550409630499

In [96]:

specificity2=cm2[1,1]/(cm2[1,1]+cm2[1,0])
specificity2

Out[96]:

0.0016806722689075631

In [97]:

sensitivity2=cm2[0,0]/(cm2[0,0]+cm2[0,1])
sensitivity2

Out[97]:

1.0

ROC AND AUC

In [98]:

from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
actual = Y1_test
false_positive_rate, true_positive_rate, thresholds = roc_curve(actual, predict2)
plt.title('Receiver Operating Characteristic')
roc_auc = auc(false_positive_rate, true_positive_rate)
plt.plot(false_positive_rate, true_positive_rate,label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

roc_auc = auc(false_positive_rate, true_positive_rate)
roc_auc

Out[98]:

0.50084033613445378

Building Decission tree

In [99]:

import pandas as pd
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X1_train,Y1_train)
clf

Out[99]:

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [100]:

predict3 = clf.predict(X1_test)
from sklearn.metrics import confusion_matrix
cm3=confusion_matrix(Y1_test, predict3)
print (cm3)

[[4818  568]
 [ 510   85]]

In [101]:

total3 = sum(sum(cm3))
accuracy3 = (cm3[0,0]+cm3[1,1])/total3
accuracy3

Out[101]:

0.81976258150810899

In [102]:

specificity3=cm3[1,1]/(cm3[1,1]+cm3[1,0])
specificity3

Out[102]:

0.14285714285714285

In [103]:

sensitivity3=cm3[0,0]/(cm3[0,0]+cm3[0,1])
sensitivity3

Out[103]:

0.8945414036390642

svm

In [131]:

from sklearn import svm
svm = svm.SVC()
model =svm.fit(X1_train,Y1_train)  #training the model

svm

Out[131]:

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [132]:

Predicted_s=svm.predict(X1_train)

In [133]:

#confusion matrix
from sklearn.metrics import confusion_matrix as cm
ConfusionMatrix = cm(Y1_train,Predicted_s)
print(ConfusionMatrix)

[[21663     0]
 [    9  2251]]

In [134]:

#accuracy
accuracy=np.trace(ConfusionMatrix)/sum(sum(ConfusionMatrix))
print(accuracy)

0.999623793003

In [135]:

sensitivity=ConfusionMatrix[0,0]/(ConfusionMatrix[0,0]+ConfusionMatrix[0,1])
sensitivity

Out[135]:

1.0

In [136]:

specificity=ConfusionMatrix[1,1]/(ConfusionMatrix[1,1]+ConfusionMatrix[1,0])
specificity

Out[136]:

0.99601769911504423

RandomForest

In [117]:

from sklearn.ensemble import RandomForestClassifier
forest=RandomForestClassifier(n_estimators=100, criterion='gini', max_depth=None, min_samples_split=2, 
                              min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', 
                              max_leaf_nodes=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, 
                              verbose=0, warm_start=False, class_weight=None)

In [118]:

forest.fit(X1_train,Y1_train)

Out[118]:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [119]:

Predicted=forest.predict(X1_test)

from sklearn.metrics import confusion_matrix as cm
ConfusionMatrix = cm(Y1_test,Predicted)
print(ConfusionMatrix)

[[5385    1]
 [ 595    0]]

In [120]:

total = sum(sum(ConfusionMatrix))
accuracy = (ConfusionMatrix[0,0]+ConfusionMatrix[1,1])/total
accuracy

Out[120]:

0.90035111185420502

In [121]:

sensitivity=ConfusionMatrix[0,0]/(ConfusionMatrix[0,0]+ConfusionMatrix[0,1])
sensitivity

Out[121]:

0.99981433345711102

In [122]:

specificity=ConfusionMatrix[1,1]/(ConfusionMatrix[1,1]+ConfusionMatrix[1,0])
specificity

Out[122]:

0.0

conclusion

logistic regression:Accuracy is 90%, specificity is 0.0 and sensitivity is 100%.

Svm :Accuracy is 99%, specificity is 99% and sensitivity is 100%.

Random forest :Accuracy is 90%, specificity is 0.0 and sensitivity is 100%.

Seem like a svm model is doing a pretty good job getting better specificity significantly.

Unbalanced dataset handling

Here the dataset is Unbalanced i.e Response is very much skewed towards 0’s and models we are built are with good accuracy but with poor specificity and sensitivity.This kind of unbalanced data need to be deal carefully.

Direct Mail Marketing Project in Python