Before start our lesson please download the datasets.
Problem Statement
An insurance company need to come up with a good marketing strategy. They want to run an e-mail campaign. Before sending a mail to all the available e-mail addresses, they want to first build a predictive model and identify the customers who are most likely to respond.
Analyze the historical data and build a predictive model that helps us in maximizing the response rate. Thereby the company will be able to manage their cost by sending email only to those who is most likely to respond.
Data importing
import pandas as pd
direct_mail=pd.read_csv("C:\\Users\\Personal\\Google Drive\\DirectMail.csv")
direct_mail.shape
direct_mail.columns.values
Summary of the dataset Summary and Structure lets us have a prilimilary look at the data to understand the kind of data we are dealing with. dtypes : dtypes shows datatype of each variable, it can be int, factor or a number. the number of observations and number of variables are mentioned on the top. describe() : describe gives us more details about the variable. for numerical or integer variable it gives Min value, Max value, Mean, Median, 1st and 3rd quartile values. Most importantly it gives us any missing values in form of NA’s.
direct_mail.dtypes
Data Exploration
import pandas as pd
import sklearn as sk
import math
import numpy as np
from scipy import stats
import matplotlib as matlab
import statsmodels
direct_mail=pd.read_csv("C:\\Users\\Personal\\Google Drive\\DirectMail.csv")
direct_mail.shape
direct_mail.columns.values
direct_mail.head(10)
direct_mail.describe()
checking missing values
Checking for any missing values in dataset To check if any na values are present in the dataset and count the NA values in the dataset
direct_mail.isnull().sum()
We can see all the 33 na values are in the CRED column, this column need to be taken care of while cleaning the data.
Univariate analysis
AGE
Age of the customer
direct_mail['AGE'].describe()
import matplotlib.pyplot as plt
%matplotlib inline
direct_mail.boxplot(column="AGE")
This variable is pretty clean.Mean and median are very close, Min and Max values are also in a possible range. Boxplot shows a healthy distribution of the data.
CRED : Credit score
Credit score is the creditworthiness of the person
direct_mail['CRED'].describe()
direct_mail['CRED_new']=direct_mail['CRED']
#to display all the rows which have missing values in 'CRED_new' Column:
direct_mail.ix[direct_mail['CRED_new'].isnull()]
#to get axis=0 index (row index) which have missing values in this column
direct_mail.ix[direct_mail['CRED_new'].isnull()].index
#Once identified where missing values exist, the next task usually is to fill them (data imputation). Depending upon the context,
#in this case, I am assigning mean value(603.6) to all those positions where missing value is present:
direct_mail.loc[direct_mail['CRED_new'].isnull(),'CRED_new']=603.6
sum(direct_mail['CRED_new'].isnull())
#and as the output suggests, this column doesn't have any missing values now
direct_mail['CRED_new'].describe()
direct_mail.boxplot(column="CRED_new")
direct_mail['CRED_new'].quantile([0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.99,1])
From the boxplot we can see there is one value ‘1789.0’ that’s standing out of the distribution and this is the max value in the column, this sould be considered outlier.
MS : Marital Status
This variable gives the marital status of the customer, M: Married, U:Unmarried, X:Other/unknown
frequency=direct_mail['MS'].value_counts()
frequency
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(5,5))
frequency.plot(kind='bar')
HEQ : Home Equity
Home equity is the value of ownership built up in a home or property that represents the current market value of the house, minus any remaining mortgage payments.
direct_mail['HEQ'].describe()
direct_mail.boxplot(column='HEQ')
direct_mail['HEQ'].quantile([0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.99,1])
direct_mail['HEQ'].value_counts()
There aren’t any missing values. However, the Boxplot and percentile distributin shows some values too far from the distribution. The max value 200 seem to be outlier.
INCOME : Income of the customer
Income of the customer
direct_mail['INCOME'].describe()
direct_mail.boxplot(column="INCOME")
direct_mail['INCOME'].quantile([0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.96,0.97,0.98,0.99,1])
direct_mail['INCOME'].value_counts()
We can clearly see in boxplot that a group of data points out of the plot.This group represents the value ‘110’ which is 1.6% of the total data. However, this values is not that distinct from the rest of the values.
DEPC : Depriciation
Is there any reduction in the value of an asset.
frequ_tab=direct_mail['DEPC'].value_counts()
frequ_tab
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(5,5))
frequ_tab.plot(kind='bar')
MOB : Existing Customer
Parameter shows if the customer is existing or new.
feq=direct_mail['MOB'].value_counts()
feq
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(5,5))
feq.plot(kind='bar')
MILEAGE
Mileage of the customer vehicle
direct_mail['MILEAGE'].describe()
direct_mail.boxplot(column="MILEAGE")
direct_mail['MILEAGE'].quantile([0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.96,0.97,0.98,0.99,1])
Mean and medians are close, but the max value is quite skewed. Boxplot shows values after around 20 are dispersed. Percentile distribution shows 99% values have an even distribution but rest 1% goes too high. This 1% can be counted as outliers.
RESTYPE : Real Estate Type
This variable shows type of the house,customer is living in.
frequ=direct_mail['RESTYPE'].value_counts()
frequ
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(5,5))
frequ.plot(kind='bar')
GENDER
Gender of the customer.
feq=direct_mail['GENDER'].value_counts()
feq
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(5,5))
feq.plot(kind='bar')
EMP_STA : Employer status
frequ=direct_mail['EMP_STA'].value_counts()
frequ
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(5,5))
frequ.plot(kind='bar')
RES_STA : Residential status
Residentail status of the customer
feq_tab=direct_mail['RES_STA'].value_counts()
feq_tab
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(5,5))
feq_tab.plot(kind='bar')
DELINQ : Delinquency Status
Delinquency is Failure in repaying the borrowed sum . This variable shows how many times the customer has been deliquent.
direct_mail['DELINQ'].describe()
feq=direct_mail['DELINQ'].value_counts()
feq
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(5,5))
feq.plot(kind='bar')
NUMTR : Number of active trades
Buying and selling the properties in a very short duration
direct_mail['NUMTR'].describe()
frequ=direct_mail['NUMTR'].value_counts()
frequ
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(5,5))
frequ.plot(kind='bar')
MRTGI : Mortgage Indicator
If customer has Mortagaged properties. N:No; Y:Yes and U:Unknown.
frequency=direct_mail['MRTGI'].value_counts()
frequency
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(5,5))
frequency.plot(kind='bar')
MFDU : Multiple Family Dwelling input
If the customer is living in multi home complex,like apartments etc.
direct_mail['MFDU'].describe()
freq=direct_mail['MFDU'].value_counts()
freq
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(5,5))
freq.plot(kind='bar')
resp :Response
This is our target variable; the Response.
freq=direct_mail['resp'].value_counts()
freq
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(5,5))
freq.plot(kind='bar')
we see that response is more skewed towards 0’s. This is an Unbalanced Data, many predictive models will also show same kind of skewness in predictions.
msn : Medical Safety Net Program
This varible shows if the customer is enrolled in this particular medical program; a medical backup insurance program generally offerd by goverment health departments to low income families.
freq_tab=direct_mail['msn'].value_counts()
freq_tab
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(5,5))
freq_tab.plot(kind='bar')
cuscode : Customer Identification Code
direct_mail['cuscode'].describe()
direct_mail.boxplot(column="cuscode")
*** below are some dummy variables, derived from some of the variables above which are already included in the dataset. > female
This variable is derived from variable ‘GENDER’
table=direct_mail['female'].value_counts()
table
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(5,5))
table.plot(kind='bar')
***4 varibales: ‘HOME’, ‘CONDO’, ‘COOP’, ‘renter’ are derived binary dummy-variables(0,1) form parent variable ‘RESTYPE’ which was a multiclass varible. > HOME : Home Indicator
tab=direct_mail['HOME'].value_counts()
tab
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(5,5))
tab.plot(kind='bar')
CONDO : Condominium Indicator
tab_1=direct_mail['CONDO'].value_counts()
tab_1
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(5,5))
tab_1.plot(kind='bar')
COOP : Co-Op Residence Indicator
tab_2=direct_mail['COOP'].value_counts()
tab_2
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(5,5))
tab_2.plot(kind='bar')
renter : Rental Home Indicator
direct_mail['renter'].describe()
tab_3=direct_mail['renter'].value_counts()
tab_3
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(3,3))
tab_3.plot(kind='bar')
*** emp1 and emp2 are derived from variable EMP_STA; employment status > emp1 : Employee1
this variable distinguishes if the customer had 1-2 jobs or else(unemployed or more than 3 jobs)
tab_4=direct_mail['emp1'].value_counts()
tab_4
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(4,3))
tab_4.plot(kind='bar')
emp2 : Employee2
This variable distinguishes if the customer has/had 3 or more than 3 jobs.
tab_5=direct_mail['emp2'].value_counts()
tab_5
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(4,3))
tab_5.plot(kind='bar')
Summary of Univariate Analysis Lets tabularize what we found in univariate analysis
Variable Missing Values Outliers Remarks
AGE Nill Nill
CRED Present (<1%) Present (<1%)
MS Nill Nill
HEQ Nill Present (<1%)
INCOME Nill Nill
DEPC Nill Nill
MOB Nill Nill
MILEAGE Nill Present (<1%)
RESTYPE Nill Nill
GENDER Nill Nill
EMP_STA Nill Nill
DELINQ Nill Nill
NUMTR Nill Nill
MRTGI Nill Nill
MFDU Nill Nill
resp Nill Nill
female Nill Nill Dummy Variable
HOME Nill Nill Dummy Variable
CONDO Nill Nill Dummy Variable
COOP Nill Nill Dummy Variable
renter Nill Nill Dummy Variable
emp1 Nill Nill Dummy Variable
emp2 Nill Nill Dummy Variable
msn Nill Nill
cuscode Nill Nill ID number
Since, sklearn requires all inputs to be numeric, we should convert all our categorical variables into numeric by encoding the categories. This can be done using the following code:
from sklearn.preprocessing import LabelEncoder
var_mod = ['MS','DEPC','MOB','RESTYPE','GENDER','EMP_STA','RES_STA','MRTGI']
le = LabelEncoder()
for i in var_mod:
direct_mail[i] = le.fit_transform(direct_mail[i])
direct_mail.dtypes
Spliting the data into training and testing set
from sklearn.cross_validation import train_test_split
features=list(direct_mail[["AGE"]+["CRED_new"]+["MS"]+["HEQ"]+["INCOME"]+["DEPC"]+["MOB"]+["MILEAGE"]+["RESTYPE"]+["GENDER"]+["EMP_STA"]+["RES_STA"]+["DELINQ"]+["NUMTR"]+["MRTGI"]+["MFDU"]+["female"]+["HOME"]+["CONDO"]+["COOP"]+["renter"]+["emp1"]+["emp2"]+["msn"]+["cuscode"]])
X = direct_mail[features]
y = direct_mail['resp']
X_train, X_test, Y_train, Y_test = train_test_split(X, y, train_size=0.8)
Y_test.shape, X_test.shape,X_train.shape,Y_train.shape
Model Building
The response variable ‘resp’ is logical(Yes-No; 0-1 type), hance we use the logistic regression
logistic regression With respect to age
from sklearn.linear_model import LogisticRegression
logistic1= LogisticRegression()
logistic1.fit(X_train,Y_train)
predict1=logistic1.predict(X_test)
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
cm1 = confusion_matrix(Y_test,predict1)
print(cm1)
total1=sum(sum(cm1))
accuracy1=(cm1[0,0]+cm1[1,1])/total1
accuracy1
specificity1=cm1[1,1]/(cm1[1,1]+cm1[1,0])
specificity1
sensitivity1=cm1[0,0]/(cm1[0,0]+cm1[0,1])
sensitivity1
Remove outliers
4 Continuous Variables shows the sign of outliers: CRED_new, HEQ, INCOME and MILEAGE. Removing the outliers one by one:
First create a new dataset in which we will put the changed variables to keep original dataset intact.
direct_mail1=direct_mail
direct_mail1.shape
CRED_new
1789 consider as an outlier and should be replaced with mean value : 603.6
direct_mail1['CRED_new1']=direct_mail1['CRED_new']
direct_mail1['CRED_new1'][direct_mail1['CRED_new1']==1789]=603.6
direct_mail1['CRED_new1'].describe()
direct_mail1.boxplot(column="CRED_new1")
direct_mail1['CRED_new1'].quantile([0.1, .25,.50,.75,0.8, 0.85, .90,0.95, .99,1])
HEQ
Values above 90 can be considered outliers according to our observations while univariate analysis.and should be replaced with mean value : 38.33
direct_mail1['HEQ_new']=direct_mail1['HEQ']
direct_mail1['HEQ_new'][direct_mail1['HEQ_new']>=90]=38.33
direct_mail1['HEQ_new'].describe()
direct_mail1.boxplot(column="HEQ_new")
direct_mail1['HEQ_new'].quantile([0.1, .25,.50,.75,0.8, 0.85, .90,0.95, .99,1])
Income
Values above 100 can be considered outliers according to our observations while univariate analysis.and should be replaced with mean value : 41.36
direct_mail1['INCOME_new']=direct_mail1['INCOME']
direct_mail1['INCOME_new'][direct_mail1['INCOME_new']>=100]=41.36
direct_mail1['INCOME_new'].describe()
direct_mail1.boxplot(column="INCOME_new")
direct_mail1['INCOME_new'].quantile([0.1, .25,.50,.75,0.8, 0.85, .90,0.95, .99,1])
Mileage
Values above 25 can be considered outliers according to our observations while univariate analysis.and should be replaced with mean value : 11.8
direct_mail1['MILEAGE_new']=direct_mail1['MILEAGE']
direct_mail1['MILEAGE_new'][direct_mail1['MILEAGE_new']>=25]=11.8
direct_mail1['MILEAGE_new'].describe()
direct_mail1.boxplot(column="MILEAGE_new")
direct_mail1['MILEAGE_new'].quantile([0.1, .25,.50,.75,0.8, 0.85, .90,0.95, .99,1])
Rebuild the model after outlier removal
Again build a Logistic Model and see if we made any improvements with outlier removal But first devide the dataset direct_mail1 into training and testing sets.
from sklearn.cross_validation import train_test_split
features=list(direct_mail1[["AGE"]+["CRED_new1"]+["MS"]+["HEQ_new"]+["INCOME_new"]+["DEPC"]+["MOB"]+["MILEAGE_new"]+["RESTYPE"]+["GENDER"]+["EMP_STA"]+["RES_STA"]+["DELINQ"]+["NUMTR"]+["MRTGI"]+["MFDU"]+["female"]+["HOME"]+["CONDO"]+["COOP"]+["renter"]+["emp1"]+["emp2"]+["msn"]+["cuscode"]])
X1 = direct_mail1[features]
y1 = direct_mail1['resp']
X1_train, X1_test, Y1_train, Y1_test = train_test_split(X1, y1, train_size=0.8)
Y1_test.shape, X1_test.shape,X1_train.shape,Y1_train.shape
from sklearn.linear_model import LogisticRegression
logistic2= LogisticRegression()
logistic2.fit(X1_train,Y1_train)
predict2=logistic2.predict(X1_test)
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
cm2 = confusion_matrix(Y1_test,predict2)
print(cm2)
total2=sum(sum(cm2))
accuracy2=(cm2[0,0]+cm2[1,1])/total2
accuracy2
specificity2=cm2[1,1]/(cm2[1,1]+cm2[1,0])
specificity2
sensitivity2=cm2[0,0]/(cm2[0,0]+cm2[0,1])
sensitivity2
ROC AND AUC
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
actual = Y1_test
false_positive_rate, true_positive_rate, thresholds = roc_curve(actual, predict2)
plt.title('Receiver Operating Characteristic')
roc_auc = auc(false_positive_rate, true_positive_rate)
plt.plot(false_positive_rate, true_positive_rate,label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
roc_auc = auc(false_positive_rate, true_positive_rate)
roc_auc
Building Decission tree
import pandas as pd
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X1_train,Y1_train)
clf
predict3 = clf.predict(X1_test)
from sklearn.metrics import confusion_matrix
cm3=confusion_matrix(Y1_test, predict3)
print (cm3)
total3 = sum(sum(cm3))
accuracy3 = (cm3[0,0]+cm3[1,1])/total3
accuracy3
specificity3=cm3[1,1]/(cm3[1,1]+cm3[1,0])
specificity3
sensitivity3=cm3[0,0]/(cm3[0,0]+cm3[0,1])
sensitivity3
svm
from sklearn import svm
svm = svm.SVC()
model =svm.fit(X1_train,Y1_train) #training the model
svm
Predicted_s=svm.predict(X1_train)
#confusion matrix
from sklearn.metrics import confusion_matrix as cm
ConfusionMatrix = cm(Y1_train,Predicted_s)
print(ConfusionMatrix)
#accuracy
accuracy=np.trace(ConfusionMatrix)/sum(sum(ConfusionMatrix))
print(accuracy)
sensitivity=ConfusionMatrix[0,0]/(ConfusionMatrix[0,0]+ConfusionMatrix[0,1])
sensitivity
specificity=ConfusionMatrix[1,1]/(ConfusionMatrix[1,1]+ConfusionMatrix[1,0])
specificity
RandomForest
from sklearn.ensemble import RandomForestClassifier
forest=RandomForestClassifier(n_estimators=100, criterion='gini', max_depth=None, min_samples_split=2,
min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto',
max_leaf_nodes=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None,
verbose=0, warm_start=False, class_weight=None)
forest.fit(X1_train,Y1_train)
Predicted=forest.predict(X1_test)
from sklearn.metrics import confusion_matrix as cm
ConfusionMatrix = cm(Y1_test,Predicted)
print(ConfusionMatrix)
total = sum(sum(ConfusionMatrix))
accuracy = (ConfusionMatrix[0,0]+ConfusionMatrix[1,1])/total
accuracy
sensitivity=ConfusionMatrix[0,0]/(ConfusionMatrix[0,0]+ConfusionMatrix[0,1])
sensitivity
specificity=ConfusionMatrix[1,1]/(ConfusionMatrix[1,1]+ConfusionMatrix[1,0])
specificity
conclusion
logistic regression:Accuracy is 90%, specificity is 0.0 and sensitivity is 100%.
Svm :Accuracy is 99%, specificity is 99% and sensitivity is 100%.
Random forest :Accuracy is 90%, specificity is 0.0 and sensitivity is 100%.
Seem like a svm model is doing a pretty good job getting better specificity significantly.
Unbalanced dataset handling
Here the dataset is Unbalanced i.e Response is very much skewed towards 0’s and models we are built are with good accuracy but with poor specificity and sensitivity.This kind of unbalanced data need to be deal carefully.