Before start our lesson please download the datasets.
Problem statement
Marketing campaigns are very crucial for any institution to generate business by promoting their products. A data driven strategy can be very helpful to achieve great results. This data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls where bank representative will pitch a banking product to potential customer. The classification goal is to predict if the client will subscribe to the specific product: ‘Yes’ or ‘No’.
Data Exploration
The dataset has 18 variables including the dependent variable ‘y’ which denotes if the customer subscribed for the product ‘Term Deposite’. Variable ‘y’(term deposit) is the dependent variable & rest are independent variable.
import pandas as pd
bank_market=pd.read_csv("C:\\Users\\Personal\\Google Drive\\bank_market.csv")
bank_market.shape
we have 18 variables, and 45211 observations.
bank_market.columns.values
bank_market.head()
Next, you can look at summary of numerical fields by using describe() function
import pandas as pd
import sklearn as sk
import math
import numpy as np
from scipy import stats
import matplotlib as matlab
import statsmodels
bank_market=pd.read_csv("C:\\Users\\Personal\\Google Drive\\bank_market.csv")
bank_market.shape
bank_market.columns.values
bank_market.head(10)
bank_market.describe()
describe() function would provide count, mean, standard deviation (std), min, quartiles and max in its output
The minimum value in Cust_num column is 1, maximum value is 45211 which is number of rows in the data. Its mean and median are equal, which shows an equal distribution.
checking missing values
bank_market.isnull().sum()
univariate analysis
age
It specifies age of the barrower in years. It’s an integer. lets see the summary of age
bank_market['age'].describe()
Minimum age is 18, maximum age is 95 which is okay. Mean and median are very close which indicates outliers may not be present.
Lets see the percentile distribution.
bank_market['age'].quantile([0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.96,0.97,0.98,0.99,1])
Percentile distribution shows only 10% of customers are younger then 28, and around 75% of the customers are between age of 28-60, meaning a more mature customer group is targeted for campaign.
import matplotlib.pyplot as plt
%matplotlib inline
bank_market.boxplot(column="age")
we can notice that good number of customers are in their early mid age. There doesn’t seem to be a sign outliers in the variable.
duration
Variable ‘duration’ specifies the duration of last call made with customer in second. let’s see the summary of duration:
bank_market['duration'].describe()
The Min value is 0.0 but Max value is 4918.0 seconds which is around 1hour22minutes. Let’s see the percentile distribution.
The Min value is 0.0 but Max value is 4918.0 seconds which is around 1hour22minutes. Let’s see the percentile distribution.
bank_market['duration'].quantile([0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.99,1])
Percentile distribution shows 95% of the calls ended within first 751 seconds or around 12 minutes.
Boxplot of the variable.
bank_market.boxplot(column="duration")
Boxplot shows that there are quite a few outliers in the this variable. We can consider last 5% values as outliers.
campaign
This variable represents number of contacts made during this campaign and for this client, this includes the last contact. Summary of variable ‘campaign’:
bank_market['campaign'].describe()
Min value is 1 and Max value is 63. Contacting same customer 63 times seem too high this might be outlier.(Outlier doesn’t mean that this is flasly entered information, however, it can affect our predictive models)
Let’s get into percentile distribution:
bank_market['campaign'].quantile([0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.99,1])
We can see about 60% of customers were contacted no more than twice and 90% has been contacted less than 5 times. Last 1%ile shows too much of variance form the whole data, considering this as an outlier should be a best call.
Boxlot for better visual understanding:
bank_market.boxplot(column="campaign")
freq=bank_market['campaign'].value_counts()
freq
We can see for all different measures that distribution is concentrated in a very small range and if the value is too far from the range we can consider it an outlier. Here any value that is more than 16 or above 99th percentile, we will it an outlier.
bank_market['campaign'].hist(bins=50)
boxplot and barplot shows high number of cutomers were called less than 5 times and 99% of the customers were not called more than 16 times.
pdays
This variable represents number of days that passed by after the client was last contacted from a previous campaign. It is a numeric variable and -1 means client was not previously contacted.
Summary of pdays:
bank_market['pdays'].describe()
Minimum value is -1 and maximum value is 871 days. Mean and median have huge gap which indicates there is presence outliers.
Let’s have a look at percentile distribution:
bank_market['pdays'].quantile([0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.99,1])
we see about 80% values are given as -1, and these customers were contacted for the first time. Let’s have a look at boxplot for good measure:
bank_market.boxplot(column="pdays")
However, if the customer was last contacted over a year ago we can assume value more than 365 to be an outlier and 99th percentile happen to be 370(very near to 365).
previous
This variable represents number of contacts performed before this campaign and for this client, it’s a numeric value.
Summary of ‘previous’:
bank_market['previous'].describe()
Minimum value is 0 maximum value is 275, mean is 0.58 but median is 0.
A look at percentile distribution:
bank_market['previous'].quantile([0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.99,1])
Percentile distribution suggest 99% values are under 8.9, so the value 275 must be an outlier.
Visualizing the distribution of ‘previous’ for a clear view:
bank_market.boxplot(column="previous")
day
It’s the last contact day of the month with customer. The numeric value must be between 1 to 31.
Let’s have a look at summary to see if there is any outlier or missing value.
bank_market['day'].describe()
Summary shows min value 1 and max value being 31, which is the exact range of days in months.
Have a look at boxplot:
bank_market.boxplot(column="day")
Distribution of this variable seems pretty fine.
balance
It’s average yearly balance of the customer, in euros. This could have large impact on subscription by customer. Let’s have a look at summary
bank_market['balance'].describe()
Min value is -8019 and max value is 102127, can’t really doubt these values being impossible.
bank_market['balance'].quantile([0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.99,1])
Percentile distribution shows 99% of the values are under 13164.9, the rest might be outliers.
Have a look at boxplot to understand distribution:
bank_market.boxplot(column="balance")
bank_market['balance'].hist(bins=50)
boxplot indicates no outliers
Exploring catagorical variables
job
This represents type of job the customer has. we can see the the count value of each value of a catagorical variable by using summary function:
frequency_table=bank_market['job'].value_counts()
frequency_table
marital
Marital status of the customer. It’s a categorical value: “married”,“divorced”,“single”. note: “divorced” means divorced or widowed
bank_market['marital'].value_counts()
education
Education level of the customer. (categorical: “unknown”,“secondary”,“primary”,“tertiary”)
bank_market['education'].value_counts()
default
This variable shows if the customer has credit in default? (binary: “yes”,“no”)
bank_market['default'].value_counts()
housing
If the customer has any housing loan.
bank_market['housing'].value_counts()
loan
If the customer has any personal loan
bank_market['loan'].value_counts()
contact
The contact communication method used to aproach customer: ‘cellular’, ‘telephone’, ‘unknown’
bank_market['contact'].value_counts()
month
It’s the last contact month of the year.
bank_market['month'].value_counts()
poutcome
outcome of the previous marketing campaign.
bank_market['poutcome'].value_counts()
y
this is our output variable(desired varible). Has the client subscribed a term deposit? (binary: ‘yes’,‘no’)
bank_market['y'].value_counts()
Summary of Univariate Analysis on Numerical Variables
Lets tabularize what we found in univariate analysis
Variable | Outliers |Remarks —————-|—————|—————
Cust_num | Nill |
age | Nill |
duration | 1% | Value more than 1269
campaign | 1% | Value more than 16
pdays | 1% | Value more than 370
previous | 1% | Value more than 8.9
day | Nill |
balance | Nill |
| | y | | Output Variable
Model Building
As our dependent variable ‘y’ is binary variable(Yes-No type), basic algorithm to aproach would be logistic regression.
We will need to split the data into training and testing sets. Using sample.split() function form library caTools, we will split the dataset into 80-20 ratio of training and testing set.
Initially we will go with the raw data set and then we will go with a basic claeaning process to clean the outliers or any NA values and see if the cleaning improves our results or not.
bank_market.dtypes
Since, sklearn requires all inputs to be numeric, we should convert all our categorical variables into numeric by encoding the categories. This can be done using the following code:
from sklearn.preprocessing import LabelEncoder
var_mod = ['job','marital','education','default','housing','loan','contact','month','poutcome','y']
le = LabelEncoder()
for i in var_mod:
bank_market[i] = le.fit_transform(bank_market[i])
bank_market.dtypes
Spliting the data into training and testing set
from sklearn.cross_validation import train_test_split
features=list(bank_market[["age"]+["job"]+["marital"]+["education"]+["default"]+["balance"]+["housing"]+["loan"]+["contact"]+["day"]+["month"]+["duration"]+["campaign"]+["pdays"]+["previous"]+["poutcome"]])
X1 = bank_market[features]
y1 = bank_market['y']
X1_train, X1_test, Y1_train, Y1_test = train_test_split(X1, y1, train_size=0.8,random_state=90)
Y1_test.shape, X1_test.shape,X1_train.shape,Y1_train.shape
from sklearn.linear_model import LogisticRegression
logistic1= LogisticRegression()
logistic1.fit(X1_train,Y1_train)
Let’s predict the class on the test set and find the Accuracy, sensitivity and specificity of this logistic regression model we just built:
predict1=logistic1.predict(X1_test)
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
cm1 = confusion_matrix(Y1_test,predict1)
print(cm1)
total1=sum(sum(cm1))
accuracy1=(cm1[0,0]+cm1[1,1])/total1
accuracy1
specificity1=cm1[1,1]/(cm1[1,1]+cm1[1,0])
specificity1
sensitivity1=cm1[0,0]/(cm1[0,0]+cm1[0,1])
sensitivity1
Okay, accuracy is 89%, that’s not bad. But specificity is very low, meaning proportion of negatives that are correctly identified is very low. In simple terms the model is identifying a large amount of potential non-subscribers as subsribers, which is bad. Depending on this model representatives migh waste energy on customers who he/she might not necessarily be able to convert into a subscribers. We will work on decreasing the Specificity of our model.
Remove outliers
We created a ummary table of the continuous variables from Univariate Analysis to document the outliers and missing values. 4 Continuous Variables shows the sign of outliers: duration, campaign, pdays and previous. Removing the outliers one by one:
First create a new dataset in which we will put the changed variables to keep original dataset intact.
bank_market1=bank_market
bank_market1.shape
bank_market1['duration_new']=bank_market1['duration']
bank_market1['duration_new'][bank_market1['duration_new']>1269]=180
bank_market1['duration_new'].describe()
bank_market1.boxplot(column="duration_new")
bank_market1['duration_new'].quantile([0.1, .25,.50,.75,0.8, 0.85, .90,0.95, .99,1])
campaign
Values above 16 are outliers and should be replaced with median value : 2
bank_market1['campaign_new']=bank_market1['campaign']
bank_market1['campaign_new'][bank_market1['campaign_new']>16]=2
bank_market1['campaign_new'].describe()
bank_market1.boxplot(column="campaign_new")
bank_market1['campaign_new'].quantile([0.1, .25,.50,.75,0.8, 0.85, .90,0.95, .99,1])
pdays
Values above 370 can be considered outliers according to our observations while univariate analysis.
bank_market1['pdays_new']=bank_market1['pdays']
bank_market1['pdays_new'][bank_market1['pdays_new']>370]=40.2
bank_market1['pdays_new'].describe()
bank_market1.boxplot(column="pdays_new")
bank_market1['pdays_new'].quantile([0.1, .25,.50,.75,0.8, 0.85, .90,0.95, .99,1])
previous
1% values that are above 8.9 can be considered outliers. We will replace these values with mean 0.58.
bank_market1['previous_new']=bank_market1['previous']
bank_market1['previous_new'][bank_market1['previous_new']>8.9]=0.58
bank_market1['previous_new'].describe()
bank_market1.boxplot(column="previous_new")
bank_market1['previous_new'].quantile([0.1, .25,.50,.75,0.8, 0.85, .90,0.95, .99,1])
After removing outliers we can still detect a hint of outliers. We were lineant toward the deciding a boundary for the outliers to keep integrety of the data and not induce any bias. If needed we can again change the margin of outliers and replace those values.
Rebuild the model after outlier removal
Again build a Logistic Model and see if we made any improvements with outlier removal But first devide the dataset bank.market1 into training and testing sets.
from sklearn.cross_validation import train_test_split
feature=list(bank_market1[["age"]+["job"]+["marital"]+["education"]+["default"]+["balance"]+["housing"]+["loan"]+["contact"]+["day"]+["month"]+["duration_new"]+["campaign_new"]+["pdays_new"]+["previous_new"]+["poutcome"]])
X2 = bank_market1[feature]
y2 = bank_market1['y']
X2_train, X2_test, Y2_train, Y2_test = train_test_split(X2, y2, train_size=0.8,random_state=90)
Y2_test.shape, X1_test.shape,X2_train.shape,Y2_train.shape
from sklearn.linear_model import LogisticRegression
logistic2= LogisticRegression()
logistic2.fit(X2_train,Y2_train)
predict2=logistic2.predict(X2_test)
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
cm2 = confusion_matrix(Y2_test,predict2)
print(cm2)
total2=sum(sum(cm2))
accuracy2=(cm2[0,0]+cm2[1,1])/total2
accuracy2
specificity2=cm2[1,1]/(cm2[1,1]+cm2[1,0])
specificity2
sensitivity2=cm2[0,0]/(cm2[0,0]+cm2[0,1])
sensitivity2
ROC AND AUC
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
actual = Y2_test
false_positive_rate, true_positive_rate, thresholds = roc_curve(actual, predict2)
plt.title('Receiver Operating Characteristic')
roc_auc = auc(false_positive_rate, true_positive_rate)
plt.plot(false_positive_rate, true_positive_rate,label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
roc_auc = auc(false_positive_rate, true_positive_rate)
roc_auc
Building Decission tree
import pandas as pd
from sklearn import tree
clf = tree.DecisionTreeClassifier(random_state=90)
clf = clf.fit(X2_train,Y2_train)
clf
predict3 = clf.predict(X2_test)
from sklearn.metrics import confusion_matrix
cm3=confusion_matrix(Y2_test, predict3)
print (cm3)
total3 = sum(sum(cm3))
accuracy3 = (cm3[0,0]+cm3[1,1])/total3
accuracy3
specificity3=cm3[1,1]/(cm3[1,1]+cm3[1,0])
specificity3
sensitivity3=cm3[0,0]/(cm3[0,0]+cm3[0,1])
sensitivity3
RandomForest
from sklearn.ensemble import RandomForestClassifier
forest=RandomForestClassifier(n_estimators=100, criterion='gini', max_depth=None, min_samples_split=2,
min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto',
max_leaf_nodes=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None,
verbose=0, warm_start=False, class_weight=None)
forest.fit(X2_train,Y2_train)
Predicted=forest.predict(X2_test)
from sklearn.metrics import confusion_matrix as cm
ConfusionMatrix = cm(Y2_test,Predicted)
print(ConfusionMatrix)
total = sum(sum(ConfusionMatrix))
accuracy = (ConfusionMatrix[0,0]+ConfusionMatrix[1,1])/total
accuracy
sensitivity=ConfusionMatrix[0,0]/(ConfusionMatrix[0,0]+ConfusionMatrix[0,1])
sensitivity
specificity=ConfusionMatrix[1,1]/(ConfusionMatrix[1,1]+ConfusionMatrix[1,0])
specificity
Seem like a decision tree model is doing a pretty good job getting better specificity significantly. However, it’s still not that good but we reached to good result.