You can download the datasets .
Problem statement
Marketing campaigns are very crucial for any institution to generate business by promoting their products. A data driven strategy can be very helpful to achieve great results. This data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls where bank representative will pitch a banking product to potential customer. The classification goal is to predict if the client will subscribe to the specific product: ‘Yes’ or ‘No’.
Data Exploration
Basic information about data
This dataset has details of bank customers. The dataset has 18 variables including the dependent variable ‘y’ which denotes if the customer subscribed for the product ‘Term Deposite’. Variable ‘y’(term deposit) is the dependent variable & rest are independent variable.
Retrieve the data from csv
bank.market<-read.csv("D:\\Dv Analytics\\Datasets\\Bank Tele Marketing\\bank_market.csv")
head(bank.market)
## Cust_num age job marital education default balance housing loan
## 1 1 58 management married tertiary no 2143 yes no
## 2 2 44 technician single secondary no 29 yes no
## 3 3 33 entrepreneur married secondary no 2 yes yes
## 4 4 47 blue-collar married unknown no 1506 yes no
## 5 5 33 unknown single unknown no 1 no no
## 6 6 35 management married tertiary no 231 yes no
## contact day month duration campaign pdays previous poutcome y
## 1 unknown 5 may 261 1 -1 0 unknown no
## 2 unknown 5 may 151 1 -1 0 unknown no
## 3 unknown 5 may 76 1 -1 0 unknown no
## 4 unknown 5 may 92 1 -1 0 unknown no
## 5 unknown 5 may 198 1 -1 0 unknown no
## 6 unknown 5 may 139 1 -1 0 unknown no
Basic information
See the dimension of the data.
dim(bank.market)
## [1] 45211 18
we have 18 variables, and 45211 observations. Variable names:
names(bank.market)
## [1] "Cust_num" "age" "job" "marital" "education"
## [6] "default" "balance" "housing" "loan" "contact"
## [11] "day" "month" "duration" "campaign" "pdays"
## [16] "previous" "poutcome" "y"
Format & structure of the data
str(bank.market)
## 'data.frame': 45211 obs. of 18 variables:
## $ Cust_num : int 1 2 3 4 5 6 7 8 9 10 ...
## $ age : int 58 44 33 47 33 35 28 42 58 43 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
## $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
## $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
## $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
## $ housing : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
## $ contact : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ day : int 5 5 5 5 5 5 5 5 5 5 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
Numerical variable: “age”, “duration”, “compaign”, “pdays”, “previous”, “day”, “balance” categorical variable: “job”, “marital”, “education”, “default”, “housing”, “loan”, “contact”, “month”, “poutcome” and “y”
See first five and last five rows of the data to get better understanding of variables.
head(bank.market)
## Cust_num age job marital education default balance housing loan
## 1 1 58 management married tertiary no 2143 yes no
## 2 2 44 technician single secondary no 29 yes no
## 3 3 33 entrepreneur married secondary no 2 yes yes
## 4 4 47 blue-collar married unknown no 1506 yes no
## 5 5 33 unknown single unknown no 1 no no
## 6 6 35 management married tertiary no 231 yes no
## contact day month duration campaign pdays previous poutcome y
## 1 unknown 5 may 261 1 -1 0 unknown no
## 2 unknown 5 may 151 1 -1 0 unknown no
## 3 unknown 5 may 76 1 -1 0 unknown no
## 4 unknown 5 may 92 1 -1 0 unknown no
## 5 unknown 5 may 198 1 -1 0 unknown no
## 6 unknown 5 may 139 1 -1 0 unknown no
tail(bank.market)
## Cust_num age job marital education default balance housing
## 45206 45206 25 technician single secondary no 505 no
## 45207 45207 51 technician married tertiary no 825 no
## 45208 45208 71 retired divorced primary no 1729 no
## 45209 45209 72 retired married secondary no 5715 no
## 45210 45210 57 blue-collar married secondary no 668 no
## 45211 45211 37 entrepreneur married secondary no 2971 no
## loan contact day month duration campaign pdays previous poutcome
## 45206 yes cellular 17 nov 386 2 -1 0 unknown
## 45207 no cellular 17 nov 977 3 -1 0 unknown
## 45208 no cellular 17 nov 456 2 -1 0 unknown
## 45209 no cellular 17 nov 1127 5 184 3 success
## 45210 no telephone 17 nov 508 4 -1 0 unknown
## 45211 no cellular 17 nov 361 2 188 11 other
## y
## 45206 yes
## 45207 yes
## 45208 yes
## 45209 yes
## 45210 no
## 45211 no
See the summary of whole data
summary(bank.market)
## Cust_num age job marital
## Min. : 1 Min. :18.00 blue-collar:9732 divorced: 5207
## 1st Qu.:11304 1st Qu.:33.00 management :9458 married :27214
## Median :22606 Median :39.00 technician :7597 single :12790
## Mean :22606 Mean :40.94 admin. :5171
## 3rd Qu.:33909 3rd Qu.:48.00 services :4154
## Max. :45211 Max. :95.00 retired :2264
## (Other) :6835
## education default balance housing loan
## primary : 6851 no :44396 Min. : -8019 no :20081 no :37967
## secondary:23202 yes: 815 1st Qu.: 72 yes:25130 yes: 7244
## tertiary :13301 Median : 448
## unknown : 1857 Mean : 1362
## 3rd Qu.: 1428
## Max. :102127
##
## contact day month duration
## cellular :29285 Min. : 1.00 may :13766 Min. : 0.0
## telephone: 2906 1st Qu.: 8.00 jul : 6895 1st Qu.: 103.0
## unknown :13020 Median :16.00 aug : 6247 Median : 180.0
## Mean :15.81 jun : 5341 Mean : 258.2
## 3rd Qu.:21.00 nov : 3970 3rd Qu.: 319.0
## Max. :31.00 apr : 2932 Max. :4918.0
## (Other): 6060
## campaign pdays previous poutcome
## Min. : 1.000 Min. : -1.0 Min. : 0.0000 failure: 4901
## 1st Qu.: 1.000 1st Qu.: -1.0 1st Qu.: 0.0000 other : 1840
## Median : 2.000 Median : -1.0 Median : 0.0000 success: 1511
## Mean : 2.764 Mean : 40.2 Mean : 0.5803 unknown:36959
## 3rd Qu.: 3.000 3rd Qu.: -1.0 3rd Qu.: 0.0000
## Max. :63.000 Max. :871.0 Max. :275.0000
##
## y
## no :39922
## yes: 5289
##
##
##
##
##
Missing Values
Summary will show the minimum,maximum,mean,median,1st quartile,3rd quartile of all the variables in the data set. It also shows whether any variable has NA(missing) values. In our data, variables doesn’t seem to have NA values. However missing values will not always be represented by ‘NA’. These can also be represented by ‘?’,‘-’,‘.’ etc. To know whether a variable has a missing value or not, summary of that variable or a frequency table can be useful. Once we discover the representation of the missing values, we can find number of missing values in each variable.
So far this dataset does not seem to contain any missing values.
We will move to Univariate Analysis to understand more about each variable.
Understanding Variables and Univariate Analysis
Univariate analysis is a way to validate one variable at a time. It allows us to understand what a variable signifies. Also any missing values or outliers can be detected with this analysis to further sanitize the data for final model building.
Exploring the numeric variables
First we will go through all numerical variables in our dataset.
Cust_num
It is the first variable in the data. It seem to be serial number of the customer. To get summary of ‘Cust_num’
summary(bank.market$Cust_num)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 11300 22610 22610 33910 45210
The minimum value in Cust_num column is 1, maximum value is 45210 which is number of rows in the data. Its mean and median are equal, which shows an equal distribution.
age
It specifies age of the barrower in years. It’s an integer. lets see the summary of age
summary(bank.market$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 33.00 39.00 40.94 48.00 95.00
Minimum age is 18, maximum age is 95 which is okay. Mean and median are very close which indicates outliers may not be present.
Lets see the percentile distribution.
quantile(bank.market$age,c(0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.96,0.97,0.98,0.99,1))
## 0% 1% 3% 5% 7% 9% 10% 20% 30% 40% 50% 60% 70% 80% 90%
## 18 23 26 27 28 28 29 32 34 36 39 42 46 51 56
## 95% 96% 97% 98% 99% 100%
## 59 59 60 63 71 95
Percentile distribution shows only 10% of customers are younger then 28, and around 75% of the customers are between age of 28-60, meaning a more mature customer group is targeted for campaign.
Plot the boxplot of the variable
library(ggplot2)
ggplot(bank.market, aes(y=age, x = 1)) + geom_boxplot()
We can notice that good number of customers are in their early mid age. There doesn’t seem to be a sign outliers in the variable.
duration
Variable ‘duration’ specifies the duration of last call made with customer in second. let’s see the summary of duration:
summary(bank.market$duration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 103.0 180.0 258.2 319.0 4918.0
The Min value is 0.0 but Max value is 4918.0 seconds which is around 1hour22minutes. Let’s see the percentile distribution.
quantile(bank.market$duration,c(0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.99,1))
## 0% 1% 3% 5% 7% 9% 10% 20% 30% 40% 50% 60% 70% 80% 90%
## 0 11 22 35 46 55 58 89 117 147 180 223 280 368 548
## 95% 99% 100%
## 751 1269 4918
Percentile distribution shows 95% of the calls ended within first 751 seconds or around 12 minutes.
Boxplot of the variable.
ggplot(bank.market, aes(y=duration, x = 1)) + geom_boxplot()
Boxplot shows that there are quite a few outliers in the this variable. We can consider last 5% values as outliers.
campaign
This variable represents number of contacts made during this campaign and for this client, this includes the last contact. Summary of variable ‘campaign’:
summary(bank.market$campaign)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 2.764 3.000 63.000
Min value is 1 and Max value is 63. Contacting same customer 63 times seem too high this might be outlier.(Outlier doesn’t mean that this is flasly entered information, however, it can affect our predictive models)
Let’s get into percentile distribution:
quantile(bank.market$campaign,c(0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.99,1))
## 0% 1% 3% 5% 7% 9% 10% 20% 30% 40% 50% 60% 70% 80% 90%
## 1 1 1 1 1 1 1 1 1 2 2 2 3 4 5
## 95% 99% 100%
## 8 16 63
We can see about 60% of customers were contacted no more than twice and 90% has been contacted less than 5 times. Last 1%ile shows too much of variance form the whole data, considering this as an outlier should be a best call.
Boxlot for better visual understanding:
ggplot(bank.market, aes(y=campaign, x = 1)) + geom_boxplot()
qplot(data = bank.market, x = campaign)
Boxplot and Barplot shows high number of cutomers were called less than 5 times and 99% of the customers were not called more than 16 times.
Lets see it’s frequency distribution.
table(bank.market$campaign)
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 17544 12505 5521 3522 1764 1291 735 540 327 266 201 155
## 13 14 15 16 17 18 19 20 21 22 23 24
## 133 93 84 79 69 51 44 43 35 23 22 20
## 25 26 27 28 29 30 31 32 33 34 35 36
## 22 13 10 16 16 8 12 9 6 5 4 4
## 37 38 39 41 43 44 46 50 51 55 58 63
## 2 3 1 2 3 1 1 2 1 1 1 1
We can see for all different measures that distribution is concentrated in a very small range and if the value is too far from the range we can consider it an outlier. Here any value that is more than 16 or above 99th percentile, we will it an outlier.
pdays
This variable represents number of days that passed by after the client was last contacted from a previous campaign. It is a numeric variable and -1 means client was not previously contacted.
Summary of pdays:
summary(bank.market$pdays)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.0 -1.0 -1.0 40.2 -1.0 871.0
Minimum value is -1 and maximum value is 871 days. Mean and median have huge gap which indicates there is presence outliers.
Let’s have a look at percentile distribution:
quantile(bank.market$pdays,c(0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.99,1))
## 0% 1% 3% 5% 7% 9% 10% 20% 30% 40% 50% 60% 70% 80% 90%
## -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 185
## 95% 99% 100%
## 317 370 871
we see about 80% values are given as -1, and these customers were contacted for the first time. Let’s have a look at boxplot for good measure:
ggplot(bank.market, aes(y=pdays, x = 1)) + geom_boxplot()
However, if the customer was last contacted over a year ago we can assume value more than 365 to be an outlier and 99th percentile happen to be 370(very near to 365).
previous
This variable represents number of contacts performed before this campaign and for this client, it’s a numeric value.
Summary of ‘previous’:
summary(bank.market$previous)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.5803 0.0000 275.0000
Minimum value is 0 maximum value is 275, mean is 0.58 but median is 0.
A look at percentile distribution:
quantile(bank.market$previous,c(0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.99,1))
## 0% 1% 3% 5% 7% 9% 10% 20% 30% 40% 50% 60%
## 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
## 70% 80% 90% 95% 99% 100%
## 0.0 0.0 2.0 3.0 8.9 275.0
Percentile distribution suggest 99% values are under 8.9, so the value 275 must be an outlier.
Visualizing the distribution of ‘previous’ for a clear view:
ggplot(bank.market, aes(y=previous, x = 1)) + geom_boxplot()
day
It’s the last contact day of the month with customer. The numeric value must be between 1 to 31.
Let’s have a look at summary to see if there is any outlier or missing value.
summary(bank.market$day)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 8.00 16.00 15.81 21.00 31.00
Summary shows min value 1 and max value being 31, which is the exact range of days in months.
Have a look at boxplot:
ggplot(bank.market, aes(y=day, x = 1)) + geom_boxplot()
Distribution of this variable seem pretty fine.
balance
It’s average yearly balance of the customer, in euros. This could have large impact on subscription by customer. Let’s have a look at summary:
summary(bank.market$balance)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -8019 72 448 1362 1428 102100
Min value is -8019 and max value is 102100, can’t really doubt these values being impossible.
quantile(bank.market$balance,c(0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.99,1))
## 0% 1% 3% 5% 7% 9% 10% 20%
## -8019.0 -627.0 -322.0 -172.0 -51.0 0.0 0.0 22.0
## 30% 40% 50% 60% 70% 80% 90% 95%
## 131.0 272.0 448.0 701.0 1126.0 1859.0 3574.0 5768.0
## 99% 100%
## 13164.9 102127.0
Percentile distribution shows 99% of the values are under 13164.9, the rest might be outliers.
Have a look at boxplot to understand distribution:
ggplot(bank.market, aes(y=balance, x = 1)) + geom_boxplot()
qplot(bank.market$balance, geom="histogram", binwidth = 5000)
Boxplot indicates no outliers.
Exploring catagorical variables
job
This represents type of job the customer has. we can see the the count value of each value of a catagorical variable by using summary function:
summary(bank.market$job)
## admin. blue-collar entrepreneur housemaid management
## 5171 9732 1487 1240 9458
## retired self-employed services student technician
## 2264 1579 4154 938 7597
## unemployed unknown
## 1303 288
marital
Marital status of the customer. It’s a categorical value: “married”,“divorced”,“single”. note: “divorced” means divorced or widowed
table(bank.market$marital)
##
## divorced married single
## 5207 27214 12790
education
Education level of the customer. (categorical: “unknown”,“secondary”,“primary”,“tertiary”)
table(bank.market$education)
##
## primary secondary tertiary unknown
## 6851 23202 13301 1857
default
This variable shows if the customer has credit in default? (binary: “yes”,“no”)
summary(bank.market$default)
## no yes
## 44396 815
housing
If the customer has any housing loan.
summary(bank.market$housing)
## no yes
## 20081 25130
there dosn’t seem to be any outlier or missing values.
loan
If the customer has any personal loan
summary(bank.market$loan)
## no yes
## 37967 7244
no missing values or outliers.
contact
The contact communication method used to aproach customer: ‘cellular’, ‘telephone’, ‘unknown’
summary(bank.market$contact)
## cellular telephone unknown
## 29285 2906 13020
no missing values or any outlier in this variable
month
It’s the last contact month of the year.
summary(bank.market$month)
## apr aug dec feb jan jul jun mar may nov oct sep
## 2932 6247 214 2649 1403 6895 5341 477 13766 3970 738 579
No missing value, no outlier.
poutcome
outcome of the previous marketing campaign.
summary(bank.market$poutcome)
## failure other success unknown
## 4901 1840 1511 36959
no outlier or missing values.
y
this is our output variable(desired varible). Has the client subscribed a term deposit? (binary: ‘yes’,‘no’)
summary(bank.market$poutcome)
## failure other success unknown
## 4901 1840 1511 36959
No missing values or outliers.
Summary of Univariate Analysis on Numerical Variables
Lets tabularize what we found in univariate analysis
Variable | Outliers |Remarks —————-|—————|————— Cust_num | Nill |
age | Nill |
duration | 1% | Value more than 1269
campaign | 1% | Value more than 16 pdays | 1% | Value more than 370
previous | 1% | Value more than 8.9 day | Nill | balance | Nill | | | y | | Output Variable
Model Building
As our dependent variable ‘y’ is binary variable(Yes-No type), basic algorithm to aproach would be logistic regression.
We will need to split the data into training and testing sets. Using sample.split() function form library caTools, we will split the dataset into 80-20 ratio of training and testing set.
Initially we will go with the raw data set and then we will go with a basic claeaning process to clean the outliers or any NA values and see if the cleaning improves our results or not.
Spliting the data into training and testing set
library(caTools)
set.seed(90) # to get the same split everytime
spl = sample.split(bank.market$y, SplitRatio = 0.8)
train = subset(bank.market, spl==TRUE)
test = subset(bank.market, spl==FALSE)
Multiple logistic regression
Great thing about R is that many of pre-built modeling functions break categorical varibles into dummy variables automatically, while building models. Building a logistic regression model:
lgm1<-glm(y~age+job+marital+loan+balance+education+default+housing+contact+day+month+duration+campaign+pdays+poutcome+previous,data=train,family=binomial())
summary(lgm1)
##
## Call:
## glm(formula = y ~ age + job + marital + loan + balance + education +
## default + housing + contact + day + month + duration + campaign +
## pdays + poutcome + previous, family = binomial(), data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.6972 -0.3756 -0.2536 -0.1500 3.4366
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.493e+00 2.073e-01 -12.028 < 2e-16 ***
## age -1.606e-03 2.476e-03 -0.649 0.516565
## jobblue-collar -2.935e-01 8.070e-02 -3.637 0.000276 ***
## jobentrepreneur -2.939e-01 1.400e-01 -2.099 0.035838 *
## jobhousemaid -4.895e-01 1.536e-01 -3.188 0.001432 **
## jobmanagement -1.308e-01 8.150e-02 -1.605 0.108489
## jobretired 3.266e-01 1.086e-01 3.009 0.002623 **
## jobself-employed -2.531e-01 1.255e-01 -2.016 0.043809 *
## jobservices -1.965e-01 9.364e-02 -2.099 0.035852 *
## jobstudent 3.345e-01 1.226e-01 2.728 0.006375 **
## jobtechnician -1.664e-01 7.701e-02 -2.161 0.030721 *
## jobunemployed -2.227e-01 1.254e-01 -1.776 0.075665 .
## jobunknown -4.036e-01 2.672e-01 -1.510 0.130957
## maritalmarried -1.756e-01 6.557e-02 -2.678 0.007397 **
## maritalsingle 9.087e-02 7.480e-02 1.215 0.224395
## loanyes -4.414e-01 6.684e-02 -6.603 4.02e-11 ***
## balance 1.264e-05 5.688e-06 2.222 0.026306 *
## educationsecondary 1.820e-01 7.249e-02 2.511 0.012026 *
## educationtertiary 3.563e-01 8.417e-02 4.233 2.31e-05 ***
## educationunknown 2.475e-01 1.169e-01 2.116 0.034306 *
## defaultyes 5.938e-03 1.826e-01 0.033 0.974049
## housingyes -6.869e-01 4.904e-02 -14.006 < 2e-16 ***
## contacttelephone -1.168e-01 8.432e-02 -1.386 0.165835
## contactunknown -1.617e+00 8.129e-02 -19.893 < 2e-16 ***
## day 8.088e-03 2.788e-03 2.901 0.003719 **
## monthaug -6.964e-01 8.799e-02 -7.914 2.49e-15 ***
## monthdec 5.624e-01 1.975e-01 2.848 0.004399 **
## monthfeb -1.349e-01 9.998e-02 -1.349 0.177307
## monthjan -1.219e+00 1.375e-01 -8.863 < 2e-16 ***
## monthjul -7.716e-01 8.636e-02 -8.934 < 2e-16 ***
## monthjun 4.729e-01 1.039e-01 4.553 5.28e-06 ***
## monthmar 1.582e+00 1.350e-01 11.712 < 2e-16 ***
## monthmay -3.567e-01 8.085e-02 -4.411 1.03e-05 ***
## monthnov -8.023e-01 9.369e-02 -8.564 < 2e-16 ***
## monthoct 9.078e-01 1.229e-01 7.388 1.49e-13 ***
## monthsep 9.012e-01 1.328e-01 6.787 1.15e-11 ***
## duration 4.148e-03 7.199e-05 57.623 < 2e-16 ***
## campaign -9.240e-02 1.150e-02 -8.033 9.50e-16 ***
## pdays -2.183e-04 3.420e-04 -0.638 0.523284
## poutcomeother 2.564e-01 9.937e-02 2.580 0.009884 **
## poutcomesuccess 2.345e+00 9.202e-02 25.485 < 2e-16 ***
## poutcomeunknown -6.290e-02 1.087e-01 -0.579 0.562784
## previous 3.238e-02 1.114e-02 2.906 0.003657 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 26104 on 36168 degrees of freedom
## Residual deviance: 17276 on 36126 degrees of freedom
## AIC: 17362
##
## Number of Fisher Scoring iterations: 6
Let’s predict the class on the test set and find the Accuracy, sensitivity and specificity of this logistic regression model we just built:
predicted_class<-ifelse(predict(lgm1, test)>0.5, 1, 0)
conf_matrix<-table(test$y, predicted_class)
conf_matrix
## predicted_class
## 0 1
## no 7853 131
## yes 782 276
accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/sum(conf_matrix)
accuracy
## [1] 0.8990268
sensitivity<-conf_matrix[1,1]/(conf_matrix[1,1]+conf_matrix[1,2])
sensitivity
## [1] 0.9835922
specificity<-conf_matrix[2,2]/(conf_matrix[2,2]+conf_matrix[2,1])
specificity
## [1] 0.2608696
Okay, accuracy is 89.98 ~ 90%, that’s not bad. But specificity is very low, meaning proportion of negatives that are correctly identified is very low. In simple terms the model is identifying a large amount of potential non-subscribers as subsribers, which is bad. Depending on this model representatives migh waste energy on customers who he/she might not necessarily be able to convert into a subscribers. We will work on decreasing the Specificity of our model.
But first let’s remove the outliers:
Remove outliers
We created a ummary table of the continuous variables from Univariate Analysis to document the outliers and missing values. 4 Continuous Variables shows the sign of outliers: duration, campaign, pdays and previous. Removing the outliers one by one:
First create a new dataset in which we will put the changed variables to keep original dataset intact.
bank.market1 <- bank.market
Now with every modification we make in the variables with outliers can be directly stored in new dataframe bank.market1.
duration
Value above 1269 can be considered outlier, which we will replace with the median value 180.
bank.market1$duration<-ifelse(bank.market$duration>1269, 180, bank.market$duration)
summary(bank.market1$duration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 103.0 180.0 243.5 310.0 1269.0
ggplot(bank.market1, aes(y=duration, x = 1)) + geom_boxplot()
quantile(bank.market1$duration,c(0.1, .25,.50,.75,0.8, 0.85, .90,0.95, .99,1))
## 10% 25% 50% 75% 80% 85% 90% 95% 99% 100%
## 58 103 180 310 358 421 521 696 1051 1269
campaign
Values above 16 are outliers and should be replaced with median value : 2
bank.market1$campaign<-ifelse(bank.market$campaign>16,2,bank.market$campaign)
summary(bank.market1$campaign)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 2.552 3.000 16.000
ggplot(bank.market1, aes(y=campaign, x = 1)) + geom_boxplot()
quantile(bank.market1$campaign,c(0.1, .25,.50,.75,0.8, 0.85, .90,0.95, .99,1))
## 10% 25% 50% 75% 80% 85% 90% 95% 99% 100%
## 1 1 2 3 4 4 5 7 12 16
pdays
Values above 370 can be considered outliers according to our observations while univariate analysis.
bank.market1$pdays<-ifelse(bank.market$pdays>370,40.2,bank.market$pdays)
summary(bank.market1$pdays)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.00 -1.00 -1.00 36.51 -1.00 370.00
ggplot(bank.market1, aes(y=pdays, x = 1)) + geom_boxplot()
quantile(bank.market1$pdays,c(0.1, .25,.50,.75,0.8, 0.85, .90,0.95, .99,1))
## 10% 25% 50% 75% 80% 85% 90% 95% 99% 100%
## -1 -1 -1 -1 -1 94 182 288 362 370
previous
1% values that are above 8.9 can be considered outliers. We will replace these values with mean 0.58.
bank.market1$previous<-ifelse(bank.market$previous>8.9,0.58,bank.market$previous)
summary(bank.market1$previous)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.4413 0.0000 8.0000
ggplot(bank.market1, aes(y=duration, x = 1)) + geom_boxplot()
quantile(bank.market1$previous,c(0.1, .25,.50,.75,0.8, 0.85, .90,0.95, .99,1))
## 10% 25% 50% 75% 80% 85% 90% 95% 99% 100%
## 0 0 0 0 0 1 2 3 6 8
After removing outliers we can still detect a hint of outliers. We were lineant toward the deciding a boundary for the outliers to keep integrety of the data and not induce any bias. If needed we can again change the margin of outliers and replace those values.
Rebuild the model after outlier removal
Again build a Logistic Model and see if we made any improvements with outlier removal But first devide the dataset bank.market1 into training and testing sets.
library("caTools")
set.seed(90)
sample=sample.split(bank.market1,SplitRatio=.75)
train1=subset(bank.market1,sample==TRUE)
test1=subset(bank.market1,sample==FALSE)
Training a Logistic Regression Model:
lgm2<-glm(formula=y~age+job+marital+loan+balance+education+default+housing+contact+day+month+duration+campaign+pdays+poutcome+previous,data=train1,family=binomial())
summary(lgm2)
##
## Call:
## glm(formula = y ~ age + job + marital + loan + balance + education +
## default + housing + contact + day + month + duration + campaign +
## pdays + poutcome + previous, family = binomial(), data = train1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9740 -0.3765 -0.2496 -0.1562 3.3590
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.835e+00 2.212e-01 -8.296 < 2e-16 ***
## age -3.965e-04 2.567e-03 -0.154 0.877248
## jobblue-collar -2.522e-01 8.490e-02 -2.970 0.002974 **
## jobentrepreneur -3.015e-01 1.452e-01 -2.077 0.037805 *
## jobhousemaid -4.194e-01 1.558e-01 -2.691 0.007124 **
## jobmanagement -1.162e-01 8.586e-02 -1.353 0.176058
## jobretired 2.912e-01 1.137e-01 2.560 0.010468 *
## jobself-employed -3.414e-01 1.326e-01 -2.575 0.010011 *
## jobservices -1.420e-01 9.752e-02 -1.456 0.145307
## jobstudent 5.550e-01 1.278e-01 4.342 1.41e-05 ***
## jobtechnician -1.291e-01 8.109e-02 -1.592 0.111378
## jobunemployed -1.376e-01 1.313e-01 -1.048 0.294644
## jobunknown -2.845e-01 2.703e-01 -1.053 0.292509
## maritalmarried -2.109e-01 6.834e-02 -3.086 0.002032 **
## maritalsingle 1.682e-02 7.824e-02 0.215 0.829819
## loanyes -3.992e-01 6.867e-02 -5.814 6.11e-09 ***
## balance 1.087e-05 5.993e-06 1.813 0.069756 .
## educationsecondary 1.814e-01 7.516e-02 2.414 0.015797 *
## educationtertiary 3.298e-01 8.777e-02 3.757 0.000172 ***
## educationunknown 1.983e-01 1.228e-01 1.614 0.106473
## defaultyes 4.095e-02 1.834e-01 0.223 0.823303
## housingyes -5.942e-01 5.071e-02 -11.719 < 2e-16 ***
## contacttelephone -1.483e-01 8.704e-02 -1.704 0.088382 .
## contactunknown -1.640e+00 8.379e-02 -19.579 < 2e-16 ***
## day 6.748e-03 2.898e-03 2.329 0.019872 *
## monthaug -7.870e-01 9.259e-02 -8.500 < 2e-16 ***
## monthdec 8.486e-01 2.033e-01 4.173 3.00e-05 ***
## monthfeb -1.772e-01 1.043e-01 -1.699 0.089347 .
## monthjan -1.101e+00 1.371e-01 -8.026 1.01e-15 ***
## monthjul -8.945e-01 9.066e-02 -9.867 < 2e-16 ***
## monthjun 4.165e-01 1.094e-01 3.808 0.000140 ***
## monthmar 1.465e+00 1.409e-01 10.397 < 2e-16 ***
## monthmay -3.901e-01 8.520e-02 -4.578 4.69e-06 ***
## monthnov -9.691e-01 9.909e-02 -9.781 < 2e-16 ***
## monthoct 8.138e-01 1.277e-01 6.372 1.87e-10 ***
## monthsep 7.246e-01 1.415e-01 5.121 3.04e-07 ***
## duration 4.784e-03 8.447e-05 56.626 < 2e-16 ***
## campaign -7.698e-02 1.258e-02 -6.118 9.49e-10 ***
## pdays -4.229e-03 4.407e-04 -9.595 < 2e-16 ***
## poutcomeother 1.066e-01 1.070e-01 0.996 0.319072
## poutcomesuccess 2.027e+00 9.484e-02 21.374 < 2e-16 ***
## poutcomeunknown -8.315e-01 1.210e-01 -6.874 6.24e-12 ***
## previous 3.112e-02 2.313e-02 1.345 0.178528
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 23515 on 32651 degrees of freedom
## Residual deviance: 15795 on 32609 degrees of freedom
## AIC: 15881
##
## Number of Fisher Scoring iterations: 6
Let’s validate the model by predicting values on test set:
predicted_values<-ifelse(predict(lgm2,test1)>0.5,1,0)
** Confusion Matrix, Accuracy, Sensitivity and Specificity for model lgm2**
conf_matrix<-table(test1$y, predicted_values)
conf_matrix
## predicted_values
## 0 1
## no 10901 176
## yes 1113 369
accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/sum(conf_matrix)
accuracy
## [1] 0.8973644
sensitivity<-conf_matrix[1,1]/(conf_matrix[1,1]+conf_matrix[1,2])
sensitivity
## [1] 0.9841112
specificity<-conf_matrix[2,2]/(conf_matrix[2,2]+conf_matrix[2,1])
specificity
## [1] 0.2489879
With the new Logistic Model that we have built after removing the outliers, we don’t see any improvement in the results.
We can move further and build a Decision Tree Model.
Building Decission tree
We will build Decision Tree Model and try to fit it with the training set train1 derived from cleaned dataset.
library("rpart")
banktree1<-rpart(y~age+job+marital+loan+balance+education+default+housing+contact+day+month+duration+campaign+pdays+poutcome+previous,method="class",data=train1,control=rpart.control(minsplit=60))
summary(banktree1)
## Call:
## rpart(formula = y ~ age + job + marital + loan + balance + education +
## default + housing + contact + day + month + duration + campaign +
## pdays + poutcome + previous, data = train1, method = "class",
## control = rpart.control(minsplit = 60))
## n= 32652
##
## CP nsplit rel error xerror xstd
## 1 0.03204623 0 1.0000000 1.0000000 0.01523312
## 2 0.02495403 3 0.9038613 0.9419490 0.01484089
## 3 0.01365905 4 0.8789073 0.8857368 0.01444414
## 4 0.01000000 5 0.8652482 0.8807460 0.01440806
##
## Variable importance
## duration poutcome
## 58 42
##
## Node number 1: 32652 observations, complexity param=0.03204623
## predicted class=no expected loss=0.1165932 P(node) =1
## class counts: 28845 3807
## probabilities: 0.883 0.117
## left son=2 (29384 obs) right son=3 (3268 obs)
## Primary splits:
## duration < 521.5 to the left, improve=678.6640, (0 missing)
## poutcome splits as LLRL, improve=616.6600, (0 missing)
## month splits as LLRLLLLRLLRR, improve=371.3537, (0 missing)
## pdays < 8.5 to the left, improve=189.3622, (0 missing)
## previous < 0.29 to the left, improve=187.3590, (0 missing)
##
## Node number 2: 29384 observations, complexity param=0.03204623
## predicted class=no expected loss=0.08259597 P(node) =0.8999142
## class counts: 26957 2427
## probabilities: 0.917 0.083
## left son=4 (28436 obs) right son=5 (948 obs)
## Primary splits:
## poutcome splits as LLRL, improve=553.1033, (0 missing)
## month splits as LLRLLLLRLLRR, improve=360.2030, (0 missing)
## duration < 178.5 to the left, improve=189.3252, (0 missing)
## pdays < 8.5 to the left, improve=171.0930, (0 missing)
## previous < 0.29 to the left, improve=169.1340, (0 missing)
##
## Node number 3: 3268 observations, complexity param=0.03204623
## predicted class=no expected loss=0.4222766 P(node) =0.1000858
## class counts: 1888 1380
## probabilities: 0.578 0.422
## left son=6 (2194 obs) right son=7 (1074 obs)
## Primary splits:
## duration < 800.5 to the left, improve=69.66139, (0 missing)
## contact splits as RRL, improve=46.43019, (0 missing)
## poutcome splits as LLRL, improve=41.75036, (0 missing)
## month splits as LRRLLLLRLLRR, improve=24.18966, (0 missing)
## pdays < 0 to the left, improve=20.38622, (0 missing)
## Surrogate splits:
## age < 81 to the left, agree=0.672, adj=0.002, (0 split)
## balance < -2385.5 to the right, agree=0.672, adj=0.002, (0 split)
## pdays < 369 to the left, agree=0.672, adj=0.001, (0 split)
## previous < 7.5 to the left, agree=0.672, adj=0.001, (0 split)
##
## Node number 4: 28436 observations
## predicted class=no expected loss=0.06488254 P(node) =0.8708808
## class counts: 26591 1845
## probabilities: 0.935 0.065
##
## Node number 5: 948 observations, complexity param=0.02495403
## predicted class=yes expected loss=0.3860759 P(node) =0.02903344
## class counts: 366 582
## probabilities: 0.386 0.614
## left son=10 (185 obs) right son=11 (763 obs)
## Primary splits:
## duration < 136.5 to the left, improve=63.166340, (0 missing)
## month splits as RRRRRRRRLLRR, improve=12.627290, (0 missing)
## housing splits as RL, improve=11.976270, (0 missing)
## pdays < 202.5 to the right, improve=10.329580, (0 missing)
## age < 58.5 to the left, improve= 6.278532, (0 missing)
## Surrogate splits:
## contact splits as RRL, agree=0.813, adj=0.043, (0 split)
## campaign < 6.5 to the right, agree=0.809, adj=0.022, (0 split)
## balance < -82 to the left, agree=0.806, adj=0.005, (0 split)
## default splits as RL, agree=0.806, adj=0.005, (0 split)
##
## Node number 6: 2194 observations, complexity param=0.01365905
## predicted class=no expected loss=0.3500456 P(node) =0.06719343
## class counts: 1426 768
## probabilities: 0.650 0.350
## left son=12 (2102 obs) right son=13 (92 obs)
## Primary splits:
## poutcome splits as LLRL, improve=35.93525, (0 missing)
## contact splits as RRL, improve=33.09535, (0 missing)
## pdays < 0 to the left, improve=24.43706, (0 missing)
## previous < 0.29 to the left, improve=24.43706, (0 missing)
## duration < 647.5 to the left, improve=18.92576, (0 missing)
##
## Node number 7: 1074 observations
## predicted class=yes expected loss=0.4301676 P(node) =0.03289232
## class counts: 462 612
## probabilities: 0.430 0.570
##
## Node number 10: 185 observations
## predicted class=no expected loss=0.2432432 P(node) =0.005665809
## class counts: 140 45
## probabilities: 0.757 0.243
##
## Node number 11: 763 observations
## predicted class=yes expected loss=0.2961992 P(node) =0.02336763
## class counts: 226 537
## probabilities: 0.296 0.704
##
## Node number 12: 2102 observations
## predicted class=no expected loss=0.3311132 P(node) =0.06437584
## class counts: 1406 696
## probabilities: 0.669 0.331
##
## Node number 13: 92 observations
## predicted class=yes expected loss=0.2173913 P(node) =0.002817592
## class counts: 20 72
## probabilities: 0.217 0.783
Plotting the tree:
library(rattle)
## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library("rpart.plot")
fancyRpartPlot(banktree1)
** Predicting Values, Calculating accuracy and confusion matrix**
predicted_values<-predict(banktree1,test1, type="class")
conf_matrix<-table(test1$y, predicted_values)
conf_matrix
## predicted_values
## no yes
## no 10810 267
## yes 1019 463
accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/sum(conf_matrix)
accuracy
## [1] 0.8976033
sensitivity<-conf_matrix[1,1]/(conf_matrix[1,1]+conf_matrix[1,2])
sensitivity
## [1] 0.975896
specificity<-conf_matrix[2,2]/(conf_matrix[2,2]+conf_matrix[2,1])
specificity
## [1] 0.3124157
With the model we have built, we need a good specificity too, otherwise the model results will make bank representatives to call the customers who migh not be interested in the product. This will make representatives waste time on potential non-buyers.
We can prune the tree to make itmore generic and expect the specificity to improve.
Pruning the tree and Complexity Parameter
Effect of complexity parameter can be understood by using function plotcp() and printcp() which are inbuilt with rpart library. printcp() shows the Training error, cross validation error and standard deviation at each node. – summary() function also provides the same details.
printcp(banktree1)
##
## Classification tree:
## rpart(formula = y ~ age + job + marital + loan + balance + education +
## default + housing + contact + day + month + duration + campaign +
## pdays + poutcome + previous, data = train1, method = "class",
## control = rpart.control(minsplit = 60))
##
## Variables actually used in tree construction:
## [1] duration poutcome
##
## Root node error: 3807/32652 = 0.11659
##
## n= 32652
##
## CP nsplit rel error xerror xstd
## 1 0.032046 0 1.00000 1.00000 0.015233
## 2 0.024954 3 0.90386 0.94195 0.014841
## 3 0.013659 4 0.87891 0.88574 0.014444
## 4 0.010000 5 0.86525 0.88075 0.014408
plotcp(banktree1)
From the observations above, we can choose a CP value to further prune the tree. Intutively we can choose cp as 0.013 and see the effect.
Post pruning the model
banktree_prune1 <- prune(banktree1, cp=0.013)
plotcp(banktree_prune1)
printcp(banktree_prune1)
##
## Classification tree:
## rpart(formula = y ~ age + job + marital + loan + balance + education +
## default + housing + contact + day + month + duration + campaign +
## pdays + poutcome + previous, data = train1, method = "class",
## control = rpart.control(minsplit = 60))
##
## Variables actually used in tree construction:
## [1] duration poutcome
##
## Root node error: 3807/32652 = 0.11659
##
## n= 32652
##
## CP nsplit rel error xerror xstd
## 1 0.032046 0 1.00000 1.00000 0.015233
## 2 0.024954 3 0.90386 0.94195 0.014841
## 3 0.013659 4 0.87891 0.88574 0.014444
## 4 0.010000 5 0.86525 0.88075 0.014408
predicted_values<-predict(banktree_prune1, test1, type="class")
conf_matrix<-table(test1$y, predicted_values)
conf_matrix
## predicted_values
## no yes
## no 10810 267
## yes 1019 463
accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/sum(conf_matrix)
accuracy
## [1] 0.8976033
sensitivity<-conf_matrix[1,1]/(conf_matrix[1,1]+conf_matrix[1,2])
sensitivity
## [1] 0.975896
specificity<-conf_matrix[2,2]/(conf_matrix[2,2]+conf_matrix[2,1])
specificity
## [1] 0.3124157
We don’t see any improvement in the specificity of our model from pruning.
We can take other approach and use a random forest model.
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
rf_model <- randomForest(y~ ., ntree=50, mtry=ncol(train1)/3, data=train1)
predicted_values<-predict(rf_model, test1)
conf_matrix<-table(test1$y, predicted_values)
conf_matrix
## predicted_values
## no yes
## no 10639 438
## yes 717 765
accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/sum(conf_matrix)
accuracy
## [1] 0.9080341
sensitivity<-conf_matrix[1,1]/(conf_matrix[1,1]+conf_matrix[1,2])
sensitivity
## [1] 0.9604586
specificity<-conf_matrix[2,2]/(conf_matrix[2,2]+conf_matrix[2,1])
specificity
## [1] 0.5161943
Seem like a random forest model is doing a pretty good job getting better specificity significantly. However, it’s still not that good but we reached to good result.


