You can download the datasets .

Problem statement

Data Exploration
- Basic information about data
Understanding Variables and Univariate Analysis
- Exploring the numeric variables
- Exploring catagorical variables
Model Building

Problem statement

Marketing campaigns are very crucial for any institution to generate business by promoting their products. A data driven strategy can be very helpful to achieve great results. This data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls where bank representative will pitch a banking product to potential customer. The classification goal is to predict if the client will subscribe to the specific product: ‘Yes’ or ‘No’.

Data Exploration

Basic information about data

This dataset has details of bank customers. The dataset has 18 variables including the dependent variable ‘y’ which denotes if the customer subscribed for the product ‘Term Deposite’. Variable ‘y’(term deposit) is the dependent variable & rest are independent variable.

Retrieve the data from csv

bank.market<-read.csv("D:\\Dv Analytics\\Datasets\\Bank Tele Marketing\\bank_market.csv")
head(bank.market)

##   Cust_num age          job marital education default balance housing loan
## 1        1  58   management married  tertiary      no    2143     yes   no
## 2        2  44   technician  single secondary      no      29     yes   no
## 3        3  33 entrepreneur married secondary      no       2     yes  yes
## 4        4  47  blue-collar married   unknown      no    1506     yes   no
## 5        5  33      unknown  single   unknown      no       1      no   no
## 6        6  35   management married  tertiary      no     231     yes   no
##   contact day month duration campaign pdays previous poutcome  y
## 1 unknown   5   may      261        1    -1        0  unknown no
## 2 unknown   5   may      151        1    -1        0  unknown no
## 3 unknown   5   may       76        1    -1        0  unknown no
## 4 unknown   5   may       92        1    -1        0  unknown no
## 5 unknown   5   may      198        1    -1        0  unknown no
## 6 unknown   5   may      139        1    -1        0  unknown no

Basic information

See the dimension of the data.

dim(bank.market)

## [1] 45211    18

we have 18 variables, and 45211 observations. Variable names:

names(bank.market)

##  [1] "Cust_num"  "age"       "job"       "marital"   "education"
##  [6] "default"   "balance"   "housing"   "loan"      "contact"  
## [11] "day"       "month"     "duration"  "campaign"  "pdays"    
## [16] "previous"  "poutcome"  "y"

Format & structure of the data

str(bank.market)

## 'data.frame':    45211 obs. of  18 variables:
##  $ Cust_num : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
##  $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
##  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
##  $ loan     : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
##  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Numerical variable: “age”, “duration”, “compaign”, “pdays”, “previous”, “day”, “balance” categorical variable: “job”, “marital”, “education”, “default”, “housing”, “loan”, “contact”, “month”, “poutcome” and “y”

See first five and last five rows of the data to get better understanding of variables.

head(bank.market)

##   Cust_num age          job marital education default balance housing loan
## 1        1  58   management married  tertiary      no    2143     yes   no
## 2        2  44   technician  single secondary      no      29     yes   no
## 3        3  33 entrepreneur married secondary      no       2     yes  yes
## 4        4  47  blue-collar married   unknown      no    1506     yes   no
## 5        5  33      unknown  single   unknown      no       1      no   no
## 6        6  35   management married  tertiary      no     231     yes   no
##   contact day month duration campaign pdays previous poutcome  y
## 1 unknown   5   may      261        1    -1        0  unknown no
## 2 unknown   5   may      151        1    -1        0  unknown no
## 3 unknown   5   may       76        1    -1        0  unknown no
## 4 unknown   5   may       92        1    -1        0  unknown no
## 5 unknown   5   may      198        1    -1        0  unknown no
## 6 unknown   5   may      139        1    -1        0  unknown no

tail(bank.market)

##       Cust_num age          job  marital education default balance housing
## 45206    45206  25   technician   single secondary      no     505      no
## 45207    45207  51   technician  married  tertiary      no     825      no
## 45208    45208  71      retired divorced   primary      no    1729      no
## 45209    45209  72      retired  married secondary      no    5715      no
## 45210    45210  57  blue-collar  married secondary      no     668      no
## 45211    45211  37 entrepreneur  married secondary      no    2971      no
##       loan   contact day month duration campaign pdays previous poutcome
## 45206  yes  cellular  17   nov      386        2    -1        0  unknown
## 45207   no  cellular  17   nov      977        3    -1        0  unknown
## 45208   no  cellular  17   nov      456        2    -1        0  unknown
## 45209   no  cellular  17   nov     1127        5   184        3  success
## 45210   no telephone  17   nov      508        4    -1        0  unknown
## 45211   no  cellular  17   nov      361        2   188       11    other
##         y
## 45206 yes
## 45207 yes
## 45208 yes
## 45209 yes
## 45210  no
## 45211  no

See the summary of whole data

summary(bank.market)

##     Cust_num          age                 job           marital     
##  Min.   :    1   Min.   :18.00   blue-collar:9732   divorced: 5207  
##  1st Qu.:11304   1st Qu.:33.00   management :9458   married :27214  
##  Median :22606   Median :39.00   technician :7597   single  :12790  
##  Mean   :22606   Mean   :40.94   admin.     :5171                   
##  3rd Qu.:33909   3rd Qu.:48.00   services   :4154                   
##  Max.   :45211   Max.   :95.00   retired    :2264                   
##                                  (Other)    :6835                   
##      education     default        balance       housing      loan      
##  primary  : 6851   no :44396   Min.   : -8019   no :20081   no :37967  
##  secondary:23202   yes:  815   1st Qu.:    72   yes:25130   yes: 7244  
##  tertiary :13301               Median :   448                          
##  unknown  : 1857               Mean   :  1362                          
##                                3rd Qu.:  1428                          
##                                Max.   :102127                          
##                                                                        
##       contact           day            month          duration     
##  cellular :29285   Min.   : 1.00   may    :13766   Min.   :   0.0  
##  telephone: 2906   1st Qu.: 8.00   jul    : 6895   1st Qu.: 103.0  
##  unknown  :13020   Median :16.00   aug    : 6247   Median : 180.0  
##                    Mean   :15.81   jun    : 5341   Mean   : 258.2  
##                    3rd Qu.:21.00   nov    : 3970   3rd Qu.: 319.0  
##                    Max.   :31.00   apr    : 2932   Max.   :4918.0  
##                                    (Other): 6060                   
##     campaign          pdays          previous           poutcome    
##  Min.   : 1.000   Min.   : -1.0   Min.   :  0.0000   failure: 4901  
##  1st Qu.: 1.000   1st Qu.: -1.0   1st Qu.:  0.0000   other  : 1840  
##  Median : 2.000   Median : -1.0   Median :  0.0000   success: 1511  
##  Mean   : 2.764   Mean   : 40.2   Mean   :  0.5803   unknown:36959  
##  3rd Qu.: 3.000   3rd Qu.: -1.0   3rd Qu.:  0.0000                  
##  Max.   :63.000   Max.   :871.0   Max.   :275.0000                  
##                                                                     
##    y        
##  no :39922  
##  yes: 5289  
##             
##             
##             
##             
##

Missing Values

Summary will show the minimum,maximum,mean,median,1^st quartile,3^rd quartile of all the variables in the data set. It also shows whether any variable has NA(missing) values. In our data, variables doesn’t seem to have NA values. However missing values will not always be represented by ‘NA’. These can also be represented by ‘?’,‘-’,‘.’ etc. To know whether a variable has a missing value or not, summary of that variable or a frequency table can be useful. Once we discover the representation of the missing values, we can find number of missing values in each variable.

So far this dataset does not seem to contain any missing values.

We will move to Univariate Analysis to understand more about each variable.

Understanding Variables and Univariate Analysis

Univariate analysis is a way to validate one variable at a time. It allows us to understand what a variable signifies. Also any missing values or outliers can be detected with this analysis to further sanitize the data for final model building.

Exploring the numeric variables

First we will go through all numerical variables in our dataset.

Cust_num

It is the first variable in the data. It seem to be serial number of the customer. To get summary of ‘Cust_num’

summary(bank.market$Cust_num)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1   11300   22610   22610   33910   45210

The minimum value in Cust_num column is 1, maximum value is 45210 which is number of rows in the data. Its mean and median are equal, which shows an equal distribution.

age

It specifies age of the barrower in years. It’s an integer. lets see the summary of age

summary(bank.market$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   33.00   39.00   40.94   48.00   95.00

Minimum age is 18, maximum age is 95 which is okay. Mean and median are very close which indicates outliers may not be present.

Lets see the percentile distribution.

quantile(bank.market$age,c(0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.96,0.97,0.98,0.99,1))

##   0%   1%   3%   5%   7%   9%  10%  20%  30%  40%  50%  60%  70%  80%  90% 
##   18   23   26   27   28   28   29   32   34   36   39   42   46   51   56 
##  95%  96%  97%  98%  99% 100% 
##   59   59   60   63   71   95

Percentile distribution shows only 10% of customers are younger then 28, and around 75% of the customers are between age of 28-60, meaning a more mature customer group is targeted for campaign.

Plot the boxplot of the variable

library(ggplot2)
ggplot(bank.market, aes(y=age, x = 1)) + geom_boxplot()

We can notice that good number of customers are in their early mid age. There doesn’t seem to be a sign outliers in the variable.

duration

Variable ‘duration’ specifies the duration of last call made with customer in second. let’s see the summary of duration:

summary(bank.market$duration)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   103.0   180.0   258.2   319.0  4918.0

The Min value is 0.0 but Max value is 4918.0 seconds which is around 1hour22minutes. Let’s see the percentile distribution.

quantile(bank.market$duration,c(0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.99,1))

##   0%   1%   3%   5%   7%   9%  10%  20%  30%  40%  50%  60%  70%  80%  90% 
##    0   11   22   35   46   55   58   89  117  147  180  223  280  368  548 
##  95%  99% 100% 
##  751 1269 4918

Percentile distribution shows 95% of the calls ended within first 751 seconds or around 12 minutes.

Boxplot of the variable.

ggplot(bank.market, aes(y=duration, x = 1)) + geom_boxplot()

Boxplot shows that there are quite a few outliers in the this variable. We can consider last 5% values as outliers.

campaign

This variable represents number of contacts made during this campaign and for this client, this includes the last contact. Summary of variable ‘campaign’:

summary(bank.market$campaign)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   2.764   3.000  63.000

Min value is 1 and Max value is 63. Contacting same customer 63 times seem too high this might be outlier.(Outlier doesn’t mean that this is flasly entered information, however, it can affect our predictive models)

Let’s get into percentile distribution:

quantile(bank.market$campaign,c(0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.99,1))

##   0%   1%   3%   5%   7%   9%  10%  20%  30%  40%  50%  60%  70%  80%  90% 
##    1    1    1    1    1    1    1    1    1    2    2    2    3    4    5 
##  95%  99% 100% 
##    8   16   63

We can see about 60% of customers were contacted no more than twice and 90% has been contacted less than 5 times. Last 1%ile shows too much of variance form the whole data, considering this as an outlier should be a best call.

Boxlot for better visual understanding:

ggplot(bank.market, aes(y=campaign, x = 1)) + geom_boxplot()

qplot(data = bank.market, x = campaign)

Boxplot and Barplot shows high number of cutomers were called less than 5 times and 99% of the customers were not called more than 16 times.

Lets see it’s frequency distribution.

table(bank.market$campaign)

## 
##     1     2     3     4     5     6     7     8     9    10    11    12 
## 17544 12505  5521  3522  1764  1291   735   540   327   266   201   155 
##    13    14    15    16    17    18    19    20    21    22    23    24 
##   133    93    84    79    69    51    44    43    35    23    22    20 
##    25    26    27    28    29    30    31    32    33    34    35    36 
##    22    13    10    16    16     8    12     9     6     5     4     4 
##    37    38    39    41    43    44    46    50    51    55    58    63 
##     2     3     1     2     3     1     1     2     1     1     1     1

We can see for all different measures that distribution is concentrated in a very small range and if the value is too far from the range we can consider it an outlier. Here any value that is more than 16 or above 99th percentile, we will it an outlier.

pdays

This variable represents number of days that passed by after the client was last contacted from a previous campaign. It is a numeric variable and -1 means client was not previously contacted.

Summary of pdays:

summary(bank.market$pdays)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    -1.0    -1.0    -1.0    40.2    -1.0   871.0

Minimum value is -1 and maximum value is 871 days. Mean and median have huge gap which indicates there is presence outliers.

Let’s have a look at percentile distribution:

quantile(bank.market$pdays,c(0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.99,1))

##   0%   1%   3%   5%   7%   9%  10%  20%  30%  40%  50%  60%  70%  80%  90% 
##   -1   -1   -1   -1   -1   -1   -1   -1   -1   -1   -1   -1   -1   -1  185 
##  95%  99% 100% 
##  317  370  871

we see about 80% values are given as -1, and these customers were contacted for the first time. Let’s have a look at boxplot for good measure:

ggplot(bank.market, aes(y=pdays, x = 1)) + geom_boxplot()

However, if the customer was last contacted over a year ago we can assume value more than 365 to be an outlier and 99th percentile happen to be 370(very near to 365).

previous

This variable represents number of contacts performed before this campaign and for this client, it’s a numeric value.

Summary of ‘previous’:

summary(bank.market$previous)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0000   0.0000   0.0000   0.5803   0.0000 275.0000

Minimum value is 0 maximum value is 275, mean is 0.58 but median is 0.

A look at percentile distribution:

quantile(bank.market$previous,c(0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.99,1))

##    0%    1%    3%    5%    7%    9%   10%   20%   30%   40%   50%   60% 
##   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0 
##   70%   80%   90%   95%   99%  100% 
##   0.0   0.0   2.0   3.0   8.9 275.0

Percentile distribution suggest 99% values are under 8.9, so the value 275 must be an outlier.

Visualizing the distribution of ‘previous’ for a clear view:

ggplot(bank.market, aes(y=previous, x = 1)) + geom_boxplot()

day

It’s the last contact day of the month with customer. The numeric value must be between 1 to 31.

Let’s have a look at summary to see if there is any outlier or missing value.

summary(bank.market$day)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    8.00   16.00   15.81   21.00   31.00

Summary shows min value 1 and max value being 31, which is the exact range of days in months.

Have a look at boxplot:

ggplot(bank.market, aes(y=day, x = 1)) + geom_boxplot()

Distribution of this variable seem pretty fine.

balance

It’s average yearly balance of the customer, in euros. This could have large impact on subscription by customer. Let’s have a look at summary:

summary(bank.market$balance)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -8019      72     448    1362    1428  102100

Min value is -8019 and max value is 102100, can’t really doubt these values being impossible.

quantile(bank.market$balance,c(0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.99,1))

##       0%       1%       3%       5%       7%       9%      10%      20% 
##  -8019.0   -627.0   -322.0   -172.0    -51.0      0.0      0.0     22.0 
##      30%      40%      50%      60%      70%      80%      90%      95% 
##    131.0    272.0    448.0    701.0   1126.0   1859.0   3574.0   5768.0 
##      99%     100% 
##  13164.9 102127.0

Percentile distribution shows 99% of the values are under 13164.9, the rest might be outliers.

Have a look at boxplot to understand distribution:

ggplot(bank.market, aes(y=balance, x = 1)) + geom_boxplot()

qplot(bank.market$balance, geom="histogram", binwidth = 5000)

Boxplot indicates no outliers.

Exploring catagorical variables

job

This represents type of job the customer has. we can see the the count value of each value of a catagorical variable by using summary function:

summary(bank.market$job)

##        admin.   blue-collar  entrepreneur     housemaid    management 
##          5171          9732          1487          1240          9458 
##       retired self-employed      services       student    technician 
##          2264          1579          4154           938          7597 
##    unemployed       unknown 
##          1303           288

marital

Marital status of the customer. It’s a categorical value: “married”,“divorced”,“single”. note: “divorced” means divorced or widowed

table(bank.market$marital)

## 
## divorced  married   single 
##     5207    27214    12790

education

Education level of the customer. (categorical: “unknown”,“secondary”,“primary”,“tertiary”)

table(bank.market$education)

## 
##   primary secondary  tertiary   unknown 
##      6851     23202     13301      1857

default

This variable shows if the customer has credit in default? (binary: “yes”,“no”)

summary(bank.market$default)

##    no   yes 
## 44396   815

housing

If the customer has any housing loan.

summary(bank.market$housing)

##    no   yes 
## 20081 25130

there dosn’t seem to be any outlier or missing values.

loan

If the customer has any personal loan

summary(bank.market$loan)

##    no   yes 
## 37967  7244

no missing values or outliers.

contact

The contact communication method used to aproach customer: ‘cellular’, ‘telephone’, ‘unknown’

summary(bank.market$contact)

##  cellular telephone   unknown 
##     29285      2906     13020

no missing values or any outlier in this variable

month

It’s the last contact month of the year.

summary(bank.market$month)

##   apr   aug   dec   feb   jan   jul   jun   mar   may   nov   oct   sep 
##  2932  6247   214  2649  1403  6895  5341   477 13766  3970   738   579

No missing value, no outlier.

poutcome

outcome of the previous marketing campaign.

summary(bank.market$poutcome)

## failure   other success unknown 
##    4901    1840    1511   36959

no outlier or missing values.

y

this is our output variable(desired varible). Has the client subscribed a term deposit? (binary: ‘yes’,‘no’)

summary(bank.market$poutcome)

## failure   other success unknown 
##    4901    1840    1511   36959

No missing values or outliers.

Summary of Univariate Analysis on Numerical Variables

Lets tabularize what we found in univariate analysis

Model Building

As our dependent variable ‘y’ is binary variable(Yes-No type), basic algorithm to aproach would be logistic regression.

We will need to split the data into training and testing sets. Using sample.split() function form library caTools, we will split the dataset into 80-20 ratio of training and testing set.

Initially we will go with the raw data set and then we will go with a basic claeaning process to clean the outliers or any NA values and see if the cleaning improves our results or not.

Spliting the data into training and testing set

library(caTools)
set.seed(90) # to get the same split everytime
spl = sample.split(bank.market$y, SplitRatio = 0.8)
train = subset(bank.market, spl==TRUE)
test = subset(bank.market, spl==FALSE)

Multiple logistic regression

Great thing about R is that many of pre-built modeling functions break categorical varibles into dummy variables automatically, while building models. Building a logistic regression model:

lgm1<-glm(y~age+job+marital+loan+balance+education+default+housing+contact+day+month+duration+campaign+pdays+poutcome+previous,data=train,family=binomial())
summary(lgm1)

## 
## Call:
## glm(formula = y ~ age + job + marital + loan + balance + education + 
##     default + housing + contact + day + month + duration + campaign + 
##     pdays + poutcome + previous, family = binomial(), data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.6972  -0.3756  -0.2536  -0.1500   3.4366  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -2.493e+00  2.073e-01 -12.028  < 2e-16 ***
## age                -1.606e-03  2.476e-03  -0.649 0.516565    
## jobblue-collar     -2.935e-01  8.070e-02  -3.637 0.000276 ***
## jobentrepreneur    -2.939e-01  1.400e-01  -2.099 0.035838 *  
## jobhousemaid       -4.895e-01  1.536e-01  -3.188 0.001432 ** 
## jobmanagement      -1.308e-01  8.150e-02  -1.605 0.108489    
## jobretired          3.266e-01  1.086e-01   3.009 0.002623 ** 
## jobself-employed   -2.531e-01  1.255e-01  -2.016 0.043809 *  
## jobservices        -1.965e-01  9.364e-02  -2.099 0.035852 *  
## jobstudent          3.345e-01  1.226e-01   2.728 0.006375 ** 
## jobtechnician      -1.664e-01  7.701e-02  -2.161 0.030721 *  
## jobunemployed      -2.227e-01  1.254e-01  -1.776 0.075665 .  
## jobunknown         -4.036e-01  2.672e-01  -1.510 0.130957    
## maritalmarried     -1.756e-01  6.557e-02  -2.678 0.007397 ** 
## maritalsingle       9.087e-02  7.480e-02   1.215 0.224395    
## loanyes            -4.414e-01  6.684e-02  -6.603 4.02e-11 ***
## balance             1.264e-05  5.688e-06   2.222 0.026306 *  
## educationsecondary  1.820e-01  7.249e-02   2.511 0.012026 *  
## educationtertiary   3.563e-01  8.417e-02   4.233 2.31e-05 ***
## educationunknown    2.475e-01  1.169e-01   2.116 0.034306 *  
## defaultyes          5.938e-03  1.826e-01   0.033 0.974049    
## housingyes         -6.869e-01  4.904e-02 -14.006  < 2e-16 ***
## contacttelephone   -1.168e-01  8.432e-02  -1.386 0.165835    
## contactunknown     -1.617e+00  8.129e-02 -19.893  < 2e-16 ***
## day                 8.088e-03  2.788e-03   2.901 0.003719 ** 
## monthaug           -6.964e-01  8.799e-02  -7.914 2.49e-15 ***
## monthdec            5.624e-01  1.975e-01   2.848 0.004399 ** 
## monthfeb           -1.349e-01  9.998e-02  -1.349 0.177307    
## monthjan           -1.219e+00  1.375e-01  -8.863  < 2e-16 ***
## monthjul           -7.716e-01  8.636e-02  -8.934  < 2e-16 ***
## monthjun            4.729e-01  1.039e-01   4.553 5.28e-06 ***
## monthmar            1.582e+00  1.350e-01  11.712  < 2e-16 ***
## monthmay           -3.567e-01  8.085e-02  -4.411 1.03e-05 ***
## monthnov           -8.023e-01  9.369e-02  -8.564  < 2e-16 ***
## monthoct            9.078e-01  1.229e-01   7.388 1.49e-13 ***
## monthsep            9.012e-01  1.328e-01   6.787 1.15e-11 ***
## duration            4.148e-03  7.199e-05  57.623  < 2e-16 ***
## campaign           -9.240e-02  1.150e-02  -8.033 9.50e-16 ***
## pdays              -2.183e-04  3.420e-04  -0.638 0.523284    
## poutcomeother       2.564e-01  9.937e-02   2.580 0.009884 ** 
## poutcomesuccess     2.345e+00  9.202e-02  25.485  < 2e-16 ***
## poutcomeunknown    -6.290e-02  1.087e-01  -0.579 0.562784    
## previous            3.238e-02  1.114e-02   2.906 0.003657 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 26104  on 36168  degrees of freedom
## Residual deviance: 17276  on 36126  degrees of freedom
## AIC: 17362
## 
## Number of Fisher Scoring iterations: 6

Let’s predict the class on the test set and find the Accuracy, sensitivity and specificity of this logistic regression model we just built:

predicted_class<-ifelse(predict(lgm1, test)>0.5, 1, 0)
conf_matrix<-table(test$y, predicted_class)
conf_matrix

##      predicted_class
##          0    1
##   no  7853  131
##   yes  782  276

accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/sum(conf_matrix)
accuracy

## [1] 0.8990268

sensitivity<-conf_matrix[1,1]/(conf_matrix[1,1]+conf_matrix[1,2])
sensitivity

## [1] 0.9835922

specificity<-conf_matrix[2,2]/(conf_matrix[2,2]+conf_matrix[2,1])
specificity

## [1] 0.2608696

Okay, accuracy is 89.98 ~ 90%, that’s not bad. But specificity is very low, meaning proportion of negatives that are correctly identified is very low. In simple terms the model is identifying a large amount of potential non-subscribers as subsribers, which is bad. Depending on this model representatives migh waste energy on customers who he/she might not necessarily be able to convert into a subscribers. We will work on decreasing the Specificity of our model.

But first let’s remove the outliers:

Remove outliers

We created a ummary table of the continuous variables from Univariate Analysis to document the outliers and missing values. 4 Continuous Variables shows the sign of outliers: duration, campaign, pdays and previous. Removing the outliers one by one:

First create a new dataset in which we will put the changed variables to keep original dataset intact.

bank.market1 <- bank.market

Now with every modification we make in the variables with outliers can be directly stored in new dataframe bank.market1.

duration

Value above 1269 can be considered outlier, which we will replace with the median value 180.

bank.market1$duration<-ifelse(bank.market$duration>1269, 180, bank.market$duration)

summary(bank.market1$duration)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   103.0   180.0   243.5   310.0  1269.0

ggplot(bank.market1, aes(y=duration, x = 1)) + geom_boxplot()

quantile(bank.market1$duration,c(0.1, .25,.50,.75,0.8, 0.85, .90,0.95, .99,1))

##  10%  25%  50%  75%  80%  85%  90%  95%  99% 100% 
##   58  103  180  310  358  421  521  696 1051 1269

campaign

Values above 16 are outliers and should be replaced with median value : 2

bank.market1$campaign<-ifelse(bank.market$campaign>16,2,bank.market$campaign)
summary(bank.market1$campaign)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   2.552   3.000  16.000

ggplot(bank.market1, aes(y=campaign, x = 1)) + geom_boxplot()

quantile(bank.market1$campaign,c(0.1, .25,.50,.75,0.8, 0.85, .90,0.95, .99,1))

##  10%  25%  50%  75%  80%  85%  90%  95%  99% 100% 
##    1    1    2    3    4    4    5    7   12   16

pdays

Values above 370 can be considered outliers according to our observations while univariate analysis.

bank.market1$pdays<-ifelse(bank.market$pdays>370,40.2,bank.market$pdays)
summary(bank.market1$pdays)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -1.00   -1.00   -1.00   36.51   -1.00  370.00

ggplot(bank.market1, aes(y=pdays, x = 1)) + geom_boxplot()

quantile(bank.market1$pdays,c(0.1, .25,.50,.75,0.8, 0.85, .90,0.95, .99,1))

##  10%  25%  50%  75%  80%  85%  90%  95%  99% 100% 
##   -1   -1   -1   -1   -1   94  182  288  362  370

previous

1% values that are above 8.9 can be considered outliers. We will replace these values with mean 0.58.

bank.market1$previous<-ifelse(bank.market$previous>8.9,0.58,bank.market$previous)
summary(bank.market1$previous)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.4413  0.0000  8.0000

ggplot(bank.market1, aes(y=duration, x = 1)) + geom_boxplot()

quantile(bank.market1$previous,c(0.1, .25,.50,.75,0.8, 0.85, .90,0.95, .99,1))

##  10%  25%  50%  75%  80%  85%  90%  95%  99% 100% 
##    0    0    0    0    0    1    2    3    6    8

After removing outliers we can still detect a hint of outliers. We were lineant toward the deciding a boundary for the outliers to keep integrety of the data and not induce any bias. If needed we can again change the margin of outliers and replace those values.

Rebuild the model after outlier removal

Again build a Logistic Model and see if we made any improvements with outlier removal But first devide the dataset bank.market1 into training and testing sets.

library("caTools")
set.seed(90)
sample=sample.split(bank.market1,SplitRatio=.75)
train1=subset(bank.market1,sample==TRUE)
test1=subset(bank.market1,sample==FALSE)

Training a Logistic Regression Model:

lgm2<-glm(formula=y~age+job+marital+loan+balance+education+default+housing+contact+day+month+duration+campaign+pdays+poutcome+previous,data=train1,family=binomial())
summary(lgm2)

## 
## Call:
## glm(formula = y ~ age + job + marital + loan + balance + education + 
##     default + housing + contact + day + month + duration + campaign + 
##     pdays + poutcome + previous, family = binomial(), data = train1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9740  -0.3765  -0.2496  -0.1562   3.3590  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -1.835e+00  2.212e-01  -8.296  < 2e-16 ***
## age                -3.965e-04  2.567e-03  -0.154 0.877248    
## jobblue-collar     -2.522e-01  8.490e-02  -2.970 0.002974 ** 
## jobentrepreneur    -3.015e-01  1.452e-01  -2.077 0.037805 *  
## jobhousemaid       -4.194e-01  1.558e-01  -2.691 0.007124 ** 
## jobmanagement      -1.162e-01  8.586e-02  -1.353 0.176058    
## jobretired          2.912e-01  1.137e-01   2.560 0.010468 *  
## jobself-employed   -3.414e-01  1.326e-01  -2.575 0.010011 *  
## jobservices        -1.420e-01  9.752e-02  -1.456 0.145307    
## jobstudent          5.550e-01  1.278e-01   4.342 1.41e-05 ***
## jobtechnician      -1.291e-01  8.109e-02  -1.592 0.111378    
## jobunemployed      -1.376e-01  1.313e-01  -1.048 0.294644    
## jobunknown         -2.845e-01  2.703e-01  -1.053 0.292509    
## maritalmarried     -2.109e-01  6.834e-02  -3.086 0.002032 ** 
## maritalsingle       1.682e-02  7.824e-02   0.215 0.829819    
## loanyes            -3.992e-01  6.867e-02  -5.814 6.11e-09 ***
## balance             1.087e-05  5.993e-06   1.813 0.069756 .  
## educationsecondary  1.814e-01  7.516e-02   2.414 0.015797 *  
## educationtertiary   3.298e-01  8.777e-02   3.757 0.000172 ***
## educationunknown    1.983e-01  1.228e-01   1.614 0.106473    
## defaultyes          4.095e-02  1.834e-01   0.223 0.823303    
## housingyes         -5.942e-01  5.071e-02 -11.719  < 2e-16 ***
## contacttelephone   -1.483e-01  8.704e-02  -1.704 0.088382 .  
## contactunknown     -1.640e+00  8.379e-02 -19.579  < 2e-16 ***
## day                 6.748e-03  2.898e-03   2.329 0.019872 *  
## monthaug           -7.870e-01  9.259e-02  -8.500  < 2e-16 ***
## monthdec            8.486e-01  2.033e-01   4.173 3.00e-05 ***
## monthfeb           -1.772e-01  1.043e-01  -1.699 0.089347 .  
## monthjan           -1.101e+00  1.371e-01  -8.026 1.01e-15 ***
## monthjul           -8.945e-01  9.066e-02  -9.867  < 2e-16 ***
## monthjun            4.165e-01  1.094e-01   3.808 0.000140 ***
## monthmar            1.465e+00  1.409e-01  10.397  < 2e-16 ***
## monthmay           -3.901e-01  8.520e-02  -4.578 4.69e-06 ***
## monthnov           -9.691e-01  9.909e-02  -9.781  < 2e-16 ***
## monthoct            8.138e-01  1.277e-01   6.372 1.87e-10 ***
## monthsep            7.246e-01  1.415e-01   5.121 3.04e-07 ***
## duration            4.784e-03  8.447e-05  56.626  < 2e-16 ***
## campaign           -7.698e-02  1.258e-02  -6.118 9.49e-10 ***
## pdays              -4.229e-03  4.407e-04  -9.595  < 2e-16 ***
## poutcomeother       1.066e-01  1.070e-01   0.996 0.319072    
## poutcomesuccess     2.027e+00  9.484e-02  21.374  < 2e-16 ***
## poutcomeunknown    -8.315e-01  1.210e-01  -6.874 6.24e-12 ***
## previous            3.112e-02  2.313e-02   1.345 0.178528    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 23515  on 32651  degrees of freedom
## Residual deviance: 15795  on 32609  degrees of freedom
## AIC: 15881
## 
## Number of Fisher Scoring iterations: 6

Let’s validate the model by predicting values on test set:

predicted_values<-ifelse(predict(lgm2,test1)>0.5,1,0)

** Confusion Matrix, Accuracy, Sensitivity and Specificity for model lgm2**

conf_matrix<-table(test1$y, predicted_values)
conf_matrix

##      predicted_values
##           0     1
##   no  10901   176
##   yes  1113   369

accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/sum(conf_matrix)
accuracy

## [1] 0.8973644

sensitivity<-conf_matrix[1,1]/(conf_matrix[1,1]+conf_matrix[1,2])
sensitivity

## [1] 0.9841112

specificity<-conf_matrix[2,2]/(conf_matrix[2,2]+conf_matrix[2,1])
specificity

## [1] 0.2489879

With the new Logistic Model that we have built after removing the outliers, we don’t see any improvement in the results.

We can move further and build a Decision Tree Model.

Building Decission tree

We will build Decision Tree Model and try to fit it with the training set train1 derived from cleaned dataset.

library("rpart")
banktree1<-rpart(y~age+job+marital+loan+balance+education+default+housing+contact+day+month+duration+campaign+pdays+poutcome+previous,method="class",data=train1,control=rpart.control(minsplit=60))
summary(banktree1)

## Call:
## rpart(formula = y ~ age + job + marital + loan + balance + education + 
##     default + housing + contact + day + month + duration + campaign + 
##     pdays + poutcome + previous, data = train1, method = "class", 
##     control = rpart.control(minsplit = 60))
##   n= 32652 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.03204623      0 1.0000000 1.0000000 0.01523312
## 2 0.02495403      3 0.9038613 0.9419490 0.01484089
## 3 0.01365905      4 0.8789073 0.8857368 0.01444414
## 4 0.01000000      5 0.8652482 0.8807460 0.01440806
## 
## Variable importance
## duration poutcome 
##       58       42 
## 
## Node number 1: 32652 observations,    complexity param=0.03204623
##   predicted class=no   expected loss=0.1165932  P(node) =1
##     class counts: 28845  3807
##    probabilities: 0.883 0.117 
##   left son=2 (29384 obs) right son=3 (3268 obs)
##   Primary splits:
##       duration < 521.5   to the left,   improve=678.6640, (0 missing)
##       poutcome splits as  LLRL,         improve=616.6600, (0 missing)
##       month    splits as  LLRLLLLRLLRR, improve=371.3537, (0 missing)
##       pdays    < 8.5     to the left,   improve=189.3622, (0 missing)
##       previous < 0.29    to the left,   improve=187.3590, (0 missing)
## 
## Node number 2: 29384 observations,    complexity param=0.03204623
##   predicted class=no   expected loss=0.08259597  P(node) =0.8999142
##     class counts: 26957  2427
##    probabilities: 0.917 0.083 
##   left son=4 (28436 obs) right son=5 (948 obs)
##   Primary splits:
##       poutcome splits as  LLRL,         improve=553.1033, (0 missing)
##       month    splits as  LLRLLLLRLLRR, improve=360.2030, (0 missing)
##       duration < 178.5   to the left,   improve=189.3252, (0 missing)
##       pdays    < 8.5     to the left,   improve=171.0930, (0 missing)
##       previous < 0.29    to the left,   improve=169.1340, (0 missing)
## 
## Node number 3: 3268 observations,    complexity param=0.03204623
##   predicted class=no   expected loss=0.4222766  P(node) =0.1000858
##     class counts:  1888  1380
##    probabilities: 0.578 0.422 
##   left son=6 (2194 obs) right son=7 (1074 obs)
##   Primary splits:
##       duration < 800.5   to the left,   improve=69.66139, (0 missing)
##       contact  splits as  RRL,          improve=46.43019, (0 missing)
##       poutcome splits as  LLRL,         improve=41.75036, (0 missing)
##       month    splits as  LRRLLLLRLLRR, improve=24.18966, (0 missing)
##       pdays    < 0       to the left,   improve=20.38622, (0 missing)
##   Surrogate splits:
##       age      < 81      to the left,  agree=0.672, adj=0.002, (0 split)
##       balance  < -2385.5 to the right, agree=0.672, adj=0.002, (0 split)
##       pdays    < 369     to the left,  agree=0.672, adj=0.001, (0 split)
##       previous < 7.5     to the left,  agree=0.672, adj=0.001, (0 split)
## 
## Node number 4: 28436 observations
##   predicted class=no   expected loss=0.06488254  P(node) =0.8708808
##     class counts: 26591  1845
##    probabilities: 0.935 0.065 
## 
## Node number 5: 948 observations,    complexity param=0.02495403
##   predicted class=yes  expected loss=0.3860759  P(node) =0.02903344
##     class counts:   366   582
##    probabilities: 0.386 0.614 
##   left son=10 (185 obs) right son=11 (763 obs)
##   Primary splits:
##       duration < 136.5   to the left,   improve=63.166340, (0 missing)
##       month    splits as  RRRRRRRRLLRR, improve=12.627290, (0 missing)
##       housing  splits as  RL,           improve=11.976270, (0 missing)
##       pdays    < 202.5   to the right,  improve=10.329580, (0 missing)
##       age      < 58.5    to the left,   improve= 6.278532, (0 missing)
##   Surrogate splits:
##       contact  splits as  RRL,         agree=0.813, adj=0.043, (0 split)
##       campaign < 6.5     to the right, agree=0.809, adj=0.022, (0 split)
##       balance  < -82     to the left,  agree=0.806, adj=0.005, (0 split)
##       default  splits as  RL,          agree=0.806, adj=0.005, (0 split)
## 
## Node number 6: 2194 observations,    complexity param=0.01365905
##   predicted class=no   expected loss=0.3500456  P(node) =0.06719343
##     class counts:  1426   768
##    probabilities: 0.650 0.350 
##   left son=12 (2102 obs) right son=13 (92 obs)
##   Primary splits:
##       poutcome splits as  LLRL,        improve=35.93525, (0 missing)
##       contact  splits as  RRL,         improve=33.09535, (0 missing)
##       pdays    < 0       to the left,  improve=24.43706, (0 missing)
##       previous < 0.29    to the left,  improve=24.43706, (0 missing)
##       duration < 647.5   to the left,  improve=18.92576, (0 missing)
## 
## Node number 7: 1074 observations
##   predicted class=yes  expected loss=0.4301676  P(node) =0.03289232
##     class counts:   462   612
##    probabilities: 0.430 0.570 
## 
## Node number 10: 185 observations
##   predicted class=no   expected loss=0.2432432  P(node) =0.005665809
##     class counts:   140    45
##    probabilities: 0.757 0.243 
## 
## Node number 11: 763 observations
##   predicted class=yes  expected loss=0.2961992  P(node) =0.02336763
##     class counts:   226   537
##    probabilities: 0.296 0.704 
## 
## Node number 12: 2102 observations
##   predicted class=no   expected loss=0.3311132  P(node) =0.06437584
##     class counts:  1406   696
##    probabilities: 0.669 0.331 
## 
## Node number 13: 92 observations
##   predicted class=yes  expected loss=0.2173913  P(node) =0.002817592
##     class counts:    20    72
##    probabilities: 0.217 0.783

Plotting the tree:

library(rattle)

## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

library("rpart.plot")
fancyRpartPlot(banktree1)

** Predicting Values, Calculating accuracy and confusion matrix**

predicted_values<-predict(banktree1,test1, type="class")

conf_matrix<-table(test1$y, predicted_values)
conf_matrix

##      predicted_values
##          no   yes
##   no  10810   267
##   yes  1019   463

accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/sum(conf_matrix)
accuracy

## [1] 0.8976033

sensitivity<-conf_matrix[1,1]/(conf_matrix[1,1]+conf_matrix[1,2])
sensitivity

## [1] 0.975896

specificity<-conf_matrix[2,2]/(conf_matrix[2,2]+conf_matrix[2,1])
specificity

## [1] 0.3124157

With the model we have built, we need a good specificity too, otherwise the model results will make bank representatives to call the customers who migh not be interested in the product. This will make representatives waste time on potential non-buyers.

We can prune the tree to make itmore generic and expect the specificity to improve.

Pruning the tree and Complexity Parameter

Effect of complexity parameter can be understood by using function plotcp() and printcp() which are inbuilt with rpart library. printcp() shows the Training error, cross validation error and standard deviation at each node. – summary() function also provides the same details.

printcp(banktree1)

## 
## Classification tree:
## rpart(formula = y ~ age + job + marital + loan + balance + education + 
##     default + housing + contact + day + month + duration + campaign + 
##     pdays + poutcome + previous, data = train1, method = "class", 
##     control = rpart.control(minsplit = 60))
## 
## Variables actually used in tree construction:
## [1] duration poutcome
## 
## Root node error: 3807/32652 = 0.11659
## 
## n= 32652 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.032046      0   1.00000 1.00000 0.015233
## 2 0.024954      3   0.90386 0.94195 0.014841
## 3 0.013659      4   0.87891 0.88574 0.014444
## 4 0.010000      5   0.86525 0.88075 0.014408

plotcp(banktree1)

From the observations above, we can choose a CP value to further prune the tree. Intutively we can choose cp as 0.013 and see the effect.

Post pruning the model

banktree_prune1 <- prune(banktree1, cp=0.013)

plotcp(banktree_prune1)

printcp(banktree_prune1)

## 
## Classification tree:
## rpart(formula = y ~ age + job + marital + loan + balance + education + 
##     default + housing + contact + day + month + duration + campaign + 
##     pdays + poutcome + previous, data = train1, method = "class", 
##     control = rpart.control(minsplit = 60))
## 
## Variables actually used in tree construction:
## [1] duration poutcome
## 
## Root node error: 3807/32652 = 0.11659
## 
## n= 32652 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.032046      0   1.00000 1.00000 0.015233
## 2 0.024954      3   0.90386 0.94195 0.014841
## 3 0.013659      4   0.87891 0.88574 0.014444
## 4 0.010000      5   0.86525 0.88075 0.014408

predicted_values<-predict(banktree_prune1, test1, type="class")

conf_matrix<-table(test1$y, predicted_values)
conf_matrix

##      predicted_values
##          no   yes
##   no  10810   267
##   yes  1019   463

accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/sum(conf_matrix)
accuracy

## [1] 0.8976033

sensitivity<-conf_matrix[1,1]/(conf_matrix[1,1]+conf_matrix[1,2])
sensitivity

## [1] 0.975896

specificity<-conf_matrix[2,2]/(conf_matrix[2,2]+conf_matrix[2,1])
specificity

## [1] 0.3124157

We don’t see any improvement in the specificity of our model from pruning.

We can take other approach and use a random forest model.

library(randomForest)

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

rf_model <- randomForest(y~ ., ntree=50,   mtry=ncol(train1)/3, data=train1)

predicted_values<-predict(rf_model, test1)
 
conf_matrix<-table(test1$y, predicted_values)
conf_matrix

##      predicted_values
##          no   yes
##   no  10639   438
##   yes   717   765

accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/sum(conf_matrix)
accuracy

## [1] 0.9080341

sensitivity<-conf_matrix[1,1]/(conf_matrix[1,1]+conf_matrix[1,2])
sensitivity

## [1] 0.9604586

specificity<-conf_matrix[2,2]/(conf_matrix[2,2]+conf_matrix[2,1])
specificity

## [1] 0.5161943

Seem like a random forest model is doing a pretty good job getting better specificity significantly. However, it’s still not that good but we reached to good result.