You can download the datasets and R code file for this session here.
Problem Statement
- An insurance company need to come up with a good marketing strategy. They want to run an e-mail campaign. Before sending a mail to all the available e-mail addresses, they want to first build a predictive model and identify the customers who are most likely to respond.
- Analyze the historical data and build a predictive model that helps us in maximizing the response rate
Import dataset with the following command from your folder
DirectMail.dataset <- read.delim("C:/Amrita/Datavedi/Case Study/DirectMail.txt")
Details of the Dataset
dimension of data
dim(DirectMail.dataset)
## [1] 29904 26
Summary of the dataset
Summary and Structure
str(DirectMail.dataset)
## 'data.frame': 29904 obs. of 26 variables:
## $ AGE : int 20 18 19 18 18 19 19 20 19 19 ...
## $ CRED : int 543 445 423 416 522 467 334 519 480 309 ...
## $ MS : Factor w/ 3 levels "M","U","X": 2 2 2 3 2 2 3 2 1 2 ...
## $ HEQ : num 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
## $ INCOME : int 20 20 20 20 20 20 20 20 20 20 ...
## $ DEPC : Factor w/ 2 levels "N","Y": 1 1 1 1 2 2 1 1 1 2 ...
## $ MOB : Factor w/ 2 levels "N","Y": 2 2 1 2 2 1 2 2 2 2 ...
## $ MILEAGE: num 14.418 0.311 17.876 17.084 16.51 ...
## $ RESTYPE: Factor w/ 4 levels "CONDO","COOP",..: 3 3 3 3 3 3 4 3 4 3 ...
## $ GENDER : Factor w/ 2 levels "F","M": 2 2 2 1 1 1 2 2 2 2 ...
## $ EMP_STA: Factor w/ 3 levels "0","1,2","3+": 2 1 2 2 2 2 2 2 3 2 ...
## $ RES_STA: Factor w/ 2 levels "1,2","3+": 1 1 1 1 1 1 1 1 1 1 ...
## $ DELINQ : int 0 2 1 0 1 0 0 4 4 0 ...
## $ NUMTR : int 2 2 0 0 0 0 1 3 1 0 ...
## $ MRTGI : Factor w/ 3 levels "N","U","Y": 1 3 3 3 3 3 1 2 1 3 ...
## $ MFDU : int 0 0 0 0 0 0 1 0 1 0 ...
## $ resp : int 0 0 0 0 0 0 1 0 0 0 ...
## $ female : int 0 0 0 1 1 1 0 0 0 0 ...
## $ HOME : int 1 1 1 1 1 1 0 1 0 1 ...
## $ CONDO : int 0 0 0 0 0 0 0 0 0 0 ...
## $ COOP : int 0 0 0 0 0 0 0 0 0 0 ...
## $ renter : int 1 1 1 1 1 1 1 1 1 1 ...
## $ emp1 : int 1 0 1 1 1 1 1 1 0 1 ...
## $ emp2 : int 0 0 0 0 0 0 0 0 1 0 ...
## $ msn : int 0 0 0 0 0 0 0 0 1 0 ...
## $ cuscode: int 1 2 3 4 5 6 7 8 9 10 ...
summary(DirectMail.dataset)
## AGE CRED MS HEQ
## Min. :18.0 Min. : 300.0 M:17221 Min. : 0.10
## 1st Qu.:36.0 1st Qu.: 574.0 U:11721 1st Qu.: 10.00
## Median :50.0 Median : 617.0 X: 962 Median : 30.00
## Mean :49.3 Mean : 603.6 Mean : 38.33
## 3rd Qu.:61.0 3rd Qu.: 652.0 3rd Qu.: 50.00
## Max. :90.0 Max. :1789.0 Max. :200.00
## NA's :33
## INCOME DEPC MOB MILEAGE RESTYPE
## Min. : 20.00 N:18502 N:11340 Min. : 0.000 CONDO : 566
## 1st Qu.: 30.00 Y:11402 Y:18564 1st Qu.: 7.776 COOP : 885
## Median : 40.00 Median :12.537 HOME :16365
## Mean : 41.36 Mean :11.803 RENTER:12088
## 3rd Qu.: 50.00 3rd Qu.:16.193
## Max. :110.00 Max. :94.640
##
## GENDER EMP_STA RES_STA DELINQ NUMTR
## F:12628 0 : 968 1,2:28109 Min. :0.0000 Min. :0.0000
## M:17276 1,2:27146 3+ : 1795 1st Qu.:0.0000 1st Qu.:0.0000
## 3+ : 1790 Median :0.0000 Median :0.0000
## Mean :0.7662 Mean :0.7633
## 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :7.0000 Max. :7.0000
##
## MRTGI MFDU resp female
## N:17484 Min. :0.0000 Min. :0.00000 Min. :0.0000
## U: 1763 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Y:10657 Median :0.0000 Median :0.00000 Median :0.0000
## Mean :0.4527 Mean :0.09554 Mean :0.4223
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.00000 Max. :1.0000
##
## HOME CONDO COOP renter
## Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :1
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:1
## Median :1.0000 Median :0.00000 Median :0.00000 Median :1
## Mean :0.5473 Mean :0.01893 Mean :0.02959 Mean :1
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:1
## Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1
##
## emp1 emp2 msn cuscode
## Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. : 1
## 1st Qu.:1.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.: 7477
## Median :1.0000 Median :0.00000 Median :1.0000 Median :14952
## Mean :0.9078 Mean :0.05986 Mean :0.5759 Mean :14952
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:22428
## Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :29904
##
Check for nas in each variable check if any na values are present in the dataset and count the NA values in the dataset
sum(is.na(DirectMail.dataset))
## [1] 33
Lets directly check in which columns the na values are present.
col<-names(DirectMail.dataset)
nas<-c(sum(is.na(DirectMail.dataset$AGE)),sum(is.na(DirectMail.dataset$CRED)),sum(is.na(DirectMail.dataset$MS)),sum(is.na(DirectMail.dataset$HEQ)),sum(is.na(DirectMail.dataset$INCOME)),sum(is.na(DirectMail.dataset$DEPC)),sum(is.na(DirectMail.dataset$MOB)),sum(is.na(DirectMail.dataset$MILEAGE)),sum(is.na(DirectMail.dataset$RESTYPE)),sum(is.na(DirectMail.dataset$GENDER)),sum(is.na(DirectMail.dataset$EMP_STA)),sum(is.na(DirectMail.dataset$RES_STA)),sum(is.na(DirectMail.dataset$DELINQ)),sum(is.na(DirectMail.dataset$NUMTR)),sum(is.na(DirectMail.dataset$MRTGI)),sum(is.na(DirectMail.dataset$MFDU)),sum(is.na(DirectMail.dataset$resp)),sum(is.na(DirectMail.dataset$female)),sum(is.na(DirectMail.dataset$HOME)),sum(is.na(DirectMail.dataset$CONDO)),sum(is.na(DirectMail.dataset$COOP)),sum(is.na(DirectMail.dataset$renter)),sum(is.na(DirectMail.dataset$emp1)),sum(is.na(DirectMail.dataset$emp2)),sum(is.na(DirectMail.dataset$msn)),sum(is.na(DirectMail.dataset$cuscode)))
table<-data.frame(col,nas)
table
## col nas
## 1 AGE 0
## 2 CRED 33
## 3 MS 0
## 4 HEQ 0
## 5 INCOME 0
## 6 DEPC 0
## 7 MOB 0
## 8 MILEAGE 0
## 9 RESTYPE 0
## 10 GENDER 0
## 11 EMP_STA 0
## 12 RES_STA 0
## 13 DELINQ 0
## 14 NUMTR 0
## 15 MRTGI 0
## 16 MFDU 0
## 17 resp 0
## 18 female 0
## 19 HOME 0
## 20 CONDO 0
## 21 COOP 0
## 22 renter 0
## 23 emp1 0
## 24 emp2 0
## 25 msn 0
## 26 cuscode 0
So here we see all the na values are present in the CRED column(no of NA’s in dataset and CRED coulmn is same),this column need to be taken care in data cleaning
summary of each variable
univariate analysis
Age
Age of the customer
class(DirectMail.dataset$AGE)
## [1] "integer"
summary(DirectMail.dataset$AGE)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.0 36.0 50.0 49.3 61.0 90.0
boxplot(DirectMail.dataset$AGE)
Credit score
Credit score is the creditworthiness of the person
class(DirectMail.dataset$CRED)
## [1] "integer"
summary(DirectMail.dataset$CRED)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 300.0 574.0 617.0 603.6 652.0 1789.0 33
boxplot(DirectMail.dataset$CRED)
Outliers are present in Credit score
MS
class(DirectMail.dataset$MS)
## [1] "factor"
summary(DirectMail.dataset$MS)
## M U X
## 17221 11721 962
barplot(table(DirectMail.dataset$MS))
Home Equity
Home equity is the value of ownership built up in a home or property that represents the current market value of the house less any remaining mortgage payments.
class(DirectMail.dataset$HEQ)
## [1] "numeric"
summary(DirectMail.dataset$HEQ)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.10 10.00 30.00 38.33 50.00 200.00
boxplot(DirectMail.dataset$HEQ)
outliers are present in Home Equity
Income of the customer
Income of the customer
class(DirectMail.dataset$INCOME)
## [1] "integer"
summary(DirectMail.dataset$INCOME)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20.00 30.00 40.00 41.36 50.00 110.00
boxplot(DirectMail.dataset$INCOME)
outlier is present in Income
Depriciation
a reduction in the value of an asset.
class(DirectMail.dataset$DEPC)
## [1] "factor"
summary(DirectMail.dataset$DEPC)
## N Y
## 18502 11402
table(DirectMail.dataset$DEPC)
##
## N Y
## 18502 11402
barplot(table(DirectMail.dataset$DEPC))
Existing Customer
If object is existing or a new customer
class(DirectMail.dataset$MOB)
## [1] "factor"
summary(DirectMail.dataset$MOB)
## N Y
## 11340 18564
table(DirectMail.dataset$MOB)
##
## N Y
## 11340 18564
barplot(table(DirectMail.dataset$MOB))
Mileage
Mileage of the customer vehicle
class(DirectMail.dataset$MILEAGE)
## [1] "numeric"
summary(DirectMail.dataset$MILEAGE)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 7.776 12.540 11.800 16.190 94.640
boxplot(DirectMail.dataset$MILEAGE)
Outliers are present in mileage
Real Estate Type
Type of the house,customer is living in.
class(DirectMail.dataset$RESTYPE)
## [1] "factor"
summary(DirectMail.dataset$RESTYPE)
## CONDO COOP HOME RENTER
## 566 885 16365 12088
table(DirectMail.dataset$RESTYPE)
##
## CONDO COOP HOME RENTER
## 566 885 16365 12088
barplot(table(DirectMail.dataset$RESTYPE))
Gender
Gender of the customer
class(DirectMail.dataset$GENDER)
## [1] "factor"
summary(DirectMail.dataset$GENDER)
## F M
## 12628 17276
table(DirectMail.dataset$GENDER)
##
## F M
## 12628 17276
barplot(table(DirectMail.dataset$GENDER))
Employer status
class(DirectMail.dataset$EMP_STA)
## [1] "factor"
summary(DirectMail.dataset$EMP_STA)
## 0 1,2 3+
## 968 27146 1790
table(DirectMail.dataset$EMP_STA)
##
## 0 1,2 3+
## 968 27146 1790
barplot(table(DirectMail.dataset$EMP_STA))
Residential status
Residentail status of the customer
class(DirectMail.dataset$RES_STA)
## [1] "factor"
summary(DirectMail.dataset$RES_STA)
## 1,2 3+
## 28109 1795
table(DirectMail.dataset$RES_STA)
##
## 1,2 3+
## 28109 1795
barplot(table(DirectMail.dataset$RES_STA))
Delinquency Status
Delinquency is Failure in repaying the borrowed sum
class(DirectMail.dataset$DELINQ)
## [1] "integer"
summary(DirectMail.dataset$DELINQ)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.7662 1.0000 7.0000
table(DirectMail.dataset$DELINQ)
##
## 0 1 2 3 4 5 6 7
## 17688 5991 3084 1984 1071 29 28 29
barplot(table(DirectMail.dataset$DELINQ))
Number of active trades
Buying and selling the properties in a very short duration
class(DirectMail.dataset$NUMTR)
## [1] "integer"
summary(DirectMail.dataset$NUMTR)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.7633 1.0000 7.0000
table(DirectMail.dataset$NUMTR)
##
## 0 1 2 3 4 5 6 7
## 17778 5923 3046 1997 1071 26 34 29
barplot(table(DirectMail.dataset$NUMTR))
Mortgage Indicator
If customer has Mortagaged properties
class(DirectMail.dataset$MRTGI)
## [1] "factor"
summary(DirectMail.dataset$MRTGI)
## N U Y
## 17484 1763 10657
table(DirectMail.dataset$MRTGI)
##
## N U Y
## 17484 1763 10657
barplot(table(DirectMail.dataset$MRTGI))
Multiple family Dwelling input
class(DirectMail.dataset$MFDU)
## [1] "integer"
summary(DirectMail.dataset$MFDU)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.4527 1.0000 1.0000
table(DirectMail.dataset$MFDU)
##
## 0 1
## 16365 13539
barplot(table(DirectMail.dataset$MFDU))
Response
Our target output is the Response
Recorded response
class(DirectMail.dataset$resp)
## [1] "integer"
summary(DirectMail.dataset$resp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09554 0.00000 1.00000
table(DirectMail.dataset$resp)
##
## 0 1
## 27047 2857
barplot(table(DirectMail.dataset$resp))
We see that response is more skewed towards 0’s.This is an Unbalanced Data.
Female
class(DirectMail.dataset$female)
## [1] "integer"
summary(DirectMail.dataset$female)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.4223 1.0000 1.0000
table(DirectMail.dataset$female)
##
## 0 1
## 17276 12628
barplot(table(DirectMail.dataset$female))
Home Indicator
class(DirectMail.dataset$HOME)
## [1] "integer"
summary(DirectMail.dataset$HOME)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 1.0000 0.5473 1.0000 1.0000
table(DirectMail.dataset$HOME)
##
## 0 1
## 13539 16365
barplot(table(DirectMail.dataset$HOME))
Condominium Indicator
class(DirectMail.dataset$CONDO)
## [1] "integer"
summary(DirectMail.dataset$CONDO)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.01893 0.00000 1.00000
table(DirectMail.dataset$CONDO)
##
## 0 1
## 29338 566
barplot(table(DirectMail.dataset$CONDO))
Co-Op Residence Indicator
class(DirectMail.dataset$COOP)
## [1] "integer"
summary(DirectMail.dataset$COOP)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.02959 0.00000 1.00000
table(DirectMail.dataset$COOP)
##
## 0 1
## 29019 885
barplot(table(DirectMail.dataset$COOP))
Rental Home Indicator
class(DirectMail.dataset$renter)
## [1] "integer"
summary(DirectMail.dataset$renter)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 1 1 1 1 1
table(DirectMail.dataset$renter)
##
## 1
## 29904
barplot(table(DirectMail.dataset$renter))
Employee1
class(DirectMail.dataset$emp1)
## [1] "integer"
summary(DirectMail.dataset$emp1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 1.0000 1.0000 0.9078 1.0000 1.0000
table(DirectMail.dataset$emp1)
##
## 0 1
## 2758 27146
barplot(table(DirectMail.dataset$emp1))
Employee2
class(DirectMail.dataset$emp2)
## [1] "integer"
summary(DirectMail.dataset$emp2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.05986 0.00000 1.00000
table(DirectMail.dataset$emp2)
##
## 0 1
## 28114 1790
barplot(table(DirectMail.dataset$emp2))
Medical Safety Net Program
class(DirectMail.dataset$msn)
## [1] "integer"
summary(DirectMail.dataset$msn)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 1.0000 0.5759 1.0000 1.0000
table(DirectMail.dataset$msn)
##
## 0 1
## 12683 17221
barplot(table(DirectMail.dataset$msn))
Customer Identification Code
class(DirectMail.dataset$cuscode)
## [1] "integer"
summary(DirectMail.dataset$cuscode)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 7477 14950 14950 22430 29900
boxplot(DirectMail.dataset$cuscode)
bivariate analysis
checking the relation between variables
plot(DirectMail.dataset$RESTYPE,DirectMail.dataset$HOME,xlab="RESTYPE",ylab="HOME")
plot(DirectMail.dataset$RESTYPE,DirectMail.dataset$CONDO,xlab="RESTYPE",ylab="CONDO")
plot(DirectMail.dataset$RESTYPE,DirectMail.dataset$COOP,xlab="RESTYPE",ylab="COOP")
plot(DirectMail.dataset$renter,DirectMail.dataset$cuscode)
we see the variables HOME,CONDO,COOP,RENTER are derived from the Variable RESTYPE.
All the cases HOME,CONDO,COOP fall in Rental status,so it is always one.
let us check for the correlation
between numeric variables
#age-HEQ
cor(DirectMail.dataset$AGE,DirectMail.dataset$HEQ)
## [1] 0.2606553
#age-CRED
cor(DirectMail.dataset$AGE,DirectMail.dataset$CRED)
## [1] NA
#age-Income
cor(DirectMail.dataset$AGE,DirectMail.dataset$INCOME)
## [1] 0.05004371
#age-Mileage
cor(DirectMail.dataset$AGE,DirectMail.dataset$MILEAGE)
## [1] -0.002332406
#age-cuscode
cor(DirectMail.dataset$AGE,DirectMail.dataset$cuscode)
## [1] 0.9026553
between category variables
include library-ltm for using biserial.cor()
library("ltm")
## Warning: package 'ltm' was built under R version 3.1.3
## Loading required package: MASS
## Loading required package: msm
## Warning: package 'msm' was built under R version 3.1.3
## Loading required package: polycor
## Warning: package 'polycor' was built under R version 3.1.3
## Loading required package: mvtnorm
## Warning: package 'mvtnorm' was built under R version 3.1.3
## Loading required package: sfsmisc
## Warning: package 'sfsmisc' was built under R version 3.1.3
#find correlation between few variables
#age-resp
biserial.cor(DirectMail.dataset$AGE,DirectMail.dataset$resp)
## [1] 0.03099026
#income-resp
biserial.cor(DirectMail.dataset$INCOME,DirectMail.dataset$resp)
## [1] 0.04153282
#HEQ-resp
biserial.cor(DirectMail.dataset$HEQ,DirectMail.dataset$resp)
## [1] 0.006410327
#CRED-resp
biserial.cor(DirectMail.dataset$CRED,DirectMail.dataset$resp)
## [1] NA
#Mileage-resp
biserial.cor(DirectMail.dataset$MILEAGE,DirectMail.dataset$resp)
## [1] -0.01935468
#resp-cuscode
biserial.cor(DirectMail.dataset$cuscode,DirectMail.dataset$resp)
## [1] 0.03345851
check for correlation of catogorical variables
let us check for correlation of few variables
library(vcd)
## Warning: package 'vcd' was built under R version 3.1.3
## Loading required package: grid
t1<-table(DirectMail.dataset$emp1,DirectMail.dataset$emp2)
assocstats(t1)
## X^2 df P(> X^2)
## Likelihood Ratio 9976.5 1 0
## Pearson 18740.1 1 0
##
## Phi-Coefficient : 0.792
## Contingency Coeff.: 0.621
## Cramer's V : 0.792
t2<-table(DirectMail.dataset$MS,DirectMail.dataset$MOB)
assocstats(t2)
## X^2 df P(> X^2)
## Likelihood Ratio 6.1726 2 0.045670
## Pearson 6.1928 2 0.045211
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.014
## Cramer's V : 0.014
t3<-table(DirectMail.dataset$MS,DirectMail.dataset$GENDER)
assocstats(t3)
## X^2 df P(> X^2)
## Likelihood Ratio 1.7613 2 0.41451
## Pearson 1.7647 2 0.41382
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.008
## Cramer's V : 0.008
Regression Model
let us see the linear regression model
model1<-lm(DirectMail.dataset$resp~DirectMail.dataset$AGE,data = DirectMail.dataset)
summary(model1)
##
## Call:
## lm(formula = DirectMail.dataset$resp ~ DirectMail.dataset$AGE,
## data = DirectMail.dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.11388 -0.10216 -0.09337 -0.08693 0.92831
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1244291 0.0056499 22.023 < 2e-16 ***
## DirectMail.dataset$AGE -0.0005860 0.0001093 -5.362 8.31e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2938 on 29902 degrees of freedom
## Multiple R-squared: 0.0009604, Adjusted R-squared: 0.000927
## F-statistic: 28.75 on 1 and 29902 DF, p-value: 8.312e-08
Here response is a logical variable we use the logistic regression
logistic regression With respect to age
model_age<-glm(resp~AGE,family = binomial(logit),data = DirectMail.dataset)
summary(model_age)
##
## Call:
## glm(formula = resp ~ AGE, family = binomial(logit), data = DirectMail.dataset)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.4949 -0.4639 -0.4418 -0.4262 2.2837
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.914613 0.064455 -29.705 < 2e-16 ***
## AGE -0.006851 0.001279 -5.355 8.57e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 18850 on 29903 degrees of freedom
## Residual deviance: 18821 on 29902 degrees of freedom
## AIC: 18825
##
## Number of Fisher Scoring iterations: 5
From the P-value we can conclude age significantly impacts our dependent variable response.
plot(DirectMail.dataset$AGE,DirectMail.dataset$resp)
curve(predict(model_age,data.frame(AGE=x),type = "resp"),add = "TRUE")
Let’s check the confusion matrix and accuracy with a threshold value of 0.4
###confusion matrix
threshold=0.4
predprop<-predict(model_age,type = "response")
predicted_values<-ifelse(predict(model_age,type = "response")>threshold,1,0)
actualvalue<-DirectMail.dataset$resp
confmat<-table(predicted_values,actualvalue)
confmat
## actualvalue
## predicted_values 0 1
## 0 27047 2857
accuracy<-(confmat[1,1])/(sum(confmat))
accuracy
## [1] 0.9044609
With the threshold value of 0.4 model predicted all cases as 0. Though accuracy is 90% this is not a good model as we couldn’t identify a single responder from this model. Let’s check logistic regression with respect to income
model_inc<-glm(resp~INCOME,family = binomial(logit),data = DirectMail.dataset)
summary(model_inc)
##
## Call:
## glm(formula = resp ~ INCOME, family = binomial(logit), data = DirectMail.dataset)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.4934 -0.4708 -0.4491 -0.4283 2.4459
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.845522 0.058199 -31.711 < 2e-16 ***
## INCOME -0.009947 0.001384 -7.186 6.65e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 18850 on 29903 degrees of freedom
## Residual deviance: 18795 on 29902 degrees of freedom
## AIC: 18799
##
## Number of Fisher Scoring iterations: 5
From the P-value we can conclude Income significantly impacts our dependent variable response.
plot(DirectMail.dataset$INCOME,DirectMail.dataset$resp)
Let us decide the threshold level using ROC curve :
library(pROC)
## Warning: package 'pROC' was built under R version 3.1.3
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
predicted_prob<-predict(model_inc,type="response")
roccurve <- roc(model_inc$y, predicted_prob)
plot(roccurve)
##
## Call:
## roc.default(response = model_inc$y, predictor = predicted_prob)
##
## Data: predicted_prob in 27047 controls (model_inc$y 0) < 2857 cases (model_inc$y 1).
## Area under the curve: 0.5411
plot(specificity + sensitivity ~ threshold, t(coords(roccurve, seq(0, 1, 0.001))), type = "l")
coords(roccurve, x="local maximas", input="sensitivity", ret=c("sensitivity","specificity","threshold","tn","tp","fn","fp"))
## local maximas local maximas local maximas local maximas
## sensitivity 1 9.891495e-01 9.632482e-01 8.855443e-01
## specificity 0 1.678560e-02 5.202056e-02 1.410877e-01
## threshold -Inf 6.160358e-02 7.649067e-02 8.381833e-02
## tn 0 4.540000e+02 1.407000e+03 3.816000e+03
## tp 2857 2.826000e+03 2.752000e+03 2.530000e+03
## fn 0 3.100000e+01 1.050000e+02 3.270000e+02
## fp 27047 2.659300e+04 2.564000e+04 2.323100e+04
## local maximas local maximas local maximas local maximas
## sensitivity 6.940847e-01 3.923696e-01 1.694085e-01 0
## specificity 3.555662e-01 6.702407e-01 8.708544e-01 1
## threshold 9.177808e-02 1.004107e-01 1.097570e-01 Inf
## tn 9.617000e+03 1.812800e+04 2.355400e+04 27047
## tp 1.983000e+03 1.121000e+03 4.840000e+02 0
## fn 8.740000e+02 1.736000e+03 2.373000e+03 2857
## fp 1.743000e+04 8.919000e+03 3.493000e+03 0
###confusion matrix
threshold=0.105
#threshold taken from graph and the above output
predprop2<-predict(model_inc,type = "response")
predicted_values2<-ifelse(predict(model_inc,type = "response")>threshold,1,0)
actualvalue<-DirectMail.dataset$resp
confmat2<-table(predicted_values2,actualvalue)
confmat2
## actualvalue
## predicted_values2 0 1
## 0 23554 2373
## 1 3493 484
accuracy<-(confmat2[1,1]+confmat2[2,2])/(sum(confmat2))
accuracy
## [1] 0.803839
This model gives us a better accuracy compared to the earlier one. 80% of the time the model correctly predicts the response variable.
Lets take all variables at a time
model2<-glm(resp~.,family = binomial(logit),data = DirectMail.dataset)
summary(model2)
##
## Call:
## glm(formula = resp ~ ., family = binomial(logit), data = DirectMail.dataset)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2381 -0.4829 -0.4071 -0.3412 2.6826
##
## Coefficients: (9 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.807e+00 2.854e-01 -6.332 2.42e-10 ***
## AGE -7.494e-03 3.500e-03 -2.141 0.032275 *
## CRED -7.792e-04 2.541e-04 -3.067 0.002162 **
## MSU -6.603e-02 4.144e-02 -1.594 0.111035
## MSX -2.790e-02 1.163e-01 -0.240 0.810398
## HEQ 5.532e-05 5.623e-04 0.098 0.921633
## INCOME -9.956e-03 1.398e-03 -7.120 1.08e-12 ***
## DEPCY 1.083e-01 4.064e-02 2.664 0.007720 **
## MOBY 4.576e-02 4.119e-02 1.111 0.266559
## MILEAGE 1.082e-02 3.395e-03 3.186 0.001441 **
## RESTYPECOOP 1.810e-01 1.784e-01 1.014 0.310434
## RESTYPEHOME -2.938e-01 1.461e-01 -2.010 0.044418 *
## RESTYPERENTER 3.157e-01 1.516e-01 2.082 0.037358 *
## GENDERM 9.920e-02 4.059e-02 2.444 0.014531 *
## EMP_STA1,2 2.485e-01 1.225e-01 2.028 0.042571 *
## EMP_STA3+ 1.096e-01 1.476e-01 0.742 0.457864
## RES_STA3+ -3.096e-01 9.375e-02 -3.303 0.000958 ***
## DELINQ 6.348e-02 1.685e-02 3.768 0.000165 ***
## NUMTR 2.489e-01 1.501e-02 16.584 < 2e-16 ***
## MRTGIU 4.267e-03 1.049e-01 0.041 0.967570
## MRTGIY 5.035e-02 6.356e-02 0.792 0.428275
## MFDU NA NA NA NA
## female NA NA NA NA
## HOME NA NA NA NA
## CONDO NA NA NA NA
## COOP NA NA NA NA
## renter NA NA NA NA
## emp1 NA NA NA NA
## emp2 NA NA NA NA
## msn NA NA NA NA
## cuscode 3.244e-06 6.725e-06 0.482 0.629483
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 18843 on 29870 degrees of freedom
## Residual deviance: 18239 on 29849 degrees of freedom
## (33 observations deleted due to missingness)
## AIC: 18283
##
## Number of Fisher Scoring iterations: 5
###confusion matrix
threshold=0.1
#threshold taken from the plot
predprop_3<-predict(model2,type = "response")
predicted_values_3<-ifelse(predict(model2,type = "response")>threshold,1,0)
actualvalue<-DirectMail.dataset$resp
#confmat3<-table(predicted_values_3,actualvalue)
#confmat3
#accuracy3<-(confmat3[1,1]+confmat3[2,2])/(sum(confmat3))
#accuracy3
Here Model cannot be built because CRED has Na’s.
Cleaning the data
Dealing with NA’S
create new dataset by removing CRED which is having na’s
mail<-DirectMail.dataset[,-2]
model3<-glm(resp~.,family = binomial(logit),data =mail)
summary(model3)
##
## Call:
## glm(formula = resp ~ ., family = binomial(logit), data = mail)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2614 -0.4829 -0.4074 -0.3419 2.6816
##
## Coefficients: (9 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.364e+00 2.233e-01 -10.589 < 2e-16 ***
## AGE -1.675e-03 3.029e-03 -0.553 0.580277
## MSU -6.623e-02 4.143e-02 -1.599 0.109909
## MSX -2.353e-02 1.163e-01 -0.202 0.839621
## HEQ -1.945e-05 5.618e-04 -0.035 0.972386
## INCOME -9.872e-03 1.397e-03 -7.065 1.61e-12 ***
## DEPCY 1.075e-01 4.063e-02 2.646 0.008135 **
## MOBY 4.552e-02 4.118e-02 1.105 0.268948
## MILEAGE 1.084e-02 3.393e-03 3.194 0.001401 **
## RESTYPECOOP 1.793e-01 1.784e-01 1.005 0.314886
## RESTYPEHOME -2.936e-01 1.461e-01 -2.009 0.044578 *
## RESTYPERENTER 3.143e-01 1.516e-01 2.073 0.038188 *
## GENDERM 1.004e-01 4.058e-02 2.473 0.013385 *
## EMP_STA1,2 2.480e-01 1.225e-01 2.024 0.042953 *
## EMP_STA3+ 1.101e-01 1.476e-01 0.746 0.455836
## RES_STA3+ -3.131e-01 9.373e-02 -3.340 0.000838 ***
## DELINQ 6.320e-02 1.684e-02 3.753 0.000175 ***
## NUMTR 2.484e-01 1.500e-02 16.552 < 2e-16 ***
## MRTGIU 4.165e-04 1.049e-01 0.004 0.996832
## MRTGIY 4.865e-02 6.355e-02 0.765 0.443989
## MFDU NA NA NA NA
## female NA NA NA NA
## HOME NA NA NA NA
## CONDO NA NA NA NA
## COOP NA NA NA NA
## renter NA NA NA NA
## emp1 NA NA NA NA
## emp2 NA NA NA NA
## msn NA NA NA NA
## cuscode -1.004e-05 5.397e-06 -1.860 0.062868 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 18850 on 29903 degrees of freedom
## Residual deviance: 18254 on 29883 degrees of freedom
## AIC: 18296
##
## Number of Fisher Scoring iterations: 5
###confusion matrix
threshold=0.15
predprop_4<-predict(model2,type = "response")
predicted_values_4<-ifelse(predict(model3,type = "response")>threshold,1,0)
actualvalue<-DirectMail.dataset$resp
confmat4<-table(predicted_values_4,actualvalue)
confmat4
## actualvalue
## predicted_values_4 0 1
## 0 24619 2279
## 1 2428 578
accuracy4<-(confmat4[1,1]+confmat4[2,2])/(sum(confmat4))
accuracy4
## [1] 0.8425963
Usually,It is not advisable to remove the variable from dataset,we clean the NA’s in the variable and bulid the model again
CRED
Repalce all na’s in the variable with mean
mail2<-DirectMail.dataset[,2]
mail3<-ifelse(is.na(mail2),603.6,mail2)
mail4<-DirectMail.dataset
mail4$CRED=mail3
#mail4 is the new dataset with no nas
model4<-glm(resp~.,family = binomial(logit),data = mail4)
summary(model4)
##
## Call:
## glm(formula = resp ~ ., family = binomial(logit), data = mail4)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2393 -0.4828 -0.4069 -0.3411 2.6829
##
## Coefficients: (9 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.856e+00 2.840e-01 -6.533 6.43e-11 ***
## AGE -6.394e-03 3.441e-03 -1.859 0.063087 .
## CRED -7.302e-04 2.526e-04 -2.891 0.003840 **
## MSU -6.500e-02 4.144e-02 -1.569 0.116706
## MSX -2.550e-02 1.163e-01 -0.219 0.826474
## HEQ 4.812e-05 5.623e-04 0.086 0.931810
## INCOME -9.932e-03 1.398e-03 -7.104 1.22e-12 ***
## DEPCY 1.078e-01 4.064e-02 2.652 0.008004 **
## MOBY 4.548e-02 4.119e-02 1.104 0.269477
## MILEAGE 1.084e-02 3.394e-03 3.194 0.001403 **
## RESTYPECOOP 1.800e-01 1.784e-01 1.009 0.313006
## RESTYPEHOME -2.952e-01 1.461e-01 -2.020 0.043397 *
## RESTYPERENTER 3.137e-01 1.516e-01 2.069 0.038579 *
## GENDERM 9.898e-02 4.059e-02 2.439 0.014747 *
## EMP_STA1,2 2.471e-01 1.225e-01 2.017 0.043733 *
## EMP_STA3+ 1.095e-01 1.476e-01 0.741 0.458392
## RES_STA3+ -3.119e-01 9.373e-02 -3.328 0.000874 ***
## DELINQ 6.338e-02 1.685e-02 3.762 0.000168 ***
## NUMTR 2.490e-01 1.501e-02 16.592 < 2e-16 ***
## MRTGIU 2.397e-03 1.049e-01 0.023 0.981779
## MRTGIY 4.918e-02 6.356e-02 0.774 0.439068
## MFDU NA NA NA NA
## female NA NA NA NA
## HOME NA NA NA NA
## CONDO NA NA NA NA
## COOP NA NA NA NA
## renter NA NA NA NA
## emp1 NA NA NA NA
## emp2 NA NA NA NA
## msn NA NA NA NA
## cuscode 9.771e-07 6.600e-06 0.148 0.882306
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 18850 on 29903 degrees of freedom
## Residual deviance: 18246 on 29882 degrees of freedom
## AIC: 18290
##
## Number of Fisher Scoring iterations: 5
###confusion matrix
threshold=0.1125
predprop_5<-predict(model4,type = "response")
predicted_values_5<-ifelse(predict(model4,type = "response")>threshold,1,0)
actualvalue<-DirectMail.dataset$resp
confmat5<-table(predicted_values_5,actualvalue)
confmat5
## actualvalue
## predicted_values_5 0 1
## 0 20055 1638
## 1 6992 1219
accuracy5<-(confmat5[1,1]+confmat5[2,2])/(sum(confmat5))
accuracy5
## [1] 0.7114098
Dealing with Outliers
CRED
We see the value 1789 is oulier,since it is above 99%
quantile(mail4$CRED,c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95))
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 95%
## 300 472 563 586 605 617 628 642 661 686 725
HEQ
we see value 200 is outlier
quantile(mail4$HEQ,c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95))
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 95%
## 0.1 0.1 10.0 10.0 30.0 30.0 30.0 50.0 50.0 70.0 90.0
Income
we see the value 110 is outlier
quantile(mail4$INCOME,c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.99))
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 95% 99%
## 20 20 30 30 40 40 40 50 50 60 70 110
Mileage
most of the values lies below 19.93
quantile(mail4$MILEAGE,c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.99))
## 0% 10% 20% 30% 40% 50% 60% 70%
## 0.00000 3.07860 6.28400 9.34770 11.10320 12.53650 13.99100 15.38610
## 80% 90% 95% 99%
## 16.98900 18.58670 19.32485 19.93300
Removing Outliners
ncred
ncred<-ifelse(mail4$CRED==1789,603.6,mail4$CRED)
boxplot(ncred)
>HEQ
nheq<-ifelse(mail4$HEQ>=90,38.33,mail4$HEQ)
boxplot(nheq)
>Income
ninc<-ifelse(mail4$INCOME>=100,41.36,mail4$INCOME)
boxplot(ninc)
>Mileage
nmil<-ifelse(mail4$MILEAGE>=25,11.8,mail4$MILEAGE)
boxplot(nmil)
Rebuild the model after removing outliers
make a new mail5-dataset by replacing the oulierless variables in mail4.
mail5<-mail4
mail5$CRED<-ncred
mail5$HEQ<-nheq
mail5$INCOME<-ninc
mail5$MILEAGE<-nmil
buliding model
model5<-glm(resp~.,family = binomial(logit),data = mail5)
summary(model5)
##
## Call:
## glm(formula = resp ~ ., family = binomial(logit), data = mail5)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2451 -0.4821 -0.4072 -0.3412 2.6950
##
## Coefficients: (9 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.855e+00 2.880e-01 -6.442 1.18e-10 ***
## AGE -5.921e-03 3.397e-03 -1.743 0.081328 .
## CRED -6.668e-04 2.567e-04 -2.598 0.009383 **
## MSU -6.591e-02 4.144e-02 -1.591 0.111703
## MSX -2.580e-02 1.163e-01 -0.222 0.824413
## HEQ 3.953e-04 9.733e-04 0.406 0.684651
## INCOME -1.120e-02 1.590e-03 -7.043 1.88e-12 ***
## DEPCY 1.074e-01 4.064e-02 2.644 0.008202 **
## MOBY 4.548e-02 4.118e-02 1.104 0.269416
## MILEAGE 1.008e-02 3.673e-03 2.745 0.006058 **
## RESTYPECOOP 1.823e-01 1.784e-01 1.022 0.306929
## RESTYPEHOME -2.954e-01 1.461e-01 -2.022 0.043196 *
## RESTYPERENTER 3.125e-01 1.516e-01 2.061 0.039313 *
## GENDERM 9.965e-02 4.059e-02 2.455 0.014080 *
## EMP_STA1,2 2.468e-01 1.225e-01 2.014 0.043989 *
## EMP_STA3+ 1.096e-01 1.476e-01 0.743 0.457640
## RES_STA3+ -3.130e-01 9.373e-02 -3.340 0.000839 ***
## DELINQ 6.301e-02 1.685e-02 3.740 0.000184 ***
## NUMTR 2.492e-01 1.501e-02 16.601 < 2e-16 ***
## MRTGIU 3.036e-03 1.049e-01 0.029 0.976918
## MRTGIY 4.893e-02 6.355e-02 0.770 0.441276
## MFDU NA NA NA NA
## female NA NA NA NA
## HOME NA NA NA NA
## CONDO NA NA NA NA
## COOP NA NA NA NA
## renter NA NA NA NA
## emp1 NA NA NA NA
## emp2 NA NA NA NA
## msn NA NA NA NA
## cuscode -3.808e-07 6.562e-06 -0.058 0.953721
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 18850 on 29903 degrees of freedom
## Residual deviance: 18253 on 29882 degrees of freedom
## AIC: 18297
##
## Number of Fisher Scoring iterations: 5
###confusion matrix
threshold=0.1125
predprop_6<-predict(model5,type = "response")
predicted_values_6<-ifelse(predict(model5,type = "response")>threshold,1,0)
actualvalue<-mail5$resp
confmat6<-table(actualvalue,predicted_values_6)
confmat6
## predicted_values_6
## actualvalue 0 1
## 0 20077 6970
## 1 1650 1207
accuracy6<-(confmat6[1,1]+confmat6[2,2])/(sum(confmat6))
accuracy6
## [1] 0.7117442
sensitivity6<-confmat6[1,1]/(confmat6[1,1]+confmat6[1,2])
sensitivity6
## [1] 0.7423004
specificity6<-confmat6[2,2]/(confmat6[2,1]+confmat6[2,2])
specificity6
## [1] 0.4224711
Here we can see that the variables CRED,INCOME,DEPCY,MILEAGE,DELINQ,NUMTR are highly impacting from their Z values. In presence of categorical variables decision trees gives better reults
Building a decision tree
with rpart library
#include a library-rpart for decision tree
library(rpart)
mail5$resp<-factor(mail5$resp)
tree<-rpart(resp~., method = "class",data = mail5,control = rpart.control(minsplit = 30,cp=0.00001))
#We can check the summary of the tree by -summary(tree)
#For a Shorter summary
tree$variable.importance
## cuscode CRED AGE NUMTR MILEAGE RESTYPE
## 77.1665794 64.0314285 52.1785565 51.4437608 47.9210444 35.7394500
## HOME MFDU INCOME MRTGI HEQ MS
## 32.5709146 32.5709146 27.3671233 21.1772451 16.9198385 13.7326187
## msn DEPC DELINQ EMP_STA female GENDER
## 10.9162690 9.4188378 7.1516773 5.1097097 4.0702741 4.0702741
## COOP MOB emp1 emp2 RES_STA
## 3.3501143 2.9477677 1.7250014 1.0397997 0.3546058
summary(tree$frame)
## var n wt dev
## <leaf> :129 Min. : 10.0 Min. : 10.0 Min. : 0.00
## cuscode: 28 1st Qu.: 33.0 1st Qu.: 33.0 1st Qu.: 8.00
## MILEAGE: 24 Median : 103.0 Median : 103.0 Median : 19.00
## CRED : 22 Mean : 877.3 Mean : 877.3 Mean : 96.81
## AGE : 14 3rd Qu.: 450.0 3rd Qu.: 450.0 3rd Qu.: 67.00
## HEQ : 7 Max. :29904.0 Max. :29904.0 Max. :2857.00
## (Other): 33
## yval complexity ncompete nsurrogate
## Min. :1.000 Min. :0.000e+00 Min. :0.000 Min. :0.00
## 1st Qu.:1.000 1st Qu.:0.000e+00 1st Qu.:0.000 1st Qu.:0.00
## Median :1.000 Median :1.000e-05 Median :0.000 Median :0.00
## Mean :1.105 Mean :9.121e-05 Mean :1.992 Mean :1.23
## 3rd Qu.:1.000 3rd Qu.:1.615e-04 3rd Qu.:4.000 3rd Qu.:2.00
## Max. :2.000 Max. :3.000e-04 Max. :4.000 Max. :5.00
##
## yval2.V1 yval2.V2 yval2.V3 yval2.V4 yval2.V5 yval2.nodeprob
## Min. :1.0000000 Min. : 2.000 Min. : 0.0000 Min. :0.2000000 Min. :0.0000000 Min. :0.0003344
## 1st Qu.:1.0000000 1st Qu.: 24.000 1st Qu.: 8.0000 1st Qu.:0.7142857 1st Qu.:0.1149051 1st Qu.:0.0011035
## Median :1.0000000 Median : 90.000 Median : 19.0000 Median :0.8260870 Median :0.1739130 Median :0.0034444
## Mean :1.1050584 Mean : 780.179 Mean : 97.1089 Mean :0.7732910 Mean :0.2267090 Mean :0.0293368
## 3rd Qu.:1.0000000 3rd Qu.: 403.000 3rd Qu.: 67.0000 3rd Qu.:0.8850949 3rd Qu.:0.2857143 3rd Qu.:0.0150482
## Max. :2.0000000 Max. :27047.000 Max. :2857.0000 Max. :1.0000000 Max. :0.8000000 Max. :1.0000000
##
summary(tree$where)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 5.00 42.00 60.89 86.00 257.00
summary(tree$call)
## Length Class Mode
## 5 call call
summary(tree$cptable)
## CP nsplit rel error xerror
## Min. :0.0000100 Min. : 0.00 Min. :0.9748 Min. :1.000
## 1st Qu.:0.0000875 1st Qu.: 35.00 1st Qu.:0.9776 1st Qu.:1.031
## Median :0.0001556 Median : 68.00 Median :0.9814 Median :1.049
## Mean :0.0001581 Mean : 66.77 Mean :0.9834 Mean :1.050
## 3rd Qu.:0.0002333 3rd Qu.: 97.00 3rd Qu.:0.9881 3rd Qu.:1.065
## Max. :0.0003000 Max. :128.00 Max. :1.0000 Max. :1.092
## xstd
## Min. :0.01779
## 1st Qu.:0.01804
## Median :0.01818
## Mean :0.01818
## 3rd Qu.:0.01830
## Max. :0.01851
summary(tree$method)
## Length Class Mode
## 1 character character
Let’s use PARTY package to draw a decision tree
library("party")
## Warning: package 'party' was built under R version 3.1.3
## Loading required package: modeltools
## Warning: package 'modeltools' was built under R version 3.1.3
## Loading required package: stats4
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 3.1.3
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 3.1.3
dec_tree<-ctree(resp ~., data = mail5,controls = ctree_control(mincriterion = 0.1))
summary(dec_tree)
## Length Class Mode
## 1 BinaryTree S4
dec_tree
##
## Conditional inference tree with 105 terminal nodes
##
## Response: resp
## Inputs: AGE, CRED, MS, HEQ, INCOME, DEPC, MOB, MILEAGE, RESTYPE, GENDER, EMP_STA, RES_STA, DELINQ, NUMTR, MRTGI, MFDU, female, HOME, CONDO, COOP, renter, emp1, emp2, msn, cuscode
## Number of observations: 29904
##
## 1) NUMTR <= 2; criterion = 1, statistic = 277.497
## 2) RESTYPE == {CONDO, HOME}; criterion = 1, statistic = 176.563
## 3) INCOME <= 30; criterion = 1, statistic = 116.148
## 4) INCOME <= 20; criterion = 0.969, statistic = 10.394
## 5) cuscode <= 20349; criterion = 0.671, statistic = 5.82
## 6) HEQ <= 30; criterion = 0.256, statistic = 3.741
## 7)* weights = 1001
## 6) HEQ > 30
## 8) NUMTR <= 0; criterion = 0.594, statistic = 5.359
## 9) MS == {U}; criterion = 0.431, statistic = 6.815
## 10)* weights = 128
## 9) MS == {M, X}
## 11)* weights = 175
## 8) NUMTR > 0
## 12) DEPC == {Y}; criterion = 0.186, statistic = 3.403
## 13)* weights = 69
## 12) DEPC == {N}
## 14) HEQ <= 50; criterion = 0.137, statistic = 3.141
## 15)* weights = 68
## 14) HEQ > 50
## 16)* weights = 34
## 5) cuscode > 20349
## 17)* weights = 548
## 4) INCOME > 20
## 18) NUMTR <= 0; criterion = 0.515, statistic = 4.944
## 19) HEQ <= 38.33; criterion = 0.16, statistic = 3.63
## 20) msn <= 0; criterion = 0.576, statistic = 6.509
## 21) DEPC == {Y}; criterion = 0.803, statistic = 6.871
## 22) DELINQ <= 3; criterion = 0.821, statistic = 7.07
## 23) MFDU <= 0; criterion = 0.661, statistic = 5.758
## 24)* weights = 204
## 23) MFDU > 0
## 25)* weights = 10
## 22) DELINQ > 3
## 26)* weights = 8
## 21) DEPC == {N}
## 27)* weights = 381
## 20) msn > 0
## 28) CRED <= 398; criterion = 0.241, statistic = 4.407
## 29) MOB == {N}; criterion = 0.328, statistic = 4.073
## 30)* weights = 18
## 29) MOB == {Y}
## 31)* weights = 37
## 28) CRED > 398
## 32)* weights = 764
## 19) HEQ > 38.33
## 33) MILEAGE <= 19.77; criterion = 0.146, statistic = 3.188
## 34) DEPC == {N}; criterion = 0.123, statistic = 3.059
## 35)* weights = 364
## 34) DEPC == {Y}
## 36) MS == {M, U}; criterion = 0.804, statistic = 9.493
## 37)* weights = 229
## 36) MS == {X}
## 38)* weights = 11
## 33) MILEAGE > 19.77
## 39)* weights = 9
## 18) NUMTR > 0
## 40) GENDER == {F}; criterion = 0.722, statistic = 6.173
## 41) MS == {U}; criterion = 0.778, statistic = 9.215
## 42)* weights = 173
## 41) MS == {M, X}
## 43)* weights = 277
## 40) GENDER == {M}
## 44)* weights = 592
## 3) INCOME > 30
## 45) INCOME <= 50; criterion = 1, statistic = 19.063
## 46) cuscode <= 13621; criterion = 0.997, statistic = 14.948
## 47) NUMTR <= 1; criterion = 0.964, statistic = 10.14
## 48) CRED <= 336; criterion = 0.439, statistic = 4.576
## 49)* weights = 81
## 48) CRED > 336
## 50) emp2 <= 0; criterion = 0.208, statistic = 3.609
## 51) HOME <= 0; criterion = 0.358, statistic = 4.207
## 52)* weights = 94
## 51) HOME > 0
## 53) MILEAGE <= 13.409; criterion = 0.19, statistic = 3.838
## 54)* weights = 1563
## 53) MILEAGE > 13.409
## 55)* weights = 1189
## 50) emp2 > 0
## 56)* weights = 179
## 47) NUMTR > 1
## 57) CRED <= 578; criterion = 0.551, statistic = 5.127
## 58)* weights = 148
## 57) CRED > 578
## 59) RESTYPE == {CONDO}; criterion = 0.91, statistic = 8.396
## 60)* weights = 10
## 59) RESTYPE == {HOME}
## 61)* weights = 232
## 46) cuscode > 13621
## 62) MFDU <= 0; criterion = 0.983, statistic = 11.484
## 63) emp2 <= 0; criterion = 0.911, statistic = 10.162
## 64) msn <= 0; criterion = 0.103, statistic = 3.436
## 65) EMP_STA == {0}; criterion = 0.884, statistic = 7.91
## 66) NUMTR <= 1; criterion = 0.433, statistic = 6.289
## 67) DELINQ <= 0; criterion = 0.101, statistic = 2.916
## 68)* weights = 27
## 67) DELINQ > 0
## 69)* weights = 23
## 66) NUMTR > 1
## 70)* weights = 12
## 65) EMP_STA == {1,2}
## 71) DELINQ <= 1; criterion = 0.635, statistic = 5.599
## 72) INCOME <= 41.36; criterion = 0.107, statistic = 2.956
## 73) HEQ <= 10; criterion = 0.865, statistic = 7.616
## 74) AGE <= 51; criterion = 0.159, statistic = 3.26
## 75)* weights = 36
## 74) AGE > 51
## 76) GENDER == {F}; criterion = 0.259, statistic = 3.758
## 77)* weights = 84
## 76) GENDER == {M}
## 78)* weights = 117
## 73) HEQ > 10
## 79)* weights = 571
## 72) INCOME > 41.36
## 80) HEQ <= 50; criterion = 0.146, statistic = 3.19
## 81)* weights = 492
## 80) HEQ > 50
## 82) GENDER == {F}; criterion = 0.13, statistic = 3.096
## 83)* weights = 21
## 82) GENDER == {M}
## 84)* weights = 16
## 71) DELINQ > 1
## 85)* weights = 375
## 64) msn > 0
## 86)* weights = 2512
## 63) emp2 > 0
## 87)* weights = 275
## 62) MFDU > 0
## 88) MOB == {N}; criterion = 0.964, statistic = 10.12
## 89)* weights = 54
## 88) MOB == {Y}
## 90)* weights = 95
## 45) INCOME > 50
## 91) NUMTR <= 0; criterion = 0.111, statistic = 2.983
## 92) INCOME <= 60; criterion = 0.108, statistic = 4.241
## 93)* weights = 858
## 92) INCOME > 60
## 94) MFDU <= 0; criterion = 0.564, statistic = 5.197
## 95)* weights = 332
## 94) MFDU > 0
## 96)* weights = 10
## 91) NUMTR > 0
## 97)* weights = 648
## 2) RESTYPE == {COOP, RENTER}
## 98) NUMTR <= 0; criterion = 1, statistic = 49.019
## 99) DELINQ <= 0; criterion = 0.976, statistic = 10.893
## 100) cuscode <= 23396; criterion = 0.483, statistic = 4.784
## 101) DEPC == {N}; criterion = 0.3, statistic = 4.439
## 102) MOB == {Y}; criterion = 0.314, statistic = 4.01
## 103) MILEAGE <= 5.961; criterion = 0.571, statistic = 5.234
## 104)* weights = 266
## 103) MILEAGE > 5.961
## 105) cuscode <= 21419; criterion = 0.439, statistic = 4.576
## 106)* weights = 986
## 105) cuscode > 21419
## 107) msn <= 0; criterion = 0.318, statistic = 4.602
## 108) GENDER == {F}; criterion = 0.18, statistic = 3.37
## 109)* weights = 15
## 108) GENDER == {M}
## 110)* weights = 22
## 107) msn > 0
## 111) AGE <= 62; criterion = 0.196, statistic = 3.452
## 112)* weights = 50
## 111) AGE > 62
## 113)* weights = 11
## 102) MOB == {N}
## 114) MILEAGE <= 14.973; criterion = 0.836, statistic = 7.233
## 115) RESTYPE == {COOP}; criterion = 0.145, statistic = 3.182
## 116)* weights = 52
## 115) RESTYPE == {RENTER}
## 117)* weights = 534
## 114) MILEAGE > 14.973
## 118) msn <= 0; criterion = 0.578, statistic = 5.864
## 119)* weights = 124
## 118) msn > 0
## 120)* weights = 139
## 101) DEPC == {Y}
## 121)* weights = 1423
## 100) cuscode > 23396
## 122) CRED <= 574; criterion = 0.852, statistic = 7.441
## 123) emp1 <= 0; criterion = 0.33, statistic = 4.082
## 124)* weights = 10
## 123) emp1 > 0
## 125) cuscode <= 26013; criterion = 0.176, statistic = 3.352
## 126) msn <= 0; criterion = 0.12, statistic = 3.388
## 127)* weights = 24
## 126) msn > 0
## 128) AGE <= 67; criterion = 0.104, statistic = 2.94
## 129)* weights = 7
## 128) AGE > 67
## 130)* weights = 25
## 125) cuscode > 26013
## 131)* weights = 9
## 122) CRED > 574
## 132) RES_STA == {1,2}; criterion = 0.133, statistic = 4.184
## 133)* weights = 846
## 132) RES_STA == {3+}
## 134)* weights = 52
## 99) DELINQ > 0
## 135) INCOME <= 40; criterion = 0.379, statistic = 4.956
## 136) RES_STA == {1,2}; criterion = 0.478, statistic = 4.76
## 137) CRED <= 745; criterion = 0.102, statistic = 2.923
## 138)* weights = 1878
## 137) CRED > 745
## 139)* weights = 51
## 136) RES_STA == {3+}
## 140)* weights = 143
## 135) INCOME > 40
## 141)* weights = 1055
## 98) NUMTR > 0
## 142) AGE <= 71; criterion = 0.289, statistic = 3.897
## 143)* weights = 3529
## 142) AGE > 71
## 144) AGE <= 82; criterion = 0.663, statistic = 5.772
## 145) DELINQ <= 0; criterion = 0.14, statistic = 3.154
## 146) female <= 0; criterion = 0.473, statistic = 4.736
## 147)* weights = 76
## 146) female > 0
## 148)* weights = 70
## 145) DELINQ > 0
## 149)* weights = 97
## 144) AGE > 82
## 150) MOB == {Y}; criterion = 0.312, statistic = 3.999
## 151) NUMTR <= 1; criterion = 0.801, statistic = 6.852
## 152)* weights = 36
## 151) NUMTR > 1
## 153)* weights = 22
## 150) MOB == {N}
## 154) msn <= 0; criterion = 0.513, statistic = 5.474
## 155)* weights = 16
## 154) msn > 0
## 156)* weights = 25
## 1) NUMTR > 2
## 157) MFDU <= 0; criterion = 1, statistic = 24.277
## 158) INCOME <= 30; criterion = 1, statistic = 26.157
## 159) DEPC == {N}; criterion = 0.454, statistic = 4.647
## 160) HEQ <= 30; criterion = 0.473, statistic = 4.739
## 161) EMP_STA == {3+}; criterion = 0.518, statistic = 7.299
## 162)* weights = 13
## 161) EMP_STA == {0, 1,2}
## 163)* weights = 190
## 160) HEQ > 30
## 164) INCOME <= 20; criterion = 0.865, statistic = 7.619
## 165)* weights = 57
## 164) INCOME > 20
## 166)* weights = 106
## 159) DEPC == {Y}
## 167) female <= 0; criterion = 0.392, statistic = 4.362
## 168) cuscode <= 17714; criterion = 0.327, statistic = 4.068
## 169) INCOME <= 20; criterion = 0.447, statistic = 4.616
## 170)* weights = 25
## 169) INCOME > 20
## 171)* weights = 44
## 168) cuscode > 17714
## 172)* weights = 46
## 167) female > 0
## 173) MRTGI == {U, Y}; criterion = 0.262, statistic = 5.909
## 174)* weights = 57
## 173) MRTGI == {N}
## 175)* weights = 22
## 158) INCOME > 30
## 176) GENDER == {F}; criterion = 0.488, statistic = 5.051
## 177) INCOME <= 50; criterion = 0.841, statistic = 7.304
## 178) cuscode <= 23443; criterion = 0.371, statistic = 4.264
## 179) MOB == {Y}; criterion = 0.148, statistic = 3.2
## 180)* weights = 170
## 179) MOB == {N}
## 181) INCOME <= 40; criterion = 0.149, statistic = 3.209
## 182) NUMTR <= 3; criterion = 0.462, statistic = 4.685
## 183) DELINQ <= 0; criterion = 0.144, statistic = 3.179
## 184)* weights = 18
## 183) DELINQ > 0
## 185)* weights = 16
## 182) NUMTR > 3
## 186)* weights = 36
## 181) INCOME > 40
## 187)* weights = 47
## 178) cuscode > 23443
## 188) AGE <= 69; criterion = 0.182, statistic = 3.383
## 189)* weights = 53
## 188) AGE > 69
## 190) DELINQ <= 1; criterion = 0.366, statistic = 4.245
## 191)* weights = 31
## 190) DELINQ > 1
## 192)* weights = 11
## 177) INCOME > 50
## 193) MILEAGE <= 5.617; criterion = 0.295, statistic = 3.92
## 194)* weights = 14
## 193) MILEAGE > 5.617
## 195)* weights = 82
## 176) GENDER == {M}
## 196) msn <= 0; criterion = 0.884, statistic = 8.311
## 197)* weights = 279
## 196) msn > 0
## 198)* weights = 394
## 157) MFDU > 0
## 199) AGE <= 56; criterion = 0.998, statistic = 15.829
## 200)* weights = 939
## 199) AGE > 56
## 201) NUMTR <= 3; criterion = 0.763, statistic = 6.502
## 202)* weights = 314
## 201) NUMTR > 3
## 203) HEQ <= 38.33; criterion = 0.662, statistic = 5.763
## 204)* weights = 133
## 203) HEQ > 38.33
## 205) GENDER == {M}; criterion = 0.297, statistic = 3.933
## 206)* weights = 35
## 205) GENDER == {F}
## 207) INCOME <= 30; criterion = 0.18, statistic = 3.37
## 208)* weights = 12
## 207) INCOME > 30
## 209)* weights = 13
#We can see the plotted decision tree using plot(dec_tree),the parameter for function-ctree(),can be varied suitably for our Dataset
#plot(dec_tree)
predict_dectree<-predict(dec_tree)
confmat8<-table(actualvalue,predict_dectree)
confmat8
## predict_dectree
## actualvalue 0 1
## 0 27031 16
## 1 2835 22
accuracy8<-(confmat8[1,1]+confmat8[2,2])/(sum(confmat8))
accuracy8
## [1] 0.9046616
sensitivity8<-confmat8[1,1]/(confmat8[1,2]+confmat8[1,1])
sensitivity8
## [1] 0.9994084
specificity8<-confmat8[2,2]/(confmat8[2,1]+confmat8[2,2])
specificity8
## [1] 0.007700385
Unbalanced dataset handling
Here the dataset is Unbalanced i.e Response is very much skewed towards 0’s and models we are built are with good accuracy but with poor specificity and sensitivity.These Unbalanced data need to be dealt carefully.


