You can download the datasets and R code file for this session here.

Problem Statement
Details of the Dataset
Summary of the dataset
summary of each variable
- let us check for the correlation
Regression Model
Cleaning the data
- Dealing with NA’S
- Dealing with Outliers
  - Removing Outliners
Rebuild the model after removing outliers
Building a decision tree
- Unbalanced dataset handling

Problem Statement

An insurance company need to come up with a good marketing strategy. They want to run an e-mail campaign. Before sending a mail to all the available e-mail addresses, they want to first build a predictive model and identify the customers who are most likely to respond.
Analyze the historical data and build a predictive model that helps us in maximizing the response rate

Import dataset with the following command from your folder

 DirectMail.dataset <- read.delim("C:/Amrita/Datavedi/Case Study/DirectMail.txt")

Details of the Dataset

dimension of data

dim(DirectMail.dataset)

## [1] 29904    26

Summary of the dataset

Summary and Structure

str(DirectMail.dataset)

## 'data.frame':    29904 obs. of  26 variables:
##  $ AGE    : int  20 18 19 18 18 19 19 20 19 19 ...
##  $ CRED   : int  543 445 423 416 522 467 334 519 480 309 ...
##  $ MS     : Factor w/ 3 levels "M","U","X": 2 2 2 3 2 2 3 2 1 2 ...
##  $ HEQ    : num  0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
##  $ INCOME : int  20 20 20 20 20 20 20 20 20 20 ...
##  $ DEPC   : Factor w/ 2 levels "N","Y": 1 1 1 1 2 2 1 1 1 2 ...
##  $ MOB    : Factor w/ 2 levels "N","Y": 2 2 1 2 2 1 2 2 2 2 ...
##  $ MILEAGE: num  14.418 0.311 17.876 17.084 16.51 ...
##  $ RESTYPE: Factor w/ 4 levels "CONDO","COOP",..: 3 3 3 3 3 3 4 3 4 3 ...
##  $ GENDER : Factor w/ 2 levels "F","M": 2 2 2 1 1 1 2 2 2 2 ...
##  $ EMP_STA: Factor w/ 3 levels "0","1,2","3+": 2 1 2 2 2 2 2 2 3 2 ...
##  $ RES_STA: Factor w/ 2 levels "1,2","3+": 1 1 1 1 1 1 1 1 1 1 ...
##  $ DELINQ : int  0 2 1 0 1 0 0 4 4 0 ...
##  $ NUMTR  : int  2 2 0 0 0 0 1 3 1 0 ...
##  $ MRTGI  : Factor w/ 3 levels "N","U","Y": 1 3 3 3 3 3 1 2 1 3 ...
##  $ MFDU   : int  0 0 0 0 0 0 1 0 1 0 ...
##  $ resp   : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ female : int  0 0 0 1 1 1 0 0 0 0 ...
##  $ HOME   : int  1 1 1 1 1 1 0 1 0 1 ...
##  $ CONDO  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ COOP   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ renter : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ emp1   : int  1 0 1 1 1 1 1 1 0 1 ...
##  $ emp2   : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ msn    : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ cuscode: int  1 2 3 4 5 6 7 8 9 10 ...

summary(DirectMail.dataset)

##       AGE            CRED        MS             HEQ        
##  Min.   :18.0   Min.   : 300.0   M:17221   Min.   :  0.10  
##  1st Qu.:36.0   1st Qu.: 574.0   U:11721   1st Qu.: 10.00  
##  Median :50.0   Median : 617.0   X:  962   Median : 30.00  
##  Mean   :49.3   Mean   : 603.6             Mean   : 38.33  
##  3rd Qu.:61.0   3rd Qu.: 652.0             3rd Qu.: 50.00  
##  Max.   :90.0   Max.   :1789.0             Max.   :200.00  
##                 NA's   :33                                 
##      INCOME       DEPC      MOB          MILEAGE         RESTYPE     
##  Min.   : 20.00   N:18502   N:11340   Min.   : 0.000   CONDO :  566  
##  1st Qu.: 30.00   Y:11402   Y:18564   1st Qu.: 7.776   COOP  :  885  
##  Median : 40.00                       Median :12.537   HOME  :16365  
##  Mean   : 41.36                       Mean   :11.803   RENTER:12088  
##  3rd Qu.: 50.00                       3rd Qu.:16.193                 
##  Max.   :110.00                       Max.   :94.640                 
##                                                                      
##  GENDER    EMP_STA     RES_STA         DELINQ           NUMTR       
##  F:12628   0  :  968   1,2:28109   Min.   :0.0000   Min.   :0.0000  
##  M:17276   1,2:27146   3+ : 1795   1st Qu.:0.0000   1st Qu.:0.0000  
##            3+ : 1790               Median :0.0000   Median :0.0000  
##                                    Mean   :0.7662   Mean   :0.7633  
##                                    3rd Qu.:1.0000   3rd Qu.:1.0000  
##                                    Max.   :7.0000   Max.   :7.0000  
##                                                                     
##  MRTGI          MFDU             resp             female      
##  N:17484   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  U: 1763   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Y:10657   Median :0.0000   Median :0.00000   Median :0.0000  
##            Mean   :0.4527   Mean   :0.09554   Mean   :0.4223  
##            3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:1.0000  
##            Max.   :1.0000   Max.   :1.00000   Max.   :1.0000  
##                                                               
##       HOME            CONDO              COOP             renter 
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.00000   Min.   :1  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:1  
##  Median :1.0000   Median :0.00000   Median :0.00000   Median :1  
##  Mean   :0.5473   Mean   :0.01893   Mean   :0.02959   Mean   :1  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:1  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.00000   Max.   :1  
##                                                                  
##       emp1             emp2              msn            cuscode     
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :    1  
##  1st Qu.:1.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.: 7477  
##  Median :1.0000   Median :0.00000   Median :1.0000   Median :14952  
##  Mean   :0.9078   Mean   :0.05986   Mean   :0.5759   Mean   :14952  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:22428  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :29904  
##

Check for nas in each variable check if any na values are present in the dataset and count the NA values in the dataset

 sum(is.na(DirectMail.dataset))

## [1] 33

Lets directly check in which columns the na values are present.

col<-names(DirectMail.dataset)
nas<-c(sum(is.na(DirectMail.dataset$AGE)),sum(is.na(DirectMail.dataset$CRED)),sum(is.na(DirectMail.dataset$MS)),sum(is.na(DirectMail.dataset$HEQ)),sum(is.na(DirectMail.dataset$INCOME)),sum(is.na(DirectMail.dataset$DEPC)),sum(is.na(DirectMail.dataset$MOB)),sum(is.na(DirectMail.dataset$MILEAGE)),sum(is.na(DirectMail.dataset$RESTYPE)),sum(is.na(DirectMail.dataset$GENDER)),sum(is.na(DirectMail.dataset$EMP_STA)),sum(is.na(DirectMail.dataset$RES_STA)),sum(is.na(DirectMail.dataset$DELINQ)),sum(is.na(DirectMail.dataset$NUMTR)),sum(is.na(DirectMail.dataset$MRTGI)),sum(is.na(DirectMail.dataset$MFDU)),sum(is.na(DirectMail.dataset$resp)),sum(is.na(DirectMail.dataset$female)),sum(is.na(DirectMail.dataset$HOME)),sum(is.na(DirectMail.dataset$CONDO)),sum(is.na(DirectMail.dataset$COOP)),sum(is.na(DirectMail.dataset$renter)),sum(is.na(DirectMail.dataset$emp1)),sum(is.na(DirectMail.dataset$emp2)),sum(is.na(DirectMail.dataset$msn)),sum(is.na(DirectMail.dataset$cuscode)))
table<-data.frame(col,nas)
table

##        col nas
## 1      AGE   0
## 2     CRED  33
## 3       MS   0
## 4      HEQ   0
## 5   INCOME   0
## 6     DEPC   0
## 7      MOB   0
## 8  MILEAGE   0
## 9  RESTYPE   0
## 10  GENDER   0
## 11 EMP_STA   0
## 12 RES_STA   0
## 13  DELINQ   0
## 14   NUMTR   0
## 15   MRTGI   0
## 16    MFDU   0
## 17    resp   0
## 18  female   0
## 19    HOME   0
## 20   CONDO   0
## 21    COOP   0
## 22  renter   0
## 23    emp1   0
## 24    emp2   0
## 25     msn   0
## 26 cuscode   0

So here we see all the na values are present in the CRED column(no of NA’s in dataset and CRED coulmn is same),this column need to be taken care in data cleaning

summary of each variable

univariate analysis

Age

Age of the customer

class(DirectMail.dataset$AGE)

## [1] "integer"

summary(DirectMail.dataset$AGE)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    18.0    36.0    50.0    49.3    61.0    90.0

boxplot(DirectMail.dataset$AGE)

Credit score

Credit score is the creditworthiness of the person

class(DirectMail.dataset$CRED)

## [1] "integer"

summary(DirectMail.dataset$CRED)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   300.0   574.0   617.0   603.6   652.0  1789.0      33

boxplot(DirectMail.dataset$CRED)

Outliers are present in Credit score

MS

class(DirectMail.dataset$MS)

## [1] "factor"

summary(DirectMail.dataset$MS)

##     M     U     X 
## 17221 11721   962

barplot(table(DirectMail.dataset$MS))

Home Equity

Home equity is the value of ownership built up in a home or property that represents the current market value of the house less any remaining mortgage payments.

class(DirectMail.dataset$HEQ)

## [1] "numeric"

summary(DirectMail.dataset$HEQ)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.10   10.00   30.00   38.33   50.00  200.00

boxplot(DirectMail.dataset$HEQ)

outliers are present in Home Equity

Income of the customer

Income of the customer

class(DirectMail.dataset$INCOME)

## [1] "integer"

summary(DirectMail.dataset$INCOME)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20.00   30.00   40.00   41.36   50.00  110.00

boxplot(DirectMail.dataset$INCOME)

outlier is present in Income

Depriciation

a reduction in the value of an asset.

class(DirectMail.dataset$DEPC)

## [1] "factor"

summary(DirectMail.dataset$DEPC)

##     N     Y 
## 18502 11402

table(DirectMail.dataset$DEPC)

## 
##     N     Y 
## 18502 11402

barplot(table(DirectMail.dataset$DEPC))

Existing Customer

If object is existing or a new customer

class(DirectMail.dataset$MOB)

## [1] "factor"

summary(DirectMail.dataset$MOB)

##     N     Y 
## 11340 18564

table(DirectMail.dataset$MOB)

## 
##     N     Y 
## 11340 18564

barplot(table(DirectMail.dataset$MOB))

Mileage

Mileage of the customer vehicle

class(DirectMail.dataset$MILEAGE)

## [1] "numeric"

summary(DirectMail.dataset$MILEAGE)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   7.776  12.540  11.800  16.190  94.640

boxplot(DirectMail.dataset$MILEAGE)

Outliers are present in mileage

Real Estate Type

Type of the house,customer is living in.

class(DirectMail.dataset$RESTYPE)

## [1] "factor"

summary(DirectMail.dataset$RESTYPE)

##  CONDO   COOP   HOME RENTER 
##    566    885  16365  12088

table(DirectMail.dataset$RESTYPE)

## 
##  CONDO   COOP   HOME RENTER 
##    566    885  16365  12088

barplot(table(DirectMail.dataset$RESTYPE))

Gender

Gender of the customer

class(DirectMail.dataset$GENDER)

## [1] "factor"

summary(DirectMail.dataset$GENDER)

##     F     M 
## 12628 17276

table(DirectMail.dataset$GENDER)

## 
##     F     M 
## 12628 17276

barplot(table(DirectMail.dataset$GENDER))

Employer status

class(DirectMail.dataset$EMP_STA)

## [1] "factor"

summary(DirectMail.dataset$EMP_STA)

##     0   1,2    3+ 
##   968 27146  1790

table(DirectMail.dataset$EMP_STA)

## 
##     0   1,2    3+ 
##   968 27146  1790

barplot(table(DirectMail.dataset$EMP_STA))

Residential status

Residentail status of the customer

class(DirectMail.dataset$RES_STA)

## [1] "factor"

summary(DirectMail.dataset$RES_STA)

##   1,2    3+ 
## 28109  1795

table(DirectMail.dataset$RES_STA)

## 
##   1,2    3+ 
## 28109  1795

barplot(table(DirectMail.dataset$RES_STA))

Delinquency Status

Delinquency is Failure in repaying the borrowed sum

class(DirectMail.dataset$DELINQ)

## [1] "integer"

summary(DirectMail.dataset$DELINQ)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.7662  1.0000  7.0000

table(DirectMail.dataset$DELINQ)

## 
##     0     1     2     3     4     5     6     7 
## 17688  5991  3084  1984  1071    29    28    29

barplot(table(DirectMail.dataset$DELINQ))

Number of active trades

Buying and selling the properties in a very short duration

class(DirectMail.dataset$NUMTR)

## [1] "integer"

summary(DirectMail.dataset$NUMTR)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.7633  1.0000  7.0000

table(DirectMail.dataset$NUMTR)

## 
##     0     1     2     3     4     5     6     7 
## 17778  5923  3046  1997  1071    26    34    29

barplot(table(DirectMail.dataset$NUMTR))

Mortgage Indicator

If customer has Mortagaged properties

class(DirectMail.dataset$MRTGI)

## [1] "factor"

summary(DirectMail.dataset$MRTGI)

##     N     U     Y 
## 17484  1763 10657

table(DirectMail.dataset$MRTGI)

## 
##     N     U     Y 
## 17484  1763 10657

barplot(table(DirectMail.dataset$MRTGI))

Multiple family Dwelling input

class(DirectMail.dataset$MFDU)

## [1] "integer"

summary(DirectMail.dataset$MFDU)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.4527  1.0000  1.0000

table(DirectMail.dataset$MFDU)

## 
##     0     1 
## 16365 13539

barplot(table(DirectMail.dataset$MFDU))

Response

Our target output is the Response

Recorded response

class(DirectMail.dataset$resp)

## [1] "integer"

summary(DirectMail.dataset$resp)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.09554 0.00000 1.00000

table(DirectMail.dataset$resp)

## 
##     0     1 
## 27047  2857

barplot(table(DirectMail.dataset$resp))

We see that response is more skewed towards 0’s.This is an Unbalanced Data.

Female

class(DirectMail.dataset$female)

## [1] "integer"

summary(DirectMail.dataset$female)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.4223  1.0000  1.0000

table(DirectMail.dataset$female)

## 
##     0     1 
## 17276 12628

barplot(table(DirectMail.dataset$female))

Home Indicator

class(DirectMail.dataset$HOME)

## [1] "integer"

summary(DirectMail.dataset$HOME)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  1.0000  0.5473  1.0000  1.0000

table(DirectMail.dataset$HOME)

## 
##     0     1 
## 13539 16365

barplot(table(DirectMail.dataset$HOME))

Condominium Indicator

class(DirectMail.dataset$CONDO)

## [1] "integer"

summary(DirectMail.dataset$CONDO)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.01893 0.00000 1.00000

table(DirectMail.dataset$CONDO)

## 
##     0     1 
## 29338   566

barplot(table(DirectMail.dataset$CONDO))

Co-Op Residence Indicator

class(DirectMail.dataset$COOP)

## [1] "integer"

summary(DirectMail.dataset$COOP)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.02959 0.00000 1.00000

table(DirectMail.dataset$COOP)

## 
##     0     1 
## 29019   885

barplot(table(DirectMail.dataset$COOP))

Rental Home Indicator

class(DirectMail.dataset$renter)

## [1] "integer"

summary(DirectMail.dataset$renter)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       1       1       1       1       1

table(DirectMail.dataset$renter)

## 
##     1 
## 29904

barplot(table(DirectMail.dataset$renter))

Employee1

class(DirectMail.dataset$emp1)

## [1] "integer"

summary(DirectMail.dataset$emp1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  1.0000  1.0000  0.9078  1.0000  1.0000

table(DirectMail.dataset$emp1)

## 
##     0     1 
##  2758 27146

barplot(table(DirectMail.dataset$emp1))

Employee2

class(DirectMail.dataset$emp2)

## [1] "integer"

summary(DirectMail.dataset$emp2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05986 0.00000 1.00000

table(DirectMail.dataset$emp2)

## 
##     0     1 
## 28114  1790

barplot(table(DirectMail.dataset$emp2))

Medical Safety Net Program

class(DirectMail.dataset$msn)

## [1] "integer"

summary(DirectMail.dataset$msn)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  1.0000  0.5759  1.0000  1.0000

table(DirectMail.dataset$msn)

## 
##     0     1 
## 12683 17221

barplot(table(DirectMail.dataset$msn))

Customer Identification Code

class(DirectMail.dataset$cuscode)

## [1] "integer"

summary(DirectMail.dataset$cuscode)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1    7477   14950   14950   22430   29900

boxplot(DirectMail.dataset$cuscode)

bivariate analysis

checking the relation between variables

plot(DirectMail.dataset$RESTYPE,DirectMail.dataset$HOME,xlab="RESTYPE",ylab="HOME")

plot(DirectMail.dataset$RESTYPE,DirectMail.dataset$CONDO,xlab="RESTYPE",ylab="CONDO")

plot(DirectMail.dataset$RESTYPE,DirectMail.dataset$COOP,xlab="RESTYPE",ylab="COOP")

plot(DirectMail.dataset$renter,DirectMail.dataset$cuscode)

we see the variables HOME,CONDO,COOP,RENTER are derived from the Variable RESTYPE.
All the cases HOME,CONDO,COOP fall in Rental status,so it is always one.

let us check for the correlation

between numeric variables

#age-HEQ
cor(DirectMail.dataset$AGE,DirectMail.dataset$HEQ)

## [1] 0.2606553

#age-CRED
cor(DirectMail.dataset$AGE,DirectMail.dataset$CRED)

## [1] NA

#age-Income
cor(DirectMail.dataset$AGE,DirectMail.dataset$INCOME)

## [1] 0.05004371

#age-Mileage
cor(DirectMail.dataset$AGE,DirectMail.dataset$MILEAGE)

## [1] -0.002332406

#age-cuscode
cor(DirectMail.dataset$AGE,DirectMail.dataset$cuscode)

## [1] 0.9026553

between category variables
include library-ltm for using biserial.cor()

library("ltm")

## Warning: package 'ltm' was built under R version 3.1.3

## Loading required package: MASS

## Loading required package: msm

## Warning: package 'msm' was built under R version 3.1.3

## Loading required package: polycor

## Warning: package 'polycor' was built under R version 3.1.3

## Loading required package: mvtnorm

## Warning: package 'mvtnorm' was built under R version 3.1.3

## Loading required package: sfsmisc

## Warning: package 'sfsmisc' was built under R version 3.1.3

#find correlation between few variables
#age-resp
biserial.cor(DirectMail.dataset$AGE,DirectMail.dataset$resp)

## [1] 0.03099026

#income-resp
biserial.cor(DirectMail.dataset$INCOME,DirectMail.dataset$resp)

## [1] 0.04153282

#HEQ-resp
biserial.cor(DirectMail.dataset$HEQ,DirectMail.dataset$resp)

## [1] 0.006410327

#CRED-resp
biserial.cor(DirectMail.dataset$CRED,DirectMail.dataset$resp)

## [1] NA

#Mileage-resp
biserial.cor(DirectMail.dataset$MILEAGE,DirectMail.dataset$resp)

## [1] -0.01935468

#resp-cuscode
biserial.cor(DirectMail.dataset$cuscode,DirectMail.dataset$resp)

## [1] 0.03345851

check for correlation of catogorical variables

let us check for correlation of few variables

library(vcd)

## Warning: package 'vcd' was built under R version 3.1.3

## Loading required package: grid

t1<-table(DirectMail.dataset$emp1,DirectMail.dataset$emp2)
assocstats(t1)

##                      X^2 df P(> X^2)
## Likelihood Ratio  9976.5  1        0
## Pearson          18740.1  1        0
## 
## Phi-Coefficient   : 0.792 
## Contingency Coeff.: 0.621 
## Cramer's V        : 0.792

t2<-table(DirectMail.dataset$MS,DirectMail.dataset$MOB)
assocstats(t2)

##                     X^2 df P(> X^2)
## Likelihood Ratio 6.1726  2 0.045670
## Pearson          6.1928  2 0.045211
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.014 
## Cramer's V        : 0.014

t3<-table(DirectMail.dataset$MS,DirectMail.dataset$GENDER)
assocstats(t3)

##                     X^2 df P(> X^2)
## Likelihood Ratio 1.7613  2  0.41451
## Pearson          1.7647  2  0.41382
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.008 
## Cramer's V        : 0.008

Regression Model

let us see the linear regression model

model1<-lm(DirectMail.dataset$resp~DirectMail.dataset$AGE,data = DirectMail.dataset)
  summary(model1)

## 
## Call:
## lm(formula = DirectMail.dataset$resp ~ DirectMail.dataset$AGE, 
##     data = DirectMail.dataset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.11388 -0.10216 -0.09337 -0.08693  0.92831 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             0.1244291  0.0056499  22.023  < 2e-16 ***
## DirectMail.dataset$AGE -0.0005860  0.0001093  -5.362 8.31e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2938 on 29902 degrees of freedom
## Multiple R-squared:  0.0009604,  Adjusted R-squared:  0.000927 
## F-statistic: 28.75 on 1 and 29902 DF,  p-value: 8.312e-08

Here response is a logical variable we use the logistic regression

logistic regression With respect to age

model_age<-glm(resp~AGE,family = binomial(logit),data = DirectMail.dataset)
summary(model_age)

## 
## Call:
## glm(formula = resp ~ AGE, family = binomial(logit), data = DirectMail.dataset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.4949  -0.4639  -0.4418  -0.4262   2.2837  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.914613   0.064455 -29.705  < 2e-16 ***
## AGE         -0.006851   0.001279  -5.355 8.57e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 18850  on 29903  degrees of freedom
## Residual deviance: 18821  on 29902  degrees of freedom
## AIC: 18825
## 
## Number of Fisher Scoring iterations: 5

From the P-value we can conclude age significantly impacts our dependent variable response.

plot(DirectMail.dataset$AGE,DirectMail.dataset$resp)

curve(predict(model_age,data.frame(AGE=x),type = "resp"),add = "TRUE")

Let’s check the confusion matrix and accuracy with a threshold value of 0.4

###confusion matrix
threshold=0.4
predprop<-predict(model_age,type = "response")
predicted_values<-ifelse(predict(model_age,type = "response")>threshold,1,0)
actualvalue<-DirectMail.dataset$resp
confmat<-table(predicted_values,actualvalue)
confmat

##                 actualvalue
## predicted_values     0     1
##                0 27047  2857

accuracy<-(confmat[1,1])/(sum(confmat))
accuracy

## [1] 0.9044609

With the threshold value of 0.4 model predicted all cases as 0. Though accuracy is 90% this is not a good model as we couldn’t identify a single responder from this model. Let’s check logistic regression with respect to income

model_inc<-glm(resp~INCOME,family = binomial(logit),data = DirectMail.dataset)
summary(model_inc)

## 
## Call:
## glm(formula = resp ~ INCOME, family = binomial(logit), data = DirectMail.dataset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.4934  -0.4708  -0.4491  -0.4283   2.4459  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.845522   0.058199 -31.711  < 2e-16 ***
## INCOME      -0.009947   0.001384  -7.186 6.65e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 18850  on 29903  degrees of freedom
## Residual deviance: 18795  on 29902  degrees of freedom
## AIC: 18799
## 
## Number of Fisher Scoring iterations: 5

From the P-value we can conclude Income significantly impacts our dependent variable response.

plot(DirectMail.dataset$INCOME,DirectMail.dataset$resp)

Let us decide the threshold level using ROC curve :

library(pROC)

## Warning: package 'pROC' was built under R version 3.1.3

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

predicted_prob<-predict(model_inc,type="response")
roccurve <- roc(model_inc$y, predicted_prob)
plot(roccurve)

## 
## Call:
## roc.default(response = model_inc$y, predictor = predicted_prob)
## 
## Data: predicted_prob in 27047 controls (model_inc$y 0) < 2857 cases (model_inc$y 1).
## Area under the curve: 0.5411

plot(specificity + sensitivity ~ threshold, t(coords(roccurve, seq(0, 1, 0.001))), type = "l")

coords(roccurve, x="local maximas", input="sensitivity", ret=c("sensitivity","specificity","threshold","tn","tp","fn","fp"))

##             local maximas local maximas local maximas local maximas
## sensitivity             1  9.891495e-01  9.632482e-01  8.855443e-01
## specificity             0  1.678560e-02  5.202056e-02  1.410877e-01
## threshold            -Inf  6.160358e-02  7.649067e-02  8.381833e-02
## tn                      0  4.540000e+02  1.407000e+03  3.816000e+03
## tp                   2857  2.826000e+03  2.752000e+03  2.530000e+03
## fn                      0  3.100000e+01  1.050000e+02  3.270000e+02
## fp                  27047  2.659300e+04  2.564000e+04  2.323100e+04
##             local maximas local maximas local maximas local maximas
## sensitivity  6.940847e-01  3.923696e-01  1.694085e-01             0
## specificity  3.555662e-01  6.702407e-01  8.708544e-01             1
## threshold    9.177808e-02  1.004107e-01  1.097570e-01           Inf
## tn           9.617000e+03  1.812800e+04  2.355400e+04         27047
## tp           1.983000e+03  1.121000e+03  4.840000e+02             0
## fn           8.740000e+02  1.736000e+03  2.373000e+03          2857
## fp           1.743000e+04  8.919000e+03  3.493000e+03             0

###confusion matrix
threshold=0.105
#threshold taken from graph and the above output
predprop2<-predict(model_inc,type = "response")
predicted_values2<-ifelse(predict(model_inc,type = "response")>threshold,1,0)
actualvalue<-DirectMail.dataset$resp
confmat2<-table(predicted_values2,actualvalue)
confmat2

##                  actualvalue
## predicted_values2     0     1
##                 0 23554  2373
##                 1  3493   484

accuracy<-(confmat2[1,1]+confmat2[2,2])/(sum(confmat2))
accuracy

## [1] 0.803839

This model gives us a better accuracy compared to the earlier one. 80% of the time the model correctly predicts the response variable.

Lets take all variables at a time

model2<-glm(resp~.,family = binomial(logit),data = DirectMail.dataset)
summary(model2)

## 
## Call:
## glm(formula = resp ~ ., family = binomial(logit), data = DirectMail.dataset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2381  -0.4829  -0.4071  -0.3412   2.6826  
## 
## Coefficients: (9 not defined because of singularities)
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -1.807e+00  2.854e-01  -6.332 2.42e-10 ***
## AGE           -7.494e-03  3.500e-03  -2.141 0.032275 *  
## CRED          -7.792e-04  2.541e-04  -3.067 0.002162 ** 
## MSU           -6.603e-02  4.144e-02  -1.594 0.111035    
## MSX           -2.790e-02  1.163e-01  -0.240 0.810398    
## HEQ            5.532e-05  5.623e-04   0.098 0.921633    
## INCOME        -9.956e-03  1.398e-03  -7.120 1.08e-12 ***
## DEPCY          1.083e-01  4.064e-02   2.664 0.007720 ** 
## MOBY           4.576e-02  4.119e-02   1.111 0.266559    
## MILEAGE        1.082e-02  3.395e-03   3.186 0.001441 ** 
## RESTYPECOOP    1.810e-01  1.784e-01   1.014 0.310434    
## RESTYPEHOME   -2.938e-01  1.461e-01  -2.010 0.044418 *  
## RESTYPERENTER  3.157e-01  1.516e-01   2.082 0.037358 *  
## GENDERM        9.920e-02  4.059e-02   2.444 0.014531 *  
## EMP_STA1,2     2.485e-01  1.225e-01   2.028 0.042571 *  
## EMP_STA3+      1.096e-01  1.476e-01   0.742 0.457864    
## RES_STA3+     -3.096e-01  9.375e-02  -3.303 0.000958 ***
## DELINQ         6.348e-02  1.685e-02   3.768 0.000165 ***
## NUMTR          2.489e-01  1.501e-02  16.584  < 2e-16 ***
## MRTGIU         4.267e-03  1.049e-01   0.041 0.967570    
## MRTGIY         5.035e-02  6.356e-02   0.792 0.428275    
## MFDU                  NA         NA      NA       NA    
## female                NA         NA      NA       NA    
## HOME                  NA         NA      NA       NA    
## CONDO                 NA         NA      NA       NA    
## COOP                  NA         NA      NA       NA    
## renter                NA         NA      NA       NA    
## emp1                  NA         NA      NA       NA    
## emp2                  NA         NA      NA       NA    
## msn                   NA         NA      NA       NA    
## cuscode        3.244e-06  6.725e-06   0.482 0.629483    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 18843  on 29870  degrees of freedom
## Residual deviance: 18239  on 29849  degrees of freedom
##   (33 observations deleted due to missingness)
## AIC: 18283
## 
## Number of Fisher Scoring iterations: 5

###confusion matrix
threshold=0.1
#threshold taken from the plot
predprop_3<-predict(model2,type = "response")
predicted_values_3<-ifelse(predict(model2,type = "response")>threshold,1,0)
actualvalue<-DirectMail.dataset$resp
#confmat3<-table(predicted_values_3,actualvalue)
#confmat3
#accuracy3<-(confmat3[1,1]+confmat3[2,2])/(sum(confmat3))
#accuracy3

Here Model cannot be built because CRED has Na’s.

Cleaning the data

Dealing with NA’S

create new dataset by removing CRED which is having na’s

mail<-DirectMail.dataset[,-2]
model3<-glm(resp~.,family = binomial(logit),data =mail)
summary(model3)

## 
## Call:
## glm(formula = resp ~ ., family = binomial(logit), data = mail)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2614  -0.4829  -0.4074  -0.3419   2.6816  
## 
## Coefficients: (9 not defined because of singularities)
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -2.364e+00  2.233e-01 -10.589  < 2e-16 ***
## AGE           -1.675e-03  3.029e-03  -0.553 0.580277    
## MSU           -6.623e-02  4.143e-02  -1.599 0.109909    
## MSX           -2.353e-02  1.163e-01  -0.202 0.839621    
## HEQ           -1.945e-05  5.618e-04  -0.035 0.972386    
## INCOME        -9.872e-03  1.397e-03  -7.065 1.61e-12 ***
## DEPCY          1.075e-01  4.063e-02   2.646 0.008135 ** 
## MOBY           4.552e-02  4.118e-02   1.105 0.268948    
## MILEAGE        1.084e-02  3.393e-03   3.194 0.001401 ** 
## RESTYPECOOP    1.793e-01  1.784e-01   1.005 0.314886    
## RESTYPEHOME   -2.936e-01  1.461e-01  -2.009 0.044578 *  
## RESTYPERENTER  3.143e-01  1.516e-01   2.073 0.038188 *  
## GENDERM        1.004e-01  4.058e-02   2.473 0.013385 *  
## EMP_STA1,2     2.480e-01  1.225e-01   2.024 0.042953 *  
## EMP_STA3+      1.101e-01  1.476e-01   0.746 0.455836    
## RES_STA3+     -3.131e-01  9.373e-02  -3.340 0.000838 ***
## DELINQ         6.320e-02  1.684e-02   3.753 0.000175 ***
## NUMTR          2.484e-01  1.500e-02  16.552  < 2e-16 ***
## MRTGIU         4.165e-04  1.049e-01   0.004 0.996832    
## MRTGIY         4.865e-02  6.355e-02   0.765 0.443989    
## MFDU                  NA         NA      NA       NA    
## female                NA         NA      NA       NA    
## HOME                  NA         NA      NA       NA    
## CONDO                 NA         NA      NA       NA    
## COOP                  NA         NA      NA       NA    
## renter                NA         NA      NA       NA    
## emp1                  NA         NA      NA       NA    
## emp2                  NA         NA      NA       NA    
## msn                   NA         NA      NA       NA    
## cuscode       -1.004e-05  5.397e-06  -1.860 0.062868 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 18850  on 29903  degrees of freedom
## Residual deviance: 18254  on 29883  degrees of freedom
## AIC: 18296
## 
## Number of Fisher Scoring iterations: 5

###confusion matrix
threshold=0.15
predprop_4<-predict(model2,type = "response")
predicted_values_4<-ifelse(predict(model3,type = "response")>threshold,1,0)
actualvalue<-DirectMail.dataset$resp
confmat4<-table(predicted_values_4,actualvalue)
confmat4

##                   actualvalue
## predicted_values_4     0     1
##                  0 24619  2279
##                  1  2428   578

accuracy4<-(confmat4[1,1]+confmat4[2,2])/(sum(confmat4))
accuracy4

## [1] 0.8425963

Usually,It is not advisable to remove the variable from dataset,we clean the NA’s in the variable and bulid the model again

CRED

Repalce all na’s in the variable with mean

mail2<-DirectMail.dataset[,2]

mail3<-ifelse(is.na(mail2),603.6,mail2)

mail4<-DirectMail.dataset
mail4$CRED=mail3
#mail4 is the new dataset with no nas

model4<-glm(resp~.,family = binomial(logit),data = mail4)
summary(model4)

## 
## Call:
## glm(formula = resp ~ ., family = binomial(logit), data = mail4)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2393  -0.4828  -0.4069  -0.3411   2.6829  
## 
## Coefficients: (9 not defined because of singularities)
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -1.856e+00  2.840e-01  -6.533 6.43e-11 ***
## AGE           -6.394e-03  3.441e-03  -1.859 0.063087 .  
## CRED          -7.302e-04  2.526e-04  -2.891 0.003840 ** 
## MSU           -6.500e-02  4.144e-02  -1.569 0.116706    
## MSX           -2.550e-02  1.163e-01  -0.219 0.826474    
## HEQ            4.812e-05  5.623e-04   0.086 0.931810    
## INCOME        -9.932e-03  1.398e-03  -7.104 1.22e-12 ***
## DEPCY          1.078e-01  4.064e-02   2.652 0.008004 ** 
## MOBY           4.548e-02  4.119e-02   1.104 0.269477    
## MILEAGE        1.084e-02  3.394e-03   3.194 0.001403 ** 
## RESTYPECOOP    1.800e-01  1.784e-01   1.009 0.313006    
## RESTYPEHOME   -2.952e-01  1.461e-01  -2.020 0.043397 *  
## RESTYPERENTER  3.137e-01  1.516e-01   2.069 0.038579 *  
## GENDERM        9.898e-02  4.059e-02   2.439 0.014747 *  
## EMP_STA1,2     2.471e-01  1.225e-01   2.017 0.043733 *  
## EMP_STA3+      1.095e-01  1.476e-01   0.741 0.458392    
## RES_STA3+     -3.119e-01  9.373e-02  -3.328 0.000874 ***
## DELINQ         6.338e-02  1.685e-02   3.762 0.000168 ***
## NUMTR          2.490e-01  1.501e-02  16.592  < 2e-16 ***
## MRTGIU         2.397e-03  1.049e-01   0.023 0.981779    
## MRTGIY         4.918e-02  6.356e-02   0.774 0.439068    
## MFDU                  NA         NA      NA       NA    
## female                NA         NA      NA       NA    
## HOME                  NA         NA      NA       NA    
## CONDO                 NA         NA      NA       NA    
## COOP                  NA         NA      NA       NA    
## renter                NA         NA      NA       NA    
## emp1                  NA         NA      NA       NA    
## emp2                  NA         NA      NA       NA    
## msn                   NA         NA      NA       NA    
## cuscode        9.771e-07  6.600e-06   0.148 0.882306    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 18850  on 29903  degrees of freedom
## Residual deviance: 18246  on 29882  degrees of freedom
## AIC: 18290
## 
## Number of Fisher Scoring iterations: 5

###confusion matrix
threshold=0.1125
predprop_5<-predict(model4,type = "response")
predicted_values_5<-ifelse(predict(model4,type = "response")>threshold,1,0)
actualvalue<-DirectMail.dataset$resp
confmat5<-table(predicted_values_5,actualvalue)
confmat5

##                   actualvalue
## predicted_values_5     0     1
##                  0 20055  1638
##                  1  6992  1219

accuracy5<-(confmat5[1,1]+confmat5[2,2])/(sum(confmat5))
accuracy5

## [1] 0.7114098

Dealing with Outliers

CRED
We see the value 1789 is oulier,since it is above 99%

quantile(mail4$CRED,c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95))

##  0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 95% 
## 300 472 563 586 605 617 628 642 661 686 725

HEQ
we see value 200 is outlier

quantile(mail4$HEQ,c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95))

##   0%  10%  20%  30%  40%  50%  60%  70%  80%  90%  95% 
##  0.1  0.1 10.0 10.0 30.0 30.0 30.0 50.0 50.0 70.0 90.0

Income
we see the value 110 is outlier

quantile(mail4$INCOME,c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.99))

##  0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 95% 99% 
##  20  20  30  30  40  40  40  50  50  60  70 110

Mileage
most of the values lies below 19.93

quantile(mail4$MILEAGE,c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.99))

##       0%      10%      20%      30%      40%      50%      60%      70% 
##  0.00000  3.07860  6.28400  9.34770 11.10320 12.53650 13.99100 15.38610 
##      80%      90%      95%      99% 
## 16.98900 18.58670 19.32485 19.93300

Removing Outliners

ncred

ncred<-ifelse(mail4$CRED==1789,603.6,mail4$CRED)
boxplot(ncred)

>HEQ

nheq<-ifelse(mail4$HEQ>=90,38.33,mail4$HEQ)
boxplot(nheq)

>Income

ninc<-ifelse(mail4$INCOME>=100,41.36,mail4$INCOME)
boxplot(ninc)

>Mileage

nmil<-ifelse(mail4$MILEAGE>=25,11.8,mail4$MILEAGE)
boxplot(nmil)

Rebuild the model after removing outliers

make a new mail5-dataset by replacing the oulierless variables in mail4.

mail5<-mail4
mail5$CRED<-ncred
mail5$HEQ<-nheq
mail5$INCOME<-ninc
mail5$MILEAGE<-nmil

buliding model

model5<-glm(resp~.,family = binomial(logit),data = mail5)
summary(model5)

## 
## Call:
## glm(formula = resp ~ ., family = binomial(logit), data = mail5)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2451  -0.4821  -0.4072  -0.3412   2.6950  
## 
## Coefficients: (9 not defined because of singularities)
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -1.855e+00  2.880e-01  -6.442 1.18e-10 ***
## AGE           -5.921e-03  3.397e-03  -1.743 0.081328 .  
## CRED          -6.668e-04  2.567e-04  -2.598 0.009383 ** 
## MSU           -6.591e-02  4.144e-02  -1.591 0.111703    
## MSX           -2.580e-02  1.163e-01  -0.222 0.824413    
## HEQ            3.953e-04  9.733e-04   0.406 0.684651    
## INCOME        -1.120e-02  1.590e-03  -7.043 1.88e-12 ***
## DEPCY          1.074e-01  4.064e-02   2.644 0.008202 ** 
## MOBY           4.548e-02  4.118e-02   1.104 0.269416    
## MILEAGE        1.008e-02  3.673e-03   2.745 0.006058 ** 
## RESTYPECOOP    1.823e-01  1.784e-01   1.022 0.306929    
## RESTYPEHOME   -2.954e-01  1.461e-01  -2.022 0.043196 *  
## RESTYPERENTER  3.125e-01  1.516e-01   2.061 0.039313 *  
## GENDERM        9.965e-02  4.059e-02   2.455 0.014080 *  
## EMP_STA1,2     2.468e-01  1.225e-01   2.014 0.043989 *  
## EMP_STA3+      1.096e-01  1.476e-01   0.743 0.457640    
## RES_STA3+     -3.130e-01  9.373e-02  -3.340 0.000839 ***
## DELINQ         6.301e-02  1.685e-02   3.740 0.000184 ***
## NUMTR          2.492e-01  1.501e-02  16.601  < 2e-16 ***
## MRTGIU         3.036e-03  1.049e-01   0.029 0.976918    
## MRTGIY         4.893e-02  6.355e-02   0.770 0.441276    
## MFDU                  NA         NA      NA       NA    
## female                NA         NA      NA       NA    
## HOME                  NA         NA      NA       NA    
## CONDO                 NA         NA      NA       NA    
## COOP                  NA         NA      NA       NA    
## renter                NA         NA      NA       NA    
## emp1                  NA         NA      NA       NA    
## emp2                  NA         NA      NA       NA    
## msn                   NA         NA      NA       NA    
## cuscode       -3.808e-07  6.562e-06  -0.058 0.953721    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 18850  on 29903  degrees of freedom
## Residual deviance: 18253  on 29882  degrees of freedom
## AIC: 18297
## 
## Number of Fisher Scoring iterations: 5

###confusion matrix
threshold=0.1125
predprop_6<-predict(model5,type = "response")
predicted_values_6<-ifelse(predict(model5,type = "response")>threshold,1,0)
actualvalue<-mail5$resp
confmat6<-table(actualvalue,predicted_values_6)
confmat6

##            predicted_values_6
## actualvalue     0     1
##           0 20077  6970
##           1  1650  1207

accuracy6<-(confmat6[1,1]+confmat6[2,2])/(sum(confmat6))
accuracy6

## [1] 0.7117442

sensitivity6<-confmat6[1,1]/(confmat6[1,1]+confmat6[1,2])
sensitivity6

## [1] 0.7423004

specificity6<-confmat6[2,2]/(confmat6[2,1]+confmat6[2,2])
specificity6

## [1] 0.4224711

Here we can see that the variables CRED,INCOME,DEPCY,MILEAGE,DELINQ,NUMTR are highly impacting from their Z values. In presence of categorical variables decision trees gives better reults

Building a decision tree

with rpart library

#include a library-rpart for decision tree
library(rpart)
mail5$resp<-factor(mail5$resp)
tree<-rpart(resp~., method = "class",data = mail5,control = rpart.control(minsplit = 30,cp=0.00001))
#We can check  the summary of the tree by -summary(tree)

#For a Shorter summary
tree$variable.importance

##    cuscode       CRED        AGE      NUMTR    MILEAGE    RESTYPE 
## 77.1665794 64.0314285 52.1785565 51.4437608 47.9210444 35.7394500 
##       HOME       MFDU     INCOME      MRTGI        HEQ         MS 
## 32.5709146 32.5709146 27.3671233 21.1772451 16.9198385 13.7326187 
##        msn       DEPC     DELINQ    EMP_STA     female     GENDER 
## 10.9162690  9.4188378  7.1516773  5.1097097  4.0702741  4.0702741 
##       COOP        MOB       emp1       emp2    RES_STA 
##  3.3501143  2.9477677  1.7250014  1.0397997  0.3546058

summary(tree$frame)

##       var            n                 wt               dev         
##  <leaf> :129   Min.   :   10.0   Min.   :   10.0   Min.   :   0.00  
##  cuscode: 28   1st Qu.:   33.0   1st Qu.:   33.0   1st Qu.:   8.00  
##  MILEAGE: 24   Median :  103.0   Median :  103.0   Median :  19.00  
##  CRED   : 22   Mean   :  877.3   Mean   :  877.3   Mean   :  96.81  
##  AGE    : 14   3rd Qu.:  450.0   3rd Qu.:  450.0   3rd Qu.:  67.00  
##  HEQ    :  7   Max.   :29904.0   Max.   :29904.0   Max.   :2857.00  
##  (Other): 33                                                        
##       yval         complexity           ncompete       nsurrogate  
##  Min.   :1.000   Min.   :0.000e+00   Min.   :0.000   Min.   :0.00  
##  1st Qu.:1.000   1st Qu.:0.000e+00   1st Qu.:0.000   1st Qu.:0.00  
##  Median :1.000   Median :1.000e-05   Median :0.000   Median :0.00  
##  Mean   :1.105   Mean   :9.121e-05   Mean   :1.992   Mean   :1.23  
##  3rd Qu.:1.000   3rd Qu.:1.615e-04   3rd Qu.:4.000   3rd Qu.:2.00  
##  Max.   :2.000   Max.   :3.000e-04   Max.   :4.000   Max.   :5.00  
##                                                                    
##       yval2.V1             yval2.V2             yval2.V3             yval2.V4             yval2.V5          yval2.nodeprob   
##  Min.   :1.0000000    Min.   :    2.000    Min.   :   0.0000    Min.   :0.2000000    Min.   :0.0000000    Min.   :0.0003344  
##  1st Qu.:1.0000000    1st Qu.:   24.000    1st Qu.:   8.0000    1st Qu.:0.7142857    1st Qu.:0.1149051    1st Qu.:0.0011035  
##  Median :1.0000000    Median :   90.000    Median :  19.0000    Median :0.8260870    Median :0.1739130    Median :0.0034444  
##  Mean   :1.1050584    Mean   :  780.179    Mean   :  97.1089    Mean   :0.7732910    Mean   :0.2267090    Mean   :0.0293368  
##  3rd Qu.:1.0000000    3rd Qu.:  403.000    3rd Qu.:  67.0000    3rd Qu.:0.8850949    3rd Qu.:0.2857143    3rd Qu.:0.0150482  
##  Max.   :2.0000000    Max.   :27047.000    Max.   :2857.0000    Max.   :1.0000000    Max.   :0.8000000    Max.   :1.0000000  
##

summary(tree$where)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00    5.00   42.00   60.89   86.00  257.00

summary(tree$call)

## Length  Class   Mode 
##      5   call   call

summary(tree$cptable)

##        CP                nsplit         rel error          xerror     
##  Min.   :0.0000100   Min.   :  0.00   Min.   :0.9748   Min.   :1.000  
##  1st Qu.:0.0000875   1st Qu.: 35.00   1st Qu.:0.9776   1st Qu.:1.031  
##  Median :0.0001556   Median : 68.00   Median :0.9814   Median :1.049  
##  Mean   :0.0001581   Mean   : 66.77   Mean   :0.9834   Mean   :1.050  
##  3rd Qu.:0.0002333   3rd Qu.: 97.00   3rd Qu.:0.9881   3rd Qu.:1.065  
##  Max.   :0.0003000   Max.   :128.00   Max.   :1.0000   Max.   :1.092  
##       xstd        
##  Min.   :0.01779  
##  1st Qu.:0.01804  
##  Median :0.01818  
##  Mean   :0.01818  
##  3rd Qu.:0.01830  
##  Max.   :0.01851

summary(tree$method)

##    Length     Class      Mode 
##         1 character character

Let’s use PARTY package to draw a decision tree

library("party")

## Warning: package 'party' was built under R version 3.1.3

## Loading required package: modeltools

## Warning: package 'modeltools' was built under R version 3.1.3

## Loading required package: stats4

## Loading required package: strucchange

## Warning: package 'strucchange' was built under R version 3.1.3

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

## Warning: package 'sandwich' was built under R version 3.1.3

dec_tree<-ctree(resp ~., data = mail5,controls = ctree_control(mincriterion = 0.1))
summary(dec_tree)

##     Length      Class       Mode 
##          1 BinaryTree         S4

dec_tree

## 
##   Conditional inference tree with 105 terminal nodes
## 
## Response:  resp 
## Inputs:  AGE, CRED, MS, HEQ, INCOME, DEPC, MOB, MILEAGE, RESTYPE, GENDER, EMP_STA, RES_STA, DELINQ, NUMTR, MRTGI, MFDU, female, HOME, CONDO, COOP, renter, emp1, emp2, msn, cuscode 
## Number of observations:  29904 
## 
## 1) NUMTR <= 2; criterion = 1, statistic = 277.497
##   2) RESTYPE == {CONDO, HOME}; criterion = 1, statistic = 176.563
##     3) INCOME <= 30; criterion = 1, statistic = 116.148
##       4) INCOME <= 20; criterion = 0.969, statistic = 10.394
##         5) cuscode <= 20349; criterion = 0.671, statistic = 5.82
##           6) HEQ <= 30; criterion = 0.256, statistic = 3.741
##             7)*  weights = 1001 
##           6) HEQ > 30
##             8) NUMTR <= 0; criterion = 0.594, statistic = 5.359
##               9) MS == {U}; criterion = 0.431, statistic = 6.815
##                 10)*  weights = 128 
##               9) MS == {M, X}
##                 11)*  weights = 175 
##             8) NUMTR > 0
##               12) DEPC == {Y}; criterion = 0.186, statistic = 3.403
##                 13)*  weights = 69 
##               12) DEPC == {N}
##                 14) HEQ <= 50; criterion = 0.137, statistic = 3.141
##                   15)*  weights = 68 
##                 14) HEQ > 50
##                   16)*  weights = 34 
##         5) cuscode > 20349
##           17)*  weights = 548 
##       4) INCOME > 20
##         18) NUMTR <= 0; criterion = 0.515, statistic = 4.944
##           19) HEQ <= 38.33; criterion = 0.16, statistic = 3.63
##             20) msn <= 0; criterion = 0.576, statistic = 6.509
##               21) DEPC == {Y}; criterion = 0.803, statistic = 6.871
##                 22) DELINQ <= 3; criterion = 0.821, statistic = 7.07
##                   23) MFDU <= 0; criterion = 0.661, statistic = 5.758
##                     24)*  weights = 204 
##                   23) MFDU > 0
##                     25)*  weights = 10 
##                 22) DELINQ > 3
##                   26)*  weights = 8 
##               21) DEPC == {N}
##                 27)*  weights = 381 
##             20) msn > 0
##               28) CRED <= 398; criterion = 0.241, statistic = 4.407
##                 29) MOB == {N}; criterion = 0.328, statistic = 4.073
##                   30)*  weights = 18 
##                 29) MOB == {Y}
##                   31)*  weights = 37 
##               28) CRED > 398
##                 32)*  weights = 764 
##           19) HEQ > 38.33
##             33) MILEAGE <= 19.77; criterion = 0.146, statistic = 3.188
##               34) DEPC == {N}; criterion = 0.123, statistic = 3.059
##                 35)*  weights = 364 
##               34) DEPC == {Y}
##                 36) MS == {M, U}; criterion = 0.804, statistic = 9.493
##                   37)*  weights = 229 
##                 36) MS == {X}
##                   38)*  weights = 11 
##             33) MILEAGE > 19.77
##               39)*  weights = 9 
##         18) NUMTR > 0
##           40) GENDER == {F}; criterion = 0.722, statistic = 6.173
##             41) MS == {U}; criterion = 0.778, statistic = 9.215
##               42)*  weights = 173 
##             41) MS == {M, X}
##               43)*  weights = 277 
##           40) GENDER == {M}
##             44)*  weights = 592 
##     3) INCOME > 30
##       45) INCOME <= 50; criterion = 1, statistic = 19.063
##         46) cuscode <= 13621; criterion = 0.997, statistic = 14.948
##           47) NUMTR <= 1; criterion = 0.964, statistic = 10.14
##             48) CRED <= 336; criterion = 0.439, statistic = 4.576
##               49)*  weights = 81 
##             48) CRED > 336
##               50) emp2 <= 0; criterion = 0.208, statistic = 3.609
##                 51) HOME <= 0; criterion = 0.358, statistic = 4.207
##                   52)*  weights = 94 
##                 51) HOME > 0
##                   53) MILEAGE <= 13.409; criterion = 0.19, statistic = 3.838
##                     54)*  weights = 1563 
##                   53) MILEAGE > 13.409
##                     55)*  weights = 1189 
##               50) emp2 > 0
##                 56)*  weights = 179 
##           47) NUMTR > 1
##             57) CRED <= 578; criterion = 0.551, statistic = 5.127
##               58)*  weights = 148 
##             57) CRED > 578
##               59) RESTYPE == {CONDO}; criterion = 0.91, statistic = 8.396
##                 60)*  weights = 10 
##               59) RESTYPE == {HOME}
##                 61)*  weights = 232 
##         46) cuscode > 13621
##           62) MFDU <= 0; criterion = 0.983, statistic = 11.484
##             63) emp2 <= 0; criterion = 0.911, statistic = 10.162
##               64) msn <= 0; criterion = 0.103, statistic = 3.436
##                 65) EMP_STA == {0}; criterion = 0.884, statistic = 7.91
##                   66) NUMTR <= 1; criterion = 0.433, statistic = 6.289
##                     67) DELINQ <= 0; criterion = 0.101, statistic = 2.916
##                       68)*  weights = 27 
##                     67) DELINQ > 0
##                       69)*  weights = 23 
##                   66) NUMTR > 1
##                     70)*  weights = 12 
##                 65) EMP_STA == {1,2}
##                   71) DELINQ <= 1; criterion = 0.635, statistic = 5.599
##                     72) INCOME <= 41.36; criterion = 0.107, statistic = 2.956
##                       73) HEQ <= 10; criterion = 0.865, statistic = 7.616
##                         74) AGE <= 51; criterion = 0.159, statistic = 3.26
##                           75)*  weights = 36 
##                         74) AGE > 51
##                           76) GENDER == {F}; criterion = 0.259, statistic = 3.758
##                             77)*  weights = 84 
##                           76) GENDER == {M}
##                             78)*  weights = 117 
##                       73) HEQ > 10
##                         79)*  weights = 571 
##                     72) INCOME > 41.36
##                       80) HEQ <= 50; criterion = 0.146, statistic = 3.19
##                         81)*  weights = 492 
##                       80) HEQ > 50
##                         82) GENDER == {F}; criterion = 0.13, statistic = 3.096
##                           83)*  weights = 21 
##                         82) GENDER == {M}
##                           84)*  weights = 16 
##                   71) DELINQ > 1
##                     85)*  weights = 375 
##               64) msn > 0
##                 86)*  weights = 2512 
##             63) emp2 > 0
##               87)*  weights = 275 
##           62) MFDU > 0
##             88) MOB == {N}; criterion = 0.964, statistic = 10.12
##               89)*  weights = 54 
##             88) MOB == {Y}
##               90)*  weights = 95 
##       45) INCOME > 50
##         91) NUMTR <= 0; criterion = 0.111, statistic = 2.983
##           92) INCOME <= 60; criterion = 0.108, statistic = 4.241
##             93)*  weights = 858 
##           92) INCOME > 60
##             94) MFDU <= 0; criterion = 0.564, statistic = 5.197
##               95)*  weights = 332 
##             94) MFDU > 0
##               96)*  weights = 10 
##         91) NUMTR > 0
##           97)*  weights = 648 
##   2) RESTYPE == {COOP, RENTER}
##     98) NUMTR <= 0; criterion = 1, statistic = 49.019
##       99) DELINQ <= 0; criterion = 0.976, statistic = 10.893
##         100) cuscode <= 23396; criterion = 0.483, statistic = 4.784
##           101) DEPC == {N}; criterion = 0.3, statistic = 4.439
##             102) MOB == {Y}; criterion = 0.314, statistic = 4.01
##               103) MILEAGE <= 5.961; criterion = 0.571, statistic = 5.234
##                 104)*  weights = 266 
##               103) MILEAGE > 5.961
##                 105) cuscode <= 21419; criterion = 0.439, statistic = 4.576
##                   106)*  weights = 986 
##                 105) cuscode > 21419
##                   107) msn <= 0; criterion = 0.318, statistic = 4.602
##                     108) GENDER == {F}; criterion = 0.18, statistic = 3.37
##                       109)*  weights = 15 
##                     108) GENDER == {M}
##                       110)*  weights = 22 
##                   107) msn > 0
##                     111) AGE <= 62; criterion = 0.196, statistic = 3.452
##                       112)*  weights = 50 
##                     111) AGE > 62
##                       113)*  weights = 11 
##             102) MOB == {N}
##               114) MILEAGE <= 14.973; criterion = 0.836, statistic = 7.233
##                 115) RESTYPE == {COOP}; criterion = 0.145, statistic = 3.182
##                   116)*  weights = 52 
##                 115) RESTYPE == {RENTER}
##                   117)*  weights = 534 
##               114) MILEAGE > 14.973
##                 118) msn <= 0; criterion = 0.578, statistic = 5.864
##                   119)*  weights = 124 
##                 118) msn > 0
##                   120)*  weights = 139 
##           101) DEPC == {Y}
##             121)*  weights = 1423 
##         100) cuscode > 23396
##           122) CRED <= 574; criterion = 0.852, statistic = 7.441
##             123) emp1 <= 0; criterion = 0.33, statistic = 4.082
##               124)*  weights = 10 
##             123) emp1 > 0
##               125) cuscode <= 26013; criterion = 0.176, statistic = 3.352
##                 126) msn <= 0; criterion = 0.12, statistic = 3.388
##                   127)*  weights = 24 
##                 126) msn > 0
##                   128) AGE <= 67; criterion = 0.104, statistic = 2.94
##                     129)*  weights = 7 
##                   128) AGE > 67
##                     130)*  weights = 25 
##               125) cuscode > 26013
##                 131)*  weights = 9 
##           122) CRED > 574
##             132) RES_STA == {1,2}; criterion = 0.133, statistic = 4.184
##               133)*  weights = 846 
##             132) RES_STA == {3+}
##               134)*  weights = 52 
##       99) DELINQ > 0
##         135) INCOME <= 40; criterion = 0.379, statistic = 4.956
##           136) RES_STA == {1,2}; criterion = 0.478, statistic = 4.76
##             137) CRED <= 745; criterion = 0.102, statistic = 2.923
##               138)*  weights = 1878 
##             137) CRED > 745
##               139)*  weights = 51 
##           136) RES_STA == {3+}
##             140)*  weights = 143 
##         135) INCOME > 40
##           141)*  weights = 1055 
##     98) NUMTR > 0
##       142) AGE <= 71; criterion = 0.289, statistic = 3.897
##         143)*  weights = 3529 
##       142) AGE > 71
##         144) AGE <= 82; criterion = 0.663, statistic = 5.772
##           145) DELINQ <= 0; criterion = 0.14, statistic = 3.154
##             146) female <= 0; criterion = 0.473, statistic = 4.736
##               147)*  weights = 76 
##             146) female > 0
##               148)*  weights = 70 
##           145) DELINQ > 0
##             149)*  weights = 97 
##         144) AGE > 82
##           150) MOB == {Y}; criterion = 0.312, statistic = 3.999
##             151) NUMTR <= 1; criterion = 0.801, statistic = 6.852
##               152)*  weights = 36 
##             151) NUMTR > 1
##               153)*  weights = 22 
##           150) MOB == {N}
##             154) msn <= 0; criterion = 0.513, statistic = 5.474
##               155)*  weights = 16 
##             154) msn > 0
##               156)*  weights = 25 
## 1) NUMTR > 2
##   157) MFDU <= 0; criterion = 1, statistic = 24.277
##     158) INCOME <= 30; criterion = 1, statistic = 26.157
##       159) DEPC == {N}; criterion = 0.454, statistic = 4.647
##         160) HEQ <= 30; criterion = 0.473, statistic = 4.739
##           161) EMP_STA == {3+}; criterion = 0.518, statistic = 7.299
##             162)*  weights = 13 
##           161) EMP_STA == {0, 1,2}
##             163)*  weights = 190 
##         160) HEQ > 30
##           164) INCOME <= 20; criterion = 0.865, statistic = 7.619
##             165)*  weights = 57 
##           164) INCOME > 20
##             166)*  weights = 106 
##       159) DEPC == {Y}
##         167) female <= 0; criterion = 0.392, statistic = 4.362
##           168) cuscode <= 17714; criterion = 0.327, statistic = 4.068
##             169) INCOME <= 20; criterion = 0.447, statistic = 4.616
##               170)*  weights = 25 
##             169) INCOME > 20
##               171)*  weights = 44 
##           168) cuscode > 17714
##             172)*  weights = 46 
##         167) female > 0
##           173) MRTGI == {U, Y}; criterion = 0.262, statistic = 5.909
##             174)*  weights = 57 
##           173) MRTGI == {N}
##             175)*  weights = 22 
##     158) INCOME > 30
##       176) GENDER == {F}; criterion = 0.488, statistic = 5.051
##         177) INCOME <= 50; criterion = 0.841, statistic = 7.304
##           178) cuscode <= 23443; criterion = 0.371, statistic = 4.264
##             179) MOB == {Y}; criterion = 0.148, statistic = 3.2
##               180)*  weights = 170 
##             179) MOB == {N}
##               181) INCOME <= 40; criterion = 0.149, statistic = 3.209
##                 182) NUMTR <= 3; criterion = 0.462, statistic = 4.685
##                   183) DELINQ <= 0; criterion = 0.144, statistic = 3.179
##                     184)*  weights = 18 
##                   183) DELINQ > 0
##                     185)*  weights = 16 
##                 182) NUMTR > 3
##                   186)*  weights = 36 
##               181) INCOME > 40
##                 187)*  weights = 47 
##           178) cuscode > 23443
##             188) AGE <= 69; criterion = 0.182, statistic = 3.383
##               189)*  weights = 53 
##             188) AGE > 69
##               190) DELINQ <= 1; criterion = 0.366, statistic = 4.245
##                 191)*  weights = 31 
##               190) DELINQ > 1
##                 192)*  weights = 11 
##         177) INCOME > 50
##           193) MILEAGE <= 5.617; criterion = 0.295, statistic = 3.92
##             194)*  weights = 14 
##           193) MILEAGE > 5.617
##             195)*  weights = 82 
##       176) GENDER == {M}
##         196) msn <= 0; criterion = 0.884, statistic = 8.311
##           197)*  weights = 279 
##         196) msn > 0
##           198)*  weights = 394 
##   157) MFDU > 0
##     199) AGE <= 56; criterion = 0.998, statistic = 15.829
##       200)*  weights = 939 
##     199) AGE > 56
##       201) NUMTR <= 3; criterion = 0.763, statistic = 6.502
##         202)*  weights = 314 
##       201) NUMTR > 3
##         203) HEQ <= 38.33; criterion = 0.662, statistic = 5.763
##           204)*  weights = 133 
##         203) HEQ > 38.33
##           205) GENDER == {M}; criterion = 0.297, statistic = 3.933
##             206)*  weights = 35 
##           205) GENDER == {F}
##             207) INCOME <= 30; criterion = 0.18, statistic = 3.37
##               208)*  weights = 12 
##             207) INCOME > 30
##               209)*  weights = 13

#We can see the plotted decision tree using plot(dec_tree),the parameter for function-ctree(),can be varied suitably for our Dataset
#plot(dec_tree)
predict_dectree<-predict(dec_tree)
confmat8<-table(actualvalue,predict_dectree)
confmat8

##            predict_dectree
## actualvalue     0     1
##           0 27031    16
##           1  2835    22

accuracy8<-(confmat8[1,1]+confmat8[2,2])/(sum(confmat8))
accuracy8

## [1] 0.9046616

sensitivity8<-confmat8[1,1]/(confmat8[1,2]+confmat8[1,1])
sensitivity8

## [1] 0.9994084

specificity8<-confmat8[2,2]/(confmat8[2,1]+confmat8[2,2])
specificity8

## [1] 0.007700385

Unbalanced dataset handling

Here the dataset is Unbalanced i.e Response is very much skewed towards 0’s and models we are built are with good accuracy but with poor specificity and sensitivity.These Unbalanced data need to be dealt carefully.

Direct Mail Marketing