You can download the datasets .

Content

Title: “German credit”
Abstract:
Source Information
Data File for this case :
data importing:
data information
Variable1 = “Creditability”
variable2=“Account.Balance”
variable3=“Duration.of.Credit..month”
- Univariate Analysis of duration in month
variable4=“Payment.Status.of.Previous.Credit”
variable5=“purpose”
variable6=“Credit.Amount”
- Univariate Analysis of Credit.Amount
variable7=“Value.Savings.Stocks”
variable8 = “Length.of.current.employment”
variable9 =“Instalment.per.cent”
variable10 =“Sex…Marital.Status”
variable11 = “Guarantors”
variables12 = “Duration.in.Current.address”
variable13=“Most.valuable.available.asset”
variable14 = “Age..years.”
Univariate Analysis of Age..years.
variable15=“Concurrent.Credits”
variable16 = “Type.of.apartment”
variable17= “No.of.Credits.at.this.Bank”
variable18 =“Occupation”
variable19 =“No.of.dependents”
variable20 =“Telephone”
variable21= “Foreign.Worker”
Model Building
predictors – Qualitative chi-square p -value
predicators – numerical data
Building logistic Regression
classification table
Logistic regression Accuracy
Multicollinearity
Individual Impact of Variables
AIC and BIC
Model Selction
Duration.in.Current.address,Occupation and No.of.dependents are the least impacting variable, lets drop these variables and re build the model
Building Final logistic Regression
Logistic Regression Accuracy
AIC and BIC
decision tree
accuracy
outlier treatment for creditamount
Logistic model (credit amount replace with credit_amount_new)
Accuracy
logistic model1 builds with a with out outlier treatment(credit.amount) and it’s accuracy is 0.77. logistic model3 builds with a outlier treatment(credit_amount_new)and it’s accuracy is 0.77.
Changing Threshold
AIC and BIC
Decision tree(Credit.Amount replace with credit_amount_new )
Prediction using the model
german_tree builds with a with out outlier treatment and its accuracy is 0.76.german_tree1 builds with a outlier treatment and it’s accuracy is 0.76. there is no difference in accuracy with out outlier treatment/with outlier treatment.
Training and Validation data
Overfitting
Model on training data
Validation accuracy
It performs well on training data not in test data.
ROC and AUC
ROC and AUC on decision
k-fold Cross Validation building
K=10
Building the models on K-fold samples
K_fold_tree$finalModel
confusion matrix
Bootstrap
Tree model on boots straped data
conclusion:
- Confusion Matrix and Statistics
If we the change thershold value to increase the sensitivity and specificity.
real accuracy of the whole data is 76%.

Title: “German credit”

Abstract:

The objective is to predict the credit ratings of customers

Source Information

Professor Dr. Hans Hofmann
Institut f“ur Statistik und”Okonometrie
Universit“at Hamburg
FB Wirtschaftswissenschaften
Von-Melle-Park 5
2000 Hamburg 13

Data File for this case :

German Credit data – german_credit.csv

The German Credit Data contains data on 20 variables and the classification whether an applicant is considered a Good or a Bad credit risk for 1000 loan applicants.

data importing:

german_credit<-read.csv("C:\Amrita\Datavedi\Case Study\german_credit.csv")
names(german_credit)

##  [1] "Creditability"                    
##  [2] "Account.Balance"                  
##  [3] "Duration.of.Credit..month."       
##  [4] "Payment.Status.of.Previous.Credit"
##  [5] "Purpose"                          
##  [6] "Credit.Amount"                    
##  [7] "Value.Savings.Stocks"             
##  [8] "Length.of.current.employment"     
##  [9] "Instalment.per.cent"              
## [10] "Sex...Marital.Status"             
## [11] "Guarantors"                       
## [12] "Duration.in.Current.address"      
## [13] "Most.valuable.available.asset"    
## [14] "Age..years."                      
## [15] "Concurrent.Credits"               
## [16] "Type.of.apartment"                
## [17] "No.of.Credits.at.this.Bank"       
## [18] "Occupation"                       
## [19] "No.of.dependents"                 
## [20] "Telephone"                        
## [21] "Foreign.Worker"

dim(german_credit)

## [1] 1000   21

x<-sum(is.na(german_credit))
x

## [1] 0

data information

Variable1 = “Creditability”

It is a Binary data ,which can take only two possible values.The two values in a binary variable, numerically as 0 and 1

0: No, 1:Yes

It is a type of categorical data, which more generally represents experiments with a fixed number of possible outcomes.

This is the target/response variable

variable2=“Account.Balance”

It contains qualitative data. there are four categories.

1 : … < 0 DM

2 : 0 <= … < 200 DM

3 : … >= 200 DM / salary assignments for at least 1 year

4 : no checking account

DM-Deutsche mark.The basic unit of money in Germany.

Account.Balance contains qualitative data.Central tendencies ,dispersion does not make any sense.frequency table,mode and barplot are calculated for qualitative data.mode gives the maximum value of status of Account.Balance.

frequency table of Account.Balance

tab<-table(german_credit$Account.Balance)
tab

## 
##   1   2   3   4 
## 274 269  63 394

names(tab)

## [1] "1" "2" "3" "4"

x<-sum(is.na(german_credit$Account.Balance))
x

## [1] 0

1 -stands for zero balance, 2 -stands for below 200 balance, 3 -stands for above 200 balance, 4 -stands for no checking accounts.

mode of Account.Balance.It gives the maximum value.

temp <- table(as.vector(german_credit$Account.Balance))
names(temp)[temp == max(temp)]

## [1] "4"

mode of Status of Account.Balance is 4.

ggplot of Account.Balance

library("ggplot2")

## Warning: package 'ggplot2' was built under R version 3.1.3

qplot(data<-german_credit$Account.Balance,main="Account.Balance", ylab="German_currency-Dm", colour= I("purple"),size=I(5))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

output description:

1 stands for 274 people have zero balance.

2 stands for 269 people have below 200 DM balance.

3 stands for 63 people have above 200 DM balance.

4 stands for 394 people have no checking account.

correlation between Account.Balance and creditability

library("ltm")

## Warning: package 'ltm' was built under R version 3.1.3

## Loading required package: MASS

## Loading required package: msm

## Warning: package 'msm' was built under R version 3.1.3

## Loading required package: polycor

## Warning: package 'polycor' was built under R version 3.1.3

## Loading required package: mvtnorm

## Warning: package 'mvtnorm' was built under R version 3.1.3

## Loading required package: sfsmisc

## Warning: package 'sfsmisc' was built under R version 3.1.3

 biserial.cor(german_credit$Account.Balance,german_credit$Creditability)

## [1] -0.350672

library(vcd)

## Warning: package 'vcd' was built under R version 3.1.3

## Loading required package: grid

contin_table<-table(german_credit$Account.Balance,german_credit$Creditability)
contin_table

##    
##       0   1
##   1 135 139
##   2 105 164
##   3  14  49
##   4  46 348

assocstats(contin_table)

##                     X^2 df P(> X^2)
## Likelihood Ratio 131.34  3        0
## Pearson          123.72  3        0
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.332 
## Cramer's V        : 0.352

correlation is -0.35. Account.Balance and creditability are negatively correlated.

crosstable of Account.Balance and creditability

library("gmodels")

## Warning: package 'gmodels' was built under R version 3.1.3

CrossTable(german_credit$Creditability, german_credit$Account.Balance, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1000 
## 
##  
##                             | german_credit$Account.Balance 
## german_credit$Creditability |         1 |         2 |         3 |         4 | Row Total | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
##                           0 |       135 |       105 |        14 |        46 |       300 | 
##                             |       0.5 |       0.4 |       0.2 |       0.1 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
##                           1 |       139 |       164 |        49 |       348 |       700 | 
##                             |       0.5 |       0.6 |       0.8 |       0.9 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
##                Column Total |       274 |       269 |        63 |       394 |      1000 | 
##                             |       0.3 |       0.3 |       0.1 |       0.4 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  123.7209     d.f. =  3     p =  1.218902e-26 
## 
## 
##

variable3=“Duration.of.Credit..month”

It is a Numerical data.

head(german_credit$Duration.of.Credit..month.)

## [1] 18  9 12 12 12 10

Univariate Analysis of duration in month

Central tendencies of duration in month

mean of duration in month

mean(german_credit$Duration.of.Credit..month.)

## [1] 20.903

median of duration in month

median(german_credit$Duration.of.Credit..month.)

## [1] 18

Dispersion of duration in month Variance of duration in month

var(german_credit$Duration.of.Credit..month.)

## [1] 145.415

Standard deviation of duration in month

sd(german_credit$Duration.of.Credit..month.)

## [1] 12.05881

summary gives four quartiles of duration in month

summary(german_credit$Duration.of.Credit..month.)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     4.0    12.0    18.0    20.9    24.0    72.0

boxplot of duration of credit month

quantile(german_credit$Duration.of.Credit..month.)

##   0%  25%  50%  75% 100% 
##    4   12   18   24   72

quantile(german_credit$Duration.of.Credit..month.,c(0.75,0.80,0.90,1))

##  75%  80%  90% 100% 
##   24   30   36   72

boxplot(german_credit$Duration.of.Credit..month.)

output description

In this boxplot the minimum is 4 , maximum is 72, and median is 18. first quartile is 12,third quartile is 24.

histogram of Duration.of.Credit..month.

hist(german_credit$Duration.of.Credit..month.)

correlation between Duration.of.Credit..month. and response

library("ltm")
biserial.cor(german_credit$Duration.of.Credit..month.,german_credit$Creditability)

## [1] 0.2148192

correlation is 0.21.Duration.of.Credit..month. and Creditability positively correlated. #t-test

t.test(german_credit$Duration.of.Credit..month.)

## 
##  One Sample t-test
## 
## data:  german_credit$Duration.of.Credit..month.
## t = 54.8156, df = 999, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  20.15469 21.65131
## sample estimates:
## mean of x 
##    20.903

variable4=“Payment.Status.of.Previous.Credit”

It is a Categorical data.It contains 5 categories.

0: no credits taken

1: all credits at this bank paid back duly

2: existing credits paid back duly till now

3: delay in paying off in the past

4: critical account

Payment.Status.of.Previous.Credit contains qualitative data.Central tendencies ,dispersion does not make any sense.frequency table,mode and barplot are calculated for qualitative data.mode gives the maximum value of status of Account.Balance.

frequency table of Payment.Status.of.Previous.Credit

tab<-table(german_credit$Payment.Status.of.Previous.Credit)
tab

## 
##   0   1   2   3   4 
##  40  49 530  88 293

names(tab)

## [1] "0" "1" "2" "3" "4"

0: no credits taken

1: all credits at this bank paid back duly

2: existing credits paid back duly till now

3: delay in paying off in the past

4: critical account

mode of Payment.Status.of.Previous.Credit It gives the maximum value.

temp <- table(as.vector(german_credit$Payment.Status.of.Previous.Credit))
names(temp)[temp == max(temp)]

## [1] "2"

mode of Status of Payment.Status.of.Previous.Credit.=2

ggplot of Payment.Status.of.Previous.Credit

library("ggplot2")
qplot(data<-german_credit$Payment.Status.of.Previous.Credit,main="Payment.Status.of.Previous.Credit", ylab="customers", colour= I("purple"),size=I(5))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

output description:

0 stands for 40 customers have no credits taken/all credits paid back duly.

1 stands for 49 customers are all credits at this bank paid back duly.

2 stands for 530 customers have existing credits paid back duly till now.

3 stands for 88 customers are delay in paying off in the past.

4 stands for 293 customers have critical account/other credits existing (not at this bank)

correlation between Payment.Status.of.Previous.Credit and creditability

library("ltm")
 biserial.cor(german_credit$Payment.Status.of.Previous.Credit,german_credit$Creditability)

## [1] -0.2286703

library(vcd)
contin_table<-table(german_credit$Payment.Status.of.Previous.Credit,german_credit$Creditability)
contin_table

##    
##       0   1
##   0  25  15
##   1  28  21
##   2 169 361
##   3  28  60
##   4  50 243

assocstats(contin_table)

##                     X^2 df   P(> X^2)
## Likelihood Ratio 60.467  4 2.3139e-12
## Pearson          61.691  4 1.2792e-12
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.241 
## Cramer's V        : 0.248

correlation is -0.22. Payment.Status.of.Previous.Credit and creditability are negatively correlated.

library("gmodels")
CrossTable(german_credit$Creditability, german_credit$Payment.Status.of.Previous.Credit, digits=1,prop.r=F, prop.t=F, prop.chisq=F, chisq=T)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1000 
## 
##  
##                             | german_credit$Payment.Status.of.Previous.Credit 
## german_credit$Creditability |         0 |         1 |         2 |         3 |         4 | Row Total | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
##                           0 |        25 |        28 |       169 |        28 |        50 |       300 | 
##                             |       0.6 |       0.6 |       0.3 |       0.3 |       0.2 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
##                           1 |        15 |        21 |       361 |        60 |       243 |       700 | 
##                             |       0.4 |       0.4 |       0.7 |       0.7 |       0.8 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
##                Column Total |        40 |        49 |       530 |        88 |       293 |      1000 | 
##                             |       0.0 |       0.0 |       0.5 |       0.1 |       0.3 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  61.6914     d.f. =  4     p =  1.279187e-12 
## 
## 
##

variable5=“purpose”

purpose is a qualitative data.It contains 11 categories. 0 : car (new)

1 : car (used)

2 : furniture/equipment

3 : radio/television

4 : domestic appliances

5 : repairs

6 : education

7 : (vacation – does not exist?)

8 : retraining

9 : business

10 : others

purpose contains qualitative data.Central tendencies ,dispersion does not make any sense.frequency table,mode and barplot are calculated for qualitative data.mode gives the maximum value of purpose

frequency table of purpose

tab<-table(german_credit$Purpose)
tab

## 
##   0   1   2   3   4   5   6   8   9  10 
## 234 103 181 280  12  22  50   9  97  12

names(tab)

##  [1] "0"  "1"  "2"  "3"  "4"  "5"  "6"  "8"  "9"  "10"

0 : car (new)

1 : car (used)

2 : furniture/equipment

3 : radio/television

4 : domestic appliances

5 : repairs

6 : education

7 : (vacation – does not exist?)

8 : retraining

9 : business

10 : others

mode of purpose.It gives the maximum value.

temp <- table(as.vector(german_credit$Purpose))
names(temp)[temp == max(temp)]

## [1] "3"

mode of Status of Account.Balance is 3.

ggplot of purpose

library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$Purpose,main="purpose", ylab="customers", colour= I("purple"),size=I(5))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

correlation between purpose and creditability

library("ltm", lib.loc="~/R/win-library/3.3")
 biserial.cor(german_credit$Purpose,german_credit$Creditability)

## [1] 0.01796988

library(vcd)
contin_table<-table(german_credit$Purpose,german_credit$Creditability)
contin_table

##     
##        0   1
##   0   89 145
##   1   17  86
##   2   58 123
##   3   62 218
##   4    4   8
##   5    8  14
##   6   22  28
##   8    1   8
##   9   34  63
##   10   5   7

assocstats(contin_table)

##                     X^2 df   P(> X^2)
## Likelihood Ratio 34.510  9 7.2688e-05
## Pearson          33.356  9 1.1575e-04
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.18 
## Cramer's V        : 0.183

correlation is 0.017. purpose and creditability are positivevely correlated.

crosstable of purpose and creditability

library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$Purpose, digits=1,prop.r=F, prop.t=F, prop.chisq=F, chisq=T)

## Warning in chisq.test(t, correct = FALSE, ...): Chi-squared approximation
## may be incorrect

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1000 
## 
##  
##                             | german_credit$Purpose 
## german_credit$Creditability |         0 |         1 |         2 |         3 |         4 |         5 |         6 |         8 |         9 |        10 | Row Total | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                           0 |        89 |        17 |        58 |        62 |         4 |         8 |        22 |         1 |        34 |         5 |       300 | 
##                             |       0.4 |       0.2 |       0.3 |       0.2 |       0.3 |       0.4 |       0.4 |       0.1 |       0.4 |       0.4 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                           1 |       145 |        86 |       123 |       218 |         8 |        14 |        28 |         8 |        63 |         7 |       700 | 
##                             |       0.6 |       0.8 |       0.7 |       0.8 |       0.7 |       0.6 |       0.6 |       0.9 |       0.6 |       0.6 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                Column Total |       234 |       103 |       181 |       280 |        12 |        22 |        50 |         9 |        97 |        12 |      1000 | 
##                             |       0.2 |       0.1 |       0.2 |       0.3 |       0.0 |       0.0 |       0.0 |       0.0 |       0.1 |       0.0 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  33.35645     d.f. =  9     p =  0.0001157491 
## 
## 
##

variable6=“Credit.Amount”

It is a numerical data.

head(german_credit$Credit.Amount)

## [1] 1049 2799  841 2122 2171 2241

Univariate Analysis of Credit.Amount

Central tendencies of Credit.Amount

mean of Credit.Amount

mean(german_credit$Credit.Amount)

## [1] 3271.248

median of Credit.Amount

median(german_credit$Credit.Amount)

## [1] 2319.5

Dispersion of Credit.Amount Variance of Credit.Amount

var(german_credit$Credit.Amount)

## [1] 7967927

Standard deviation of Credit.Amount

sd(german_credit$Credit.Amount)

## [1] 2822.752

summary gives four quartiles of Credit.Amount

summary(german_credit$Credit.Amount)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     250    1366    2320    3271    3972   18420

hist(german_credit$Credit.Amount)

boxplot of Credit.Amount

quantile(german_credit$Credit.Amount)

##       0%      25%      50%      75%     100% 
##   250.00  1365.50  2319.50  3972.25 18424.00

quantile(german_credit$Credit.Amount,c(0.75,0.80,0.90,1))

##      75%      80%      90%     100% 
##  3972.25  4720.00  7179.40 18424.00

boxplot(german_credit$Credit.Amount)

output description Notethat outliers are discussed later.

histogram of Credit.Amount

hist(german_credit$Credit.Amount)

correlation between Credit.Amount and response

library("ltm", lib.loc="~/R/win-library/3.3")
biserial.cor(german_credit$Credit.Amount,german_credit$Creditability)

## [1] 0.1546628

correlation is 0.15. Credit.Amount and Creditability positively correlated. #t-test

t.test(german_credit$Credit.Amount)

## 
##  One Sample t-test
## 
## data:  german_credit$Credit.Amount
## t = 36.6472, df = 999, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  3096.083 3446.413
## sample estimates:
## mean of x 
##  3271.248

variable7=“Value.Savings.Stocks”

Average balance in savings account

Average balance in savings account is a qualitative data.It contains 5 categories.

1 : < 100 DM

2 : 100<= … < 500 DM

3 : 500<= … < 1000 DM

4 : =>1000 DM

5 : unknown/ no savings account

DM-Deutsche mark.The basic unit of money in Germany.

Average balance in savings account contains qualitative data.Central tendencies ,dispersion does not make any sense.frequency table,mode and barplot are calculated for qualitative data.mode gives the maximum value of Average balance in savings account

frequency table of Average balance in savings account

tab<-table(german_credit$Value.Savings.Stocks)
tab

## 
##   1   2   3   4   5 
## 603 103  63  48 183

names(tab)

## [1] "1" "2" "3" "4" "5"

1 : < 100 DM

2 : 100<= … < 500 DM

3 : 500<= … < 1000 DM

4 : =>1000 DM

5 : unknown/ no savings account

mode of Value.Savings.Stocks.It gives the maximum value.

temp <- table(as.vector(german_credit$Value.Savings.Stocks))
names(temp)[temp == max(temp)]

## [1] "1"

mode of Value.Savings.Stocks is 1.

ggplot of Value.Savings.Stocks

library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$Value.Savings.Stocks,main="Value.Savings.Stocks", ylab="German_currency-Dm", colour= I("purple"),size=I(5))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

output description: 1 stands for 603 people have below 100 DM balance.

2 stands for 103 people have below 500 DM balance.

3 stands for 63 people have below 1000 DM balance.

4 stands for 48 people have above 1000 DM balance.

5 stands for 183 people have no checking account.

correlation between Value.Savings.Stocks and creditability

library("ltm", lib.loc="~/R/win-library/3.3")
 biserial.cor(german_credit$Value.Savings.Stocks,german_credit$Creditability)

## [1] -0.1788532

library(vcd)
contin_table<-table(german_credit$Value.Savings.Stocks,german_credit$Creditability)
contin_table

##    
##       0   1
##   1 217 386
##   2  34  69
##   3  11  52
##   4   6  42
##   5  32 151

assocstats(contin_table)

##                     X^2 df   P(> X^2)
## Likelihood Ratio 38.975  4 7.0491e-08
## Pearson          36.099  4 2.7612e-07
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.187 
## Cramer's V        : 0.19

correlation is -0.17. Value.Savings.Stocks and creditability are negatively correlated.

crosstable of Account.Balance and creditability

library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$Value.Savings.Stocks, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1000 
## 
##  
##                             | german_credit$Value.Savings.Stocks 
## german_credit$Creditability |         1 |         2 |         3 |         4 |         5 | Row Total | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
##                           0 |       217 |        34 |        11 |         6 |        32 |       300 | 
##                             |       0.4 |       0.3 |       0.2 |       0.1 |       0.2 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
##                           1 |       386 |        69 |        52 |        42 |       151 |       700 | 
##                             |       0.6 |       0.7 |       0.8 |       0.9 |       0.8 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
##                Column Total |       603 |       103 |        63 |        48 |       183 |      1000 | 
##                             |       0.6 |       0.1 |       0.1 |       0.0 |       0.2 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  36.09893     d.f. =  4     p =  2.761214e-07 
## 
## 
##

variable8 = “Length.of.current.employment”

It is a qualitative data.It has 5 categories.

1 : unemployed

2: < 1 year

3 : 1 <= … < 4 years

4 : 4 <=… < 7 years

4 : >= 7 years

Length.of.current.employment contains qualitative data.Central tendencies ,dispersion does not make any sense.frequency table,mode and barplot are calculated for qualitative data.mode gives the maximum value of Length.of.current.employment

frequency table of Length.of.current.employment

tab<-table(german_credit$Length.of.current.employment)
tab

## 
##   1   2   3   4   5 
##  62 172 339 174 253

names(tab)

## [1] "1" "2" "3" "4" "5"

1 : unemployed

2: < 1 year

3 : 1 <= … < 4 years

4 : 4 <=… < 7 years

4 : >= 7 years

mode of Length.of.current.employment.It gives the maximum value.

temp <- table(as.vector(german_credit$Length.of.current.employment))
names(temp)[temp == max(temp)]

## [1] "3"

mode of Status of Length.of.current.employment is 3.

ggplot of Length.of.current.employment

library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$Length.of.current.employment,main="Length.of.current.employment", ylab="employees", colour= I("purple"),size=I(5))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

correlation between Length.of.current.employment and creditability

library("ltm", lib.loc="~/R/win-library/3.3")
 biserial.cor(german_credit$Length.of.current.employment,german_credit$Creditability)

## [1] -0.115944

library(vcd)
contin_table<-table(german_credit$Length.of.current.employment,german_credit$Creditability)
contin_table

##    
##       0   1
##   1  23  39
##   2  70 102
##   3 104 235
##   4  39 135
##   5  64 189

assocstats(contin_table)

##                     X^2 df  P(> X^2)
## Likelihood Ratio 18.164  4 0.0011464
## Pearson          18.368  4 0.0010455
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.134 
## Cramer's V        : 0.136

correlation is -0.11. Length.of.current.employment and creditability are negatively correlated.

crosstable of Length.of.current.employment and creditability

library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$Length.of.current.employment, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1000 
## 
##  
##                             | german_credit$Length.of.current.employment 
## german_credit$Creditability |         1 |         2 |         3 |         4 |         5 | Row Total | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
##                           0 |        23 |        70 |       104 |        39 |        64 |       300 | 
##                             |       0.4 |       0.4 |       0.3 |       0.2 |       0.3 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
##                           1 |        39 |       102 |       235 |       135 |       189 |       700 | 
##                             |       0.6 |       0.6 |       0.7 |       0.8 |       0.7 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
##                Column Total |        62 |       172 |       339 |       174 |       253 |      1000 | 
##                             |       0.1 |       0.2 |       0.3 |       0.2 |       0.3 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  18.36827     d.f. =  4     p =  0.001045452 
## 
## 
##

variable9 =“Instalment.per.cent”

Installment rate as % of disposable income.It is a qualitative data.It has a 4 categories.

frequency table of Instalment.per.cent

tab<-table(german_credit$Instalment.per.cent)
tab

## 
##   1   2   3   4 
## 136 231 157 476

names(tab)

## [1] "1" "2" "3" "4"

mode of Instalment.per.cent.It gives the maximum value.

temp <- table(as.vector(german_credit$Instalment.per.cent))
names(temp)[temp == max(temp)]

## [1] "4"

mode of Status of Instalment.per.cent is 4.

ggplot of Instalment.per.cent

library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$Instalment.per.cent,main="Instalment.per.cent", ylab="customers", colour= I("purple"),size=I(5))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

correlation between Instalment.per.cent and creditability

library("ltm", lib.loc="~/R/win-library/3.3")
 biserial.cor(german_credit$Instalment.per.cent,german_credit$Creditability)

## [1] 0.07236773

library(vcd)
contin_table<-table(german_credit$Instalment.per.cent,german_credit$Creditability)
contin_table

##    
##       0   1
##   1  34 102
##   2  62 169
##   3  45 112
##   4 159 317

assocstats(contin_table)

##                     X^2 df P(> X^2)
## Likelihood Ratio 5.5065  3  0.13825
## Pearson          5.4768  3  0.14003
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.074 
## Cramer's V        : 0.074

correlation is 0.072. Instalment.per.cent and creditability are positively correlated.

crosstable of Instalment.per.cent and creditability

library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$Instalment.per.cent, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1000 
## 
##  
##                             | german_credit$Instalment.per.cent 
## german_credit$Creditability |         1 |         2 |         3 |         4 | Row Total | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
##                           0 |        34 |        62 |        45 |       159 |       300 | 
##                             |       0.2 |       0.3 |       0.3 |       0.3 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
##                           1 |       102 |       169 |       112 |       317 |       700 | 
##                             |       0.8 |       0.7 |       0.7 |       0.7 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
##                Column Total |       136 |       231 |       157 |       476 |      1000 | 
##                             |       0.1 |       0.2 |       0.2 |       0.5 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  5.476792     d.f. =  3     p =  0.1400333 
## 
## 
##

variable10 =“Sex…Marital.Status”

variable 10 is Personal status and sex .It is a qualitative data.There are 4 categories.

1 : male : divorced/separated

2 : female : divorced/separated/married

3 : male : single

4 : male : married/widowed

frequency table of Sex…Marital.Status

tab<-table(german_credit$Sex...Marital.Status)
tab

## 
##   1   2   3   4 
##  50 310 548  92

names(tab)

## [1] "1" "2" "3" "4"

1 : male : divorced/separated

2 : female : divorced/separated/married

3 : male : single

4 : male : married/widowed

mode of Sex…Marital.Status.It gives the maximum value.

temp <- table(as.vector(german_credit$Sex...Marital.Status))
names(temp)[temp == max(temp)]

## [1] "3"

mode of Status of Sex…Marital.Status is 3.

ggplot of Sex…Marital.Status

library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$Sex...Marital.Status,main="Sex...Marital.Status", ylab="customers", colour= I("purple"),size=I(5))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

output description:

1-50 mens are divorced/separated.

2-310 womens aredivorced/separated/married.

3-548 males are single.

4-92 males are married/widowed.

correlation between Sex…Marital.Status and creditability

library("ltm", lib.loc="~/R/win-library/3.3")
 biserial.cor(german_credit$Sex...Marital.Status,german_credit$Creditability)

## [1] -0.0881402

library(vcd)
contin_table<-table(german_credit$Sex...Marital.Status,german_credit$Creditability)
contin_table

##    
##       0   1
##   1  20  30
##   2 109 201
##   3 146 402
##   4  25  67

assocstats(contin_table)

##                     X^2 df P(> X^2)
## Likelihood Ratio 9.4414  3 0.023963
## Pearson          9.6052  3 0.022238
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.098 
## Cramer's V        : 0.098

correlation is -0.088. Sex…Marital.Status and creditability are negatively correlated.

crosstable of Sex…Marital.Status and creditability

library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$Sex...Marital.Status, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1000 
## 
##  
##                             | german_credit$Sex...Marital.Status 
## german_credit$Creditability |         1 |         2 |         3 |         4 | Row Total | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
##                           0 |        20 |       109 |       146 |        25 |       300 | 
##                             |       0.4 |       0.4 |       0.3 |       0.3 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
##                           1 |        30 |       201 |       402 |        67 |       700 | 
##                             |       0.6 |       0.6 |       0.7 |       0.7 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
##                Column Total |        50 |       310 |       548 |        92 |      1000 | 
##                             |       0.0 |       0.3 |       0.5 |       0.1 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  9.605214     d.f. =  3     p =  0.02223801 
## 
## 
##

variable11 = “Guarantors”

It is a qualitative data.It contain 3 categories.

1 : none

2 : co-applicant

3 : guarantor

frequency table of Guarantors

tab<-table(german_credit$Guarantors)
tab

## 
##   1   2   3 
## 907  41  52

names(tab)

## [1] "1" "2" "3"

1 -stands for none, 2 -stands for co-applicant, 3 -stands for guarantor .

mode of Guarantors.It gives the maximum value.

temp <- table(as.vector(german_credit$Guarantors))
names(temp)[temp == max(temp)]

## [1] "1"

mode of Status of Guarantors 1.

ggplot of Guarantors

library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$Guarantors,main="Guarantors", ylab="customers", colour= I("purple"),size=I(5))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

output description:

1 stands for 907 customers have no Guarantors.

2 stands for 41 customers have co-applicants.

3 stands for 52 customers have Guarantors.

correlation between Guarantors and creditability

library("ltm", lib.loc="~/R/win-library/3.3")
 biserial.cor(german_credit$Guarantors,german_credit$Creditability)

## [1] -0.0251242

library(vcd)
contin_table<-table(german_credit$Guarantors,german_credit$Creditability)
contin_table

##    
##       0   1
##   1 272 635
##   2  18  23
##   3  10  42

assocstats(contin_table)

##                     X^2 df P(> X^2)
## Likelihood Ratio 6.6501  2 0.035971
## Pearson          6.6454  2 0.036056
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.081 
## Cramer's V        : 0.082

correlation is -0.025. Guarantors and creditability are negatively correlated.

crosstable of Guarantors and creditability

library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$Guarantors, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1000 
## 
##  
##                             | german_credit$Guarantors 
## german_credit$Creditability |         1 |         2 |         3 | Row Total | 
## ----------------------------|-----------|-----------|-----------|-----------|
##                           0 |       272 |        18 |        10 |       300 | 
##                             |       0.3 |       0.4 |       0.2 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|
##                           1 |       635 |        23 |        42 |       700 | 
##                             |       0.7 |       0.6 |       0.8 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|
##                Column Total |       907 |        41 |        52 |      1000 | 
##                             |       0.9 |       0.0 |       0.1 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  6.645367     d.f. =  2     p =  0.03605595 
## 
## 
##

variables12 = “Duration.in.Current.address”

It is a Categorical data.It has a 4 categories.

1: <= 1 year

2: <.<=2 years

3: <.<=3 years

4: >4years

frequency table of Duration.in.Current.address

tab<-table(german_credit$Duration.in.Current.address)
tab

## 
##   1   2   3   4 
## 130 308 149 413

names(tab)

## [1] "1" "2" "3" "4"

1: <= 1 year

2: <.<=2 years

3: <.<=3 years

4: >4years

mode of Duration.in.Current.address.It gives the maximum value.

temp <- table(as.vector(german_credit$Duration.in.Current.address))
names(temp)[temp == max(temp)]

## [1] "4"

mode of Duration.in.Current.address is 4.

ggplot of Duration.in.Current.address

library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$Duration.in.Current.address,main="Duration.in.Current.address", ylab="customers", colour= I("purple"),size=I(5))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

correlation between Duration.in.Current.address and creditability

library("ltm", lib.loc="~/R/win-library/3.3")
 biserial.cor(german_credit$Duration.in.Current.address,german_credit$Creditability)

## [1] 0.002965675

library(vcd)
contin_table<-table(german_credit$Duration.in.Current.address,german_credit$Creditability)
contin_table

##    
##       0   1
##   1  36  94
##   2  97 211
##   3  43 106
##   4 124 289

assocstats(contin_table)

##                      X^2 df P(> X^2)
## Likelihood Ratio 0.75207  3  0.86089
## Pearson          0.74930  3  0.86155
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.027 
## Cramer's V        : 0.027

correlation is 0.002. Duration.in.Current.address and creditability are positively correlated.

crosstable of Duration.in.Current.address and creditability

library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$Duration.in.Current.address, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1000 
## 
##  
##                             | german_credit$Duration.in.Current.address 
## german_credit$Creditability |         1 |         2 |         3 |         4 | Row Total | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
##                           0 |        36 |        97 |        43 |       124 |       300 | 
##                             |       0.3 |       0.3 |       0.3 |       0.3 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
##                           1 |        94 |       211 |       106 |       289 |       700 | 
##                             |       0.7 |       0.7 |       0.7 |       0.7 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
##                Column Total |       130 |       308 |       149 |       413 |      1000 | 
##                             |       0.1 |       0.3 |       0.1 |       0.4 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  0.7492964     d.f. =  3     p =  0.8615521 
## 
## 
##

variable13=“Most.valuable.available.asset”

It is a qualitative data.It contains 4 categories.

1 : real estate

2 : if not A121 : building society savings agreement/life insurance

3 : if not A121/A122 : car or other, not in variable 7

4 : unknown / no property

frequency table Most.valuable.available.asset

tab<-table(german_credit$Most.valuable.available.asset)
tab

## 
##   1   2   3   4 
## 282 232 332 154

names(tab)

## [1] "1" "2" "3" "4"

1 : real estate

2 : if not A121 : building society savings agreement/life insurance

3 : if not A121/A122 : car or other, not in variable 7

4 : unknown / no property

mode of Most.valuable.available.asset.It gives the maximum value.

temp <- table(as.vector(german_credit$Most.valuable.available.asset))
names(temp)[temp == max(temp)]

## [1] "3"

mode of Most.valuable.available.asset is 3.

ggplot of Most.valuable.available.asset

library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$Most.valuable.available.asset,main="Most.valuable.available.asset", ylab="customers", colour= I("purple"),size=I(5))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

correlation between Most.valuable.available.asset and creditability

library("ltm", lib.loc="~/R/win-library/3.3")
 biserial.cor(german_credit$Most.valuable.available.asset,german_credit$Creditability)

## [1] 0.1425406

library(vcd)
contin_table<-table(german_credit$Most.valuable.available.asset,german_credit$Creditability)
contin_table

##    
##       0   1
##   1  60 222
##   2  71 161
##   3 102 230
##   4  67  87

assocstats(contin_table)

##                     X^2 df   P(> X^2)
## Likelihood Ratio 23.546  3 3.1063e-05
## Pearson          23.720  3 2.8584e-05
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.152 
## Cramer's V        : 0.154

correlation is 0.14. Most.valuable.available.asset and creditability are positively correlated.

crosstable of Most.valuable.available.asset and creditability

library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$Most.valuable.available.asset, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1000 
## 
##  
##                             | german_credit$Most.valuable.available.asset 
## german_credit$Creditability |         1 |         2 |         3 |         4 | Row Total | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
##                           0 |        60 |        71 |       102 |        67 |       300 | 
##                             |       0.2 |       0.3 |       0.3 |       0.4 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
##                           1 |       222 |       161 |       230 |        87 |       700 | 
##                             |       0.8 |       0.7 |       0.7 |       0.6 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
##                Column Total |       282 |       232 |       332 |       154 |      1000 | 
##                             |       0.3 |       0.2 |       0.3 |       0.2 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  23.71955     d.f. =  3     p =  2.858442e-05 
## 
## 
##

variable14 = “Age..years.”

It is a Numerical data.

head(german_credit$Age..years.)

## [1] 21 36 23 39 38 48

Univariate Analysis of Age..years.

Central tendencies of Age..years.

mean of Age..years.

mean(german_credit$Age..years.)

## [1] 35.542

median of Age..years.

median(german_credit$Age..years.)

## [1] 33

Dispersion of Age..years. Variance of Age..years.

var(german_credit$Age..years.)

## [1] 128.8831

Standard deviation of Age..years.

sd(german_credit$Age..years.)

## [1] 11.35267

summary gives four quartiles of Age..years.

summary(german_credit$Age..years.)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   19.00   27.00   33.00   35.54   42.00   75.00

boxplot of Age..years.

quantile(german_credit$Age..years.)

##   0%  25%  50%  75% 100% 
##   19   27   33   42   75

quantile(german_credit$Age..years.,c(0.75,0.80,0.90,1))

##  75%  80%  90% 100% 
##   42   44   52   75

boxplot(german_credit$Age..years.)

output description

In this boxplot the minimum is 19 , maximum is 75, and median is 33. first quartile is 27,third quartile is 42. Notethat outliers are discussed later.

histogram of Age..years.

hist(german_credit$Age..years.)

correlation between Age..years. and Creditability

library("ltm", lib.loc="~/R/win-library/3.3")
biserial.cor(german_credit$Age..years.,german_credit$Creditability)

## [1] -0.0912263

correlation is -0.091.Age..years.and Creditability negatively correlated. #t-test

t.test(german_credit$Age..years.)

## 
##  One Sample t-test
## 
## data:  german_credit$Age..years.
## t = 99.002, df = 999, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  34.83751 36.24649
## sample estimates:
## mean of x 
##    35.542

variable15=“Concurrent.Credits”

It is a qualitative data.It contains 3 categories.

1 : bank

2 : stores

3 : none

frequency table of Concurrent.Credits

tab<-table(german_credit$Concurrent.Credits)
tab

## 
##   1   2   3 
## 139  47 814

names(tab)

## [1] "1" "2" "3"

1 : bank

2 : stores

3 : none

mode of Concurrent.Credits.It gives the maximum value.

temp <- table(as.vector(german_credit$Concurrent.Credits))
names(temp)[temp == max(temp)]

## [1] "3"

mode of Concurrent.Credits is 3.

ggplot of Concurrent.Credits

library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$Concurrent.Credits,main="Concurrent.Credits", ylab="customers", colour= I("purple"),size=I(5))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

output description:

1 stands for 139 customers are in bank.

2 stands for 47 customers are in store.

3 stands for 814 customers have no concurrent credits.

correlation between Concurrent.Credits and creditability

library("ltm", lib.loc="~/R/win-library/3.3")
 biserial.cor(german_credit$Concurrent.Credits,german_credit$Creditability)

## [1] -0.1097892

library(vcd)
contin_table<-table(german_credit$Concurrent.Credits,german_credit$Creditability)
contin_table

##    
##       0   1
##   1  57  82
##   2  19  28
##   3 224 590

assocstats(contin_table)

##                     X^2 df  P(> X^2)
## Likelihood Ratio 12.303  2 0.0021298
## Pearson          12.839  2 0.0016293
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.113 
## Cramer's V        : 0.113

correlation is -0.109. Concurrent.Credits and creditability are negatively correlated.

crosstable of Concurrent.Credits and creditability

library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$Concurrent.Credits, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1000 
## 
##  
##                             | german_credit$Concurrent.Credits 
## german_credit$Creditability |         1 |         2 |         3 | Row Total | 
## ----------------------------|-----------|-----------|-----------|-----------|
##                           0 |        57 |        19 |       224 |       300 | 
##                             |       0.4 |       0.4 |       0.3 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|
##                           1 |        82 |        28 |       590 |       700 | 
##                             |       0.6 |       0.6 |       0.7 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|
##                Column Total |       139 |        47 |       814 |      1000 | 
##                             |       0.1 |       0.0 |       0.8 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  12.83919     d.f. =  2     p =  0.001629318 
## 
## 
##

variable16 = “Type.of.apartment”

It is a qualitative data.It has a 3 categories.

1 : rent

2 : own

3 : for free

frequency table of Type.of.apartment

tab<-table(german_credit$Type.of.apartment)
tab

## 
##   1   2   3 
## 179 714 107

names(tab)

## [1] "1" "2" "3"

1 : rent

2 : own

3 : for free

mode of Type.of.apartment.It gives the maximum value.

temp <- table(as.vector(german_credit$Type.of.apartment))
names(temp)[temp == max(temp)]

## [1] "2"

mode of Status of Type.of.apartment is 2.

ggplot of Type.of.apartment

library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$Type.of.apartment,main="Type.of.apartment", ylab="customers", colour= I("purple"),size=I(5))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

output description:

1 stands for 179 customers are staying in rent houses.

2 stands for 714 customers are staying in own houses.

3 stands for 107 customers are staying in free quaters.

correlation between Type.of.apartment and creditability

library("ltm", lib.loc="~/R/win-library/3.3")
 biserial.cor(german_credit$Type.of.apartment,german_credit$Creditability)

## [1] -0.01810985

library(vcd)
contin_table<-table(german_credit$Type.of.apartment,german_credit$Creditability)
contin_table

##    
##       0   1
##   1  70 109
##   2 186 528
##   3  44  63

assocstats(contin_table)

##                     X^2 df   P(> X^2)
## Likelihood Ratio 18.129  2 1.1573e-04
## Pearson          18.674  2 8.8103e-05
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.135 
## Cramer's V        : 0.137

correlation is -0.018. Type.of.apartment and creditability are negatively correlated.

crosstable of Type.of.apartment and creditability

library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$Type.of.apartment, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1000 
## 
##  
##                             | german_credit$Type.of.apartment 
## german_credit$Creditability |         1 |         2 |         3 | Row Total | 
## ----------------------------|-----------|-----------|-----------|-----------|
##                           0 |        70 |       186 |        44 |       300 | 
##                             |       0.4 |       0.3 |       0.4 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|
##                           1 |       109 |       528 |        63 |       700 | 
##                             |       0.6 |       0.7 |       0.6 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|
##                Column Total |       179 |       714 |       107 |      1000 | 
##                             |       0.2 |       0.7 |       0.1 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  18.67401     d.f. =  2     p =  8.810311e-05 
## 
## 
##

variable17= “No.of.Credits.at.this.Bank”

It is a qualitative data. It has a 4 categories.

frequency table No.of.Credits.at.this.Bank

tab<-table(german_credit$No.of.Credits.at.this.Bank)
tab

## 
##   1   2   3   4 
## 633 333  28   6

names(tab)

## [1] "1" "2" "3" "4"

mode of No.of.Credits.at.this.Bank.It gives the maximum value.

temp <- table(as.vector(german_credit$No.of.Credits.at.this.Bank))
names(temp)[temp == max(temp)]

## [1] "1"

mode of No.of.Credits.at.this.Bank is 1.

ggplot of No.of.Credits.at.this.Bank

library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$No.of.Credits.at.this.Bank,main="No.of.Credits.at.this.Bank", ylab="customers", colour= I("purple"),size=I(5))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

correlation between No.of.Credits.at.this.Bank and creditability

library("ltm", lib.loc="~/R/win-library/3.3")
 biserial.cor(german_credit$No.of.Credits.at.this.Bank,german_credit$Creditability)

## [1] -0.04570962

library(vcd)
contin_table<-table(german_credit$No.of.Credits.at.this.Bank,german_credit$Creditability)
contin_table

##    
##       0   1
##   1 200 433
##   2  92 241
##   3   6  22
##   4   2   4

assocstats(contin_table)

##                     X^2 df P(> X^2)
## Likelihood Ratio 2.7425  3  0.43304
## Pearson          2.6712  3  0.44514
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.052 
## Cramer's V        : 0.052

correlation is -0.045. No.of.Credits.at.this.Bank and creditability are negatively correlated.

crosstable of No.of.Credits.at.this.Bank and creditability

library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$No.of.Credits.at.this.Bank, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)

## Warning in chisq.test(t, correct = FALSE, ...): Chi-squared approximation
## may be incorrect

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1000 
## 
##  
##                             | german_credit$No.of.Credits.at.this.Bank 
## german_credit$Creditability |         1 |         2 |         3 |         4 | Row Total | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
##                           0 |       200 |        92 |         6 |         2 |       300 | 
##                             |       0.3 |       0.3 |       0.2 |       0.3 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
##                           1 |       433 |       241 |        22 |         4 |       700 | 
##                             |       0.7 |       0.7 |       0.8 |       0.7 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
##                Column Total |       633 |       333 |        28 |         6 |      1000 | 
##                             |       0.6 |       0.3 |       0.0 |       0.0 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  2.671198     d.f. =  3     p =  0.4451441 
## 
## 
##

variable18 =“Occupation”

It is a qualitative data.It contains 4 categories.

1 : unemployed/ unskilled – non-resident

2 : unskilled – resident

3 : skilled employee / official

4 : management/ self-employed/highly qualified employee/ officer

frequency table of Occupation

tab<-table(german_credit$Occupation)
tab

## 
##   1   2   3   4 
##  22 200 630 148

names(tab)

## [1] "1" "2" "3" "4"

1 : unemployed/ unskilled – non-resident

2 : unskilled – resident

3 : skilled employee / official

4 : management/ self-employed/highly qualified employee/ officer

mode of Occupation.It gives the maximum value.

temp <- table(as.vector(german_credit$Occupation))
names(temp)[temp == max(temp)]

## [1] "3"

mode of Occupation is 3.

ggplot of Occupation

library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$Occupation,main="Occupation", ylab="customers", colour= I("purple"),size=I(5))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

output description:

1 : 22 customers are unemployed/ unskilled – non-resident

2 : 200 customers are unskilled – resident

3 : 630 customers are skilled employee / official

4 : 148 customers are management/ self-employed/highly qualified employee/ officer

correlation between Occupation and creditability

library("ltm", lib.loc="~/R/win-library/3.3")
 biserial.cor(german_credit$Occupation,german_credit$Creditability)

## [1] 0.03271863

library(vcd)
contin_table<-table(german_credit$Occupation,german_credit$Creditability)
contin_table

##    
##       0   1
##   1   7  15
##   2  56 144
##   3 186 444
##   4  51  97

assocstats(contin_table)

##                     X^2 df P(> X^2)
## Likelihood Ratio 1.8540  3  0.60326
## Pearson          1.8852  3  0.59658
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.043 
## Cramer's V        : 0.043

correlation is 0.032. Occupation and creditability are positively correlated.

crosstable of Occupation and creditability

library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$Occupation, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1000 
## 
##  
##                             | german_credit$Occupation 
## german_credit$Creditability |         1 |         2 |         3 |         4 | Row Total | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
##                           0 |         7 |        56 |       186 |        51 |       300 | 
##                             |       0.3 |       0.3 |       0.3 |       0.3 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
##                           1 |        15 |       144 |       444 |        97 |       700 | 
##                             |       0.7 |       0.7 |       0.7 |       0.7 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
##                Column Total |        22 |       200 |       630 |       148 |      1000 | 
##                             |       0.0 |       0.2 |       0.6 |       0.1 |           | 
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  1.885156     d.f. =  3     p =  0.5965816 
## 
## 
##

variable19 =“No.of.dependents”

It is a qualitative data.

frequency table of No.of.dependents

tab<-table(german_credit$No.of.dependents)
tab

## 
##   1   2 
## 845 155

names(tab)

## [1] "1" "2"

mode of No.of.dependents.It gives the maximum value.

temp <- table(as.vector(german_credit$No.of.dependents))
names(temp)[temp == max(temp)]

## [1] "1"

mode of Status of No.of.dependents 1.

ggplot of No.of.dependents

library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$No.of.dependents,main="No.of.dependents", ylab="customers", colour= I("purple"),size=I(5))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

correlation between No.of.dependents and creditability

library("ltm", lib.loc="~/R/win-library/3.3")
 biserial.cor(german_credit$No.of.dependents,german_credit$Creditability)

## [1] -0.003013345

library(vcd)
contin_table<-table(german_credit$No.of.dependents,german_credit$Creditability)
contin_table

##    
##       0   1
##   1 254 591
##   2  46 109

assocstats(contin_table)

##                        X^2 df P(> X^2)
## Likelihood Ratio 0.0091047  1  0.92398
## Pearson          0.0090893  1  0.92405
## 
## Phi-Coefficient   : 0.003 
## Contingency Coeff.: 0.003 
## Cramer's V        : 0.003

correlation is -0.003. No.of.dependents and creditability are negatively correlated.

crosstable of No.of.dependents and creditability

library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$No.of.dependents, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1000 
## 
##  
##                             | german_credit$No.of.dependents 
## german_credit$Creditability |         1 |         2 | Row Total | 
## ----------------------------|-----------|-----------|-----------|
##                           0 |       254 |        46 |       300 | 
##                             |       0.3 |       0.3 |           | 
## ----------------------------|-----------|-----------|-----------|
##                           1 |       591 |       109 |       700 | 
##                             |       0.7 |       0.7 |           | 
## ----------------------------|-----------|-----------|-----------|
##                Column Total |       845 |       155 |      1000 | 
##                             |       0.8 |       0.2 |           | 
## ----------------------------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  0.009089339     d.f. =  1     p =  0.9240463 
## 
## Pearson's Chi-squared test with Yates' continuity correction 
## ------------------------------------------------------------
## Chi^2 =  0     d.f. =  1     p =  1 
## 
##

variable20 =“Telephone”

It is a qualitative data. it contains 2 categories.

1 : none

2 : yes, registered under the customers name

frequency table of Telephone

tab<-table(german_credit$Telephone)
tab

## 
##   1   2 
## 596 404

names(tab)

## [1] "1" "2"

1 -stands for none, 2 -stands for yes, registered under the customers name.

mode of Telephone.It gives the maximum value.

temp <- table(as.vector(german_credit$Telephone))
names(temp)[temp == max(temp)]

## [1] "1"

mode of Status of Telephone is 1.

ggplot of Telephone

library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$Telephone,main="Telephone", ylab="customers", colour= I("purple"),size=I(5))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

correlation between Telephone and creditability

library("ltm", lib.loc="~/R/win-library/3.3")
 biserial.cor(german_credit$Telephone,german_credit$Creditability)

## [1] -0.03644795

library(vcd)
contin_table<-table(german_credit$Telephone,german_credit$Creditability)
contin_table

##    
##       0   1
##   1 187 409
##   2 113 291

assocstats(contin_table)

##                     X^2 df P(> X^2)
## Likelihood Ratio 1.3359  1  0.24776
## Pearson          1.3298  1  0.24884
## 
## Phi-Coefficient   : 0.036 
## Contingency Coeff.: 0.036 
## Cramer's V        : 0.036

correlation is -0.35. Telephone and creditability are negatively correlated.

crosstable of Telephone and creditability

library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$Telephone, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1000 
## 
##  
##                             | german_credit$Telephone 
## german_credit$Creditability |         1 |         2 | Row Total | 
## ----------------------------|-----------|-----------|-----------|
##                           0 |       187 |       113 |       300 | 
##                             |       0.3 |       0.3 |           | 
## ----------------------------|-----------|-----------|-----------|
##                           1 |       409 |       291 |       700 | 
##                             |       0.7 |       0.7 |           | 
## ----------------------------|-----------|-----------|-----------|
##                Column Total |       596 |       404 |      1000 | 
##                             |       0.6 |       0.4 |           | 
## ----------------------------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  1.329783     d.f. =  1     p =  0.2488438 
## 
## Pearson's Chi-squared test with Yates' continuity correction 
## ------------------------------------------------------------
## Chi^2 =  1.172559     d.f. =  1     p =  0.2788762 
## 
##

variable21= “Foreign.Worker”

It is a qualitative data.It contains a two categories.

1 : yes

2 : no

frequency table of Foreign.Worker

tab<-table(german_credit$Foreign.Worker)
tab

## 
##   1   2 
## 963  37

names(tab)

## [1] "1" "2"

1 -stands for yes, 2 -stands for no

mode of Foreign.Worker.It gives the maximum value.

temp <- table(as.vector(german_credit$Foreign.Worker))
names(temp)[temp == max(temp)]

## [1] "1"

mode of Foreign.Workere is 1.

ggplot of Foreign.Worker

library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$Foreign.Worker,main="Foreign.Worker", ylab="customers", colour= I("purple"),size=I(5))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

correlation between Foreign.Worker and creditability

library("ltm", lib.loc="~/R/win-library/3.3")
 a<-biserial.cor(german_credit$Foreign.Worker,german_credit$Creditability)

library(vcd)
contin_table<-table(german_credit$Foreign.Worker,german_credit$Creditability)
contin_table

##    
##       0   1
##   1 296 667
##   2   4  33

assocstats(contin_table)

##                     X^2 df  P(> X^2)
## Likelihood Ratio 8.0724  1 0.0044945
## Pearson          6.7370  1 0.0094431
## 
## Phi-Coefficient   : 0.082 
## Contingency Coeff.: 0.082 
## Cramer's V        : 0.082

correlation is -0.08. Foreign.Worker and creditability are negatively correlated.

crosstable of Foreign.Worker and creditability

library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$Foreign.Worker, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1000 
## 
##  
##                             | german_credit$Foreign.Worker 
## german_credit$Creditability |         1 |         2 | Row Total | 
## ----------------------------|-----------|-----------|-----------|
##                           0 |       296 |         4 |       300 | 
##                             |       0.3 |       0.1 |           | 
## ----------------------------|-----------|-----------|-----------|
##                           1 |       667 |        33 |       700 | 
##                             |       0.7 |       0.9 |           | 
## ----------------------------|-----------|-----------|-----------|
##                Column Total |       963 |        37 |      1000 | 
##                             |       1.0 |       0.0 |           | 
## ----------------------------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  6.737044     d.f. =  1     p =  0.009443096 
## 
## Pearson's Chi-squared test with Yates' continuity correction 
## ------------------------------------------------------------
## Chi^2 =  5.821576     d.f. =  1     p =  0.01583075 
## 
##

Model Building

predictors – Qualitative chi-square p -value

variable2 = “Account.Balance” 1.218902e-26 <0.001

variable4 = “Payment.Status.of.Previous.Credit” 1.279187e-12 <0.001

variable5 = “Purpose” 0.0001157491 <0.001

variable7 = “Value.Savings.Stocks” 2.761214e-07 <0.001

variable8 = “Length.of.current.employment” 0.001045452 <0.001

variable9 = “Instalment.per.cent” 0.1400333 = 0.14

variable10 = “Sex…Marital.Status” 0.02223801 = 0.02

variable11 = “Guarantors” 0.03605595 = 0.03

variable12 = “Duration.in.Current.address” 0.8615521 = 0.86

variable13 = “Most.valuable.available.asset” 2.858442e-05 <0.001

variable15 = “Concurrent.Credits” 0.001629318 = 0.001

variable16 = “Type.of.apartment” 8.810311e-05 < 0.001

variable17 = “No.of.Credits.at.this.Bank” 0.4451441

variable18 = “Occupation” 0.5965816

variable19 = “No.of.dependents” 1

variable20 = “Telephone” 0.2788762 = 0.27

variable21 = “Foreign.Worker” 0.01583075 = 0.01

predicators – numerical data

variable3 = “Duration.of.Credit..month.” p-value < 0.001

variable6 = “Credit.Amount” p-value < 0.001

variable14 = “Age..years.” p-value < 0.001

Building logistic Regression

Logistic Regression is a classification algorithm. It is used to predict a binary outcome (1 / 0, Yes / No, True / False) given a set of independent variables. To represent binary / categorical outcome. logistic regression as a special case of linear regression when the outcome variable is categorical, where we are using log of odds as dependent variable. In simple words, it predicts the probability of occurrence of an event by fitting data to a logit function.

names(german_credit)

##  [1] "Creditability"                    
##  [2] "Account.Balance"                  
##  [3] "Duration.of.Credit..month."       
##  [4] "Payment.Status.of.Previous.Credit"
##  [5] "Purpose"                          
##  [6] "Credit.Amount"                    
##  [7] "Value.Savings.Stocks"             
##  [8] "Length.of.current.employment"     
##  [9] "Instalment.per.cent"              
## [10] "Sex...Marital.Status"             
## [11] "Guarantors"                       
## [12] "Duration.in.Current.address"      
## [13] "Most.valuable.available.asset"    
## [14] "Age..years."                      
## [15] "Concurrent.Credits"               
## [16] "Type.of.apartment"                
## [17] "No.of.Credits.at.this.Bank"       
## [18] "Occupation"                       
## [19] "No.of.dependents"                 
## [20] "Telephone"                        
## [21] "Foreign.Worker"

LogisticModel<- glm(german_credit$Creditability~.,family=binomial,data=german_credit )
summary(LogisticModel)

## 
## Call:
## glm(formula = german_credit$Creditability ~ ., family = binomial, 
##     data = german_credit)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5854  -0.7927   0.4512   0.7445   1.9483  
## 
## Coefficients:
##                                     Estimate Std. Error z value Pr(>|z|)
## (Intercept)                       -3.994e+00  1.024e+00  -3.901 9.58e-05
## Account.Balance                    5.799e-01  7.004e-02   8.280  < 2e-16
## Duration.of.Credit..month.        -2.457e-02  8.725e-03  -2.816 0.004862
## Payment.Status.of.Previous.Credit  3.822e-01  8.740e-02   4.373 1.23e-05
## Purpose                            3.153e-02  3.009e-02   1.048 0.294697
## Credit.Amount                     -9.340e-05  4.012e-05  -2.328 0.019908
## Value.Savings.Stocks               2.391e-01  5.827e-02   4.104 4.07e-05
## Length.of.current.employment       1.517e-01  7.118e-02   2.132 0.033027
## Instalment.per.cent               -2.983e-01  8.276e-02  -3.605 0.000312
## Sex...Marital.Status               2.574e-01  1.157e-01   2.224 0.026131
## Guarantors                         3.473e-01  1.777e-01   1.954 0.050681
## Duration.in.Current.address       -1.411e-02  7.742e-02  -0.182 0.855335
## Most.valuable.available.asset     -1.828e-01  9.101e-02  -2.009 0.044521
## Age..years.                        8.917e-03  8.206e-03   1.087 0.277218
## Concurrent.Credits                 2.419e-01  1.111e-01   2.178 0.029420
## Type.of.apartment                  2.931e-01  1.677e-01   1.748 0.080527
## No.of.Credits.at.this.Bank        -2.436e-01  1.610e-01  -1.513 0.130257
## Occupation                         1.889e-02  1.367e-01   0.138 0.890081
## No.of.dependents                  -1.708e-01  2.319e-01  -0.736 0.461567
## Telephone                          2.947e-01  1.880e-01   1.567 0.117024
## Foreign.Worker                     1.158e+00  6.078e-01   1.906 0.056680
##                                      
## (Intercept)                       ***
## Account.Balance                   ***
## Duration.of.Credit..month.        ** 
## Payment.Status.of.Previous.Credit ***
## Purpose                              
## Credit.Amount                     *  
## Value.Savings.Stocks              ***
## Length.of.current.employment      *  
## Instalment.per.cent               ***
## Sex...Marital.Status              *  
## Guarantors                        .  
## Duration.in.Current.address          
## Most.valuable.available.asset     *  
## Age..years.                          
## Concurrent.Credits                *  
## Type.of.apartment                 .  
## No.of.Credits.at.this.Bank           
## Occupation                           
## No.of.dependents                     
## Telephone                            
## Foreign.Worker                    .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1221.73  on 999  degrees of freedom
## Residual deviance:  956.56  on 979  degrees of freedom
## AIC: 998.56
## 
## Number of Fisher Scoring iterations: 5

classification table

library("caret")

## Warning: package 'caret' was built under R version 3.1.3

## Loading required package: lattice

library("rattle")

## Warning: package 'rattle' was built under R version 3.1.3

## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

threshold=0.5
predicted_values<-ifelse(predict(LogisticModel,type="response")>threshold,1,0)
actual_values<-LogisticModel$y
conf_matrix<-table(predicted_values,actual_values)
conf_matrix

##                 actual_values
## predicted_values   0   1
##                0 144  77
##                1 156 623

sensitivity(conf_matrix)

## [1] 0.48

specificity(conf_matrix)

## [1] 0.89

Logistic regression Accuracy

accuracy1<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy1

## [1] 0.767

Multicollinearity

library("car")

## Warning: package 'car' was built under R version 3.1.3

summary(LogisticModel)

## 
## Call:
## glm(formula = german_credit$Creditability ~ ., family = binomial, 
##     data = german_credit)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5854  -0.7927   0.4512   0.7445   1.9483  
## 
## Coefficients:
##                                     Estimate Std. Error z value Pr(>|z|)
## (Intercept)                       -3.994e+00  1.024e+00  -3.901 9.58e-05
## Account.Balance                    5.799e-01  7.004e-02   8.280  < 2e-16
## Duration.of.Credit..month.        -2.457e-02  8.725e-03  -2.816 0.004862
## Payment.Status.of.Previous.Credit  3.822e-01  8.740e-02   4.373 1.23e-05
## Purpose                            3.153e-02  3.009e-02   1.048 0.294697
## Credit.Amount                     -9.340e-05  4.012e-05  -2.328 0.019908
## Value.Savings.Stocks               2.391e-01  5.827e-02   4.104 4.07e-05
## Length.of.current.employment       1.517e-01  7.118e-02   2.132 0.033027
## Instalment.per.cent               -2.983e-01  8.276e-02  -3.605 0.000312
## Sex...Marital.Status               2.574e-01  1.157e-01   2.224 0.026131
## Guarantors                         3.473e-01  1.777e-01   1.954 0.050681
## Duration.in.Current.address       -1.411e-02  7.742e-02  -0.182 0.855335
## Most.valuable.available.asset     -1.828e-01  9.101e-02  -2.009 0.044521
## Age..years.                        8.917e-03  8.206e-03   1.087 0.277218
## Concurrent.Credits                 2.419e-01  1.111e-01   2.178 0.029420
## Type.of.apartment                  2.931e-01  1.677e-01   1.748 0.080527
## No.of.Credits.at.this.Bank        -2.436e-01  1.610e-01  -1.513 0.130257
## Occupation                         1.889e-02  1.367e-01   0.138 0.890081
## No.of.dependents                  -1.708e-01  2.319e-01  -0.736 0.461567
## Telephone                          2.947e-01  1.880e-01   1.567 0.117024
## Foreign.Worker                     1.158e+00  6.078e-01   1.906 0.056680
##                                      
## (Intercept)                       ***
## Account.Balance                   ***
## Duration.of.Credit..month.        ** 
## Payment.Status.of.Previous.Credit ***
## Purpose                              
## Credit.Amount                     *  
## Value.Savings.Stocks              ***
## Length.of.current.employment      *  
## Instalment.per.cent               ***
## Sex...Marital.Status              *  
## Guarantors                        .  
## Duration.in.Current.address          
## Most.valuable.available.asset     *  
## Age..years.                          
## Concurrent.Credits                *  
## Type.of.apartment                 .  
## No.of.Credits.at.this.Bank           
## Occupation                           
## No.of.dependents                     
## Telephone                            
## Foreign.Worker                    .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1221.73  on 999  degrees of freedom
## Residual deviance:  956.56  on 979  degrees of freedom
## AIC: 998.56
## 
## Number of Fisher Scoring iterations: 5

vif(LogisticModel, scale = FALSE)

##                   Account.Balance        Duration.of.Credit..month. 
##                          1.064438                          1.821553 
## Payment.Status.of.Previous.Credit                           Purpose 
##                          1.302567                          1.087702 
##                     Credit.Amount              Value.Savings.Stocks 
##                          2.227685                          1.079491 
##      Length.of.current.employment               Instalment.per.cent 
##                          1.158789                          1.295559 
##              Sex...Marital.Status                        Guarantors 
##                          1.070219                          1.063473 
##       Duration.in.Current.address     Most.valuable.available.asset 
##                          1.169790                          1.383774 
##                       Age..years.                Concurrent.Credits 
##                          1.306198                          1.092446 
##                 Type.of.apartment        No.of.Credits.at.this.Bank 
##                          1.347264                          1.312079 
##                        Occupation                  No.of.dependents 
##                          1.331917                          1.083981 
##                         Telephone                    Foreign.Worker 
##                          1.319990                          1.041820

Individual Impact of Variables

library("caret")
varImp(LogisticModel, scale = FALSE)

##                                     Overall
## Account.Balance                   8.2802843
## Duration.of.Credit..month.        2.8160232
## Payment.Status.of.Previous.Credit 4.3726880
## Purpose                           1.0478728
## Credit.Amount                     2.3280721
## Value.Savings.Stocks              4.1036081
## Length.of.current.employment      2.1317598
## Instalment.per.cent               3.6048549
## Sex...Marital.Status              2.2242622
## Guarantors                        1.9541747
## Duration.in.Current.address       0.1823156
## Most.valuable.available.asset     2.0091558
## Age..years.                       1.0865896
## Concurrent.Credits                2.1778089
## Type.of.apartment                 1.7476343
## No.of.Credits.at.this.Bank        1.5130896
## Occupation                        0.1382014
## No.of.dependents                  0.7362695
## Telephone                         1.5673893
## Foreign.Worker                    1.9057709

AIC and BIC

library("stats")
AIC(LogisticModel)

## [1] 998.5572

BIC(LogisticModel)

## [1] 1101.62

Model Selction

library("caret")
varImp(LogisticModel, scale = FALSE)

##                                     Overall
## Account.Balance                   8.2802843
## Duration.of.Credit..month.        2.8160232
## Payment.Status.of.Previous.Credit 4.3726880
## Purpose                           1.0478728
## Credit.Amount                     2.3280721
## Value.Savings.Stocks              4.1036081
## Length.of.current.employment      2.1317598
## Instalment.per.cent               3.6048549
## Sex...Marital.Status              2.2242622
## Guarantors                        1.9541747
## Duration.in.Current.address       0.1823156
## Most.valuable.available.asset     2.0091558
## Age..years.                       1.0865896
## Concurrent.Credits                2.1778089
## Type.of.apartment                 1.7476343
## No.of.Credits.at.this.Bank        1.5130896
## Occupation                        0.1382014
## No.of.dependents                  0.7362695
## Telephone                         1.5673893
## Foreign.Worker                    1.9057709

Duration.in.Current.address,Occupation and No.of.dependents are the least impacting variable, lets drop these variables and re build the model

Building Final logistic Regression

LogisticModel1<- glm(german_credit$Creditability~german_credit$Account.Balance+german_credit$Duration.of.Credit..month.+ german_credit$Payment.Status.of.Previous.Credit+german_credit$Purpose+german_credit$Credit.Amount+german_credit$Value.Savings.Stocks+german_credit$Length.of.current.employment+german_credit$Instalment.per.cent +german_credit$Sex...Marital.Status+german_credit$Guarantors+german_credit$Most.valuable.available.asset+german_credit$Age..years.+german_credit$Concurrent.Credits +german_credit$Type.of.apartment+german_credit$No.of.Credits.at.this.Bank+german_credit$Telephone+german_credit$Foreign.Worker) 
 
summary(LogisticModel1)

## 
## Call:
## glm(formula = german_credit$Creditability ~ german_credit$Account.Balance + 
##     german_credit$Duration.of.Credit..month. + german_credit$Payment.Status.of.Previous.Credit + 
##     german_credit$Purpose + german_credit$Credit.Amount + german_credit$Value.Savings.Stocks + 
##     german_credit$Length.of.current.employment + german_credit$Instalment.per.cent + 
##     german_credit$Sex...Marital.Status + german_credit$Guarantors + 
##     german_credit$Most.valuable.available.asset + german_credit$Age..years. + 
##     german_credit$Concurrent.Credits + german_credit$Type.of.apartment + 
##     german_credit$No.of.Credits.at.this.Bank + german_credit$Telephone + 
##     german_credit$Foreign.Worker)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0846  -0.3477   0.1123   0.2952   0.7905  
## 
## Coefficients:
##                                                   Estimate Std. Error
## (Intercept)                                     -7.003e-02  1.361e-01
## german_credit$Account.Balance                    9.936e-02  1.079e-02
## german_credit$Duration.of.Credit..month.        -4.395e-03  1.481e-03
## german_credit$Payment.Status.of.Previous.Credit  6.626e-02  1.384e-02
## german_credit$Purpose                            4.825e-03  4.811e-03
## german_credit$Credit.Amount                     -1.502e-05  6.700e-06
## german_credit$Value.Savings.Stocks               3.382e-02  8.448e-03
## german_credit$Length.of.current.employment       2.399e-02  1.129e-02
## german_credit$Instalment.per.cent               -4.573e-02  1.289e-02
## german_credit$Sex...Marital.Status               4.224e-02  1.853e-02
## german_credit$Guarantors                         5.860e-02  2.773e-02
## german_credit$Most.valuable.available.asset     -3.238e-02  1.405e-02
## german_credit$Age..years.                        9.144e-04  1.251e-03
## german_credit$Concurrent.Credits                 3.687e-02  1.879e-02
## german_credit$Type.of.apartment                  4.879e-02  2.730e-02
## german_credit$No.of.Credits.at.this.Bank        -4.490e-02  2.522e-02
## german_credit$Telephone                          5.348e-02  2.791e-02
## german_credit$Foreign.Worker                     1.111e-01  7.035e-02
##                                                 t value Pr(>|t|)    
## (Intercept)                                      -0.514 0.607051    
## german_credit$Account.Balance                     9.210  < 2e-16 ***
## german_credit$Duration.of.Credit..month.         -2.967 0.003078 ** 
## german_credit$Payment.Status.of.Previous.Credit   4.787 1.95e-06 ***
## german_credit$Purpose                             1.003 0.316163    
## german_credit$Credit.Amount                      -2.241 0.025230 *  
## german_credit$Value.Savings.Stocks                4.003 6.73e-05 ***
## german_credit$Length.of.current.employment        2.124 0.033903 *  
## german_credit$Instalment.per.cent                -3.547 0.000407 ***
## german_credit$Sex...Marital.Status                2.279 0.022852 *  
## german_credit$Guarantors                          2.113 0.034829 *  
## german_credit$Most.valuable.available.asset      -2.305 0.021359 *  
## german_credit$Age..years.                         0.731 0.464819    
## german_credit$Concurrent.Credits                  1.962 0.050000 .  
## german_credit$Type.of.apartment                   1.787 0.074226 .  
## german_credit$No.of.Credits.at.this.Bank         -1.781 0.075280 .  
## german_credit$Telephone                           1.916 0.055624 .  
## german_credit$Foreign.Worker                      1.579 0.114765    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1632154)
## 
##     Null deviance: 210.00  on 999  degrees of freedom
## Residual deviance: 160.28  on 982  degrees of freedom
## AIC: 1045
## 
## Number of Fisher Scoring iterations: 2

threshold=0.5
predicted_values<-ifelse(predict(LogisticModel1,type="response")>threshold,1,0)
actual_values<-LogisticModel1$y
conf_matrix<-table(predicted_values,actual_values)
conf_matrix

##                 actual_values
## predicted_values   0   1
##                0 139  68
##                1 161 632

sensitivity(conf_matrix)

## [1] 0.4633333

specificity(conf_matrix)

## [1] 0.9028571

Logistic Regression Accuracy

accuracy2<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy2

## [1] 0.771

AIC and BIC

##                df       AIC
## LogisticModel  21  998.5572
## LogisticModel1 19 1045.0284

##                df      BIC
## LogisticModel  21 1101.620
## LogisticModel1 19 1138.276

decision tree

The series of questions and their possible answers can be organized in the form of a decision tree, which is a hierarchical structure consisting of nodes and directed edges.The tree has 3 types of nodes:

Rootnode that has no incoming edges and zero or more outgoing edges.

Internal node,each of which has exactly one incoming edge and two or more outgoing edges.

leaf or terminal node,each of which has exactly one incoming edge and no outgoing edges.

library("rpart")
library("tree")

## Warning: package 'tree' was built under R version 3.1.3

german_tree<-rpart(german_credit$Creditability~.,method="class",  data=german_credit)
german_tree

## n= 1000 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 1000 300 1 (0.3000000 0.7000000)  
##     2) Account.Balance< 2.5 543 240 1 (0.4419890 0.5580110)  
##       4) Duration.of.Credit..month.>=22.5 237 103 0 (0.5654008 0.4345992)  
##         8) Value.Savings.Stocks< 3.5 196  74 0 (0.6224490 0.3775510) *
##         9) Value.Savings.Stocks>=3.5 41  12 1 (0.2926829 0.7073171) *
##       5) Duration.of.Credit..month.< 22.5 306 106 1 (0.3464052 0.6535948)  
##        10) Payment.Status.of.Previous.Credit< 1.5 28   7 0 (0.7500000 0.2500000) *
##        11) Payment.Status.of.Previous.Credit>=1.5 278  85 1 (0.3057554 0.6942446)  
##          22) Credit.Amount>=7491.5 7   1 0 (0.8571429 0.1428571) *
##          23) Credit.Amount< 7491.5 271  79 1 (0.2915129 0.7084871)  
##            46) Duration.of.Credit..month.>=11.5 193  67 1 (0.3471503 0.6528497)  
##              92) Credit.Amount< 1387.5 73  36 0 (0.5068493 0.4931507)  
##               184) Most.valuable.available.asset>=2.5 23   4 0 (0.8260870 0.1739130) *
##               185) Most.valuable.available.asset< 2.5 50  18 1 (0.3600000 0.6400000) *
##              93) Credit.Amount>=1387.5 120  30 1 (0.2500000 0.7500000) *
##            47) Duration.of.Credit..month.< 11.5 78  12 1 (0.1538462 0.8461538) *
##     3) Account.Balance>=2.5 457  60 1 (0.1312910 0.8687090) *

library("rattle")
library("rpart.plot")

## Warning: package 'rpart.plot' was built under R version 3.1.3

fancyRpartPlot(german_tree)

printcp(german_tree)

## 
## Classification tree:
## rpart(formula = german_credit$Creditability ~ ., data = german_credit, 
##     method = "class")
## 
## Variables actually used in tree construction:
## [1] Account.Balance                   Credit.Amount                    
## [3] Duration.of.Credit..month.        Most.valuable.available.asset    
## [5] Payment.Status.of.Previous.Credit Value.Savings.Stocks             
## 
## Root node error: 300/1000 = 0.3
## 
## n= 1000 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.051667      0   1.00000 1.00000 0.048305
## 2 0.046667      3   0.84000 0.98000 0.048024
## 3 0.016667      4   0.79333 0.86000 0.046120
## 4 0.010000      8   0.72667 0.85333 0.046003

plotcp(german_tree)

#pruning

library("rpart")
library("tree")
german_tree1<-rpart(german_credit$Creditability~.,method="class", control=rpart.control(minsplit=30, cp=0.028), data=german_credit)
german_tree1

## n= 1000 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 1000 300 1 (0.3000000 0.7000000)  
##    2) Account.Balance< 2.5 543 240 1 (0.4419890 0.5580110)  
##      4) Duration.of.Credit..month.>=22.5 237 103 0 (0.5654008 0.4345992)  
##        8) Value.Savings.Stocks< 3.5 196  74 0 (0.6224490 0.3775510) *
##        9) Value.Savings.Stocks>=3.5 41  12 1 (0.2926829 0.7073171) *
##      5) Duration.of.Credit..month.< 22.5 306 106 1 (0.3464052 0.6535948)  
##       10) Payment.Status.of.Previous.Credit< 1.5 28   7 0 (0.7500000 0.2500000) *
##       11) Payment.Status.of.Previous.Credit>=1.5 278  85 1 (0.3057554 0.6942446) *
##    3) Account.Balance>=2.5 457  60 1 (0.1312910 0.8687090) *

library("rattle")
library("rpart.plot")
fancyRpartPlot(german_tree1)

printcp(german_tree1)

## 
## Classification tree:
## rpart(formula = german_credit$Creditability ~ ., data = german_credit, 
##     method = "class", control = rpart.control(minsplit = 30, 
##         cp = 0.028))
## 
## Variables actually used in tree construction:
## [1] Account.Balance                   Duration.of.Credit..month.       
## [3] Payment.Status.of.Previous.Credit Value.Savings.Stocks             
## 
## Root node error: 300/1000 = 0.3
## 
## n= 1000 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.051667      0   1.00000 1.00000 0.048305
## 2 0.046667      3   0.84000 1.01000 0.048441
## 3 0.028000      4   0.79333 0.85333 0.046003

plotcp(german_tree1)

#Prediction using the model

accuracy

Accuracy=(TP+TN)/(TP+FP+FN+TN) Sensitivity= TP/(TP+FN) or TP/ Overall Positives Specificity = TN/(TN+FP) or TN/ Overall Negatives

library("caret")
sample_pred<-predict(german_tree1, type="class")
conf_matrix<-table(sample_pred,german_credit$Creditability)
conf_matrix

##            
## sample_pred   0   1
##           0 143  81
##           1 157 619

accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy

## [1] 0.762

outlier treatment for creditamount

summary(german_credit$Credit.Amount)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     250    1366    2320    3271    3972   18420

boxplot(german_credit$Credit.Amount)

quantile(german_credit$Credit.Amount)

##       0%      25%      50%      75%     100% 
##   250.00  1365.50  2319.50  3972.25 18424.00

quantile(german_credit$Credit.Amount,c(0.75,0.80,0.90,0.95,1))

##      75%      80%      90%      95%     100% 
##  3972.25  4720.00  7179.40  9162.70 18424.00

german_credit$credit_amount_new<-german_credit$Credit.Amount
bench<- 3972 +3*IQR(german_credit$credit_amount_new)
bench

## [1] 11792.25

mean=3271

summary(german_credit$Credit.Amount)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     250    1366    2320    3271    3972   18420

german_credit$credit_amount_new<-ifelse((german_credit$Credit.Amount)>bench,mean,german_credit$Credit.Amount)

summary(german_credit$credit_amount_new)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     250    1366    2320    3017    3786   11760

boxplot(german_credit$credit_amount_new)

length(german_credit$credit_amount_new)

## [1] 1000

names(german_credit)

##  [1] "Creditability"                    
##  [2] "Account.Balance"                  
##  [3] "Duration.of.Credit..month."       
##  [4] "Payment.Status.of.Previous.Credit"
##  [5] "Purpose"                          
##  [6] "Credit.Amount"                    
##  [7] "Value.Savings.Stocks"             
##  [8] "Length.of.current.employment"     
##  [9] "Instalment.per.cent"              
## [10] "Sex...Marital.Status"             
## [11] "Guarantors"                       
## [12] "Duration.in.Current.address"      
## [13] "Most.valuable.available.asset"    
## [14] "Age..years."                      
## [15] "Concurrent.Credits"               
## [16] "Type.of.apartment"                
## [17] "No.of.Credits.at.this.Bank"       
## [18] "Occupation"                       
## [19] "No.of.dependents"                 
## [20] "Telephone"                        
## [21] "Foreign.Worker"                   
## [22] "credit_amount_new"

dim(german_credit)

## [1] 1000   22

Logistic model (credit amount replace with credit_amount_new)

LogisticModel3<- glm(german_credit$Creditability~german_credit$Account.Balance+german_credit$Duration.of.Credit..month.+ german_credit$Payment.Status.of.Previous.Credit+german_credit$Purpose+german_credit$credit_amount_new+german_credit$Value.Savings.Stocks+german_credit$Length.of.current.employment+german_credit$Instalment.per.cent +german_credit$Sex...Marital.Status+german_credit$Guarantors+german_credit$Most.valuable.available.asset+german_credit$Age..years.+german_credit$Concurrent.Credits +german_credit$Type.of.apartment+german_credit$No.of.Credits.at.this.Bank+german_credit$Telephone+german_credit$Foreign.Worker) 
 
summary(LogisticModel3)

## 
## Call:
## glm(formula = german_credit$Creditability ~ german_credit$Account.Balance + 
##     german_credit$Duration.of.Credit..month. + german_credit$Payment.Status.of.Previous.Credit + 
##     german_credit$Purpose + german_credit$credit_amount_new + 
##     german_credit$Value.Savings.Stocks + german_credit$Length.of.current.employment + 
##     german_credit$Instalment.per.cent + german_credit$Sex...Marital.Status + 
##     german_credit$Guarantors + german_credit$Most.valuable.available.asset + 
##     german_credit$Age..years. + german_credit$Concurrent.Credits + 
##     german_credit$Type.of.apartment + german_credit$No.of.Credits.at.this.Bank + 
##     german_credit$Telephone + german_credit$Foreign.Worker)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0890  -0.3366   0.1152   0.2985   0.7536  
## 
## Coefficients:
##                                                   Estimate Std. Error
## (Intercept)                                     -6.652e-02  1.366e-01
## german_credit$Account.Balance                    9.966e-02  1.082e-02
## german_credit$Duration.of.Credit..month.        -6.623e-03  1.466e-03
## german_credit$Payment.Status.of.Previous.Credit  6.695e-02  1.387e-02
## german_credit$Purpose                            5.128e-03  4.844e-03
## german_credit$credit_amount_new                  1.268e-06  8.076e-06
## german_credit$Value.Savings.Stocks               3.288e-02  8.473e-03
## german_credit$Length.of.current.employment       2.541e-02  1.131e-02
## german_credit$Instalment.per.cent               -3.278e-02  1.292e-02
## german_credit$Sex...Marital.Status               4.173e-02  1.858e-02
## german_credit$Guarantors                         5.761e-02  2.780e-02
## german_credit$Most.valuable.available.asset     -3.751e-02  1.404e-02
## german_credit$Age..years.                        7.464e-04  1.252e-03
## german_credit$Concurrent.Credits                 3.722e-02  1.884e-02
## german_credit$Type.of.apartment                  4.825e-02  2.738e-02
## german_credit$No.of.Credits.at.this.Bank        -4.772e-02  2.536e-02
## german_credit$Telephone                          3.928e-02  2.763e-02
## german_credit$Foreign.Worker                     1.014e-01  7.040e-02
##                                                 t value Pr(>|t|)    
## (Intercept)                                      -0.487 0.626386    
## german_credit$Account.Balance                     9.213  < 2e-16 ***
## german_credit$Duration.of.Credit..month.         -4.518 7.01e-06 ***
## german_credit$Payment.Status.of.Previous.Credit   4.825 1.62e-06 ***
## german_credit$Purpose                             1.059 0.289982    
## german_credit$credit_amount_new                   0.157 0.875303    
## german_credit$Value.Savings.Stocks                3.881 0.000111 ***
## german_credit$Length.of.current.employment        2.246 0.024927 *  
## german_credit$Instalment.per.cent                -2.537 0.011327 *  
## german_credit$Sex...Marital.Status                2.246 0.024900 *  
## german_credit$Guarantors                          2.072 0.038505 *  
## german_credit$Most.valuable.available.asset      -2.671 0.007684 ** 
## german_credit$Age..years.                         0.596 0.551113    
## german_credit$Concurrent.Credits                  1.976 0.048425 *  
## german_credit$Type.of.apartment                   1.762 0.078315 .  
## german_credit$No.of.Credits.at.this.Bank         -1.882 0.060168 .  
## german_credit$Telephone                           1.421 0.155489    
## german_credit$Foreign.Worker                      1.440 0.150311    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1640462)
## 
##     Null deviance: 210.00  on 999  degrees of freedom
## Residual deviance: 161.09  on 982  degrees of freedom
## AIC: 1050.1
## 
## Number of Fisher Scoring iterations: 2

library("rattle")
threshold=0.5
predicted_values<-ifelse(predict(LogisticModel3,type="response")>threshold,1,0)
actual_values<-LogisticModel3$y
conf_matrix<-table(predicted_values,actual_values)
conf_matrix

##                 actual_values
## predicted_values   0   1
##                0 140  69
##                1 160 631

sensitivity(conf_matrix)

## [1] 0.4666667

specificity(conf_matrix)

## [1] 0.9014286

Accuracy

accuracy3<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy3

## [1] 0.771

logistic model1 builds with a with out outlier treatment(credit.amount) and it’s accuracy is 0.77. logistic model3 builds with a outlier treatment(credit_amount_new)and it’s accuracy is 0.77.

Changing Threshold

threshold=0.8
predicted_values<-ifelse(predict(LogisticModel3,type="response")>threshold,1,0)
actual_values<-LogisticModel3$y
conf_matrix<-table(predicted_values,actual_values)
conf_matrix

##                 actual_values
## predicted_values   0   1
##                0 265 385
##                1  35 315

sensitivity(conf_matrix)

## [1] 0.8833333

specificity(conf_matrix)

## [1] 0.45

accuracy4<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy4

## [1] 0.58

AIC and BIC

AIC(LogisticModel,LogisticModel1,LogisticModel3)

##                df       AIC
## LogisticModel  21  998.5572
## LogisticModel1 19 1045.0284
## LogisticModel3 19 1050.1058

BIC(LogisticModel,LogisticModel1,LogisticModel3)

##                df      BIC
## LogisticModel  21 1101.620
## LogisticModel1 19 1138.276
## LogisticModel3 19 1143.353

Decision tree(Credit.Amount replace with credit_amount_new )

library("rpart", lib.loc="~/R/win-library/3.3")
library("tree", lib.loc="~/R/win-library/3.3")
dim(german_credit)

## [1] 1000   22

names(german_credit)

##  [1] "Creditability"                    
##  [2] "Account.Balance"                  
##  [3] "Duration.of.Credit..month."       
##  [4] "Payment.Status.of.Previous.Credit"
##  [5] "Purpose"                          
##  [6] "Credit.Amount"                    
##  [7] "Value.Savings.Stocks"             
##  [8] "Length.of.current.employment"     
##  [9] "Instalment.per.cent"              
## [10] "Sex...Marital.Status"             
## [11] "Guarantors"                       
## [12] "Duration.in.Current.address"      
## [13] "Most.valuable.available.asset"    
## [14] "Age..years."                      
## [15] "Concurrent.Credits"               
## [16] "Type.of.apartment"                
## [17] "No.of.Credits.at.this.Bank"       
## [18] "Occupation"                       
## [19] "No.of.dependents"                 
## [20] "Telephone"                        
## [21] "Foreign.Worker"                   
## [22] "credit_amount_new"

german_tree1<-rpart(german_credit$Creditability~german_credit$Account.Balance+german_credit$Duration.of.Credit..month.+ german_credit$Payment.Status.of.Previous.Credit+german_credit$Purpose+german_credit$credit_amount_new+german_credit$Value.Savings.Stocks+german_credit$Length.of.current.employment+german_credit$Instalment.per.cent +german_credit$Sex...Marital.Status+german_credit$Guarantors+german_credit$Duration.in.Current.address+german_credit$Most.valuable.available.asset+german_credit$Age..years.+german_credit$Concurrent.Credits +german_credit$Type.of.apartment+german_credit$No.of.Credits.at.this.Bank+german_credit$Occupation+german_credit$No.of.dependents+german_credit$Telephone+german_credit$Foreign.Worker,method="class", control=rpart.control(minsplit=30,cp=0.028))
german_tree1

## n= 1000 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 1000 300 1 (0.3000000 0.7000000)  
##    2) german_credit$Account.Balance< 2.5 543 240 1 (0.4419890 0.5580110)  
##      4) german_credit$Duration.of.Credit..month.>=22.5 237 103 0 (0.5654008 0.4345992)  
##        8) german_credit$Value.Savings.Stocks< 3.5 196  74 0 (0.6224490 0.3775510) *
##        9) german_credit$Value.Savings.Stocks>=3.5 41  12 1 (0.2926829 0.7073171) *
##      5) german_credit$Duration.of.Credit..month.< 22.5 306 106 1 (0.3464052 0.6535948)  
##       10) german_credit$Payment.Status.of.Previous.Credit< 1.5 28   7 0 (0.7500000 0.2500000) *
##       11) german_credit$Payment.Status.of.Previous.Credit>=1.5 278  85 1 (0.3057554 0.6942446) *
##    3) german_credit$Account.Balance>=2.5 457  60 1 (0.1312910 0.8687090) *

library("rattle", lib.loc="~/R/win-library/3.3")
library("rpart.plot", lib.loc="~/R/win-library/3.3")
fancyRpartPlot(german_tree1)

printcp(german_tree1)

## 
## Classification tree:
## rpart(formula = german_credit$Creditability ~ german_credit$Account.Balance + 
##     german_credit$Duration.of.Credit..month. + german_credit$Payment.Status.of.Previous.Credit + 
##     german_credit$Purpose + german_credit$credit_amount_new + 
##     german_credit$Value.Savings.Stocks + german_credit$Length.of.current.employment + 
##     german_credit$Instalment.per.cent + german_credit$Sex...Marital.Status + 
##     german_credit$Guarantors + german_credit$Duration.in.Current.address + 
##     german_credit$Most.valuable.available.asset + german_credit$Age..years. + 
##     german_credit$Concurrent.Credits + german_credit$Type.of.apartment + 
##     german_credit$No.of.Credits.at.this.Bank + german_credit$Occupation + 
##     german_credit$No.of.dependents + german_credit$Telephone + 
##     german_credit$Foreign.Worker, method = "class", control = rpart.control(minsplit = 30, 
##     cp = 0.028))
## 
## Variables actually used in tree construction:
## [1] german_credit$Account.Balance                  
## [2] german_credit$Duration.of.Credit..month.       
## [3] german_credit$Payment.Status.of.Previous.Credit
## [4] german_credit$Value.Savings.Stocks             
## 
## Root node error: 300/1000 = 0.3
## 
## n= 1000 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.051667      0   1.00000 1.00000 0.048305
## 2 0.046667      3   0.84000 0.96000 0.047733
## 3 0.028000      4   0.79333 0.81667 0.045335

plotcp(german_tree1)

Prediction using the model

sample_pred<-predict(german_tree1, type="class")
conf_matrix<-table(sample_pred,german_credit$Creditability)
conf_matrix

##            
## sample_pred   0   1
##           0 143  81
##           1 157 619

accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy

## [1] 0.762

german_tree builds with a with out outlier treatment and its accuracy is 0.76.german_tree1 builds with a outlier treatment and it’s accuracy is 0.76. there is no difference in accuracy with out outlier treatment/with outlier treatment.

Training and Validation data

german_credit<-german_credit[-22]
dim(german_credit)

## [1] 1000   21

names(german_credit)

##  [1] "Creditability"                    
##  [2] "Account.Balance"                  
##  [3] "Duration.of.Credit..month."       
##  [4] "Payment.Status.of.Previous.Credit"
##  [5] "Purpose"                          
##  [6] "Credit.Amount"                    
##  [7] "Value.Savings.Stocks"             
##  [8] "Length.of.current.employment"     
##  [9] "Instalment.per.cent"              
## [10] "Sex...Marital.Status"             
## [11] "Guarantors"                       
## [12] "Duration.in.Current.address"      
## [13] "Most.valuable.available.asset"    
## [14] "Age..years."                      
## [15] "Concurrent.Credits"               
## [16] "Type.of.apartment"                
## [17] "No.of.Credits.at.this.Bank"       
## [18] "Occupation"                       
## [19] "No.of.dependents"                 
## [20] "Telephone"                        
## [21] "Foreign.Worker"

library("caret", lib.loc="~/R/win-library/3.3")
set.seed(121)
sampledata <- createDataPartition(german_credit$Creditability, p=0.80, list=FALSE)
train_new <- german_credit[sampledata,]
hold_out <- german_credit[-sampledata,]

Overfitting

Model on training data

library("rpart", lib.loc="~/R/win-library/3.3")
library("tree", lib.loc="~/R/win-library/3.3")
german_tree2<-rpart(Creditability~.,method="class", control=rpart.control(minsplit=30,cp=0.028), data=train_new)
german_tree2

## n= 800 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 800 230 1 (0.2875000 0.7125000)  
##    2) Account.Balance< 2.5 426 180 1 (0.4225352 0.5774648)  
##      4) Duration.of.Credit..month.>=15.5 260 127 0 (0.5115385 0.4884615)  
##        8) Value.Savings.Stocks< 3.5 210  88 0 (0.5809524 0.4190476) *
##        9) Value.Savings.Stocks>=3.5 50  11 1 (0.2200000 0.7800000) *
##      5) Duration.of.Credit..month.< 15.5 166  47 1 (0.2831325 0.7168675)  
##       10) Payment.Status.of.Previous.Credit< 1.5 12   2 0 (0.8333333 0.1666667) *
##       11) Payment.Status.of.Previous.Credit>=1.5 154  37 1 (0.2402597 0.7597403) *
##    3) Account.Balance>=2.5 374  50 1 (0.1336898 0.8663102) *

library("rattle", lib.loc="~/R/win-library/3.3")
library("tree", lib.loc="~/R/win-library/3.3")
sample_pred<-predict(german_tree2, type="class")
 

conf_matrix<-table(sample_pred,train_new$Creditability)
conf_matrix

##            
## sample_pred   0   1
##           0 132  90
##           1  98 480

accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy

## [1] 0.765

Validation accuracy

library("rattle", lib.loc="~/R/win-library/3.3")
hold_out$pred<- predict(german_tree2, hold_out,type="class")
conf_matrix_val<-table(hold_out$pred,hold_out$Creditability)
conf_matrix_val

##    
##       0   1
##   0  38  26
##   1  32 104

accuracy_val<-(conf_matrix_val[1,1]+conf_matrix_val[2,2])/(sum(conf_matrix_val))
accuracy_val

## [1] 0.71

It performs well on training data not in test data.

ROC and AUC

library("pROC")

## Warning: package 'pROC' was built under R version 3.1.3

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following object is masked from 'package:gmodels':
## 
##     ci

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

german_credit<-german_credit[-22]
dim(german_credit)

## [1] 1000   21

names(german_credit)

##  [1] "Creditability"                    
##  [2] "Account.Balance"                  
##  [3] "Duration.of.Credit..month."       
##  [4] "Payment.Status.of.Previous.Credit"
##  [5] "Purpose"                          
##  [6] "Credit.Amount"                    
##  [7] "Value.Savings.Stocks"             
##  [8] "Length.of.current.employment"     
##  [9] "Instalment.per.cent"              
## [10] "Sex...Marital.Status"             
## [11] "Guarantors"                       
## [12] "Duration.in.Current.address"      
## [13] "Most.valuable.available.asset"    
## [14] "Age..years."                      
## [15] "Concurrent.Credits"               
## [16] "Type.of.apartment"                
## [17] "No.of.Credits.at.this.Bank"       
## [18] "Occupation"                       
## [19] "No.of.dependents"                 
## [20] "Telephone"                        
## [21] "Foreign.Worker"

LogisticModel4<- glm(german_credit$Creditability~.,family=binomial(),data=german_credit)
summary(LogisticModel4)

## 
## Call:
## glm(formula = german_credit$Creditability ~ ., family = binomial(), 
##     data = german_credit)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5854  -0.7927   0.4512   0.7445   1.9483  
## 
## Coefficients:
##                                     Estimate Std. Error z value Pr(>|z|)
## (Intercept)                       -3.994e+00  1.024e+00  -3.901 9.58e-05
## Account.Balance                    5.799e-01  7.004e-02   8.280  < 2e-16
## Duration.of.Credit..month.        -2.457e-02  8.725e-03  -2.816 0.004862
## Payment.Status.of.Previous.Credit  3.822e-01  8.740e-02   4.373 1.23e-05
## Purpose                            3.153e-02  3.009e-02   1.048 0.294697
## Credit.Amount                     -9.340e-05  4.012e-05  -2.328 0.019908
## Value.Savings.Stocks               2.391e-01  5.827e-02   4.104 4.07e-05
## Length.of.current.employment       1.517e-01  7.118e-02   2.132 0.033027
## Instalment.per.cent               -2.983e-01  8.276e-02  -3.605 0.000312
## Sex...Marital.Status               2.574e-01  1.157e-01   2.224 0.026131
## Guarantors                         3.473e-01  1.777e-01   1.954 0.050681
## Duration.in.Current.address       -1.411e-02  7.742e-02  -0.182 0.855335
## Most.valuable.available.asset     -1.828e-01  9.101e-02  -2.009 0.044521
## Age..years.                        8.917e-03  8.206e-03   1.087 0.277218
## Concurrent.Credits                 2.419e-01  1.111e-01   2.178 0.029420
## Type.of.apartment                  2.931e-01  1.677e-01   1.748 0.080527
## No.of.Credits.at.this.Bank        -2.436e-01  1.610e-01  -1.513 0.130257
## Occupation                         1.889e-02  1.367e-01   0.138 0.890081
## No.of.dependents                  -1.708e-01  2.319e-01  -0.736 0.461567
## Telephone                          2.947e-01  1.880e-01   1.567 0.117024
## Foreign.Worker                     1.158e+00  6.078e-01   1.906 0.056680
##                                      
## (Intercept)                       ***
## Account.Balance                   ***
## Duration.of.Credit..month.        ** 
## Payment.Status.of.Previous.Credit ***
## Purpose                              
## Credit.Amount                     *  
## Value.Savings.Stocks              ***
## Length.of.current.employment      *  
## Instalment.per.cent               ***
## Sex...Marital.Status              *  
## Guarantors                        .  
## Duration.in.Current.address          
## Most.valuable.available.asset     *  
## Age..years.                          
## Concurrent.Credits                *  
## Type.of.apartment                 .  
## No.of.Credits.at.this.Bank           
## Occupation                           
## No.of.dependents                     
## Telephone                            
## Foreign.Worker                    .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1221.73  on 999  degrees of freedom
## Residual deviance:  956.56  on 979  degrees of freedom
## AIC: 998.56
## 
## Number of Fisher Scoring iterations: 5

predicted_prob<-predict(LogisticModel4,type="response")
roccurve <- roc(LogisticModel4$y, predicted_prob)
plot(roccurve)

## 
## Call:
## roc.default(response = LogisticModel4$y, predictor = predicted_prob)
## 
## Data: predicted_prob in 300 controls (LogisticModel4$y 0) < 700 cases (LogisticModel4$y 1).
## Area under the curve: 0.8036

auc(roccurve)

## Area under the curve: 0.8036

auc(LogisticModel4$y, predicted_prob)

## Area under the curve: 0.8036

ROC and AUC on decision

names(german_credit)

##  [1] "Creditability"                    
##  [2] "Account.Balance"                  
##  [3] "Duration.of.Credit..month."       
##  [4] "Payment.Status.of.Previous.Credit"
##  [5] "Purpose"                          
##  [6] "Credit.Amount"                    
##  [7] "Value.Savings.Stocks"             
##  [8] "Length.of.current.employment"     
##  [9] "Instalment.per.cent"              
## [10] "Sex...Marital.Status"             
## [11] "Guarantors"                       
## [12] "Duration.in.Current.address"      
## [13] "Most.valuable.available.asset"    
## [14] "Age..years."                      
## [15] "Concurrent.Credits"               
## [16] "Type.of.apartment"                
## [17] "No.of.Credits.at.this.Bank"       
## [18] "Occupation"                       
## [19] "No.of.dependents"                 
## [20] "Telephone"                        
## [21] "Foreign.Worker"

library("pROC")

german_credit_decision<-rpart(german_credit$Creditability~.,method="class",control=rpart.control(minsplit=30,cp=0.028),data=german_credit)
summary(german_credit_decision)

## Call:
## rpart(formula = german_credit$Creditability ~ ., data = german_credit, 
##     method = "class", control = rpart.control(minsplit = 30, 
##         cp = 0.028))
##   n= 1000 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.05166667      0 1.0000000 1.0000000 0.04830459
## 2 0.04666667      3 0.8400000 0.9633333 0.04778180
## 3 0.02800000      4 0.7933333 0.8366667 0.04570424
## 
## Variable importance
##                   Account.Balance Payment.Status.of.Previous.Credit 
##                                46                                14 
##              Value.Savings.Stocks        Duration.of.Credit..month. 
##                                14                                12 
##                     Credit.Amount     Most.valuable.available.asset 
##                                 5                                 2 
##                 Type.of.apartment                           Purpose 
##                                 1                                 1 
##                       Age..years.      Length.of.current.employment 
##                                 1                                 1 
##        No.of.Credits.at.this.Bank                        Occupation 
##                                 1                                 1 
## 
## Node number 1: 1000 observations,    complexity param=0.05166667
##   predicted class=1  expected loss=0.3  P(node) =1
##     class counts:   300   700
##    probabilities: 0.300 0.700 
##   left son=2 (543 obs) right son=3 (457 obs)
##   Primary splits:
##       Account.Balance                   < 2.5    to the left,  improve=47.90962, (0 missing)
##       Payment.Status.of.Previous.Credit < 1.5    to the left,  improve=17.06212, (0 missing)
##       Value.Savings.Stocks              < 2.5    to the left,  improve=14.80642, (0 missing)
##       Duration.of.Credit..month.        < 34.5   to the right, improve=13.62155, (0 missing)
##       Credit.Amount                     < 3909.5 to the right, improve=11.08768, (0 missing)
##   Surrogate splits:
##       Value.Savings.Stocks              < 2.5    to the left,  agree=0.611, adj=0.149, (0 split)
##       Payment.Status.of.Previous.Credit < 3.5    to the left,  agree=0.592, adj=0.107, (0 split)
##       Length.of.current.employment      < 4.5    to the left,  agree=0.554, adj=0.024, (0 split)
##       Age..years.                       < 30.5   to the left,  agree=0.554, adj=0.024, (0 split)
##       No.of.Credits.at.this.Bank        < 1.5    to the left,  agree=0.554, adj=0.024, (0 split)
## 
## Node number 2: 543 observations,    complexity param=0.05166667
##   predicted class=1  expected loss=0.441989  P(node) =0.543
##     class counts:   240   303
##    probabilities: 0.442 0.558 
##   left son=4 (237 obs) right son=5 (306 obs)
##   Primary splits:
##       Duration.of.Credit..month.        < 22.5   to the right, improve=12.810640, (0 missing)
##       Payment.Status.of.Previous.Credit < 1.5    to the left,  improve= 9.653787, (0 missing)
##       Most.valuable.available.asset     < 1.5    to the right, improve= 9.181363, (0 missing)
##       Value.Savings.Stocks              < 2.5    to the left,  improve= 8.890786, (0 missing)
##       Credit.Amount                     < 8079   to the right, improve= 6.601270, (0 missing)
##   Surrogate splits:
##       Credit.Amount                 < 2805.5 to the right, agree=0.748, adj=0.422, (0 split)
##       Most.valuable.available.asset < 2.5    to the right, agree=0.646, adj=0.190, (0 split)
##       Type.of.apartment             < 2.5    to the right, agree=0.606, adj=0.097, (0 split)
##       Purpose                       < 8.5    to the right, agree=0.604, adj=0.093, (0 split)
##       Occupation                    < 3.5    to the right, agree=0.595, adj=0.072, (0 split)
## 
## Node number 3: 457 observations
##   predicted class=1  expected loss=0.131291  P(node) =0.457
##     class counts:    60   397
##    probabilities: 0.131 0.869 
## 
## Node number 4: 237 observations,    complexity param=0.05166667
##   predicted class=0  expected loss=0.4345992  P(node) =0.237
##     class counts:   134   103
##    probabilities: 0.565 0.435 
##   left son=8 (196 obs) right son=9 (41 obs)
##   Primary splits:
##       Value.Savings.Stocks       < 3.5    to the left,  improve=7.374515, (0 missing)
##       Credit.Amount              < 1381.5 to the left,  improve=3.289316, (0 missing)
##       Instalment.per.cent        < 2.5    to the right, improve=3.067516, (0 missing)
##       Duration.of.Credit..month. < 43.5   to the right, improve=2.564920, (0 missing)
##       Purpose                    < 0.5    to the left,  improve=2.358685, (0 missing)
## 
## Node number 5: 306 observations,    complexity param=0.04666667
##   predicted class=1  expected loss=0.3464052  P(node) =0.306
##     class counts:   106   200
##    probabilities: 0.346 0.654 
##   left son=10 (28 obs) right son=11 (278 obs)
##   Primary splits:
##       Payment.Status.of.Previous.Credit < 1.5    to the left,  improve=10.040510, (0 missing)
##       Most.valuable.available.asset     < 1.5    to the right, improve= 5.585685, (0 missing)
##       Guarantors                        < 2.5    to the left,  improve= 3.782059, (0 missing)
##       Duration.of.Credit..month.        < 11.5   to the right, improve= 3.766531, (0 missing)
##       Purpose                           < 7      to the left,  improve= 3.278937, (0 missing)
## 
## Node number 8: 196 observations
##   predicted class=0  expected loss=0.377551  P(node) =0.196
##     class counts:   122    74
##    probabilities: 0.622 0.378 
## 
## Node number 9: 41 observations
##   predicted class=1  expected loss=0.2926829  P(node) =0.041
##     class counts:    12    29
##    probabilities: 0.293 0.707 
## 
## Node number 10: 28 observations
##   predicted class=0  expected loss=0.25  P(node) =0.028
##     class counts:    21     7
##    probabilities: 0.750 0.250 
## 
## Node number 11: 278 observations
##   predicted class=1  expected loss=0.3057554  P(node) =0.278
##     class counts:    85   193
##    probabilities: 0.306 0.694

predicted_prob<-predict(german_credit_decision,type="class")
predicted_prob1<-ifelse(predicted_prob ==1,2,1)
roccurve <- roc(german_credit_decision$y, predicted_prob1)
plot(roccurve)

## 
## Call:
## roc.default(response = german_credit_decision$y, predictor = predicted_prob1)
## 
## Data: predicted_prob1 in 300 controls (german_credit_decision$y 1) < 700 cases (german_credit_decision$y 2).
## Area under the curve: 0.6805

auc(roccurve)

## Area under the curve: 0.6805

auc(german_credit_decision$y, predicted_prob1)

## Area under the curve: 0.6805

k-fold Cross Validation building

Divide the whole dataset into k equal parts Use kth part of the data as the holdout sample, use remaining k-1 parts of the data as training data.Repeat this K times, build K models. The average error on holdout sample gives us an idea on the testing error

K=10

library("caret", lib.loc="~/R/win-library/3.3")
train_dat <- trainControl(method="cv", number=10)
train_dat

## $method
## [1] "cv"
## 
## $number
## [1] 10
## 
## $repeats
## [1] 1
## 
## $search
## [1] "grid"
## 
## $p
## [1] 0.75
## 
## $initialWindow
## NULL
## 
## $horizon
## [1] 1
## 
## $fixedWindow
## [1] TRUE
## 
## $verboseIter
## [1] FALSE
## 
## $returnData
## [1] TRUE
## 
## $returnResamp
## [1] "final"
## 
## $savePredictions
## [1] FALSE
## 
## $classProbs
## [1] FALSE
## 
## $summaryFunction
## function (data, lev = NULL, model = NULL) 
## {
##     if (is.character(data$obs)) 
##         data$obs <- factor(data$obs, levels = lev)
##     postResample(data[, "pred"], data[, "obs"])
## }
## <environment: namespace:caret>
## 
## $selectionFunction
## [1] "best"
## 
## $preProcOptions
## $preProcOptions$thresh
## [1] 0.95
## 
## $preProcOptions$ICAcomp
## [1] 3
## 
## $preProcOptions$k
## [1] 5
## 
## 
## $sampling
## NULL
## 
## $index
## NULL
## 
## $indexOut
## NULL
## 
## $indexFinal
## NULL
## 
## $timingSamps
## [1] 0
## 
## $predictionBounds
## [1] FALSE FALSE
## 
## $seeds
## [1] NA
## 
## $adaptive
## $adaptive$min
## [1] 5
## 
## $adaptive$alpha
## [1] 0.05
## 
## $adaptive$method
## [1] "gls"
## 
## $adaptive$complete
## [1] TRUE
## 
## 
## $trim
## [1] FALSE
## 
## $allowParallel
## [1] TRUE

names(german_credit)

##  [1] "Creditability"                    
##  [2] "Account.Balance"                  
##  [3] "Duration.of.Credit..month."       
##  [4] "Payment.Status.of.Previous.Credit"
##  [5] "Purpose"                          
##  [6] "Credit.Amount"                    
##  [7] "Value.Savings.Stocks"             
##  [8] "Length.of.current.employment"     
##  [9] "Instalment.per.cent"              
## [10] "Sex...Marital.Status"             
## [11] "Guarantors"                       
## [12] "Duration.in.Current.address"      
## [13] "Most.valuable.available.asset"    
## [14] "Age..years."                      
## [15] "Concurrent.Credits"               
## [16] "Type.of.apartment"                
## [17] "No.of.Credits.at.this.Bank"       
## [18] "Occupation"                       
## [19] "No.of.dependents"                 
## [20] "Telephone"                        
## [21] "Foreign.Worker"

german_credit$Creditability<-as.factor(german_credit$Creditability)

Building the models on K-fold samples

library("e1071")

## Warning: package 'e1071' was built under R version 3.1.3

K_fold_tree<-train(Creditability~., method="rpart", trControl=train_dat, control=rpart.control(minsplit=10, cp=0.000001),  data=german_credit)
K_fold_tree

## CART 
## 
## 1000 samples
##   20 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 900, 900, 900, 900, 900, 900, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy  Kappa    
##   0.01666667  0.748     0.3496912
##   0.04666667  0.717     0.2059117
##   0.05166667  0.701     0.1253775
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.01666667.

K_fold_tree$finalModel

K_fold_tree$finalModel

## n= 1000 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 1000 300 1 (0.3000000 0.7000000)  
##    2) Account.Balance< 2.5 543 240 1 (0.4419890 0.5580110)  
##      4) Duration.of.Credit..month.>=22.5 237 103 0 (0.5654008 0.4345992)  
##        8) Value.Savings.Stocks< 3.5 196  74 0 (0.6224490 0.3775510) *
##        9) Value.Savings.Stocks>=3.5 41  12 1 (0.2926829 0.7073171) *
##      5) Duration.of.Credit..month.< 22.5 306 106 1 (0.3464052 0.6535948)  
##       10) Payment.Status.of.Previous.Credit< 1.5 28   7 0 (0.7500000 0.2500000) *
##       11) Payment.Status.of.Previous.Credit>=1.5 278  85 1 (0.3057554 0.6942446) *
##    3) Account.Balance>=2.5 457  60 1 (0.1312910 0.8687090) *

library("rattle", lib.loc="~/R/win-library/3.3")
fancyRpartPlot(K_fold_tree$finalModel)

Kfold_pred<-predict(K_fold_tree)

confusion matrix

conf_matrix<-confusionMatrix(Kfold_pred,german_credit$Creditability)
conf_matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 143  81
##          1 157 619
##                                           
##                Accuracy : 0.762           
##                  95% CI : (0.7344, 0.7881)
##     No Information Rate : 0.7             
##     P-Value [Acc > NIR] : 7.345e-06       
##                                           
##                   Kappa : 0.3891          
##  Mcnemar's Test P-Value : 1.165e-06       
##                                           
##             Sensitivity : 0.4767          
##             Specificity : 0.8843          
##          Pos Pred Value : 0.6384          
##          Neg Pred Value : 0.7977          
##              Prevalence : 0.3000          
##          Detection Rate : 0.1430          
##    Detection Prevalence : 0.2240          
##       Balanced Accuracy : 0.6805          
##                                           
##        'Positive' Class : 0               
##

Bootstrap

Boot strapping is a powerful tool to get an idea on accuracy of the model

library("caret", lib.loc="~/R/win-library/3.3")
train_control <- trainControl(method="boot", number=10)

Tree model on boots straped data

Boot_Strap_model <- train(Creditability~., method="rpart", trControl=train_dat, control=rpart.control(minsplit=10, cp=0.000001),  data=german_credit)
names(german_credit)

##  [1] "Creditability"                    
##  [2] "Account.Balance"                  
##  [3] "Duration.of.Credit..month."       
##  [4] "Payment.Status.of.Previous.Credit"
##  [5] "Purpose"                          
##  [6] "Credit.Amount"                    
##  [7] "Value.Savings.Stocks"             
##  [8] "Length.of.current.employment"     
##  [9] "Instalment.per.cent"              
## [10] "Sex...Marital.Status"             
## [11] "Guarantors"                       
## [12] "Duration.in.Current.address"      
## [13] "Most.valuable.available.asset"    
## [14] "Age..years."                      
## [15] "Concurrent.Credits"               
## [16] "Type.of.apartment"                
## [17] "No.of.Credits.at.this.Bank"       
## [18] "Occupation"                       
## [19] "No.of.dependents"                 
## [20] "Telephone"                        
## [21] "Foreign.Worker"

Boot_Strap_model$finalModel

## n= 1000 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 1000 300 1 (0.3000000 0.7000000)  
##    2) Account.Balance< 2.5 543 240 1 (0.4419890 0.5580110)  
##      4) Duration.of.Credit..month.>=22.5 237 103 0 (0.5654008 0.4345992)  
##        8) Value.Savings.Stocks< 3.5 196  74 0 (0.6224490 0.3775510) *
##        9) Value.Savings.Stocks>=3.5 41  12 1 (0.2926829 0.7073171) *
##      5) Duration.of.Credit..month.< 22.5 306 106 1 (0.3464052 0.6535948)  
##       10) Payment.Status.of.Previous.Credit< 1.5 28   7 0 (0.7500000 0.2500000) *
##       11) Payment.Status.of.Previous.Credit>=1.5 278  85 1 (0.3057554 0.6942446) *
##    3) Account.Balance>=2.5 457  60 1 (0.1312910 0.8687090) *

library("rattle", lib.loc="~/R/win-library/3.3")
fancyRpartPlot(Boot_Strap_model$finalModel)

Boot_Strap_predictions <- predict(Boot_Strap_model)
conf_matrix<-confusionMatrix(Boot_Strap_predictions,german_credit$Creditability)
conf_matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 143  81
##          1 157 619
##                                           
##                Accuracy : 0.762           
##                  95% CI : (0.7344, 0.7881)
##     No Information Rate : 0.7             
##     P-Value [Acc > NIR] : 7.345e-06       
##                                           
##                   Kappa : 0.3891          
##  Mcnemar's Test P-Value : 1.165e-06       
##                                           
##             Sensitivity : 0.4767          
##             Specificity : 0.8843          
##          Pos Pred Value : 0.6384          
##          Neg Pred Value : 0.7977          
##              Prevalence : 0.3000          
##          Detection Rate : 0.1430          
##    Detection Prevalence : 0.2240          
##       Balanced Accuracy : 0.6805          
##                                           
##        'Positive' Class : 0               
##

conclusion:

n= 1000

node), split, n, loss, yval, (yprob) * denotes terminal node

root 1000 300 1 (0.3000000 0.7000000)
Account.Balance< 2.5 543 240 1 (0.4419890 0.5580110)
1. Duration.of.Credit..month.>=22.5 237 103 0 (0.5654008 0.4345992)
2. Value.Savings.Stocks< 3.5 196 74 0 (0.6224490 0.3775510) *
  1. Value.Savings.Stocks>=3.5 41 12 1 (0.2926829 0.7073171) *
3. Duration.of.Credit..month.< 22.5 306 106 1 (0.3464052 0.6535948)
4. Payment.Status.of.Previous.Credit< 1.5 28 7 0 (0.7500000 0.2500000) *
5. Payment.Status.of.Previous.Credit>=1.5 278 85 1 (0.3057554 0.6942446) *
Account.Balance>=2.5 457 60 1 (0.1312910 0.8687090) *

Root node contains 1000 records.ie., over all records in whole data.root node termed as a good customer, loss in that node is bad customers are 300. 30% customers are bad customers and 70% customers are good customers.

2 nd node is Account.Balance< 2.5. if we calculate 54% for 1000recods we get 543 records.2 nd node also termed as good customer, loss in that node is bad customers are 240.out of 543 loss is 240.55%customers are good customers and 44% customers are bad customers.

4 th node is Duration.of.Credit..month.>=22.5.It has a 237 records.4 th node is termed as a bad customer ,loss in that node is goodcustomers are 103.out of 237 loss is 103.Bad customers are around 237.56% customers are bad customers.43% customers are good customers.

8 th node is Value.Savings.Stocks< 3.5.It has a 196 records.8 th node termed as a bad customers,loss in that node is good customers are 74.out of 196 loss is 74. 62% customers are bad customers.38%customers are good customers.

9 th node is Value.Savings.Stocks>=3.5.It has a 41 records.9 th node is termed as a good customers,loss in that node is bad customers are12.out of 41 loss is 12. 29% customers are bad customers.70%customers are good customers.

5 th node is .Duration.of.Credit..month.< 22.5.it has 306 records.5 th node is termed as a good customers,loss in that node is bad customers are106.out of 306 loss is 106. 34% customers are bad customers.65%customers are good customers.

10 th node is Payment.Status.of.Previous.Credit< 1.5. it has a 28 records. 10 th node is termed as a bad customers,loss in that node is good customers are7.out of 28 loss is 7. 75% customers are bad customers.25%customers are good

11 th node is Payment.Status.of.Previous.Credit>=1.5.It has a 278 records.11 th node is termed as a good customers,loss in that node is bad customers are 85.out of 278 loss is 85. 31% customers are bad customers.69%customers are good customers.

3 rd node is Account.Balance>=2.5.It has a 457 records.3rd node termed as a good customers,loss in that node is bad customers are 60.out of 457 loss is 60. 13% customers are bad customers.87%customers are good customers.

Confusion Matrix and Statistics

      Reference

Prediction 0 1

     0 143  81
     
     1 157 619
                                      
           Accuracy : 0.762           
              95% CI : (0.7344, 0.7881)
No Information Rate : 0.7             
P-Value [Acc > NIR] : 7.345e-06       
                                      
             Kappa : 0.3891

Mcnemar’s Test P-Value : 1.165e-06

        Sensitivity : 0.4767          
         Specificity : 0.8843

If we the change thershold value to increase the sensitivity and specificity.

      Pos Pred Value : 0.6384          
      Neg Pred Value : 0.7977          
          Prevalence : 0.3000          
      Detection Rate : 0.1430          
Detection Prevalence : 0.2240          
   Balanced Accuracy : 0.6805          
                                       
    'Positive' Class : 0

real accuracy of the whole data is 76%.

Credit Card Ratings

You can download the datasets .

Content

Title: “German credit”

Abstract:

Source Information

Data File for this case :

data importing:

data information

Variable1 = “Creditability”

variable2=“Account.Balance”

variable3=“Duration.of.Credit..month”

Univariate Analysis of duration in month

variable4=“Payment.Status.of.Previous.Credit”

variable5=“purpose”

variable6=“Credit.Amount”

Univariate Analysis of Credit.Amount

variable7=“Value.Savings.Stocks”

variable8 = “Length.of.current.employment”

variable9 =“Instalment.per.cent”

variable10 =“Sex…Marital.Status”

variable11 = “Guarantors”

variables12 = “Duration.in.Current.address”

variable13=“Most.valuable.available.asset”

variable14 = “Age..years.”

Univariate Analysis of Age..years.

variable15=“Concurrent.Credits”

variable16 = “Type.of.apartment”

variable17= “No.of.Credits.at.this.Bank”

variable18 =“Occupation”

variable19 =“No.of.dependents”

variable20 =“Telephone”

variable21= “Foreign.Worker”

Model Building

predictors – Qualitative chi-square p -value

predicators – numerical data

Building logistic Regression

classification table

Logistic regression Accuracy

Multicollinearity

Individual Impact of Variables

AIC and BIC

Model Selction

Duration.in.Current.address,Occupation and No.of.dependents are the least impacting variable, lets drop these variables and re build the model

Building Final logistic Regression

Logistic Regression Accuracy

AIC and BIC

decision tree

accuracy

outlier treatment for creditamount

Logistic model (credit amount replace with credit_amount_new)

Accuracy

logistic model1 builds with a with out outlier treatment(credit.amount) and it’s accuracy is 0.77. logistic model3 builds with a outlier treatment(credit_amount_new)and it’s accuracy is 0.77.

Changing Threshold

AIC and BIC

Decision tree(Credit.Amount replace with credit_amount_new )

Prediction using the model

german_tree builds with a with out outlier treatment and its accuracy is 0.76.german_tree1 builds with a outlier treatment and it’s accuracy is 0.76. there is no difference in accuracy with out outlier treatment/with outlier treatment.

Training and Validation data

Overfitting

Model on training data

Validation accuracy

It performs well on training data not in test data.

ROC and AUC

ROC and AUC on decision

k-fold Cross Validation building

K=10

Building the models on K-fold samples

K_fold_tree$finalModel

confusion matrix

Bootstrap

Tree model on boots straped data

conclusion:

Confusion Matrix and Statistics

If we the change thershold value to increase the sensitivity and specificity.

real accuracy of the whole data is 76%.