You can download the datasets .
Content
- Title: “German credit”
- Abstract:
- Source Information
- Data File for this case :
- data importing:
- data information
- Variable1 = “Creditability”
- variable2=“Account.Balance”
- variable3=“Duration.of.Credit..month”
- variable4=“Payment.Status.of.Previous.Credit”
- variable5=“purpose”
- variable6=“Credit.Amount”
- variable7=“Value.Savings.Stocks”
- variable8 = “Length.of.current.employment”
- variable9 =“Instalment.per.cent”
- variable10 =“Sex…Marital.Status”
- variable11 = “Guarantors”
- variables12 = “Duration.in.Current.address”
- variable13=“Most.valuable.available.asset”
- variable14 = “Age..years.”
- Univariate Analysis of Age..years.
- variable15=“Concurrent.Credits”
- variable16 = “Type.of.apartment”
- variable17= “No.of.Credits.at.this.Bank”
- variable18 =“Occupation”
- variable19 =“No.of.dependents”
- variable20 =“Telephone”
- variable21= “Foreign.Worker”
- Model Building
- predictors – Qualitative chi-square p -value
- predicators – numerical data
- Building logistic Regression
- classification table
- Logistic regression Accuracy
- Multicollinearity
- Individual Impact of Variables
- AIC and BIC
- Model Selction
- Duration.in.Current.address,Occupation and No.of.dependents are the least impacting variable, lets drop these variables and re build the model
- Building Final logistic Regression
- Logistic Regression Accuracy
- AIC and BIC
- decision tree
- accuracy
- outlier treatment for creditamount
- Logistic model (credit amount replace with credit_amount_new)
- Accuracy
- logistic model1 builds with a with out outlier treatment(credit.amount) and it’s accuracy is 0.77. logistic model3 builds with a outlier treatment(credit_amount_new)and it’s accuracy is 0.77.
- Changing Threshold
- AIC and BIC
- Decision tree(Credit.Amount replace with credit_amount_new )
- Prediction using the model
- german_tree builds with a with out outlier treatment and its accuracy is 0.76.german_tree1 builds with a outlier treatment and it’s accuracy is 0.76. there is no difference in accuracy with out outlier treatment/with outlier treatment.
- Training and Validation data
- Overfitting
- Model on training data
- Validation accuracy
- It performs well on training data not in test data.
- ROC and AUC
- ROC and AUC on decision
- k-fold Cross Validation building
- K=10
- Building the models on K-fold samples
- K_fold_tree$finalModel
- confusion matrix
- Bootstrap
- Tree model on boots straped data
- conclusion:
- If we the change thershold value to increase the sensitivity and specificity.
- real accuracy of the whole data is 76%.
Title: “German credit”
Abstract:
The objective is to predict the credit ratings of customers
Source Information
Professor Dr. Hans Hofmann
Institut f“ur Statistik und”Okonometrie
Universit“at Hamburg
FB Wirtschaftswissenschaften
Von-Melle-Park 5
2000 Hamburg 13
Data File for this case :
German Credit data – german_credit.csv
The German Credit Data contains data on 20 variables and the classification whether an applicant is considered a Good or a Bad credit risk for 1000 loan applicants.
data importing:
german_credit<-read.csv("C:\Amrita\Datavedi\Case Study\german_credit.csv")
names(german_credit)
## [1] "Creditability"
## [2] "Account.Balance"
## [3] "Duration.of.Credit..month."
## [4] "Payment.Status.of.Previous.Credit"
## [5] "Purpose"
## [6] "Credit.Amount"
## [7] "Value.Savings.Stocks"
## [8] "Length.of.current.employment"
## [9] "Instalment.per.cent"
## [10] "Sex...Marital.Status"
## [11] "Guarantors"
## [12] "Duration.in.Current.address"
## [13] "Most.valuable.available.asset"
## [14] "Age..years."
## [15] "Concurrent.Credits"
## [16] "Type.of.apartment"
## [17] "No.of.Credits.at.this.Bank"
## [18] "Occupation"
## [19] "No.of.dependents"
## [20] "Telephone"
## [21] "Foreign.Worker"
dim(german_credit)
## [1] 1000 21
x<-sum(is.na(german_credit))
x
## [1] 0
data information
Variable1 = “Creditability”
It is a Binary data ,which can take only two possible values.The two values in a binary variable, numerically as 0 and 1
0: No, 1:Yes
It is a type of categorical data, which more generally represents experiments with a fixed number of possible outcomes.
This is the target/response variable
variable2=“Account.Balance”
It contains qualitative data. there are four categories.
1 : … < 0 DM
2 : 0 <= … < 200 DM
3 : … >= 200 DM / salary assignments for at least 1 year
4 : no checking account
DM-Deutsche mark.The basic unit of money in Germany.
Account.Balance contains qualitative data.Central tendencies ,dispersion does not make any sense.frequency table,mode and barplot are calculated for qualitative data.mode gives the maximum value of status of Account.Balance.
frequency table of Account.Balance
tab<-table(german_credit$Account.Balance)
tab
##
## 1 2 3 4
## 274 269 63 394
names(tab)
## [1] "1" "2" "3" "4"
x<-sum(is.na(german_credit$Account.Balance))
x
## [1] 0
1 -stands for zero balance, 2 -stands for below 200 balance, 3 -stands for above 200 balance, 4 -stands for no checking accounts.
mode of Account.Balance.It gives the maximum value.
temp <- table(as.vector(german_credit$Account.Balance))
names(temp)[temp == max(temp)]
## [1] "4"
mode of Status of Account.Balance is 4.
ggplot of Account.Balance
library("ggplot2")
## Warning: package 'ggplot2' was built under R version 3.1.3
qplot(data<-german_credit$Account.Balance,main="Account.Balance", ylab="German_currency-Dm", colour= I("purple"),size=I(5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
output description:
1 stands for 274 people have zero balance.
2 stands for 269 people have below 200 DM balance.
3 stands for 63 people have above 200 DM balance.
4 stands for 394 people have no checking account.
correlation between Account.Balance and creditability
library("ltm")
## Warning: package 'ltm' was built under R version 3.1.3
## Loading required package: MASS
## Loading required package: msm
## Warning: package 'msm' was built under R version 3.1.3
## Loading required package: polycor
## Warning: package 'polycor' was built under R version 3.1.3
## Loading required package: mvtnorm
## Warning: package 'mvtnorm' was built under R version 3.1.3
## Loading required package: sfsmisc
## Warning: package 'sfsmisc' was built under R version 3.1.3
biserial.cor(german_credit$Account.Balance,german_credit$Creditability)
## [1] -0.350672
library(vcd)
## Warning: package 'vcd' was built under R version 3.1.3
## Loading required package: grid
contin_table<-table(german_credit$Account.Balance,german_credit$Creditability)
contin_table
##
## 0 1
## 1 135 139
## 2 105 164
## 3 14 49
## 4 46 348
assocstats(contin_table)
## X^2 df P(> X^2)
## Likelihood Ratio 131.34 3 0
## Pearson 123.72 3 0
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.332
## Cramer's V : 0.352
correlation is -0.35. Account.Balance and creditability are negatively correlated.
crosstable of Account.Balance and creditability
library("gmodels")
## Warning: package 'gmodels' was built under R version 3.1.3
CrossTable(german_credit$Creditability, german_credit$Account.Balance, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | german_credit$Account.Balance
## german_credit$Creditability | 1 | 2 | 3 | 4 | Row Total |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## 0 | 135 | 105 | 14 | 46 | 300 |
## | 0.5 | 0.4 | 0.2 | 0.1 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## 1 | 139 | 164 | 49 | 348 | 700 |
## | 0.5 | 0.6 | 0.8 | 0.9 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 274 | 269 | 63 | 394 | 1000 |
## | 0.3 | 0.3 | 0.1 | 0.4 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 123.7209 d.f. = 3 p = 1.218902e-26
##
##
##
variable3=“Duration.of.Credit..month”
It is a Numerical data.
head(german_credit$Duration.of.Credit..month.)
## [1] 18 9 12 12 12 10
Univariate Analysis of duration in month
Central tendencies of duration in month
mean of duration in month
mean(german_credit$Duration.of.Credit..month.)
## [1] 20.903
median of duration in month
median(german_credit$Duration.of.Credit..month.)
## [1] 18
Dispersion of duration in month Variance of duration in month
var(german_credit$Duration.of.Credit..month.)
## [1] 145.415
Standard deviation of duration in month
sd(german_credit$Duration.of.Credit..month.)
## [1] 12.05881
summary gives four quartiles of duration in month
summary(german_credit$Duration.of.Credit..month.)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 12.0 18.0 20.9 24.0 72.0
boxplot of duration of credit month
quantile(german_credit$Duration.of.Credit..month.)
## 0% 25% 50% 75% 100%
## 4 12 18 24 72
quantile(german_credit$Duration.of.Credit..month.,c(0.75,0.80,0.90,1))
## 75% 80% 90% 100%
## 24 30 36 72
boxplot(german_credit$Duration.of.Credit..month.)
output description
In this boxplot the minimum is 4 , maximum is 72, and median is 18. first quartile is 12,third quartile is 24.
histogram of Duration.of.Credit..month.
hist(german_credit$Duration.of.Credit..month.)
correlation between Duration.of.Credit..month. and response
library("ltm")
biserial.cor(german_credit$Duration.of.Credit..month.,german_credit$Creditability)
## [1] 0.2148192
correlation is 0.21.Duration.of.Credit..month. and Creditability positively correlated. #t-test
t.test(german_credit$Duration.of.Credit..month.)
##
## One Sample t-test
##
## data: german_credit$Duration.of.Credit..month.
## t = 54.8156, df = 999, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 20.15469 21.65131
## sample estimates:
## mean of x
## 20.903
variable4=“Payment.Status.of.Previous.Credit”
It is a Categorical data.It contains 5 categories.
0: no credits taken
1: all credits at this bank paid back duly
2: existing credits paid back duly till now
3: delay in paying off in the past
4: critical account
Payment.Status.of.Previous.Credit contains qualitative data.Central tendencies ,dispersion does not make any sense.frequency table,mode and barplot are calculated for qualitative data.mode gives the maximum value of status of Account.Balance.
frequency table of Payment.Status.of.Previous.Credit
tab<-table(german_credit$Payment.Status.of.Previous.Credit)
tab
##
## 0 1 2 3 4
## 40 49 530 88 293
names(tab)
## [1] "0" "1" "2" "3" "4"
0: no credits taken
1: all credits at this bank paid back duly
2: existing credits paid back duly till now
3: delay in paying off in the past
4: critical account
mode of Payment.Status.of.Previous.Credit It gives the maximum value.
temp <- table(as.vector(german_credit$Payment.Status.of.Previous.Credit))
names(temp)[temp == max(temp)]
## [1] "2"
mode of Status of Payment.Status.of.Previous.Credit.=2
ggplot of Payment.Status.of.Previous.Credit
library("ggplot2")
qplot(data<-german_credit$Payment.Status.of.Previous.Credit,main="Payment.Status.of.Previous.Credit", ylab="customers", colour= I("purple"),size=I(5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
output description:
0 stands for 40 customers have no credits taken/all credits paid back duly.
1 stands for 49 customers are all credits at this bank paid back duly.
2 stands for 530 customers have existing credits paid back duly till now.
3 stands for 88 customers are delay in paying off in the past.
4 stands for 293 customers have critical account/other credits existing (not at this bank)
correlation between Payment.Status.of.Previous.Credit and creditability
library("ltm")
biserial.cor(german_credit$Payment.Status.of.Previous.Credit,german_credit$Creditability)
## [1] -0.2286703
library(vcd)
contin_table<-table(german_credit$Payment.Status.of.Previous.Credit,german_credit$Creditability)
contin_table
##
## 0 1
## 0 25 15
## 1 28 21
## 2 169 361
## 3 28 60
## 4 50 243
assocstats(contin_table)
## X^2 df P(> X^2)
## Likelihood Ratio 60.467 4 2.3139e-12
## Pearson 61.691 4 1.2792e-12
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.241
## Cramer's V : 0.248
correlation is -0.22. Payment.Status.of.Previous.Credit and creditability are negatively correlated.
library("gmodels")
CrossTable(german_credit$Creditability, german_credit$Payment.Status.of.Previous.Credit, digits=1,prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | german_credit$Payment.Status.of.Previous.Credit
## german_credit$Creditability | 0 | 1 | 2 | 3 | 4 | Row Total |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 0 | 25 | 28 | 169 | 28 | 50 | 300 |
## | 0.6 | 0.6 | 0.3 | 0.3 | 0.2 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 1 | 15 | 21 | 361 | 60 | 243 | 700 |
## | 0.4 | 0.4 | 0.7 | 0.7 | 0.8 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 40 | 49 | 530 | 88 | 293 | 1000 |
## | 0.0 | 0.0 | 0.5 | 0.1 | 0.3 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 61.6914 d.f. = 4 p = 1.279187e-12
##
##
##
variable5=“purpose”
purpose is a qualitative data.It contains 11 categories. 0 : car (new)
1 : car (used)
2 : furniture/equipment
3 : radio/television
4 : domestic appliances
5 : repairs
6 : education
7 : (vacation – does not exist?)
8 : retraining
9 : business
10 : others
purpose contains qualitative data.Central tendencies ,dispersion does not make any sense.frequency table,mode and barplot are calculated for qualitative data.mode gives the maximum value of purpose
frequency table of purpose
tab<-table(german_credit$Purpose)
tab
##
## 0 1 2 3 4 5 6 8 9 10
## 234 103 181 280 12 22 50 9 97 12
names(tab)
## [1] "0" "1" "2" "3" "4" "5" "6" "8" "9" "10"
0 : car (new)
1 : car (used)
2 : furniture/equipment
3 : radio/television
4 : domestic appliances
5 : repairs
6 : education
7 : (vacation – does not exist?)
8 : retraining
9 : business
10 : others
mode of purpose.It gives the maximum value.
temp <- table(as.vector(german_credit$Purpose))
names(temp)[temp == max(temp)]
## [1] "3"
mode of Status of Account.Balance is 3.
ggplot of purpose
library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$Purpose,main="purpose", ylab="customers", colour= I("purple"),size=I(5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
correlation between purpose and creditability
library("ltm", lib.loc="~/R/win-library/3.3")
biserial.cor(german_credit$Purpose,german_credit$Creditability)
## [1] 0.01796988
library(vcd)
contin_table<-table(german_credit$Purpose,german_credit$Creditability)
contin_table
##
## 0 1
## 0 89 145
## 1 17 86
## 2 58 123
## 3 62 218
## 4 4 8
## 5 8 14
## 6 22 28
## 8 1 8
## 9 34 63
## 10 5 7
assocstats(contin_table)
## X^2 df P(> X^2)
## Likelihood Ratio 34.510 9 7.2688e-05
## Pearson 33.356 9 1.1575e-04
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.18
## Cramer's V : 0.183
correlation is 0.017. purpose and creditability are positivevely correlated.
crosstable of purpose and creditability
library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$Purpose, digits=1,prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
## Warning in chisq.test(t, correct = FALSE, ...): Chi-squared approximation
## may be incorrect
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | german_credit$Purpose
## german_credit$Creditability | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 8 | 9 | 10 | Row Total |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 0 | 89 | 17 | 58 | 62 | 4 | 8 | 22 | 1 | 34 | 5 | 300 |
## | 0.4 | 0.2 | 0.3 | 0.2 | 0.3 | 0.4 | 0.4 | 0.1 | 0.4 | 0.4 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 1 | 145 | 86 | 123 | 218 | 8 | 14 | 28 | 8 | 63 | 7 | 700 |
## | 0.6 | 0.8 | 0.7 | 0.8 | 0.7 | 0.6 | 0.6 | 0.9 | 0.6 | 0.6 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 234 | 103 | 181 | 280 | 12 | 22 | 50 | 9 | 97 | 12 | 1000 |
## | 0.2 | 0.1 | 0.2 | 0.3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.1 | 0.0 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 33.35645 d.f. = 9 p = 0.0001157491
##
##
##
variable6=“Credit.Amount”
It is a numerical data.
head(german_credit$Credit.Amount)
## [1] 1049 2799 841 2122 2171 2241
Univariate Analysis of Credit.Amount
Central tendencies of Credit.Amount
mean of Credit.Amount
mean(german_credit$Credit.Amount)
## [1] 3271.248
median of Credit.Amount
median(german_credit$Credit.Amount)
## [1] 2319.5
Dispersion of Credit.Amount Variance of Credit.Amount
var(german_credit$Credit.Amount)
## [1] 7967927
Standard deviation of Credit.Amount
sd(german_credit$Credit.Amount)
## [1] 2822.752
summary gives four quartiles of Credit.Amount
summary(german_credit$Credit.Amount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 250 1366 2320 3271 3972 18420
hist(german_credit$Credit.Amount)
boxplot of Credit.Amount
quantile(german_credit$Credit.Amount)
## 0% 25% 50% 75% 100%
## 250.00 1365.50 2319.50 3972.25 18424.00
quantile(german_credit$Credit.Amount,c(0.75,0.80,0.90,1))
## 75% 80% 90% 100%
## 3972.25 4720.00 7179.40 18424.00
boxplot(german_credit$Credit.Amount)
output description Notethat outliers are discussed later.
histogram of Credit.Amount
hist(german_credit$Credit.Amount)
correlation between Credit.Amount and response
library("ltm", lib.loc="~/R/win-library/3.3")
biserial.cor(german_credit$Credit.Amount,german_credit$Creditability)
## [1] 0.1546628
correlation is 0.15. Credit.Amount and Creditability positively correlated. #t-test
t.test(german_credit$Credit.Amount)
##
## One Sample t-test
##
## data: german_credit$Credit.Amount
## t = 36.6472, df = 999, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 3096.083 3446.413
## sample estimates:
## mean of x
## 3271.248
variable7=“Value.Savings.Stocks”
Average balance in savings account
Average balance in savings account is a qualitative data.It contains 5 categories.
1 : < 100 DM
2 : 100<= … < 500 DM
3 : 500<= … < 1000 DM
4 : =>1000 DM
5 : unknown/ no savings account
DM-Deutsche mark.The basic unit of money in Germany.
Average balance in savings account contains qualitative data.Central tendencies ,dispersion does not make any sense.frequency table,mode and barplot are calculated for qualitative data.mode gives the maximum value of Average balance in savings account
frequency table of Average balance in savings account
tab<-table(german_credit$Value.Savings.Stocks)
tab
##
## 1 2 3 4 5
## 603 103 63 48 183
names(tab)
## [1] "1" "2" "3" "4" "5"
1 : < 100 DM
2 : 100<= … < 500 DM
3 : 500<= … < 1000 DM
4 : =>1000 DM
5 : unknown/ no savings account
mode of Value.Savings.Stocks.It gives the maximum value.
temp <- table(as.vector(german_credit$Value.Savings.Stocks))
names(temp)[temp == max(temp)]
## [1] "1"
mode of Value.Savings.Stocks is 1.
ggplot of Value.Savings.Stocks
library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$Value.Savings.Stocks,main="Value.Savings.Stocks", ylab="German_currency-Dm", colour= I("purple"),size=I(5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
output description: 1 stands for 603 people have below 100 DM balance.
2 stands for 103 people have below 500 DM balance.
3 stands for 63 people have below 1000 DM balance.
4 stands for 48 people have above 1000 DM balance.
5 stands for 183 people have no checking account.
correlation between Value.Savings.Stocks and creditability
library("ltm", lib.loc="~/R/win-library/3.3")
biserial.cor(german_credit$Value.Savings.Stocks,german_credit$Creditability)
## [1] -0.1788532
library(vcd)
contin_table<-table(german_credit$Value.Savings.Stocks,german_credit$Creditability)
contin_table
##
## 0 1
## 1 217 386
## 2 34 69
## 3 11 52
## 4 6 42
## 5 32 151
assocstats(contin_table)
## X^2 df P(> X^2)
## Likelihood Ratio 38.975 4 7.0491e-08
## Pearson 36.099 4 2.7612e-07
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.187
## Cramer's V : 0.19
correlation is -0.17. Value.Savings.Stocks and creditability are negatively correlated.
crosstable of Account.Balance and creditability
library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$Value.Savings.Stocks, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | german_credit$Value.Savings.Stocks
## german_credit$Creditability | 1 | 2 | 3 | 4 | 5 | Row Total |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 0 | 217 | 34 | 11 | 6 | 32 | 300 |
## | 0.4 | 0.3 | 0.2 | 0.1 | 0.2 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 1 | 386 | 69 | 52 | 42 | 151 | 700 |
## | 0.6 | 0.7 | 0.8 | 0.9 | 0.8 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 603 | 103 | 63 | 48 | 183 | 1000 |
## | 0.6 | 0.1 | 0.1 | 0.0 | 0.2 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 36.09893 d.f. = 4 p = 2.761214e-07
##
##
##
variable8 = “Length.of.current.employment”
It is a qualitative data.It has 5 categories.
1 : unemployed
2: < 1 year
3 : 1 <= … < 4 years
4 : 4 <=… < 7 years
4 : >= 7 years
Length.of.current.employment contains qualitative data.Central tendencies ,dispersion does not make any sense.frequency table,mode and barplot are calculated for qualitative data.mode gives the maximum value of Length.of.current.employment
frequency table of Length.of.current.employment
tab<-table(german_credit$Length.of.current.employment)
tab
##
## 1 2 3 4 5
## 62 172 339 174 253
names(tab)
## [1] "1" "2" "3" "4" "5"
1 : unemployed
2: < 1 year
3 : 1 <= … < 4 years
4 : 4 <=… < 7 years
4 : >= 7 years
mode of Length.of.current.employment.It gives the maximum value.
temp <- table(as.vector(german_credit$Length.of.current.employment))
names(temp)[temp == max(temp)]
## [1] "3"
mode of Status of Length.of.current.employment is 3.
ggplot of Length.of.current.employment
library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$Length.of.current.employment,main="Length.of.current.employment", ylab="employees", colour= I("purple"),size=I(5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
correlation between Length.of.current.employment and creditability
library("ltm", lib.loc="~/R/win-library/3.3")
biserial.cor(german_credit$Length.of.current.employment,german_credit$Creditability)
## [1] -0.115944
library(vcd)
contin_table<-table(german_credit$Length.of.current.employment,german_credit$Creditability)
contin_table
##
## 0 1
## 1 23 39
## 2 70 102
## 3 104 235
## 4 39 135
## 5 64 189
assocstats(contin_table)
## X^2 df P(> X^2)
## Likelihood Ratio 18.164 4 0.0011464
## Pearson 18.368 4 0.0010455
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.134
## Cramer's V : 0.136
correlation is -0.11. Length.of.current.employment and creditability are negatively correlated.
crosstable of Length.of.current.employment and creditability
library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$Length.of.current.employment, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | german_credit$Length.of.current.employment
## german_credit$Creditability | 1 | 2 | 3 | 4 | 5 | Row Total |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 0 | 23 | 70 | 104 | 39 | 64 | 300 |
## | 0.4 | 0.4 | 0.3 | 0.2 | 0.3 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 1 | 39 | 102 | 235 | 135 | 189 | 700 |
## | 0.6 | 0.6 | 0.7 | 0.8 | 0.7 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 62 | 172 | 339 | 174 | 253 | 1000 |
## | 0.1 | 0.2 | 0.3 | 0.2 | 0.3 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 18.36827 d.f. = 4 p = 0.001045452
##
##
##
variable9 =“Instalment.per.cent”
Installment rate as % of disposable income.It is a qualitative data.It has a 4 categories.
frequency table of Instalment.per.cent
tab<-table(german_credit$Instalment.per.cent)
tab
##
## 1 2 3 4
## 136 231 157 476
names(tab)
## [1] "1" "2" "3" "4"
mode of Instalment.per.cent.It gives the maximum value.
temp <- table(as.vector(german_credit$Instalment.per.cent))
names(temp)[temp == max(temp)]
## [1] "4"
mode of Status of Instalment.per.cent is 4.
ggplot of Instalment.per.cent
library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$Instalment.per.cent,main="Instalment.per.cent", ylab="customers", colour= I("purple"),size=I(5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
correlation between Instalment.per.cent and creditability
library("ltm", lib.loc="~/R/win-library/3.3")
biserial.cor(german_credit$Instalment.per.cent,german_credit$Creditability)
## [1] 0.07236773
library(vcd)
contin_table<-table(german_credit$Instalment.per.cent,german_credit$Creditability)
contin_table
##
## 0 1
## 1 34 102
## 2 62 169
## 3 45 112
## 4 159 317
assocstats(contin_table)
## X^2 df P(> X^2)
## Likelihood Ratio 5.5065 3 0.13825
## Pearson 5.4768 3 0.14003
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.074
## Cramer's V : 0.074
correlation is 0.072. Instalment.per.cent and creditability are positively correlated.
crosstable of Instalment.per.cent and creditability
library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$Instalment.per.cent, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | german_credit$Instalment.per.cent
## german_credit$Creditability | 1 | 2 | 3 | 4 | Row Total |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## 0 | 34 | 62 | 45 | 159 | 300 |
## | 0.2 | 0.3 | 0.3 | 0.3 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## 1 | 102 | 169 | 112 | 317 | 700 |
## | 0.8 | 0.7 | 0.7 | 0.7 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 136 | 231 | 157 | 476 | 1000 |
## | 0.1 | 0.2 | 0.2 | 0.5 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 5.476792 d.f. = 3 p = 0.1400333
##
##
##
variable10 =“Sex…Marital.Status”
variable 10 is Personal status and sex .It is a qualitative data.There are 4 categories.
1 : male : divorced/separated
2 : female : divorced/separated/married
3 : male : single
4 : male : married/widowed
frequency table of Sex…Marital.Status
tab<-table(german_credit$Sex...Marital.Status)
tab
##
## 1 2 3 4
## 50 310 548 92
names(tab)
## [1] "1" "2" "3" "4"
1 : male : divorced/separated
2 : female : divorced/separated/married
3 : male : single
4 : male : married/widowed
mode of Sex…Marital.Status.It gives the maximum value.
temp <- table(as.vector(german_credit$Sex...Marital.Status))
names(temp)[temp == max(temp)]
## [1] "3"
mode of Status of Sex…Marital.Status is 3.
ggplot of Sex…Marital.Status
library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$Sex...Marital.Status,main="Sex...Marital.Status", ylab="customers", colour= I("purple"),size=I(5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
output description:
1-50 mens are divorced/separated.
2-310 womens aredivorced/separated/married.
3-548 males are single.
4-92 males are married/widowed.
correlation between Sex…Marital.Status and creditability
library("ltm", lib.loc="~/R/win-library/3.3")
biserial.cor(german_credit$Sex...Marital.Status,german_credit$Creditability)
## [1] -0.0881402
library(vcd)
contin_table<-table(german_credit$Sex...Marital.Status,german_credit$Creditability)
contin_table
##
## 0 1
## 1 20 30
## 2 109 201
## 3 146 402
## 4 25 67
assocstats(contin_table)
## X^2 df P(> X^2)
## Likelihood Ratio 9.4414 3 0.023963
## Pearson 9.6052 3 0.022238
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.098
## Cramer's V : 0.098
correlation is -0.088. Sex…Marital.Status and creditability are negatively correlated.
crosstable of Sex…Marital.Status and creditability
library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$Sex...Marital.Status, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | german_credit$Sex...Marital.Status
## german_credit$Creditability | 1 | 2 | 3 | 4 | Row Total |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## 0 | 20 | 109 | 146 | 25 | 300 |
## | 0.4 | 0.4 | 0.3 | 0.3 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## 1 | 30 | 201 | 402 | 67 | 700 |
## | 0.6 | 0.6 | 0.7 | 0.7 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 50 | 310 | 548 | 92 | 1000 |
## | 0.0 | 0.3 | 0.5 | 0.1 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 9.605214 d.f. = 3 p = 0.02223801
##
##
##
variable11 = “Guarantors”
It is a qualitative data.It contain 3 categories.
1 : none
2 : co-applicant
3 : guarantor
frequency table of Guarantors
tab<-table(german_credit$Guarantors)
tab
##
## 1 2 3
## 907 41 52
names(tab)
## [1] "1" "2" "3"
1 -stands for none, 2 -stands for co-applicant, 3 -stands for guarantor .
mode of Guarantors.It gives the maximum value.
temp <- table(as.vector(german_credit$Guarantors))
names(temp)[temp == max(temp)]
## [1] "1"
mode of Status of Guarantors 1.
ggplot of Guarantors
library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$Guarantors,main="Guarantors", ylab="customers", colour= I("purple"),size=I(5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
output description:
1 stands for 907 customers have no Guarantors.
2 stands for 41 customers have co-applicants.
3 stands for 52 customers have Guarantors.
correlation between Guarantors and creditability
library("ltm", lib.loc="~/R/win-library/3.3")
biserial.cor(german_credit$Guarantors,german_credit$Creditability)
## [1] -0.0251242
library(vcd)
contin_table<-table(german_credit$Guarantors,german_credit$Creditability)
contin_table
##
## 0 1
## 1 272 635
## 2 18 23
## 3 10 42
assocstats(contin_table)
## X^2 df P(> X^2)
## Likelihood Ratio 6.6501 2 0.035971
## Pearson 6.6454 2 0.036056
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.081
## Cramer's V : 0.082
correlation is -0.025. Guarantors and creditability are negatively correlated.
crosstable of Guarantors and creditability
library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$Guarantors, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | german_credit$Guarantors
## german_credit$Creditability | 1 | 2 | 3 | Row Total |
## ----------------------------|-----------|-----------|-----------|-----------|
## 0 | 272 | 18 | 10 | 300 |
## | 0.3 | 0.4 | 0.2 | |
## ----------------------------|-----------|-----------|-----------|-----------|
## 1 | 635 | 23 | 42 | 700 |
## | 0.7 | 0.6 | 0.8 | |
## ----------------------------|-----------|-----------|-----------|-----------|
## Column Total | 907 | 41 | 52 | 1000 |
## | 0.9 | 0.0 | 0.1 | |
## ----------------------------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 6.645367 d.f. = 2 p = 0.03605595
##
##
##
variables12 = “Duration.in.Current.address”
It is a Categorical data.It has a 4 categories.
1: <= 1 year
2: <.<=2 years
3: <.<=3 years
4: >4years
frequency table of Duration.in.Current.address
tab<-table(german_credit$Duration.in.Current.address)
tab
##
## 1 2 3 4
## 130 308 149 413
names(tab)
## [1] "1" "2" "3" "4"
1: <= 1 year
2: <.<=2 years
3: <.<=3 years
4: >4years
mode of Duration.in.Current.address.It gives the maximum value.
temp <- table(as.vector(german_credit$Duration.in.Current.address))
names(temp)[temp == max(temp)]
## [1] "4"
mode of Duration.in.Current.address is 4.
ggplot of Duration.in.Current.address
library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$Duration.in.Current.address,main="Duration.in.Current.address", ylab="customers", colour= I("purple"),size=I(5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
correlation between Duration.in.Current.address and creditability
library("ltm", lib.loc="~/R/win-library/3.3")
biserial.cor(german_credit$Duration.in.Current.address,german_credit$Creditability)
## [1] 0.002965675
library(vcd)
contin_table<-table(german_credit$Duration.in.Current.address,german_credit$Creditability)
contin_table
##
## 0 1
## 1 36 94
## 2 97 211
## 3 43 106
## 4 124 289
assocstats(contin_table)
## X^2 df P(> X^2)
## Likelihood Ratio 0.75207 3 0.86089
## Pearson 0.74930 3 0.86155
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.027
## Cramer's V : 0.027
correlation is 0.002. Duration.in.Current.address and creditability are positively correlated.
crosstable of Duration.in.Current.address and creditability
library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$Duration.in.Current.address, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | german_credit$Duration.in.Current.address
## german_credit$Creditability | 1 | 2 | 3 | 4 | Row Total |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## 0 | 36 | 97 | 43 | 124 | 300 |
## | 0.3 | 0.3 | 0.3 | 0.3 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## 1 | 94 | 211 | 106 | 289 | 700 |
## | 0.7 | 0.7 | 0.7 | 0.7 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 130 | 308 | 149 | 413 | 1000 |
## | 0.1 | 0.3 | 0.1 | 0.4 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 0.7492964 d.f. = 3 p = 0.8615521
##
##
##
variable13=“Most.valuable.available.asset”
It is a qualitative data.It contains 4 categories.
1 : real estate
2 : if not A121 : building society savings agreement/life insurance
3 : if not A121/A122 : car or other, not in variable 7
4 : unknown / no property
frequency table Most.valuable.available.asset
tab<-table(german_credit$Most.valuable.available.asset)
tab
##
## 1 2 3 4
## 282 232 332 154
names(tab)
## [1] "1" "2" "3" "4"
1 : real estate
2 : if not A121 : building society savings agreement/life insurance
3 : if not A121/A122 : car or other, not in variable 7
4 : unknown / no property
mode of Most.valuable.available.asset.It gives the maximum value.
temp <- table(as.vector(german_credit$Most.valuable.available.asset))
names(temp)[temp == max(temp)]
## [1] "3"
mode of Most.valuable.available.asset is 3.
ggplot of Most.valuable.available.asset
library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$Most.valuable.available.asset,main="Most.valuable.available.asset", ylab="customers", colour= I("purple"),size=I(5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
correlation between Most.valuable.available.asset and creditability
library("ltm", lib.loc="~/R/win-library/3.3")
biserial.cor(german_credit$Most.valuable.available.asset,german_credit$Creditability)
## [1] 0.1425406
library(vcd)
contin_table<-table(german_credit$Most.valuable.available.asset,german_credit$Creditability)
contin_table
##
## 0 1
## 1 60 222
## 2 71 161
## 3 102 230
## 4 67 87
assocstats(contin_table)
## X^2 df P(> X^2)
## Likelihood Ratio 23.546 3 3.1063e-05
## Pearson 23.720 3 2.8584e-05
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.152
## Cramer's V : 0.154
correlation is 0.14. Most.valuable.available.asset and creditability are positively correlated.
crosstable of Most.valuable.available.asset and creditability
library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$Most.valuable.available.asset, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | german_credit$Most.valuable.available.asset
## german_credit$Creditability | 1 | 2 | 3 | 4 | Row Total |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## 0 | 60 | 71 | 102 | 67 | 300 |
## | 0.2 | 0.3 | 0.3 | 0.4 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## 1 | 222 | 161 | 230 | 87 | 700 |
## | 0.8 | 0.7 | 0.7 | 0.6 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 282 | 232 | 332 | 154 | 1000 |
## | 0.3 | 0.2 | 0.3 | 0.2 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 23.71955 d.f. = 3 p = 2.858442e-05
##
##
##
variable14 = “Age..years.”
It is a Numerical data.
head(german_credit$Age..years.)
## [1] 21 36 23 39 38 48
Univariate Analysis of Age..years.
Central tendencies of Age..years.
mean of Age..years.
mean(german_credit$Age..years.)
## [1] 35.542
median of Age..years.
median(german_credit$Age..years.)
## [1] 33
Dispersion of Age..years. Variance of Age..years.
var(german_credit$Age..years.)
## [1] 128.8831
Standard deviation of Age..years.
sd(german_credit$Age..years.)
## [1] 11.35267
summary gives four quartiles of Age..years.
summary(german_credit$Age..years.)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.00 27.00 33.00 35.54 42.00 75.00
boxplot of Age..years.
quantile(german_credit$Age..years.)
## 0% 25% 50% 75% 100%
## 19 27 33 42 75
quantile(german_credit$Age..years.,c(0.75,0.80,0.90,1))
## 75% 80% 90% 100%
## 42 44 52 75
boxplot(german_credit$Age..years.)
output description
In this boxplot the minimum is 19 , maximum is 75, and median is 33. first quartile is 27,third quartile is 42. Notethat outliers are discussed later.
histogram of Age..years.
hist(german_credit$Age..years.)
correlation between Age..years. and Creditability
library("ltm", lib.loc="~/R/win-library/3.3")
biserial.cor(german_credit$Age..years.,german_credit$Creditability)
## [1] -0.0912263
correlation is -0.091.Age..years.and Creditability negatively correlated. #t-test
t.test(german_credit$Age..years.)
##
## One Sample t-test
##
## data: german_credit$Age..years.
## t = 99.002, df = 999, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 34.83751 36.24649
## sample estimates:
## mean of x
## 35.542
variable15=“Concurrent.Credits”
It is a qualitative data.It contains 3 categories.
1 : bank
2 : stores
3 : none
frequency table of Concurrent.Credits
tab<-table(german_credit$Concurrent.Credits)
tab
##
## 1 2 3
## 139 47 814
names(tab)
## [1] "1" "2" "3"
1 : bank
2 : stores
3 : none
mode of Concurrent.Credits.It gives the maximum value.
temp <- table(as.vector(german_credit$Concurrent.Credits))
names(temp)[temp == max(temp)]
## [1] "3"
mode of Concurrent.Credits is 3.
ggplot of Concurrent.Credits
library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$Concurrent.Credits,main="Concurrent.Credits", ylab="customers", colour= I("purple"),size=I(5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
output description:
1 stands for 139 customers are in bank.
2 stands for 47 customers are in store.
3 stands for 814 customers have no concurrent credits.
correlation between Concurrent.Credits and creditability
library("ltm", lib.loc="~/R/win-library/3.3")
biserial.cor(german_credit$Concurrent.Credits,german_credit$Creditability)
## [1] -0.1097892
library(vcd)
contin_table<-table(german_credit$Concurrent.Credits,german_credit$Creditability)
contin_table
##
## 0 1
## 1 57 82
## 2 19 28
## 3 224 590
assocstats(contin_table)
## X^2 df P(> X^2)
## Likelihood Ratio 12.303 2 0.0021298
## Pearson 12.839 2 0.0016293
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.113
## Cramer's V : 0.113
correlation is -0.109. Concurrent.Credits and creditability are negatively correlated.
crosstable of Concurrent.Credits and creditability
library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$Concurrent.Credits, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | german_credit$Concurrent.Credits
## german_credit$Creditability | 1 | 2 | 3 | Row Total |
## ----------------------------|-----------|-----------|-----------|-----------|
## 0 | 57 | 19 | 224 | 300 |
## | 0.4 | 0.4 | 0.3 | |
## ----------------------------|-----------|-----------|-----------|-----------|
## 1 | 82 | 28 | 590 | 700 |
## | 0.6 | 0.6 | 0.7 | |
## ----------------------------|-----------|-----------|-----------|-----------|
## Column Total | 139 | 47 | 814 | 1000 |
## | 0.1 | 0.0 | 0.8 | |
## ----------------------------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 12.83919 d.f. = 2 p = 0.001629318
##
##
##
variable16 = “Type.of.apartment”
It is a qualitative data.It has a 3 categories.
1 : rent
2 : own
3 : for free
frequency table of Type.of.apartment
tab<-table(german_credit$Type.of.apartment)
tab
##
## 1 2 3
## 179 714 107
names(tab)
## [1] "1" "2" "3"
1 : rent
2 : own
3 : for free
mode of Type.of.apartment.It gives the maximum value.
temp <- table(as.vector(german_credit$Type.of.apartment))
names(temp)[temp == max(temp)]
## [1] "2"
mode of Status of Type.of.apartment is 2.
ggplot of Type.of.apartment
library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$Type.of.apartment,main="Type.of.apartment", ylab="customers", colour= I("purple"),size=I(5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
output description:
1 stands for 179 customers are staying in rent houses.
2 stands for 714 customers are staying in own houses.
3 stands for 107 customers are staying in free quaters.
correlation between Type.of.apartment and creditability
library("ltm", lib.loc="~/R/win-library/3.3")
biserial.cor(german_credit$Type.of.apartment,german_credit$Creditability)
## [1] -0.01810985
library(vcd)
contin_table<-table(german_credit$Type.of.apartment,german_credit$Creditability)
contin_table
##
## 0 1
## 1 70 109
## 2 186 528
## 3 44 63
assocstats(contin_table)
## X^2 df P(> X^2)
## Likelihood Ratio 18.129 2 1.1573e-04
## Pearson 18.674 2 8.8103e-05
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.135
## Cramer's V : 0.137
correlation is -0.018. Type.of.apartment and creditability are negatively correlated.
crosstable of Type.of.apartment and creditability
library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$Type.of.apartment, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | german_credit$Type.of.apartment
## german_credit$Creditability | 1 | 2 | 3 | Row Total |
## ----------------------------|-----------|-----------|-----------|-----------|
## 0 | 70 | 186 | 44 | 300 |
## | 0.4 | 0.3 | 0.4 | |
## ----------------------------|-----------|-----------|-----------|-----------|
## 1 | 109 | 528 | 63 | 700 |
## | 0.6 | 0.7 | 0.6 | |
## ----------------------------|-----------|-----------|-----------|-----------|
## Column Total | 179 | 714 | 107 | 1000 |
## | 0.2 | 0.7 | 0.1 | |
## ----------------------------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 18.67401 d.f. = 2 p = 8.810311e-05
##
##
##
variable17= “No.of.Credits.at.this.Bank”
It is a qualitative data. It has a 4 categories.
frequency table No.of.Credits.at.this.Bank
tab<-table(german_credit$No.of.Credits.at.this.Bank)
tab
##
## 1 2 3 4
## 633 333 28 6
names(tab)
## [1] "1" "2" "3" "4"
mode of No.of.Credits.at.this.Bank.It gives the maximum value.
temp <- table(as.vector(german_credit$No.of.Credits.at.this.Bank))
names(temp)[temp == max(temp)]
## [1] "1"
mode of No.of.Credits.at.this.Bank is 1.
ggplot of No.of.Credits.at.this.Bank
library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$No.of.Credits.at.this.Bank,main="No.of.Credits.at.this.Bank", ylab="customers", colour= I("purple"),size=I(5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
correlation between No.of.Credits.at.this.Bank and creditability
library("ltm", lib.loc="~/R/win-library/3.3")
biserial.cor(german_credit$No.of.Credits.at.this.Bank,german_credit$Creditability)
## [1] -0.04570962
library(vcd)
contin_table<-table(german_credit$No.of.Credits.at.this.Bank,german_credit$Creditability)
contin_table
##
## 0 1
## 1 200 433
## 2 92 241
## 3 6 22
## 4 2 4
assocstats(contin_table)
## X^2 df P(> X^2)
## Likelihood Ratio 2.7425 3 0.43304
## Pearson 2.6712 3 0.44514
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.052
## Cramer's V : 0.052
correlation is -0.045. No.of.Credits.at.this.Bank and creditability are negatively correlated.
crosstable of No.of.Credits.at.this.Bank and creditability
library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$No.of.Credits.at.this.Bank, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
## Warning in chisq.test(t, correct = FALSE, ...): Chi-squared approximation
## may be incorrect
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | german_credit$No.of.Credits.at.this.Bank
## german_credit$Creditability | 1 | 2 | 3 | 4 | Row Total |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## 0 | 200 | 92 | 6 | 2 | 300 |
## | 0.3 | 0.3 | 0.2 | 0.3 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## 1 | 433 | 241 | 22 | 4 | 700 |
## | 0.7 | 0.7 | 0.8 | 0.7 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 633 | 333 | 28 | 6 | 1000 |
## | 0.6 | 0.3 | 0.0 | 0.0 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 2.671198 d.f. = 3 p = 0.4451441
##
##
##
variable18 =“Occupation”
It is a qualitative data.It contains 4 categories.
1 : unemployed/ unskilled – non-resident
2 : unskilled – resident
3 : skilled employee / official
4 : management/ self-employed/highly qualified employee/ officer
frequency table of Occupation
tab<-table(german_credit$Occupation)
tab
##
## 1 2 3 4
## 22 200 630 148
names(tab)
## [1] "1" "2" "3" "4"
1 : unemployed/ unskilled – non-resident
2 : unskilled – resident
3 : skilled employee / official
4 : management/ self-employed/highly qualified employee/ officer
mode of Occupation.It gives the maximum value.
temp <- table(as.vector(german_credit$Occupation))
names(temp)[temp == max(temp)]
## [1] "3"
mode of Occupation is 3.
ggplot of Occupation
library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$Occupation,main="Occupation", ylab="customers", colour= I("purple"),size=I(5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
output description:
1 : 22 customers are unemployed/ unskilled – non-resident
2 : 200 customers are unskilled – resident
3 : 630 customers are skilled employee / official
4 : 148 customers are management/ self-employed/highly qualified employee/ officer
correlation between Occupation and creditability
library("ltm", lib.loc="~/R/win-library/3.3")
biserial.cor(german_credit$Occupation,german_credit$Creditability)
## [1] 0.03271863
library(vcd)
contin_table<-table(german_credit$Occupation,german_credit$Creditability)
contin_table
##
## 0 1
## 1 7 15
## 2 56 144
## 3 186 444
## 4 51 97
assocstats(contin_table)
## X^2 df P(> X^2)
## Likelihood Ratio 1.8540 3 0.60326
## Pearson 1.8852 3 0.59658
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.043
## Cramer's V : 0.043
correlation is 0.032. Occupation and creditability are positively correlated.
crosstable of Occupation and creditability
library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$Occupation, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | german_credit$Occupation
## german_credit$Creditability | 1 | 2 | 3 | 4 | Row Total |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## 0 | 7 | 56 | 186 | 51 | 300 |
## | 0.3 | 0.3 | 0.3 | 0.3 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## 1 | 15 | 144 | 444 | 97 | 700 |
## | 0.7 | 0.7 | 0.7 | 0.7 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 22 | 200 | 630 | 148 | 1000 |
## | 0.0 | 0.2 | 0.6 | 0.1 | |
## ----------------------------|-----------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 1.885156 d.f. = 3 p = 0.5965816
##
##
##
variable19 =“No.of.dependents”
It is a qualitative data.
frequency table of No.of.dependents
tab<-table(german_credit$No.of.dependents)
tab
##
## 1 2
## 845 155
names(tab)
## [1] "1" "2"
mode of No.of.dependents.It gives the maximum value.
temp <- table(as.vector(german_credit$No.of.dependents))
names(temp)[temp == max(temp)]
## [1] "1"
mode of Status of No.of.dependents 1.
ggplot of No.of.dependents
library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$No.of.dependents,main="No.of.dependents", ylab="customers", colour= I("purple"),size=I(5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
correlation between No.of.dependents and creditability
library("ltm", lib.loc="~/R/win-library/3.3")
biserial.cor(german_credit$No.of.dependents,german_credit$Creditability)
## [1] -0.003013345
library(vcd)
contin_table<-table(german_credit$No.of.dependents,german_credit$Creditability)
contin_table
##
## 0 1
## 1 254 591
## 2 46 109
assocstats(contin_table)
## X^2 df P(> X^2)
## Likelihood Ratio 0.0091047 1 0.92398
## Pearson 0.0090893 1 0.92405
##
## Phi-Coefficient : 0.003
## Contingency Coeff.: 0.003
## Cramer's V : 0.003
correlation is -0.003. No.of.dependents and creditability are negatively correlated.
crosstable of No.of.dependents and creditability
library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$No.of.dependents, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | german_credit$No.of.dependents
## german_credit$Creditability | 1 | 2 | Row Total |
## ----------------------------|-----------|-----------|-----------|
## 0 | 254 | 46 | 300 |
## | 0.3 | 0.3 | |
## ----------------------------|-----------|-----------|-----------|
## 1 | 591 | 109 | 700 |
## | 0.7 | 0.7 | |
## ----------------------------|-----------|-----------|-----------|
## Column Total | 845 | 155 | 1000 |
## | 0.8 | 0.2 | |
## ----------------------------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 0.009089339 d.f. = 1 p = 0.9240463
##
## Pearson's Chi-squared test with Yates' continuity correction
## ------------------------------------------------------------
## Chi^2 = 0 d.f. = 1 p = 1
##
##
variable20 =“Telephone”
It is a qualitative data. it contains 2 categories.
1 : none
2 : yes, registered under the customers name
frequency table of Telephone
tab<-table(german_credit$Telephone)
tab
##
## 1 2
## 596 404
names(tab)
## [1] "1" "2"
1 -stands for none, 2 -stands for yes, registered under the customers name.
mode of Telephone.It gives the maximum value.
temp <- table(as.vector(german_credit$Telephone))
names(temp)[temp == max(temp)]
## [1] "1"
mode of Status of Telephone is 1.
ggplot of Telephone
library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$Telephone,main="Telephone", ylab="customers", colour= I("purple"),size=I(5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
correlation between Telephone and creditability
library("ltm", lib.loc="~/R/win-library/3.3")
biserial.cor(german_credit$Telephone,german_credit$Creditability)
## [1] -0.03644795
library(vcd)
contin_table<-table(german_credit$Telephone,german_credit$Creditability)
contin_table
##
## 0 1
## 1 187 409
## 2 113 291
assocstats(contin_table)
## X^2 df P(> X^2)
## Likelihood Ratio 1.3359 1 0.24776
## Pearson 1.3298 1 0.24884
##
## Phi-Coefficient : 0.036
## Contingency Coeff.: 0.036
## Cramer's V : 0.036
correlation is -0.35. Telephone and creditability are negatively correlated.
crosstable of Telephone and creditability
library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$Telephone, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | german_credit$Telephone
## german_credit$Creditability | 1 | 2 | Row Total |
## ----------------------------|-----------|-----------|-----------|
## 0 | 187 | 113 | 300 |
## | 0.3 | 0.3 | |
## ----------------------------|-----------|-----------|-----------|
## 1 | 409 | 291 | 700 |
## | 0.7 | 0.7 | |
## ----------------------------|-----------|-----------|-----------|
## Column Total | 596 | 404 | 1000 |
## | 0.6 | 0.4 | |
## ----------------------------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 1.329783 d.f. = 1 p = 0.2488438
##
## Pearson's Chi-squared test with Yates' continuity correction
## ------------------------------------------------------------
## Chi^2 = 1.172559 d.f. = 1 p = 0.2788762
##
##
variable21= “Foreign.Worker”
It is a qualitative data.It contains a two categories.
1 : yes
2 : no
frequency table of Foreign.Worker
tab<-table(german_credit$Foreign.Worker)
tab
##
## 1 2
## 963 37
names(tab)
## [1] "1" "2"
1 -stands for yes, 2 -stands for no
mode of Foreign.Worker.It gives the maximum value.
temp <- table(as.vector(german_credit$Foreign.Worker))
names(temp)[temp == max(temp)]
## [1] "1"
mode of Foreign.Workere is 1.
ggplot of Foreign.Worker
library("ggplot2", lib.loc="~/R/win-library/3.3")
qplot(data<-german_credit$Foreign.Worker,main="Foreign.Worker", ylab="customers", colour= I("purple"),size=I(5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
correlation between Foreign.Worker and creditability
library("ltm", lib.loc="~/R/win-library/3.3")
a<-biserial.cor(german_credit$Foreign.Worker,german_credit$Creditability)
library(vcd)
contin_table<-table(german_credit$Foreign.Worker,german_credit$Creditability)
contin_table
##
## 0 1
## 1 296 667
## 2 4 33
assocstats(contin_table)
## X^2 df P(> X^2)
## Likelihood Ratio 8.0724 1 0.0044945
## Pearson 6.7370 1 0.0094431
##
## Phi-Coefficient : 0.082
## Contingency Coeff.: 0.082
## Cramer's V : 0.082
correlation is -0.08. Foreign.Worker and creditability are negatively correlated.
crosstable of Foreign.Worker and creditability
library("gmodels", lib.loc="~/R/win-library/3.3")
CrossTable(german_credit$Creditability, german_credit$Foreign.Worker, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | german_credit$Foreign.Worker
## german_credit$Creditability | 1 | 2 | Row Total |
## ----------------------------|-----------|-----------|-----------|
## 0 | 296 | 4 | 300 |
## | 0.3 | 0.1 | |
## ----------------------------|-----------|-----------|-----------|
## 1 | 667 | 33 | 700 |
## | 0.7 | 0.9 | |
## ----------------------------|-----------|-----------|-----------|
## Column Total | 963 | 37 | 1000 |
## | 1.0 | 0.0 | |
## ----------------------------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 6.737044 d.f. = 1 p = 0.009443096
##
## Pearson's Chi-squared test with Yates' continuity correction
## ------------------------------------------------------------
## Chi^2 = 5.821576 d.f. = 1 p = 0.01583075
##
##
Model Building
predictors – Qualitative chi-square p -value
variable2 = “Account.Balance” 1.218902e-26 <0.001
variable4 = “Payment.Status.of.Previous.Credit” 1.279187e-12 <0.001
variable5 = “Purpose” 0.0001157491 <0.001
variable7 = “Value.Savings.Stocks” 2.761214e-07 <0.001
variable8 = “Length.of.current.employment” 0.001045452 <0.001
variable9 = “Instalment.per.cent” 0.1400333 = 0.14
variable10 = “Sex…Marital.Status” 0.02223801 = 0.02
variable11 = “Guarantors” 0.03605595 = 0.03
variable12 = “Duration.in.Current.address” 0.8615521 = 0.86
variable13 = “Most.valuable.available.asset” 2.858442e-05 <0.001
variable15 = “Concurrent.Credits” 0.001629318 = 0.001
variable16 = “Type.of.apartment” 8.810311e-05 < 0.001
variable17 = “No.of.Credits.at.this.Bank” 0.4451441
variable18 = “Occupation” 0.5965816
variable19 = “No.of.dependents” 1
variable20 = “Telephone” 0.2788762 = 0.27
variable21 = “Foreign.Worker” 0.01583075 = 0.01
predicators – numerical data
variable3 = “Duration.of.Credit..month.” p-value < 0.001
variable6 = “Credit.Amount” p-value < 0.001
variable14 = “Age..years.” p-value < 0.001
Building logistic Regression
Logistic Regression is a classification algorithm. It is used to predict a binary outcome (1 / 0, Yes / No, True / False) given a set of independent variables. To represent binary / categorical outcome. logistic regression as a special case of linear regression when the outcome variable is categorical, where we are using log of odds as dependent variable. In simple words, it predicts the probability of occurrence of an event by fitting data to a logit function.
names(german_credit)
## [1] "Creditability"
## [2] "Account.Balance"
## [3] "Duration.of.Credit..month."
## [4] "Payment.Status.of.Previous.Credit"
## [5] "Purpose"
## [6] "Credit.Amount"
## [7] "Value.Savings.Stocks"
## [8] "Length.of.current.employment"
## [9] "Instalment.per.cent"
## [10] "Sex...Marital.Status"
## [11] "Guarantors"
## [12] "Duration.in.Current.address"
## [13] "Most.valuable.available.asset"
## [14] "Age..years."
## [15] "Concurrent.Credits"
## [16] "Type.of.apartment"
## [17] "No.of.Credits.at.this.Bank"
## [18] "Occupation"
## [19] "No.of.dependents"
## [20] "Telephone"
## [21] "Foreign.Worker"
LogisticModel<- glm(german_credit$Creditability~.,family=binomial,data=german_credit )
summary(LogisticModel)
##
## Call:
## glm(formula = german_credit$Creditability ~ ., family = binomial,
## data = german_credit)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5854 -0.7927 0.4512 0.7445 1.9483
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.994e+00 1.024e+00 -3.901 9.58e-05
## Account.Balance 5.799e-01 7.004e-02 8.280 < 2e-16
## Duration.of.Credit..month. -2.457e-02 8.725e-03 -2.816 0.004862
## Payment.Status.of.Previous.Credit 3.822e-01 8.740e-02 4.373 1.23e-05
## Purpose 3.153e-02 3.009e-02 1.048 0.294697
## Credit.Amount -9.340e-05 4.012e-05 -2.328 0.019908
## Value.Savings.Stocks 2.391e-01 5.827e-02 4.104 4.07e-05
## Length.of.current.employment 1.517e-01 7.118e-02 2.132 0.033027
## Instalment.per.cent -2.983e-01 8.276e-02 -3.605 0.000312
## Sex...Marital.Status 2.574e-01 1.157e-01 2.224 0.026131
## Guarantors 3.473e-01 1.777e-01 1.954 0.050681
## Duration.in.Current.address -1.411e-02 7.742e-02 -0.182 0.855335
## Most.valuable.available.asset -1.828e-01 9.101e-02 -2.009 0.044521
## Age..years. 8.917e-03 8.206e-03 1.087 0.277218
## Concurrent.Credits 2.419e-01 1.111e-01 2.178 0.029420
## Type.of.apartment 2.931e-01 1.677e-01 1.748 0.080527
## No.of.Credits.at.this.Bank -2.436e-01 1.610e-01 -1.513 0.130257
## Occupation 1.889e-02 1.367e-01 0.138 0.890081
## No.of.dependents -1.708e-01 2.319e-01 -0.736 0.461567
## Telephone 2.947e-01 1.880e-01 1.567 0.117024
## Foreign.Worker 1.158e+00 6.078e-01 1.906 0.056680
##
## (Intercept) ***
## Account.Balance ***
## Duration.of.Credit..month. **
## Payment.Status.of.Previous.Credit ***
## Purpose
## Credit.Amount *
## Value.Savings.Stocks ***
## Length.of.current.employment *
## Instalment.per.cent ***
## Sex...Marital.Status *
## Guarantors .
## Duration.in.Current.address
## Most.valuable.available.asset *
## Age..years.
## Concurrent.Credits *
## Type.of.apartment .
## No.of.Credits.at.this.Bank
## Occupation
## No.of.dependents
## Telephone
## Foreign.Worker .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1221.73 on 999 degrees of freedom
## Residual deviance: 956.56 on 979 degrees of freedom
## AIC: 998.56
##
## Number of Fisher Scoring iterations: 5
classification table
library("caret")
## Warning: package 'caret' was built under R version 3.1.3
## Loading required package: lattice
library("rattle")
## Warning: package 'rattle' was built under R version 3.1.3
## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
threshold=0.5
predicted_values<-ifelse(predict(LogisticModel,type="response")>threshold,1,0)
actual_values<-LogisticModel$y
conf_matrix<-table(predicted_values,actual_values)
conf_matrix
## actual_values
## predicted_values 0 1
## 0 144 77
## 1 156 623
sensitivity(conf_matrix)
## [1] 0.48
specificity(conf_matrix)
## [1] 0.89
Logistic regression Accuracy
accuracy1<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy1
## [1] 0.767
Multicollinearity
library("car")
## Warning: package 'car' was built under R version 3.1.3
summary(LogisticModel)
##
## Call:
## glm(formula = german_credit$Creditability ~ ., family = binomial,
## data = german_credit)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5854 -0.7927 0.4512 0.7445 1.9483
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.994e+00 1.024e+00 -3.901 9.58e-05
## Account.Balance 5.799e-01 7.004e-02 8.280 < 2e-16
## Duration.of.Credit..month. -2.457e-02 8.725e-03 -2.816 0.004862
## Payment.Status.of.Previous.Credit 3.822e-01 8.740e-02 4.373 1.23e-05
## Purpose 3.153e-02 3.009e-02 1.048 0.294697
## Credit.Amount -9.340e-05 4.012e-05 -2.328 0.019908
## Value.Savings.Stocks 2.391e-01 5.827e-02 4.104 4.07e-05
## Length.of.current.employment 1.517e-01 7.118e-02 2.132 0.033027
## Instalment.per.cent -2.983e-01 8.276e-02 -3.605 0.000312
## Sex...Marital.Status 2.574e-01 1.157e-01 2.224 0.026131
## Guarantors 3.473e-01 1.777e-01 1.954 0.050681
## Duration.in.Current.address -1.411e-02 7.742e-02 -0.182 0.855335
## Most.valuable.available.asset -1.828e-01 9.101e-02 -2.009 0.044521
## Age..years. 8.917e-03 8.206e-03 1.087 0.277218
## Concurrent.Credits 2.419e-01 1.111e-01 2.178 0.029420
## Type.of.apartment 2.931e-01 1.677e-01 1.748 0.080527
## No.of.Credits.at.this.Bank -2.436e-01 1.610e-01 -1.513 0.130257
## Occupation 1.889e-02 1.367e-01 0.138 0.890081
## No.of.dependents -1.708e-01 2.319e-01 -0.736 0.461567
## Telephone 2.947e-01 1.880e-01 1.567 0.117024
## Foreign.Worker 1.158e+00 6.078e-01 1.906 0.056680
##
## (Intercept) ***
## Account.Balance ***
## Duration.of.Credit..month. **
## Payment.Status.of.Previous.Credit ***
## Purpose
## Credit.Amount *
## Value.Savings.Stocks ***
## Length.of.current.employment *
## Instalment.per.cent ***
## Sex...Marital.Status *
## Guarantors .
## Duration.in.Current.address
## Most.valuable.available.asset *
## Age..years.
## Concurrent.Credits *
## Type.of.apartment .
## No.of.Credits.at.this.Bank
## Occupation
## No.of.dependents
## Telephone
## Foreign.Worker .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1221.73 on 999 degrees of freedom
## Residual deviance: 956.56 on 979 degrees of freedom
## AIC: 998.56
##
## Number of Fisher Scoring iterations: 5
vif(LogisticModel, scale = FALSE)
## Account.Balance Duration.of.Credit..month.
## 1.064438 1.821553
## Payment.Status.of.Previous.Credit Purpose
## 1.302567 1.087702
## Credit.Amount Value.Savings.Stocks
## 2.227685 1.079491
## Length.of.current.employment Instalment.per.cent
## 1.158789 1.295559
## Sex...Marital.Status Guarantors
## 1.070219 1.063473
## Duration.in.Current.address Most.valuable.available.asset
## 1.169790 1.383774
## Age..years. Concurrent.Credits
## 1.306198 1.092446
## Type.of.apartment No.of.Credits.at.this.Bank
## 1.347264 1.312079
## Occupation No.of.dependents
## 1.331917 1.083981
## Telephone Foreign.Worker
## 1.319990 1.041820
Individual Impact of Variables
library("caret")
varImp(LogisticModel, scale = FALSE)
## Overall
## Account.Balance 8.2802843
## Duration.of.Credit..month. 2.8160232
## Payment.Status.of.Previous.Credit 4.3726880
## Purpose 1.0478728
## Credit.Amount 2.3280721
## Value.Savings.Stocks 4.1036081
## Length.of.current.employment 2.1317598
## Instalment.per.cent 3.6048549
## Sex...Marital.Status 2.2242622
## Guarantors 1.9541747
## Duration.in.Current.address 0.1823156
## Most.valuable.available.asset 2.0091558
## Age..years. 1.0865896
## Concurrent.Credits 2.1778089
## Type.of.apartment 1.7476343
## No.of.Credits.at.this.Bank 1.5130896
## Occupation 0.1382014
## No.of.dependents 0.7362695
## Telephone 1.5673893
## Foreign.Worker 1.9057709
AIC and BIC
library("stats")
AIC(LogisticModel)
## [1] 998.5572
BIC(LogisticModel)
## [1] 1101.62
Model Selction
library("caret")
varImp(LogisticModel, scale = FALSE)
## Overall
## Account.Balance 8.2802843
## Duration.of.Credit..month. 2.8160232
## Payment.Status.of.Previous.Credit 4.3726880
## Purpose 1.0478728
## Credit.Amount 2.3280721
## Value.Savings.Stocks 4.1036081
## Length.of.current.employment 2.1317598
## Instalment.per.cent 3.6048549
## Sex...Marital.Status 2.2242622
## Guarantors 1.9541747
## Duration.in.Current.address 0.1823156
## Most.valuable.available.asset 2.0091558
## Age..years. 1.0865896
## Concurrent.Credits 2.1778089
## Type.of.apartment 1.7476343
## No.of.Credits.at.this.Bank 1.5130896
## Occupation 0.1382014
## No.of.dependents 0.7362695
## Telephone 1.5673893
## Foreign.Worker 1.9057709
Duration.in.Current.address,Occupation and No.of.dependents are the least impacting variable, lets drop these variables and re build the model
Building Final logistic Regression
LogisticModel1<- glm(german_credit$Creditability~german_credit$Account.Balance+german_credit$Duration.of.Credit..month.+ german_credit$Payment.Status.of.Previous.Credit+german_credit$Purpose+german_credit$Credit.Amount+german_credit$Value.Savings.Stocks+german_credit$Length.of.current.employment+german_credit$Instalment.per.cent +german_credit$Sex...Marital.Status+german_credit$Guarantors+german_credit$Most.valuable.available.asset+german_credit$Age..years.+german_credit$Concurrent.Credits +german_credit$Type.of.apartment+german_credit$No.of.Credits.at.this.Bank+german_credit$Telephone+german_credit$Foreign.Worker)
summary(LogisticModel1)
##
## Call:
## glm(formula = german_credit$Creditability ~ german_credit$Account.Balance +
## german_credit$Duration.of.Credit..month. + german_credit$Payment.Status.of.Previous.Credit +
## german_credit$Purpose + german_credit$Credit.Amount + german_credit$Value.Savings.Stocks +
## german_credit$Length.of.current.employment + german_credit$Instalment.per.cent +
## german_credit$Sex...Marital.Status + german_credit$Guarantors +
## german_credit$Most.valuable.available.asset + german_credit$Age..years. +
## german_credit$Concurrent.Credits + german_credit$Type.of.apartment +
## german_credit$No.of.Credits.at.this.Bank + german_credit$Telephone +
## german_credit$Foreign.Worker)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0846 -0.3477 0.1123 0.2952 0.7905
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -7.003e-02 1.361e-01
## german_credit$Account.Balance 9.936e-02 1.079e-02
## german_credit$Duration.of.Credit..month. -4.395e-03 1.481e-03
## german_credit$Payment.Status.of.Previous.Credit 6.626e-02 1.384e-02
## german_credit$Purpose 4.825e-03 4.811e-03
## german_credit$Credit.Amount -1.502e-05 6.700e-06
## german_credit$Value.Savings.Stocks 3.382e-02 8.448e-03
## german_credit$Length.of.current.employment 2.399e-02 1.129e-02
## german_credit$Instalment.per.cent -4.573e-02 1.289e-02
## german_credit$Sex...Marital.Status 4.224e-02 1.853e-02
## german_credit$Guarantors 5.860e-02 2.773e-02
## german_credit$Most.valuable.available.asset -3.238e-02 1.405e-02
## german_credit$Age..years. 9.144e-04 1.251e-03
## german_credit$Concurrent.Credits 3.687e-02 1.879e-02
## german_credit$Type.of.apartment 4.879e-02 2.730e-02
## german_credit$No.of.Credits.at.this.Bank -4.490e-02 2.522e-02
## german_credit$Telephone 5.348e-02 2.791e-02
## german_credit$Foreign.Worker 1.111e-01 7.035e-02
## t value Pr(>|t|)
## (Intercept) -0.514 0.607051
## german_credit$Account.Balance 9.210 < 2e-16 ***
## german_credit$Duration.of.Credit..month. -2.967 0.003078 **
## german_credit$Payment.Status.of.Previous.Credit 4.787 1.95e-06 ***
## german_credit$Purpose 1.003 0.316163
## german_credit$Credit.Amount -2.241 0.025230 *
## german_credit$Value.Savings.Stocks 4.003 6.73e-05 ***
## german_credit$Length.of.current.employment 2.124 0.033903 *
## german_credit$Instalment.per.cent -3.547 0.000407 ***
## german_credit$Sex...Marital.Status 2.279 0.022852 *
## german_credit$Guarantors 2.113 0.034829 *
## german_credit$Most.valuable.available.asset -2.305 0.021359 *
## german_credit$Age..years. 0.731 0.464819
## german_credit$Concurrent.Credits 1.962 0.050000 .
## german_credit$Type.of.apartment 1.787 0.074226 .
## german_credit$No.of.Credits.at.this.Bank -1.781 0.075280 .
## german_credit$Telephone 1.916 0.055624 .
## german_credit$Foreign.Worker 1.579 0.114765
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1632154)
##
## Null deviance: 210.00 on 999 degrees of freedom
## Residual deviance: 160.28 on 982 degrees of freedom
## AIC: 1045
##
## Number of Fisher Scoring iterations: 2
threshold=0.5
predicted_values<-ifelse(predict(LogisticModel1,type="response")>threshold,1,0)
actual_values<-LogisticModel1$y
conf_matrix<-table(predicted_values,actual_values)
conf_matrix
## actual_values
## predicted_values 0 1
## 0 139 68
## 1 161 632
sensitivity(conf_matrix)
## [1] 0.4633333
specificity(conf_matrix)
## [1] 0.9028571
Logistic Regression Accuracy
accuracy2<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy2
## [1] 0.771
AIC and BIC
## df AIC
## LogisticModel 21 998.5572
## LogisticModel1 19 1045.0284
## df BIC
## LogisticModel 21 1101.620
## LogisticModel1 19 1138.276
decision tree
The series of questions and their possible answers can be organized in the form of a decision tree, which is a hierarchical structure consisting of nodes and directed edges.The tree has 3 types of nodes:
Rootnode that has no incoming edges and zero or more outgoing edges.
Internal node,each of which has exactly one incoming edge and two or more outgoing edges.
leaf or terminal node,each of which has exactly one incoming edge and no outgoing edges.
library("rpart")
library("tree")
## Warning: package 'tree' was built under R version 3.1.3
german_tree<-rpart(german_credit$Creditability~.,method="class", data=german_credit)
german_tree
## n= 1000
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 1000 300 1 (0.3000000 0.7000000)
## 2) Account.Balance< 2.5 543 240 1 (0.4419890 0.5580110)
## 4) Duration.of.Credit..month.>=22.5 237 103 0 (0.5654008 0.4345992)
## 8) Value.Savings.Stocks< 3.5 196 74 0 (0.6224490 0.3775510) *
## 9) Value.Savings.Stocks>=3.5 41 12 1 (0.2926829 0.7073171) *
## 5) Duration.of.Credit..month.< 22.5 306 106 1 (0.3464052 0.6535948)
## 10) Payment.Status.of.Previous.Credit< 1.5 28 7 0 (0.7500000 0.2500000) *
## 11) Payment.Status.of.Previous.Credit>=1.5 278 85 1 (0.3057554 0.6942446)
## 22) Credit.Amount>=7491.5 7 1 0 (0.8571429 0.1428571) *
## 23) Credit.Amount< 7491.5 271 79 1 (0.2915129 0.7084871)
## 46) Duration.of.Credit..month.>=11.5 193 67 1 (0.3471503 0.6528497)
## 92) Credit.Amount< 1387.5 73 36 0 (0.5068493 0.4931507)
## 184) Most.valuable.available.asset>=2.5 23 4 0 (0.8260870 0.1739130) *
## 185) Most.valuable.available.asset< 2.5 50 18 1 (0.3600000 0.6400000) *
## 93) Credit.Amount>=1387.5 120 30 1 (0.2500000 0.7500000) *
## 47) Duration.of.Credit..month.< 11.5 78 12 1 (0.1538462 0.8461538) *
## 3) Account.Balance>=2.5 457 60 1 (0.1312910 0.8687090) *
library("rattle")
library("rpart.plot")
## Warning: package 'rpart.plot' was built under R version 3.1.3
fancyRpartPlot(german_tree)
printcp(german_tree)
##
## Classification tree:
## rpart(formula = german_credit$Creditability ~ ., data = german_credit,
## method = "class")
##
## Variables actually used in tree construction:
## [1] Account.Balance Credit.Amount
## [3] Duration.of.Credit..month. Most.valuable.available.asset
## [5] Payment.Status.of.Previous.Credit Value.Savings.Stocks
##
## Root node error: 300/1000 = 0.3
##
## n= 1000
##
## CP nsplit rel error xerror xstd
## 1 0.051667 0 1.00000 1.00000 0.048305
## 2 0.046667 3 0.84000 0.98000 0.048024
## 3 0.016667 4 0.79333 0.86000 0.046120
## 4 0.010000 8 0.72667 0.85333 0.046003
plotcp(german_tree)
#pruning
library("rpart")
library("tree")
german_tree1<-rpart(german_credit$Creditability~.,method="class", control=rpart.control(minsplit=30, cp=0.028), data=german_credit)
german_tree1
## n= 1000
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 1000 300 1 (0.3000000 0.7000000)
## 2) Account.Balance< 2.5 543 240 1 (0.4419890 0.5580110)
## 4) Duration.of.Credit..month.>=22.5 237 103 0 (0.5654008 0.4345992)
## 8) Value.Savings.Stocks< 3.5 196 74 0 (0.6224490 0.3775510) *
## 9) Value.Savings.Stocks>=3.5 41 12 1 (0.2926829 0.7073171) *
## 5) Duration.of.Credit..month.< 22.5 306 106 1 (0.3464052 0.6535948)
## 10) Payment.Status.of.Previous.Credit< 1.5 28 7 0 (0.7500000 0.2500000) *
## 11) Payment.Status.of.Previous.Credit>=1.5 278 85 1 (0.3057554 0.6942446) *
## 3) Account.Balance>=2.5 457 60 1 (0.1312910 0.8687090) *
library("rattle")
library("rpart.plot")
fancyRpartPlot(german_tree1)
printcp(german_tree1)
##
## Classification tree:
## rpart(formula = german_credit$Creditability ~ ., data = german_credit,
## method = "class", control = rpart.control(minsplit = 30,
## cp = 0.028))
##
## Variables actually used in tree construction:
## [1] Account.Balance Duration.of.Credit..month.
## [3] Payment.Status.of.Previous.Credit Value.Savings.Stocks
##
## Root node error: 300/1000 = 0.3
##
## n= 1000
##
## CP nsplit rel error xerror xstd
## 1 0.051667 0 1.00000 1.00000 0.048305
## 2 0.046667 3 0.84000 1.01000 0.048441
## 3 0.028000 4 0.79333 0.85333 0.046003
plotcp(german_tree1)
#Prediction using the model
accuracy
Accuracy=(TP+TN)/(TP+FP+FN+TN) Sensitivity= TP/(TP+FN) or TP/ Overall Positives Specificity = TN/(TN+FP) or TN/ Overall Negatives
library("caret")
sample_pred<-predict(german_tree1, type="class")
conf_matrix<-table(sample_pred,german_credit$Creditability)
conf_matrix
##
## sample_pred 0 1
## 0 143 81
## 1 157 619
accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy
## [1] 0.762
outlier treatment for creditamount
summary(german_credit$Credit.Amount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 250 1366 2320 3271 3972 18420
boxplot(german_credit$Credit.Amount)
quantile(german_credit$Credit.Amount)
## 0% 25% 50% 75% 100%
## 250.00 1365.50 2319.50 3972.25 18424.00
quantile(german_credit$Credit.Amount,c(0.75,0.80,0.90,0.95,1))
## 75% 80% 90% 95% 100%
## 3972.25 4720.00 7179.40 9162.70 18424.00
german_credit$credit_amount_new<-german_credit$Credit.Amount
bench<- 3972 +3*IQR(german_credit$credit_amount_new)
bench
## [1] 11792.25
mean=3271
summary(german_credit$Credit.Amount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 250 1366 2320 3271 3972 18420
german_credit$credit_amount_new<-ifelse((german_credit$Credit.Amount)>bench,mean,german_credit$Credit.Amount)
summary(german_credit$credit_amount_new)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 250 1366 2320 3017 3786 11760
boxplot(german_credit$credit_amount_new)
length(german_credit$credit_amount_new)
## [1] 1000
names(german_credit)
## [1] "Creditability"
## [2] "Account.Balance"
## [3] "Duration.of.Credit..month."
## [4] "Payment.Status.of.Previous.Credit"
## [5] "Purpose"
## [6] "Credit.Amount"
## [7] "Value.Savings.Stocks"
## [8] "Length.of.current.employment"
## [9] "Instalment.per.cent"
## [10] "Sex...Marital.Status"
## [11] "Guarantors"
## [12] "Duration.in.Current.address"
## [13] "Most.valuable.available.asset"
## [14] "Age..years."
## [15] "Concurrent.Credits"
## [16] "Type.of.apartment"
## [17] "No.of.Credits.at.this.Bank"
## [18] "Occupation"
## [19] "No.of.dependents"
## [20] "Telephone"
## [21] "Foreign.Worker"
## [22] "credit_amount_new"
dim(german_credit)
## [1] 1000 22
Logistic model (credit amount replace with credit_amount_new)
LogisticModel3<- glm(german_credit$Creditability~german_credit$Account.Balance+german_credit$Duration.of.Credit..month.+ german_credit$Payment.Status.of.Previous.Credit+german_credit$Purpose+german_credit$credit_amount_new+german_credit$Value.Savings.Stocks+german_credit$Length.of.current.employment+german_credit$Instalment.per.cent +german_credit$Sex...Marital.Status+german_credit$Guarantors+german_credit$Most.valuable.available.asset+german_credit$Age..years.+german_credit$Concurrent.Credits +german_credit$Type.of.apartment+german_credit$No.of.Credits.at.this.Bank+german_credit$Telephone+german_credit$Foreign.Worker)
summary(LogisticModel3)
##
## Call:
## glm(formula = german_credit$Creditability ~ german_credit$Account.Balance +
## german_credit$Duration.of.Credit..month. + german_credit$Payment.Status.of.Previous.Credit +
## german_credit$Purpose + german_credit$credit_amount_new +
## german_credit$Value.Savings.Stocks + german_credit$Length.of.current.employment +
## german_credit$Instalment.per.cent + german_credit$Sex...Marital.Status +
## german_credit$Guarantors + german_credit$Most.valuable.available.asset +
## german_credit$Age..years. + german_credit$Concurrent.Credits +
## german_credit$Type.of.apartment + german_credit$No.of.Credits.at.this.Bank +
## german_credit$Telephone + german_credit$Foreign.Worker)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0890 -0.3366 0.1152 0.2985 0.7536
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -6.652e-02 1.366e-01
## german_credit$Account.Balance 9.966e-02 1.082e-02
## german_credit$Duration.of.Credit..month. -6.623e-03 1.466e-03
## german_credit$Payment.Status.of.Previous.Credit 6.695e-02 1.387e-02
## german_credit$Purpose 5.128e-03 4.844e-03
## german_credit$credit_amount_new 1.268e-06 8.076e-06
## german_credit$Value.Savings.Stocks 3.288e-02 8.473e-03
## german_credit$Length.of.current.employment 2.541e-02 1.131e-02
## german_credit$Instalment.per.cent -3.278e-02 1.292e-02
## german_credit$Sex...Marital.Status 4.173e-02 1.858e-02
## german_credit$Guarantors 5.761e-02 2.780e-02
## german_credit$Most.valuable.available.asset -3.751e-02 1.404e-02
## german_credit$Age..years. 7.464e-04 1.252e-03
## german_credit$Concurrent.Credits 3.722e-02 1.884e-02
## german_credit$Type.of.apartment 4.825e-02 2.738e-02
## german_credit$No.of.Credits.at.this.Bank -4.772e-02 2.536e-02
## german_credit$Telephone 3.928e-02 2.763e-02
## german_credit$Foreign.Worker 1.014e-01 7.040e-02
## t value Pr(>|t|)
## (Intercept) -0.487 0.626386
## german_credit$Account.Balance 9.213 < 2e-16 ***
## german_credit$Duration.of.Credit..month. -4.518 7.01e-06 ***
## german_credit$Payment.Status.of.Previous.Credit 4.825 1.62e-06 ***
## german_credit$Purpose 1.059 0.289982
## german_credit$credit_amount_new 0.157 0.875303
## german_credit$Value.Savings.Stocks 3.881 0.000111 ***
## german_credit$Length.of.current.employment 2.246 0.024927 *
## german_credit$Instalment.per.cent -2.537 0.011327 *
## german_credit$Sex...Marital.Status 2.246 0.024900 *
## german_credit$Guarantors 2.072 0.038505 *
## german_credit$Most.valuable.available.asset -2.671 0.007684 **
## german_credit$Age..years. 0.596 0.551113
## german_credit$Concurrent.Credits 1.976 0.048425 *
## german_credit$Type.of.apartment 1.762 0.078315 .
## german_credit$No.of.Credits.at.this.Bank -1.882 0.060168 .
## german_credit$Telephone 1.421 0.155489
## german_credit$Foreign.Worker 1.440 0.150311
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1640462)
##
## Null deviance: 210.00 on 999 degrees of freedom
## Residual deviance: 161.09 on 982 degrees of freedom
## AIC: 1050.1
##
## Number of Fisher Scoring iterations: 2
library("rattle")
threshold=0.5
predicted_values<-ifelse(predict(LogisticModel3,type="response")>threshold,1,0)
actual_values<-LogisticModel3$y
conf_matrix<-table(predicted_values,actual_values)
conf_matrix
## actual_values
## predicted_values 0 1
## 0 140 69
## 1 160 631
sensitivity(conf_matrix)
## [1] 0.4666667
specificity(conf_matrix)
## [1] 0.9014286
Accuracy
accuracy3<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy3
## [1] 0.771
logistic model1 builds with a with out outlier treatment(credit.amount) and it’s accuracy is 0.77. logistic model3 builds with a outlier treatment(credit_amount_new)and it’s accuracy is 0.77.
Changing Threshold
threshold=0.8
predicted_values<-ifelse(predict(LogisticModel3,type="response")>threshold,1,0)
actual_values<-LogisticModel3$y
conf_matrix<-table(predicted_values,actual_values)
conf_matrix
## actual_values
## predicted_values 0 1
## 0 265 385
## 1 35 315
sensitivity(conf_matrix)
## [1] 0.8833333
specificity(conf_matrix)
## [1] 0.45
accuracy4<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy4
## [1] 0.58
AIC and BIC
AIC(LogisticModel,LogisticModel1,LogisticModel3)
## df AIC
## LogisticModel 21 998.5572
## LogisticModel1 19 1045.0284
## LogisticModel3 19 1050.1058
BIC(LogisticModel,LogisticModel1,LogisticModel3)
## df BIC
## LogisticModel 21 1101.620
## LogisticModel1 19 1138.276
## LogisticModel3 19 1143.353
Decision tree(Credit.Amount replace with credit_amount_new )
library("rpart", lib.loc="~/R/win-library/3.3")
library("tree", lib.loc="~/R/win-library/3.3")
dim(german_credit)
## [1] 1000 22
names(german_credit)
## [1] "Creditability"
## [2] "Account.Balance"
## [3] "Duration.of.Credit..month."
## [4] "Payment.Status.of.Previous.Credit"
## [5] "Purpose"
## [6] "Credit.Amount"
## [7] "Value.Savings.Stocks"
## [8] "Length.of.current.employment"
## [9] "Instalment.per.cent"
## [10] "Sex...Marital.Status"
## [11] "Guarantors"
## [12] "Duration.in.Current.address"
## [13] "Most.valuable.available.asset"
## [14] "Age..years."
## [15] "Concurrent.Credits"
## [16] "Type.of.apartment"
## [17] "No.of.Credits.at.this.Bank"
## [18] "Occupation"
## [19] "No.of.dependents"
## [20] "Telephone"
## [21] "Foreign.Worker"
## [22] "credit_amount_new"
german_tree1<-rpart(german_credit$Creditability~german_credit$Account.Balance+german_credit$Duration.of.Credit..month.+ german_credit$Payment.Status.of.Previous.Credit+german_credit$Purpose+german_credit$credit_amount_new+german_credit$Value.Savings.Stocks+german_credit$Length.of.current.employment+german_credit$Instalment.per.cent +german_credit$Sex...Marital.Status+german_credit$Guarantors+german_credit$Duration.in.Current.address+german_credit$Most.valuable.available.asset+german_credit$Age..years.+german_credit$Concurrent.Credits +german_credit$Type.of.apartment+german_credit$No.of.Credits.at.this.Bank+german_credit$Occupation+german_credit$No.of.dependents+german_credit$Telephone+german_credit$Foreign.Worker,method="class", control=rpart.control(minsplit=30,cp=0.028))
german_tree1
## n= 1000
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 1000 300 1 (0.3000000 0.7000000)
## 2) german_credit$Account.Balance< 2.5 543 240 1 (0.4419890 0.5580110)
## 4) german_credit$Duration.of.Credit..month.>=22.5 237 103 0 (0.5654008 0.4345992)
## 8) german_credit$Value.Savings.Stocks< 3.5 196 74 0 (0.6224490 0.3775510) *
## 9) german_credit$Value.Savings.Stocks>=3.5 41 12 1 (0.2926829 0.7073171) *
## 5) german_credit$Duration.of.Credit..month.< 22.5 306 106 1 (0.3464052 0.6535948)
## 10) german_credit$Payment.Status.of.Previous.Credit< 1.5 28 7 0 (0.7500000 0.2500000) *
## 11) german_credit$Payment.Status.of.Previous.Credit>=1.5 278 85 1 (0.3057554 0.6942446) *
## 3) german_credit$Account.Balance>=2.5 457 60 1 (0.1312910 0.8687090) *
library("rattle", lib.loc="~/R/win-library/3.3")
library("rpart.plot", lib.loc="~/R/win-library/3.3")
fancyRpartPlot(german_tree1)
printcp(german_tree1)
##
## Classification tree:
## rpart(formula = german_credit$Creditability ~ german_credit$Account.Balance +
## german_credit$Duration.of.Credit..month. + german_credit$Payment.Status.of.Previous.Credit +
## german_credit$Purpose + german_credit$credit_amount_new +
## german_credit$Value.Savings.Stocks + german_credit$Length.of.current.employment +
## german_credit$Instalment.per.cent + german_credit$Sex...Marital.Status +
## german_credit$Guarantors + german_credit$Duration.in.Current.address +
## german_credit$Most.valuable.available.asset + german_credit$Age..years. +
## german_credit$Concurrent.Credits + german_credit$Type.of.apartment +
## german_credit$No.of.Credits.at.this.Bank + german_credit$Occupation +
## german_credit$No.of.dependents + german_credit$Telephone +
## german_credit$Foreign.Worker, method = "class", control = rpart.control(minsplit = 30,
## cp = 0.028))
##
## Variables actually used in tree construction:
## [1] german_credit$Account.Balance
## [2] german_credit$Duration.of.Credit..month.
## [3] german_credit$Payment.Status.of.Previous.Credit
## [4] german_credit$Value.Savings.Stocks
##
## Root node error: 300/1000 = 0.3
##
## n= 1000
##
## CP nsplit rel error xerror xstd
## 1 0.051667 0 1.00000 1.00000 0.048305
## 2 0.046667 3 0.84000 0.96000 0.047733
## 3 0.028000 4 0.79333 0.81667 0.045335
plotcp(german_tree1)
Prediction using the model
sample_pred<-predict(german_tree1, type="class")
conf_matrix<-table(sample_pred,german_credit$Creditability)
conf_matrix
##
## sample_pred 0 1
## 0 143 81
## 1 157 619
accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy
## [1] 0.762
german_tree builds with a with out outlier treatment and its accuracy is 0.76.german_tree1 builds with a outlier treatment and it’s accuracy is 0.76. there is no difference in accuracy with out outlier treatment/with outlier treatment.
Training and Validation data
german_credit<-german_credit[-22]
dim(german_credit)
## [1] 1000 21
names(german_credit)
## [1] "Creditability"
## [2] "Account.Balance"
## [3] "Duration.of.Credit..month."
## [4] "Payment.Status.of.Previous.Credit"
## [5] "Purpose"
## [6] "Credit.Amount"
## [7] "Value.Savings.Stocks"
## [8] "Length.of.current.employment"
## [9] "Instalment.per.cent"
## [10] "Sex...Marital.Status"
## [11] "Guarantors"
## [12] "Duration.in.Current.address"
## [13] "Most.valuable.available.asset"
## [14] "Age..years."
## [15] "Concurrent.Credits"
## [16] "Type.of.apartment"
## [17] "No.of.Credits.at.this.Bank"
## [18] "Occupation"
## [19] "No.of.dependents"
## [20] "Telephone"
## [21] "Foreign.Worker"
library("caret", lib.loc="~/R/win-library/3.3")
set.seed(121)
sampledata <- createDataPartition(german_credit$Creditability, p=0.80, list=FALSE)
train_new <- german_credit[sampledata,]
hold_out <- german_credit[-sampledata,]
Overfitting
Model on training data
library("rpart", lib.loc="~/R/win-library/3.3")
library("tree", lib.loc="~/R/win-library/3.3")
german_tree2<-rpart(Creditability~.,method="class", control=rpart.control(minsplit=30,cp=0.028), data=train_new)
german_tree2
## n= 800
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 800 230 1 (0.2875000 0.7125000)
## 2) Account.Balance< 2.5 426 180 1 (0.4225352 0.5774648)
## 4) Duration.of.Credit..month.>=15.5 260 127 0 (0.5115385 0.4884615)
## 8) Value.Savings.Stocks< 3.5 210 88 0 (0.5809524 0.4190476) *
## 9) Value.Savings.Stocks>=3.5 50 11 1 (0.2200000 0.7800000) *
## 5) Duration.of.Credit..month.< 15.5 166 47 1 (0.2831325 0.7168675)
## 10) Payment.Status.of.Previous.Credit< 1.5 12 2 0 (0.8333333 0.1666667) *
## 11) Payment.Status.of.Previous.Credit>=1.5 154 37 1 (0.2402597 0.7597403) *
## 3) Account.Balance>=2.5 374 50 1 (0.1336898 0.8663102) *
library("rattle", lib.loc="~/R/win-library/3.3")
library("tree", lib.loc="~/R/win-library/3.3")
sample_pred<-predict(german_tree2, type="class")
conf_matrix<-table(sample_pred,train_new$Creditability)
conf_matrix
##
## sample_pred 0 1
## 0 132 90
## 1 98 480
accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy
## [1] 0.765
Validation accuracy
library("rattle", lib.loc="~/R/win-library/3.3")
hold_out$pred<- predict(german_tree2, hold_out,type="class")
conf_matrix_val<-table(hold_out$pred,hold_out$Creditability)
conf_matrix_val
##
## 0 1
## 0 38 26
## 1 32 104
accuracy_val<-(conf_matrix_val[1,1]+conf_matrix_val[2,2])/(sum(conf_matrix_val))
accuracy_val
## [1] 0.71
It performs well on training data not in test data.
ROC and AUC
library("pROC")
## Warning: package 'pROC' was built under R version 3.1.3
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following object is masked from 'package:gmodels':
##
## ci
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
german_credit<-german_credit[-22]
dim(german_credit)
## [1] 1000 21
names(german_credit)
## [1] "Creditability"
## [2] "Account.Balance"
## [3] "Duration.of.Credit..month."
## [4] "Payment.Status.of.Previous.Credit"
## [5] "Purpose"
## [6] "Credit.Amount"
## [7] "Value.Savings.Stocks"
## [8] "Length.of.current.employment"
## [9] "Instalment.per.cent"
## [10] "Sex...Marital.Status"
## [11] "Guarantors"
## [12] "Duration.in.Current.address"
## [13] "Most.valuable.available.asset"
## [14] "Age..years."
## [15] "Concurrent.Credits"
## [16] "Type.of.apartment"
## [17] "No.of.Credits.at.this.Bank"
## [18] "Occupation"
## [19] "No.of.dependents"
## [20] "Telephone"
## [21] "Foreign.Worker"
LogisticModel4<- glm(german_credit$Creditability~.,family=binomial(),data=german_credit)
summary(LogisticModel4)
##
## Call:
## glm(formula = german_credit$Creditability ~ ., family = binomial(),
## data = german_credit)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5854 -0.7927 0.4512 0.7445 1.9483
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.994e+00 1.024e+00 -3.901 9.58e-05
## Account.Balance 5.799e-01 7.004e-02 8.280 < 2e-16
## Duration.of.Credit..month. -2.457e-02 8.725e-03 -2.816 0.004862
## Payment.Status.of.Previous.Credit 3.822e-01 8.740e-02 4.373 1.23e-05
## Purpose 3.153e-02 3.009e-02 1.048 0.294697
## Credit.Amount -9.340e-05 4.012e-05 -2.328 0.019908
## Value.Savings.Stocks 2.391e-01 5.827e-02 4.104 4.07e-05
## Length.of.current.employment 1.517e-01 7.118e-02 2.132 0.033027
## Instalment.per.cent -2.983e-01 8.276e-02 -3.605 0.000312
## Sex...Marital.Status 2.574e-01 1.157e-01 2.224 0.026131
## Guarantors 3.473e-01 1.777e-01 1.954 0.050681
## Duration.in.Current.address -1.411e-02 7.742e-02 -0.182 0.855335
## Most.valuable.available.asset -1.828e-01 9.101e-02 -2.009 0.044521
## Age..years. 8.917e-03 8.206e-03 1.087 0.277218
## Concurrent.Credits 2.419e-01 1.111e-01 2.178 0.029420
## Type.of.apartment 2.931e-01 1.677e-01 1.748 0.080527
## No.of.Credits.at.this.Bank -2.436e-01 1.610e-01 -1.513 0.130257
## Occupation 1.889e-02 1.367e-01 0.138 0.890081
## No.of.dependents -1.708e-01 2.319e-01 -0.736 0.461567
## Telephone 2.947e-01 1.880e-01 1.567 0.117024
## Foreign.Worker 1.158e+00 6.078e-01 1.906 0.056680
##
## (Intercept) ***
## Account.Balance ***
## Duration.of.Credit..month. **
## Payment.Status.of.Previous.Credit ***
## Purpose
## Credit.Amount *
## Value.Savings.Stocks ***
## Length.of.current.employment *
## Instalment.per.cent ***
## Sex...Marital.Status *
## Guarantors .
## Duration.in.Current.address
## Most.valuable.available.asset *
## Age..years.
## Concurrent.Credits *
## Type.of.apartment .
## No.of.Credits.at.this.Bank
## Occupation
## No.of.dependents
## Telephone
## Foreign.Worker .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1221.73 on 999 degrees of freedom
## Residual deviance: 956.56 on 979 degrees of freedom
## AIC: 998.56
##
## Number of Fisher Scoring iterations: 5
predicted_prob<-predict(LogisticModel4,type="response")
roccurve <- roc(LogisticModel4$y, predicted_prob)
plot(roccurve)
##
## Call:
## roc.default(response = LogisticModel4$y, predictor = predicted_prob)
##
## Data: predicted_prob in 300 controls (LogisticModel4$y 0) < 700 cases (LogisticModel4$y 1).
## Area under the curve: 0.8036
auc(roccurve)
## Area under the curve: 0.8036
auc(LogisticModel4$y, predicted_prob)
## Area under the curve: 0.8036
ROC and AUC on decision
names(german_credit)
## [1] "Creditability"
## [2] "Account.Balance"
## [3] "Duration.of.Credit..month."
## [4] "Payment.Status.of.Previous.Credit"
## [5] "Purpose"
## [6] "Credit.Amount"
## [7] "Value.Savings.Stocks"
## [8] "Length.of.current.employment"
## [9] "Instalment.per.cent"
## [10] "Sex...Marital.Status"
## [11] "Guarantors"
## [12] "Duration.in.Current.address"
## [13] "Most.valuable.available.asset"
## [14] "Age..years."
## [15] "Concurrent.Credits"
## [16] "Type.of.apartment"
## [17] "No.of.Credits.at.this.Bank"
## [18] "Occupation"
## [19] "No.of.dependents"
## [20] "Telephone"
## [21] "Foreign.Worker"
library("pROC")
german_credit_decision<-rpart(german_credit$Creditability~.,method="class",control=rpart.control(minsplit=30,cp=0.028),data=german_credit)
summary(german_credit_decision)
## Call:
## rpart(formula = german_credit$Creditability ~ ., data = german_credit,
## method = "class", control = rpart.control(minsplit = 30,
## cp = 0.028))
## n= 1000
##
## CP nsplit rel error xerror xstd
## 1 0.05166667 0 1.0000000 1.0000000 0.04830459
## 2 0.04666667 3 0.8400000 0.9633333 0.04778180
## 3 0.02800000 4 0.7933333 0.8366667 0.04570424
##
## Variable importance
## Account.Balance Payment.Status.of.Previous.Credit
## 46 14
## Value.Savings.Stocks Duration.of.Credit..month.
## 14 12
## Credit.Amount Most.valuable.available.asset
## 5 2
## Type.of.apartment Purpose
## 1 1
## Age..years. Length.of.current.employment
## 1 1
## No.of.Credits.at.this.Bank Occupation
## 1 1
##
## Node number 1: 1000 observations, complexity param=0.05166667
## predicted class=1 expected loss=0.3 P(node) =1
## class counts: 300 700
## probabilities: 0.300 0.700
## left son=2 (543 obs) right son=3 (457 obs)
## Primary splits:
## Account.Balance < 2.5 to the left, improve=47.90962, (0 missing)
## Payment.Status.of.Previous.Credit < 1.5 to the left, improve=17.06212, (0 missing)
## Value.Savings.Stocks < 2.5 to the left, improve=14.80642, (0 missing)
## Duration.of.Credit..month. < 34.5 to the right, improve=13.62155, (0 missing)
## Credit.Amount < 3909.5 to the right, improve=11.08768, (0 missing)
## Surrogate splits:
## Value.Savings.Stocks < 2.5 to the left, agree=0.611, adj=0.149, (0 split)
## Payment.Status.of.Previous.Credit < 3.5 to the left, agree=0.592, adj=0.107, (0 split)
## Length.of.current.employment < 4.5 to the left, agree=0.554, adj=0.024, (0 split)
## Age..years. < 30.5 to the left, agree=0.554, adj=0.024, (0 split)
## No.of.Credits.at.this.Bank < 1.5 to the left, agree=0.554, adj=0.024, (0 split)
##
## Node number 2: 543 observations, complexity param=0.05166667
## predicted class=1 expected loss=0.441989 P(node) =0.543
## class counts: 240 303
## probabilities: 0.442 0.558
## left son=4 (237 obs) right son=5 (306 obs)
## Primary splits:
## Duration.of.Credit..month. < 22.5 to the right, improve=12.810640, (0 missing)
## Payment.Status.of.Previous.Credit < 1.5 to the left, improve= 9.653787, (0 missing)
## Most.valuable.available.asset < 1.5 to the right, improve= 9.181363, (0 missing)
## Value.Savings.Stocks < 2.5 to the left, improve= 8.890786, (0 missing)
## Credit.Amount < 8079 to the right, improve= 6.601270, (0 missing)
## Surrogate splits:
## Credit.Amount < 2805.5 to the right, agree=0.748, adj=0.422, (0 split)
## Most.valuable.available.asset < 2.5 to the right, agree=0.646, adj=0.190, (0 split)
## Type.of.apartment < 2.5 to the right, agree=0.606, adj=0.097, (0 split)
## Purpose < 8.5 to the right, agree=0.604, adj=0.093, (0 split)
## Occupation < 3.5 to the right, agree=0.595, adj=0.072, (0 split)
##
## Node number 3: 457 observations
## predicted class=1 expected loss=0.131291 P(node) =0.457
## class counts: 60 397
## probabilities: 0.131 0.869
##
## Node number 4: 237 observations, complexity param=0.05166667
## predicted class=0 expected loss=0.4345992 P(node) =0.237
## class counts: 134 103
## probabilities: 0.565 0.435
## left son=8 (196 obs) right son=9 (41 obs)
## Primary splits:
## Value.Savings.Stocks < 3.5 to the left, improve=7.374515, (0 missing)
## Credit.Amount < 1381.5 to the left, improve=3.289316, (0 missing)
## Instalment.per.cent < 2.5 to the right, improve=3.067516, (0 missing)
## Duration.of.Credit..month. < 43.5 to the right, improve=2.564920, (0 missing)
## Purpose < 0.5 to the left, improve=2.358685, (0 missing)
##
## Node number 5: 306 observations, complexity param=0.04666667
## predicted class=1 expected loss=0.3464052 P(node) =0.306
## class counts: 106 200
## probabilities: 0.346 0.654
## left son=10 (28 obs) right son=11 (278 obs)
## Primary splits:
## Payment.Status.of.Previous.Credit < 1.5 to the left, improve=10.040510, (0 missing)
## Most.valuable.available.asset < 1.5 to the right, improve= 5.585685, (0 missing)
## Guarantors < 2.5 to the left, improve= 3.782059, (0 missing)
## Duration.of.Credit..month. < 11.5 to the right, improve= 3.766531, (0 missing)
## Purpose < 7 to the left, improve= 3.278937, (0 missing)
##
## Node number 8: 196 observations
## predicted class=0 expected loss=0.377551 P(node) =0.196
## class counts: 122 74
## probabilities: 0.622 0.378
##
## Node number 9: 41 observations
## predicted class=1 expected loss=0.2926829 P(node) =0.041
## class counts: 12 29
## probabilities: 0.293 0.707
##
## Node number 10: 28 observations
## predicted class=0 expected loss=0.25 P(node) =0.028
## class counts: 21 7
## probabilities: 0.750 0.250
##
## Node number 11: 278 observations
## predicted class=1 expected loss=0.3057554 P(node) =0.278
## class counts: 85 193
## probabilities: 0.306 0.694
predicted_prob<-predict(german_credit_decision,type="class")
predicted_prob1<-ifelse(predicted_prob ==1,2,1)
roccurve <- roc(german_credit_decision$y, predicted_prob1)
plot(roccurve)
##
## Call:
## roc.default(response = german_credit_decision$y, predictor = predicted_prob1)
##
## Data: predicted_prob1 in 300 controls (german_credit_decision$y 1) < 700 cases (german_credit_decision$y 2).
## Area under the curve: 0.6805
auc(roccurve)
## Area under the curve: 0.6805
auc(german_credit_decision$y, predicted_prob1)
## Area under the curve: 0.6805
k-fold Cross Validation building
Divide the whole dataset into k equal parts Use kth part of the data as the holdout sample, use remaining k-1 parts of the data as training data.Repeat this K times, build K models. The average error on holdout sample gives us an idea on the testing error
K=10
library("caret", lib.loc="~/R/win-library/3.3")
train_dat <- trainControl(method="cv", number=10)
train_dat
## $method
## [1] "cv"
##
## $number
## [1] 10
##
## $repeats
## [1] 1
##
## $search
## [1] "grid"
##
## $p
## [1] 0.75
##
## $initialWindow
## NULL
##
## $horizon
## [1] 1
##
## $fixedWindow
## [1] TRUE
##
## $verboseIter
## [1] FALSE
##
## $returnData
## [1] TRUE
##
## $returnResamp
## [1] "final"
##
## $savePredictions
## [1] FALSE
##
## $classProbs
## [1] FALSE
##
## $summaryFunction
## function (data, lev = NULL, model = NULL)
## {
## if (is.character(data$obs))
## data$obs <- factor(data$obs, levels = lev)
## postResample(data[, "pred"], data[, "obs"])
## }
## <environment: namespace:caret>
##
## $selectionFunction
## [1] "best"
##
## $preProcOptions
## $preProcOptions$thresh
## [1] 0.95
##
## $preProcOptions$ICAcomp
## [1] 3
##
## $preProcOptions$k
## [1] 5
##
##
## $sampling
## NULL
##
## $index
## NULL
##
## $indexOut
## NULL
##
## $indexFinal
## NULL
##
## $timingSamps
## [1] 0
##
## $predictionBounds
## [1] FALSE FALSE
##
## $seeds
## [1] NA
##
## $adaptive
## $adaptive$min
## [1] 5
##
## $adaptive$alpha
## [1] 0.05
##
## $adaptive$method
## [1] "gls"
##
## $adaptive$complete
## [1] TRUE
##
##
## $trim
## [1] FALSE
##
## $allowParallel
## [1] TRUE
names(german_credit)
## [1] "Creditability"
## [2] "Account.Balance"
## [3] "Duration.of.Credit..month."
## [4] "Payment.Status.of.Previous.Credit"
## [5] "Purpose"
## [6] "Credit.Amount"
## [7] "Value.Savings.Stocks"
## [8] "Length.of.current.employment"
## [9] "Instalment.per.cent"
## [10] "Sex...Marital.Status"
## [11] "Guarantors"
## [12] "Duration.in.Current.address"
## [13] "Most.valuable.available.asset"
## [14] "Age..years."
## [15] "Concurrent.Credits"
## [16] "Type.of.apartment"
## [17] "No.of.Credits.at.this.Bank"
## [18] "Occupation"
## [19] "No.of.dependents"
## [20] "Telephone"
## [21] "Foreign.Worker"
german_credit$Creditability<-as.factor(german_credit$Creditability)
Building the models on K-fold samples
library("e1071")
## Warning: package 'e1071' was built under R version 3.1.3
K_fold_tree<-train(Creditability~., method="rpart", trControl=train_dat, control=rpart.control(minsplit=10, cp=0.000001), data=german_credit)
K_fold_tree
## CART
##
## 1000 samples
## 20 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 900, 900, 900, 900, 900, 900, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.01666667 0.748 0.3496912
## 0.04666667 0.717 0.2059117
## 0.05166667 0.701 0.1253775
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01666667.
K_fold_tree$finalModel
K_fold_tree$finalModel
## n= 1000
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 1000 300 1 (0.3000000 0.7000000)
## 2) Account.Balance< 2.5 543 240 1 (0.4419890 0.5580110)
## 4) Duration.of.Credit..month.>=22.5 237 103 0 (0.5654008 0.4345992)
## 8) Value.Savings.Stocks< 3.5 196 74 0 (0.6224490 0.3775510) *
## 9) Value.Savings.Stocks>=3.5 41 12 1 (0.2926829 0.7073171) *
## 5) Duration.of.Credit..month.< 22.5 306 106 1 (0.3464052 0.6535948)
## 10) Payment.Status.of.Previous.Credit< 1.5 28 7 0 (0.7500000 0.2500000) *
## 11) Payment.Status.of.Previous.Credit>=1.5 278 85 1 (0.3057554 0.6942446) *
## 3) Account.Balance>=2.5 457 60 1 (0.1312910 0.8687090) *
library("rattle", lib.loc="~/R/win-library/3.3")
fancyRpartPlot(K_fold_tree$finalModel)
Kfold_pred<-predict(K_fold_tree)
confusion matrix
conf_matrix<-confusionMatrix(Kfold_pred,german_credit$Creditability)
conf_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 143 81
## 1 157 619
##
## Accuracy : 0.762
## 95% CI : (0.7344, 0.7881)
## No Information Rate : 0.7
## P-Value [Acc > NIR] : 7.345e-06
##
## Kappa : 0.3891
## Mcnemar's Test P-Value : 1.165e-06
##
## Sensitivity : 0.4767
## Specificity : 0.8843
## Pos Pred Value : 0.6384
## Neg Pred Value : 0.7977
## Prevalence : 0.3000
## Detection Rate : 0.1430
## Detection Prevalence : 0.2240
## Balanced Accuracy : 0.6805
##
## 'Positive' Class : 0
##
Bootstrap
Boot strapping is a powerful tool to get an idea on accuracy of the model
library("caret", lib.loc="~/R/win-library/3.3")
train_control <- trainControl(method="boot", number=10)
Tree model on boots straped data
Boot_Strap_model <- train(Creditability~., method="rpart", trControl=train_dat, control=rpart.control(minsplit=10, cp=0.000001), data=german_credit)
names(german_credit)
## [1] "Creditability"
## [2] "Account.Balance"
## [3] "Duration.of.Credit..month."
## [4] "Payment.Status.of.Previous.Credit"
## [5] "Purpose"
## [6] "Credit.Amount"
## [7] "Value.Savings.Stocks"
## [8] "Length.of.current.employment"
## [9] "Instalment.per.cent"
## [10] "Sex...Marital.Status"
## [11] "Guarantors"
## [12] "Duration.in.Current.address"
## [13] "Most.valuable.available.asset"
## [14] "Age..years."
## [15] "Concurrent.Credits"
## [16] "Type.of.apartment"
## [17] "No.of.Credits.at.this.Bank"
## [18] "Occupation"
## [19] "No.of.dependents"
## [20] "Telephone"
## [21] "Foreign.Worker"
Boot_Strap_model$finalModel
## n= 1000
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 1000 300 1 (0.3000000 0.7000000)
## 2) Account.Balance< 2.5 543 240 1 (0.4419890 0.5580110)
## 4) Duration.of.Credit..month.>=22.5 237 103 0 (0.5654008 0.4345992)
## 8) Value.Savings.Stocks< 3.5 196 74 0 (0.6224490 0.3775510) *
## 9) Value.Savings.Stocks>=3.5 41 12 1 (0.2926829 0.7073171) *
## 5) Duration.of.Credit..month.< 22.5 306 106 1 (0.3464052 0.6535948)
## 10) Payment.Status.of.Previous.Credit< 1.5 28 7 0 (0.7500000 0.2500000) *
## 11) Payment.Status.of.Previous.Credit>=1.5 278 85 1 (0.3057554 0.6942446) *
## 3) Account.Balance>=2.5 457 60 1 (0.1312910 0.8687090) *
library("rattle", lib.loc="~/R/win-library/3.3")
fancyRpartPlot(Boot_Strap_model$finalModel)
Boot_Strap_predictions <- predict(Boot_Strap_model)
conf_matrix<-confusionMatrix(Boot_Strap_predictions,german_credit$Creditability)
conf_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 143 81
## 1 157 619
##
## Accuracy : 0.762
## 95% CI : (0.7344, 0.7881)
## No Information Rate : 0.7
## P-Value [Acc > NIR] : 7.345e-06
##
## Kappa : 0.3891
## Mcnemar's Test P-Value : 1.165e-06
##
## Sensitivity : 0.4767
## Specificity : 0.8843
## Pos Pred Value : 0.6384
## Neg Pred Value : 0.7977
## Prevalence : 0.3000
## Detection Rate : 0.1430
## Detection Prevalence : 0.2240
## Balanced Accuracy : 0.6805
##
## 'Positive' Class : 0
##
conclusion:
n= 1000
node), split, n, loss, yval, (yprob) * denotes terminal node
- root 1000 300 1 (0.3000000 0.7000000)
- Account.Balance< 2.5 543 240 1 (0.4419890 0.5580110)
- Duration.of.Credit..month.>=22.5 237 103 0 (0.5654008 0.4345992)
- Value.Savings.Stocks< 3.5 196 74 0 (0.6224490 0.3775510) *
- Value.Savings.Stocks>=3.5 41 12 1 (0.2926829 0.7073171) *
- Duration.of.Credit..month.< 22.5 306 106 1 (0.3464052 0.6535948)
- Payment.Status.of.Previous.Credit< 1.5 28 7 0 (0.7500000 0.2500000) *
- Payment.Status.of.Previous.Credit>=1.5 278 85 1 (0.3057554 0.6942446) *
- Account.Balance>=2.5 457 60 1 (0.1312910 0.8687090) *
Root node contains 1000 records.ie., over all records in whole data.root node termed as a good customer, loss in that node is bad customers are 300. 30% customers are bad customers and 70% customers are good customers.
2 nd node is Account.Balance< 2.5. if we calculate 54% for 1000recods we get 543 records.2 nd node also termed as good customer, loss in that node is bad customers are 240.out of 543 loss is 240.55%customers are good customers and 44% customers are bad customers.
4 th node is Duration.of.Credit..month.>=22.5.It has a 237 records.4 th node is termed as a bad customer ,loss in that node is goodcustomers are 103.out of 237 loss is 103.Bad customers are around 237.56% customers are bad customers.43% customers are good customers.
8 th node is Value.Savings.Stocks< 3.5.It has a 196 records.8 th node termed as a bad customers,loss in that node is good customers are 74.out of 196 loss is 74. 62% customers are bad customers.38%customers are good customers.
9 th node is Value.Savings.Stocks>=3.5.It has a 41 records.9 th node is termed as a good customers,loss in that node is bad customers are12.out of 41 loss is 12. 29% customers are bad customers.70%customers are good customers.
5 th node is .Duration.of.Credit..month.< 22.5.it has 306 records.5 th node is termed as a good customers,loss in that node is bad customers are106.out of 306 loss is 106. 34% customers are bad customers.65%customers are good customers.
10 th node is Payment.Status.of.Previous.Credit< 1.5. it has a 28 records. 10 th node is termed as a bad customers,loss in that node is good customers are7.out of 28 loss is 7. 75% customers are bad customers.25%customers are good
11 th node is Payment.Status.of.Previous.Credit>=1.5.It has a 278 records.11 th node is termed as a good customers,loss in that node is bad customers are 85.out of 278 loss is 85. 31% customers are bad customers.69%customers are good customers.
3 rd node is Account.Balance>=2.5.It has a 457 records.3rd node termed as a good customers,loss in that node is bad customers are 60.out of 457 loss is 60. 13% customers are bad customers.87%customers are good customers.
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 143 81
1 157 619
Accuracy : 0.762
95% CI : (0.7344, 0.7881)
No Information Rate : 0.7
P-Value [Acc > NIR] : 7.345e-06
Kappa : 0.3891
Mcnemar’s Test P-Value : 1.165e-06
Sensitivity : 0.4767
Specificity : 0.8843
If we the change thershold value to increase the sensitivity and specificity.
Pos Pred Value : 0.6384
Neg Pred Value : 0.7977
Prevalence : 0.3000
Detection Rate : 0.1430
Detection Prevalence : 0.2240
Balanced Accuracy : 0.6805
'Positive' Class : 0
real accuracy of the whole data is 76%.


