You can download the datasets .
Problem Statement
Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit.
Credit scoring algorithms, which make a guess at the probability of default, are the methods banks use to determine whether or not a loan should be granted.
The goal is to build a model that borrowers can use to help make the best financial decisions. The data is raw, you may have to spend considerable amount of time for validating and cleaning the data
Business Model
Banks give credit or loans to people for their needs. Banks earn profit by interest. Borrowers have to pay the loan in some installments. Some borrowers defalut their loan. These causes huge loss to the banks. So to reduce that cases, banks check the people to know whether the borrower is good person or not. Using the data given to us about the borrowers we need to build a model which can predict whether a borrower will pay loan or default the loan.
Data Exploration
First get the data set and metadata.Understand metadata thoroughly.
Next import the data set to R. Find the format like .csv,.xlsx etc of the data set. Use suitable function to import it into R. In our case data set is in csv format. So we use the function ‘read.csv()’ to import the data set.
train_data<-read.csv("D:\intern bang\case study 1\Data Set\training data.csv")
We are assigning the data to the variable ‘train_data’
See the dimension of the data.
dim(train_data)
## [1] 150000 12
we have 12 variables, and 150000 observations.
Get the variable names.
names(train_data)
## [1] "X"
## [2] "SeriousDlqin2yrs"
## [3] "RevolvingUtilizationOfUnsecuredLines"
## [4] "age"
## [5] "NumberOfTime30.59DaysPastDueNotWorse"
## [6] "DebtRatio"
## [7] "MonthlyIncome"
## [8] "NumberOfOpenCreditLinesAndLoans"
## [9] "NumberOfTimes90DaysLate"
## [10] "NumberRealEstateLoansOrLines"
## [11] "NumberOfTime60.89DaysPastDueNotWorse"
## [12] "NumberOfDependents"
check these names with the names in given metadata file. For our data, names are same in both. There is an extra variable ‘X’ which is serial number.
see the structure of the data
str(train_data)
## 'data.frame': 150000 obs. of 12 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ SeriousDlqin2yrs : int 1 0 0 0 0 0 0 0 0 0 ...
## $ RevolvingUtilizationOfUnsecuredLines: num 0.766 0.957 0.658 0.234 0.907 ...
## $ age : int 45 40 38 30 49 74 57 39 27 57 ...
## $ NumberOfTime30.59DaysPastDueNotWorse: int 2 0 1 0 1 0 0 0 0 0 ...
## $ DebtRatio : num 0.803 0.1219 0.0851 0.036 0.0249 ...
## $ MonthlyIncome : int 9120 2600 3042 3300 63588 3500 NA 3500 NA 23684 ...
## $ NumberOfOpenCreditLinesAndLoans : int 13 4 2 5 7 3 8 8 2 9 ...
## $ NumberOfTimes90DaysLate : int 0 0 1 0 0 0 0 0 0 0 ...
## $ NumberRealEstateLoansOrLines : int 6 0 0 0 1 1 3 0 0 4 ...
## $ NumberOfTime60.89DaysPastDueNotWorse: int 0 0 0 0 0 0 0 0 0 0 ...
## $ NumberOfDependents : int 2 1 0 0 0 1 0 0 NA 2 ...
In our data some variables are integers and some are numeric.
See first five and last five rows of the data to get better understanding of variables.
head(train_data)
## X SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age
## 1 1 1 0.7661266 45
## 2 2 0 0.9571510 40
## 3 3 0 0.6581801 38
## 4 4 0 0.2338098 30
## 5 5 0 0.9072394 49
## 6 6 0 0.2131787 74
## NumberOfTime30.59DaysPastDueNotWorse DebtRatio MonthlyIncome
## 1 2 0.80298213 9120
## 2 0 0.12187620 2600
## 3 1 0.08511338 3042
## 4 0 0.03604968 3300
## 5 1 0.02492570 63588
## 6 0 0.37560697 3500
## NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate
## 1 13 0
## 2 4 0
## 3 2 1
## 4 5 0
## 5 7 0
## 6 3 0
## NumberRealEstateLoansOrLines NumberOfTime60.89DaysPastDueNotWorse
## 1 6 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 1 0
## 6 1 0
## NumberOfDependents
## 1 2
## 2 1
## 3 0
## 4 0
## 5 0
## 6 1
tail(train_data)
## X SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age
## 149995 149995 0 0.38574226 50
## 149996 149996 0 0.04067386 74
## 149997 149997 0 0.29974515 44
## 149998 149998 0 0.24604392 58
## 149999 149999 0 0.00000000 30
## 150000 150000 0 0.85028295 64
## NumberOfTime30.59DaysPastDueNotWorse DebtRatio MonthlyIncome
## 149995 0 0.4042929 3400
## 149996 0 0.2251309 2100
## 149997 0 0.7165622 5584
## 149998 0 3870.0000000 NA
## 149999 0 0.0000000 5716
## 150000 0 0.2499081 8158
## NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate
## 149995 7 0
## 149996 4 0
## 149997 4 0
## 149998 18 0
## 149999 4 0
## 150000 8 0
## NumberRealEstateLoansOrLines NumberOfTime60.89DaysPastDueNotWorse
## 149995 0 0
## 149996 1 0
## 149997 1 0
## 149998 1 0
## 149999 0 0
## 150000 2 0
## NumberOfDependents
## 149995 0
## 149996 0
## 149997 2
## 149998 0
## 149999 0
## 150000 0
See the summary of whole data
summary(train_data)
## X SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines
## Min. : 1 Min. :0.00000 Min. : 0.00
## 1st Qu.: 37501 1st Qu.:0.00000 1st Qu.: 0.03
## Median : 75001 Median :0.00000 Median : 0.15
## Mean : 75001 Mean :0.06684 Mean : 6.05
## 3rd Qu.:112500 3rd Qu.:0.00000 3rd Qu.: 0.56
## Max. :150000 Max. :1.00000 Max. :50708.00
##
## age NumberOfTime30.59DaysPastDueNotWorse DebtRatio
## Min. : 0.0 Min. : 0.000 Min. : 0.0
## 1st Qu.: 41.0 1st Qu.: 0.000 1st Qu.: 0.2
## Median : 52.0 Median : 0.000 Median : 0.4
## Mean : 52.3 Mean : 0.421 Mean : 353.0
## 3rd Qu.: 63.0 3rd Qu.: 0.000 3rd Qu.: 0.9
## Max. :109.0 Max. :98.000 Max. :329664.0
##
## MonthlyIncome NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate
## Min. : 0 Min. : 0.000 Min. : 0.000
## 1st Qu.: 3400 1st Qu.: 5.000 1st Qu.: 0.000
## Median : 5400 Median : 8.000 Median : 0.000
## Mean : 6670 Mean : 8.453 Mean : 0.266
## 3rd Qu.: 8249 3rd Qu.:11.000 3rd Qu.: 0.000
## Max. :3008750 Max. :58.000 Max. :98.000
## NA's :29731
## NumberRealEstateLoansOrLines NumberOfTime60.89DaysPastDueNotWorse
## Min. : 0.000 Min. : 0.0000
## 1st Qu.: 0.000 1st Qu.: 0.0000
## Median : 1.000 Median : 0.0000
## Mean : 1.018 Mean : 0.2404
## 3rd Qu.: 2.000 3rd Qu.: 0.0000
## Max. :54.000 Max. :98.0000
##
## NumberOfDependents
## Min. : 0.000
## 1st Qu.: 0.000
## Median : 0.000
## Mean : 0.757
## 3rd Qu.: 1.000
## Max. :20.000
## NA's :3924
Summary will show the minimum,maximum,mean,median,1st quartile,3rd quartile of all the variables in the data set. It also shows Whether any variable has NA(missing values) values.In our data, variables ‘MonthlyIncome’ and ‘NumberOfDependents’ have NA values.Summary gives mean of variables having NA values by excluding them.
Missing Values will not be represented by ‘NA’ always. They can be also represented by ‘?’,‘-’,‘.’ etc. To know whether a variable has a missing value or not get the summary of that variable or create a frequency table. After knowing the representation of the missing values we can find number of missing values in each variable.
In our data missing values are represented by ‘NA’. Lets find out number of missing values in each variable.
missing_values=0
for(i in 1:ncol(train_data)){
missing_values[i]=sum(is.na(train_data[,i]))
}
#is.na will check whether a value is 'NA' or not,and gives output TRUE(1),False(0) respectively. sum() will give sum of all the 1's.
We got the number of missing values of all the variables in a vector.
missing_values
## [1] 0 0 0 0 0 0 29731 0 0 0 0
## [12] 3924
Univariate Analysis
In univariate analysis, we validate one variable at a time. We will know what each variable signifies and how it is related to predicting variable. We also look for missing values,outliers etc. After that we sanitize the variable data which will be used for final model building.
“X”
It is the first variable in the data. It is serial number. We can get its summary
library(ggplot2)
summary(train_data$X)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 37500 75000 75000 112500 150000
Its minimum value is 1, maximum value is 15000 which is number of rows in the data. Its mean and median are equal.
SeriousDlqin2yrs
It tells whether a Person experienced 90 days past due delinquency or worse. This is the target variable which we have to predict. Its values are 1(yes) or 0(no).
summary(train_data$SeriousDlqin2yrs)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.06684 0.00000 1.00000
Since it is yes/no type variable we will get frequency table for better understanding of the variable.
count<-table(train_data$SeriousDlqin2yrs)
count
##
## 0 1
## 139974 10026
Out of 150000 only 10026 are bad accounts, which shows ours is imbalanced data. We will discuss more about imbalanced data in model building.
We draw bar plot of the variable for better visualisation.
barplot(count,width=0.2,col = c("red","blue"),ylim = c(0, 140000),xlim=c(0,2),main=" Bar plot of SeriousDlqin2yrs",xlab="Yes(1)/No(0)",ylab="count")
RevolvingUtilizationOfUnsecuredLines
This variable represents total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits. It is a ratio. Its’ value should be inbetween 0 and 1.
summary(train_data$RevolvingUtilizationOfUnsecuredLines)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.03 0.15 6.05 0.56 50710.00
Its’ mean is 6.05 which is greater than 1. So variable has some faulty values. Its maximum value is 50710 which is way too high.
Lets look at percentiles to know from where it is exceeding 1.
quantile(train_data$RevolvingUtilizationOfUnsecuredLines,c(0.9,0.91,0.95,0.96,0.97,0.975,0.98,0.99,1))
## 90% 91% 95% 96% 97%
## 9.812777e-01 9.999999e-01 9.999999e-01 9.999999e-01 9.999999e-01
## 97.5% 98% 99% 100%
## 9.999999e-01 1.006199e+00 1.092956e+00 5.070800e+04
Make box plot of the variable.
ggplot(train_data, aes(y=RevolvingUtilizationOfUnsecuredLines, x = 1)) + geom_boxplot()
From the box plot we can see that there are outliers present in the variable.
age
It specifies age of the barrower in years. Its’ an integer lets see the summary of age
summary(train_data$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 41.0 52.0 52.3 63.0 109.0
Minimum age is 0, which is not practical. Maximum age is 109 which is ok. Mean and median are very close which indicates outliers may not be present.
Lets see the percentile distribution.
quantile(train_data$age,c(0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1))
## 0% 1% 3% 5% 7% 9% 10% 20% 30% 40% 50% 60% 70% 80% 90%
## 0 24 27 29 30 32 33 39 44 48 52 56 61 65 72
## 100%
## 109
Check whether any barrower age is less than 20 as banks mostly wont give credit to teenagers.
sum(train_data$age<20)
## [1] 1
There is only one borrower whose age is less than 20. That barrower is of minimum age 0. This is not possible as banks can not give credit to unborn.
Plot the boxplot of the variable
ggplot(train_data, aes(y=age, x = 1)) + geom_boxplot()
We can notice an outlier at the bottom of the boxplot.
NumberOfTime30.59DaysPastDueNotWorse
It shows number of times a borrower has been 30-59 days past due but no worse in the last 2 years.
str(train_data$NumberOfTime30.59DaysPastDueNotWorse)
## int [1:150000] 2 0 1 0 1 0 0 0 0 0 ...
summary(train_data$NumberOfTime30.59DaysPastDueNotWorse)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.421 0.000 98.000
sd(train_data$NumberOfTime30.59DaysPastDueNotWorse)
## [1] 4.192781
It is an integer variable. Minimum value is zero,median is also zero. Mean is 0.421 ,SD is 4.192 and maximum value is 98. These give indication of presence of outliers.
Check the percentile distribution to know the presence of outliers.
quantile(train_data$NumberOfTime30.59DaysPastDueNotWorse,c(0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.85,0.9,0.95,1))
## 0% 1% 3% 5% 7% 9% 10% 20% 30% 40% 50% 60% 70% 80% 85%
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## 90% 95% 100%
## 1 2 98
100 percentile is 98 ,which is an outlier.
This variable range is from 0 to 98.It takes only integers values. Lets see it’s frequency distribution.
freq_table<-table(train_data$NumberOfTime30.59DaysPastDueNotWorse)
freq_table
##
## 0 1 2 3 4 5 6 7 8 9
## 126018 16033 4598 1754 747 342 140 54 25 12
## 10 11 12 13 96 98
## 4 1 2 1 5 264
This variables has values from 0 to 13 and 96,98. Last two are outliers.
Next plot boxplot and barplots to visualize the data.
qplot(data = train_data, x = NumberOfTime30.59DaysPastDueNotWorse)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(train_data, aes(y=NumberOfTime30.59DaysPastDueNotWorse, x = 1)) + geom_boxplot()
In this plot we can see the outliers.
DebtRatio
Debt ratio is obtained by dividing Monthly debt payments, alimony, living costs by monthly gross income.
str(train_data$DebtRatio)
## num [1:150000] 0.803 0.1219 0.0851 0.036 0.0249 ...
summary(train_data$DebtRatio)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.2 0.4 353.0 0.9 329700.0
Normally debt ratio should be between 0 to 1. Somtimes it can exceed 1 ,if a person spends more than his income.Here its minimum is 0,mean is 353,median is 0.4. This indicates presence of outliers. Maximum value is 329700, which is not possible.
Lets see the percentile distribution
quantile(train_data$DebtRatio,c(0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.75,0.76,0.78,0.8,0.85,0.9,0.95,1))
## 10% 20% 30% 40% 50%
## 3.087398e-02 1.337729e-01 2.136969e-01 2.874603e-01 3.665078e-01
## 60% 70% 75% 76% 78%
## 4.675064e-01 6.491891e-01 8.682538e-01 9.511839e-01 1.275069e+00
## 80% 85% 90% 95% 100%
## 4.000000e+00 2.691500e+02 1.267000e+03 2.449000e+03 3.296640e+05
Upto 76percentile it is less than 1.
Plot the boxplot.
ggplot(train_data, aes(y=DebtRatio, x = 1)) + geom_boxplot()
There are outlers present in the variable. We have to filter them before we use the data for model building.
MonthlyIncome
It is the monthly income of the barrower.
str(train_data$MonthlyIncome)
## int [1:150000] 9120 2600 3042 3300 63588 3500 NA 3500 NA 23684 ...
summary(train_data$MonthlyIncome)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 3400 5400 6670 8249 3009000 29731
This is an integer variable. It has missing values represented by ‘NA’.
Its minimum value is 0, which is practically impossible. Mean is 6670 and median is 5400 without considering NA values. Maximum value is 3009000.
We have to treat missing values first before going further into this variable analysis.
NumberOfOpenCreditLinesAndLoans
It indicates number of open loans (an installment loan such as car loan or mortgage) and lines of credit (such as credit cards).
str(train_data$NumberOfOpenCreditLinesAndLoans)
## int [1:150000] 13 4 2 5 7 3 8 8 2 9 ...
summary(train_data$NumberOfOpenCreditLinesAndLoans)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 5.000 8.000 8.453 11.000 58.000
It is an integer variable. Its minimum value is 0,maximum value is 58. Its mean is 8.543,median is 8. Mean and median are close, so outliers may not be present.
Lets see percentile distribution to know the outliers presence.
quantile(train_data$NumberOfOpenCreditLinesAndLoans,c(0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.93,0.95,0.97,0.98,0.99,0.995,1))
## 10% 20% 30% 40% 50% 60% 70% 80% 90% 93% 95% 97%
## 3 4 5 6 8 9 10 12 15 17 18 20
## 98% 99% 99.5% 100%
## 22 24 27 58
Highest value is 58 which is possible.
Lets check boxplot
ggplot(train_data, aes(y=NumberOfOpenCreditLinesAndLoans, x = 1)) + geom_boxplot()
No outliers in this variable.
NumberOfTimes90DaysLate
This variable represents number of times borrower has been 90 days or more past due.
str(train_data$NumberOfTimes90DaysLate)
## int [1:150000] 0 0 1 0 0 0 0 0 0 0 ...
summary(train_data$NumberOfTimes90DaysLate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.266 0.000 98.000
It is an integer variable. Minimum value is zero,median is also zero. Mean is 0.266 and maximum value is 98. These give indication of presence of outliers.
Check the percentile distribution to know the presence of outliers.
quantile(train_data$NumberOfTimes90DaysLate,c(0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.85,0.9,0.95,0.97,0.99,1))
## 0% 1% 3% 5% 7% 9% 10% 20% 30% 40% 50% 60% 70% 80% 85%
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 90% 95% 97% 99% 100%
## 0 1 1 3 98
100 percentile is 98 ,which is an outlier.
This variable range is from 0 to 98.It takes only integers values. Lets see it’s frequency distribution.
freq_table<-table(train_data$NumberOfTimes90DaysLate)
freq_table
##
## 0 1 2 3 4 5 6 7 8 9
## 141662 5243 1555 667 291 131 80 38 21 19
## 10 11 12 13 14 15 17 96 98
## 8 5 2 4 2 2 1 5 264
This variables has values from 0 to 15 and 17,96,98. Last two are outliers.
Next plot boxplot and barplots to visualize the data.
barplot(freq_table,width=1,col = c("red","blue"),ylim = c(0, 140000),xlim=c(0,100),main="NumberOfTimes90DaysLate",xlab="Variable values",ylab="count")
ggplot(train_data, aes(y=NumberOfTimes90DaysLate, x = 1)) + geom_boxplot()
This plot shows presence of outliers.
NumberRealEstateLoansOrLines
It shows number of mortgage and real estate loans taken by the barrower including home equity lines of credit.
str(train_data$NumberRealEstateLoansOrLines)
## int [1:150000] 6 0 0 0 1 1 3 0 0 4 ...
summary(train_data$NumberRealEstateLoansOrLines)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 1.000 1.018 2.000 54.000
It is an integer variable. Minimum value is zero,median is one. Mean is 1.018 and maximum value is 54. Mean and Median are close so there may not be outliers in this variable.
Check the percentile distribution to know the presence of outliers.
quantile(train_data$NumberRealEstateLoansOrLines,c(0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.85,0.9,0.95,0.97,0.99,1))
## 0% 1% 3% 5% 7% 9% 10% 20% 30% 40% 50% 60% 70% 80% 85%
## 0 0 0 0 0 0 0 0 0 1 1 1 1 2 2
## 90% 95% 97% 99% 100%
## 2 3 3 4 54
100 percentile is 54 ,which is a possible value for this variable.
This variable range is from 0 to 54.It takes only integers values. Lets see it’s frequency distribution.
freq_table<-table(train_data$NumberRealEstateLoansOrLines)
freq_table
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## 56188 52338 31522 6300 2170 689 320 171 93 78 37 23
## 12 13 14 15 16 17 18 19 20 21 23 25
## 18 15 7 7 4 4 2 2 2 1 2 3
## 26 29 32 54
## 1 1 1 1
This variables has values from 0 to 21 and 23,25,26,29,32,54.
Next plot boxplot and barplots to visualize the data.
barplot(freq_table,width=1,col = c("red","blue"),ylim = c(0, 60000),xlim=c(0,60),main="NumberRealEstateLoansOrLines",xlab="Variable values",ylab="count")
ggplot(train_data, aes(y=NumberRealEstateLoansOrLines, x = 1)) + geom_boxplot()
There are no outliers in this variable.
NumberOfTime60.89DaysPastDueNotWorse
It shows number of times borrower has been 60 to 89 days past due but no worse in the last 2 years.
str(train_data$NumberOfTime60.89DaysPastDueNotWorse)
## int [1:150000] 0 0 0 0 0 0 0 0 0 0 ...
summary(train_data$NumberOfTime60.89DaysPastDueNotWorse)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2404 0.0000 98.0000
It is an integer variable. Minimum value is zero,median is 0. Mean is 0.2404 and maximum value is 98. These give indication of presence of outliers.
Check the percentile distribution to know the presence of outliers.
quantile(train_data$NumberOfTime60.89DaysPastDueNotWorse,c(0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.85,0.9,0.95,0.97,0.99,1))
## 0% 1% 3% 5% 7% 9% 10% 20% 30% 40% 50% 60% 70% 80% 85%
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 90% 95% 97% 99% 100%
## 0 1 1 2 98
100 percentile is 98 ,which is an outlier.
This variable range is from 0 to 98.It takes only integers values. Lets see it’s frequency distribution.
freq_table<-table(train_data$NumberOfTime60.89DaysPastDueNotWorse)
freq_table
##
## 0 1 2 3 4 5 6 7 8 9
## 142396 5731 1118 318 105 34 16 9 2 1
## 11 96 98
## 1 5 264
This variables has values from 0 to 9 and 11,96,98. Last two are outliers.
Next plot boxplot and barplots to visualize the data.
barplot(freq_table,width=1,col = c("red","blue"),ylim = c(0, 150000),xlim=c(0,100),main="NumberOfTime60.89DaysPastDueNotWorse",xlab="Variable values",ylab="count")
ggplot(train_data, aes(y=NumberOfTime60.89DaysPastDueNotWorse, x = 1)) + geom_boxplot()
This plot shows presence of outliers.
NumberOfDependents
It represents number of dependents in the family of borrower excluding himself.
str(train_data$NumberOfDependents)
## int [1:150000] 2 1 0 0 0 1 0 0 NA 2 ...
summary(train_data$NumberOfDependents)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 0.757 1.000 20.000 3924
It is an integer variable.It has missing values represented by ‘NA’.
Its minimum value is 0. Mean is 0.757 and median is 0 without considering NA values. Maximum value is 20.
Lets tabularize what we found in univariate analysis
| Variable | Missing Values | Outliers |
|---|---|---|
| X | Nill | Nill |
| SeriousDlqin2yrs | Nill | Nill |
| RevolvingUtilizationOfUnsecuredLines | Nill | Present(<10%) |
| age | Nill | Present(<10%) |
| NumberOfTime30.59DaysPastDueNotWorse | Nill | Present(<10%) |
| DebtRatio | Nill | Present(23.4%) |
| MonthlyIncome | Present(19.82%) | has to be Analysed |
| NumberOfOpenCreditLinesAndLoans | Nill | Nill |
| NumberOfTimes90DaysLate | Nill | Present(<10%) |
| NumberRealEstateLoansOrLines | Nill | Nill |
| NumberOfTime60.89DaysPastDueNotWorse | Nill | Present(<10%) |
| NumberOfDependents | Present(<10%) | has to be Analysed |
We have to treat missing values first before going further into this variable analysis.
Next we build a model using uncleaned data and later clean the data.
Model Building
Since the predictor variable (SeriousDlqin2yrs) is YES or NO type ,first we will use logistic regression model.
We have only training data set but not test data set. So we divide our data set into two parts, first 125000 rows for training and remaining 25000 rows for testing.
First we build model without cleaning data, then we replace missing values,outliers with mean and build new model. We will check the accuracy of the model in both cases. If accuracy is increasing after treating data then we will keep the model otherwise we will discard it. We do this for other models as well, we will take the model which gives better results.
Logistic Regression
First lets create datasets for training and testing
train_dataset<-train_data[1:125000,]
test_dataset<-train_data[125001:150000,]
Next we will build the model using train_dataset
logmodel<- glm(SeriousDlqin2yrs ~ RevolvingUtilizationOfUnsecuredLines + age + NumberOfTime30.59DaysPastDueNotWorse + DebtRatio + MonthlyIncome+NumberOfOpenCreditLinesAndLoans+NumberOfTimes90DaysLate+ NumberRealEstateLoansOrLines+ NumberOfTime60.89DaysPastDueNotWorse+NumberOfDependents, data = train_dataset, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Using this model we will predict the output of test_dataset
predicted_values<-ifelse(predict(logmodel,test_dataset)>0.5,1,0)
Here we kept threshold as 0.5. Any value greater than 0 is taken as 1. Any value less than or equal to 0 is taken as 0.
After predicting we need to find out the accuracy of the model, for that we calculate confusion matrix.
conf_matrix<-table(test_dataset[,2],predicted_values)
conf_matrix
## predicted_values
## 0 1
## 0 18576 27
## 1 1402 41
Columns are predicted values, rows are actual values.
Accuracy is total true predictions divided by total predictions.We will also find sensitivity and specificity. Sensitivity is true positives/(true positives+false negatives) specificity is true negatives/(true negatives+false positivies)
accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/sum(conf_matrix)
accuracy
## [1] 0.928714
specificity<-conf_matrix[2,2]/(conf_matrix[2,2]+conf_matrix[2,1])
specificity
## [1] 0.02841303
sensitivity<-conf_matrix[1,1]/(conf_matrix[1,1]+conf_matrix[1,2])
sensitivity
## [1] 0.9985486
Our model accuracy is 0.928. This is without treating the data.Specificity is 0.028, which is very low, we need to improve it.
Next we will replace the missing values with their means and will build new model.
Missing values Treatment
MonthlyIncome and NumberOfDependents have missing values. We will replace them by their column mean values.We create new dataset.
In MonthlyIncome missing values are of 19.82%. So we create a new column NA_MonthlyIncome which indicates whether the value of MonthlyIncome in new dataset is origanal one(FALSE) or missing value replaced by the mean(TRUE).
train_data1<-train_data
c1<-c(1:150000)[is.na(train_data[,7])]
train_data1[c1,7]<-6670
#We can get the mean of remaining observations in summary of that variable
train_data1$NA_MonthlyIncome<-is.na(train_data[,7])
In NumberOfDependents missing values are of only 2.616%, so we dont create any new column.We replace missing values by mean of remaining values.
c1<-c(1:150000)[is.na(train_data[,12])]
#We can get the mean of remaining observations in summary of that variable
train_data1[c1,12]<-0.757
Next we divide new dataset for testing and training
train_dataset1<-train_data1[1:125000,]
test_dataset1<-train_data1[125001:150000,]
Create the model and test it. Find its accuracy
logmodel1 <- glm(SeriousDlqin2yrs ~ RevolvingUtilizationOfUnsecuredLines + age + NumberOfTime30.59DaysPastDueNotWorse + DebtRatio + MonthlyIncome+NumberOfOpenCreditLinesAndLoans+NumberOfTimes90DaysLate+ NumberRealEstateLoansOrLines+ NumberOfTime60.89DaysPastDueNotWorse+NumberOfDependents+NA_MonthlyIncome, data = train_dataset1, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
predicted_values1<-ifelse(predict(logmodel1,test_dataset1)>0.5,1,0)
conf_matrix<-table(test_dataset1[,2],predicted_values1)
conf_matrix
## predicted_values1
## 0 1
## 0 23234 34
## 1 1679 53
accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/sum(conf_matrix)
accuracy
## [1] 0.93148
specificity<-conf_matrix[2,2]/(conf_matrix[2,2]+conf_matrix[2,1])
specificity
## [1] 0.03060046
sensitivity<-conf_matrix[1,1]/(conf_matrix[1,1]+conf_matrix[1,2])
sensitivity
## [1] 0.9985388
For this model accuracy is 0.931. There is an increase in accuracy of the model after replacing missing values.But not much improvement in specificity.
Next we treat outliers.
Outliers Removal
We create new dataset train_data2 in which we replace outliers with mean
train_data2<-train_data1
RevolvingUtilizationOfUnsecuredLines
RevolvingUtilizationOfUnsecuredLines has outliers. Since outliers percentage is less than 10 We will replace outliers with mean of reaming data.Outliers are with value greater than 1.
remain_mean<-mean(train_data1$RevolvingUtilizationOfUnsecuredLines[c(1:150000)[train_data1$RevolvingUtilizationOfUnsecuredLines<=1]])
train_data2[c(1:150000)[train_data1$RevolvingUtilizationOfUnsecuredLines>1],3]<-remain_mean
age
Next in age there is an outlier whose value is zero we replace it with other values mean.
remain_mean<-mean(train_data1$age[c(1:150000)[train_data1$age>0]])
train_data2[c(1:150000)[train_data1$age==0],4]<-remain_mean
NumberOfTime30.59DaysPastDueNotWorse
NumberOfTime30.59DaysPastDueNotWorse has values 96,98 as outliers which are of less than 10%. We treat the outliers based on the related variable ‘SeriousDlqin2yrs’. ‘NumberOfTime30.59DaysPastDueNotWorse’ is directly related to ‘SeriousDlqin2yrs’. So we create a frequency table between SeriousDlqin2yrs and NumberOfTime30.59DaysPastDueNotWorse.
cross_table<-table(train_data1$NumberOfTime30.59DaysPastDueNotWorse,train_data1$SeriousDlqin2yrs)
cross_table
##
## 0 1
## 0 120977 5041
## 1 13624 2409
## 2 3379 1219
## 3 1136 618
## 4 429 318
## 5 188 154
## 6 66 74
## 7 26 28
## 8 17 8
## 9 8 4
## 10 1 3
## 11 0 1
## 12 1 1
## 13 0 1
## 96 1 4
## 98 121 143
For all the values in NumberOfTime30.59DaysPastDueNotWorse find the percentage of 0’s in SeriousDlqin2yrs.As both variables are related, We replace 96,98 with the values whose 0’s percentage is same as former values.
percent=0
for(i in 1:nrow(cross_table)){
percent[i]=(cross_table[i,1]/(cross_table[i,1]+cross_table[i,2]))*100
}
percent
## [1] 95.99978 84.97474 73.48847 64.76625 57.42972 54.97076 47.14286
## [8] 48.14815 68.00000 66.66667 25.00000 0.00000 50.00000 0.00000
## [15] 20.00000 45.83333
We can see that 98,6 values have nearly same percent and there are only 5 values of 96 in the variable. So we also replace 98 and also 96 by 6.
train_data2[c(1:150000)[train_data1[,5]>13],5]<-6
DebtRatio
DebtRatio has ooutliers.We take anything greater than 1 as outlier and replace them with remaining values mean. Outliers percentage is 23.4%. So we crete a new row Outlier_DebtRatio to indicate whether that value is outlier and replaced or not.
remain_mean<-mean(train_data1$DebtRatio[c(1:150000)[train_data1$DebtRatio<1]])
train_data2$Outlier_DebtRatio<-train_data1$DebtRatio>1
train_data2[c(1:150000)[train_data1$DebtRatio>1],6]<-remain_mean
MonthlyIncome
In MonthlyIncome we replaced missing values with mean lets check for outliers.
summary(train_data1$MonthlyIncome)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3903 6600 6670 7400 3009000
quantile(train_data1$MonthlyIncome,c(0,0.01,0.02,0.03,0.04,0.05,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.97,0.99,0.995,0.9975,1))
## 0% 1% 2% 3% 4% 5%
## 0.00 0.00 600.00 1000.00 1300.00 1500.00
## 10% 20% 30% 40% 50% 60%
## 2325.00 3400.00 4333.00 5400.00 6600.00 6670.00
## 70% 80% 90% 95% 97% 99%
## 6670.00 8250.00 10750.00 13500.00 16000.00 23000.00
## 99.5% 99.75% 100%
## 31250.00 44395.19 3008750.00
Observing the percentile distribution we will take upper limit as 50000 and lower limit as 1000. Anything out of this range will be considered as outlier.Lets see total number of outliers.
sum(train_data1$MonthlyIncome>50000)
## [1] 301
sum(train_data1$MonthlyIncome<1000)
## [1] 4428
outliers are of 3.152%.
we replace outliers with median of remaining values.
median<-median(c(1:150000)[train_data1$MonthlyIncome<50000&train_data1$MonthlyIncome>1000])
median
## [1] 75010.5
train_data2$MonthlyIncome[c(1:150000)[train_data1$MonthlyIncome<50000&train_data1$MonthlyIncome>1000]]<-median
NumberOfTimes90DaysLate
Outlier treatment for NumberOfTimes90DaysLate is same as NumberOfTime30.59DaysPastDueNotWorse. Outliers are 96,98.
cross_table<-table(train_data1$NumberOfTimes90DaysLate,train_data1$SeriousDlqin2yrs)
cross_table
##
## 0 1
## 0 135108 6554
## 1 3478 1765
## 2 779 776
## 3 282 385
## 4 96 195
## 5 48 83
## 6 32 48
## 7 7 31
## 8 6 15
## 9 5 14
## 10 3 5
## 11 2 3
## 12 1 1
## 13 2 2
## 14 1 1
## 15 2 0
## 17 0 1
## 96 1 4
## 98 121 143
percent=0
for(i in 1:nrow(cross_table)){
percent[i]=(cross_table[i,1]/(cross_table[i,1]+cross_table[i,2]))*100
}
percent
## [1] 95.37349 66.33607 50.09646 42.27886 32.98969 36.64122 40.00000
## [8] 18.42105 28.57143 26.31579 37.50000 40.00000 50.00000 50.00000
## [15] 50.00000 100.00000 0.00000 20.00000 45.83333
Values 98,3 has close percentage. We replace 96,98 with 3.
train_data2[c(1:150000)[train_data1[,9]>17],9]<-3
NumberOfTime60.89DaysPastDueNotWorse
This has 96,98 as outliers. This outlier treatment is same as NumberOfTime30.59DaysPastDueNotWorse.
cross_table<-table(train_data1$NumberOfTime60.89DaysPastDueNotWorse,train_data1$SeriousDlqin2yrs)
cross_table
##
## 0 1
## 0 135140 7256
## 1 3954 1777
## 2 557 561
## 3 138 180
## 4 40 65
## 5 13 21
## 6 4 12
## 7 4 5
## 8 1 1
## 9 1 0
## 11 0 1
## 96 1 4
## 98 121 143
percent=0
for(i in 1:nrow(cross_table)){
percent[i]=(cross_table[i,1]/(cross_table[i,1]+cross_table[i,2]))*100
}
percent
## [1] 94.90435 68.99319 49.82111 43.39623 38.09524 38.23529 25.00000
## [8] 44.44444 50.00000 100.00000 0.00000 20.00000 45.83333
Values 98,7 has close percentage. We replace 96,98 with 7.
train_data2[c(1:150000)[train_data1[,11]>17],11]<-7
All the outliers and missing values are cleaned. We save the final dataset for future use.
write.csv(train_data2,".../cleaned-data.csv")
After Outlier Treatment, we will build the model again using new data set.We divide data set for training and testing
train_dataset2<-train_data2[1:125000,]
test_dataset2<-train_data2[125001:150000,]
Create the model and test it. Find its accuracy
logmodel2 <- glm(SeriousDlqin2yrs ~ RevolvingUtilizationOfUnsecuredLines + age + NumberOfTime30.59DaysPastDueNotWorse + DebtRatio + MonthlyIncome+NumberOfOpenCreditLinesAndLoans+NumberOfTimes90DaysLate+ NumberRealEstateLoansOrLines+ NumberOfTime60.89DaysPastDueNotWorse+NumberOfDependents+ NA_MonthlyIncome + Outlier_DebtRatio, data = train_dataset2, family = "binomial")
predicted_values<-ifelse(predict(logmodel2,test_dataset2)>0.5,1,0)
conf_matrix<-table(test_dataset2[,2],predicted_values)
conf_matrix
## predicted_values
## 0 1
## 0 23163 105
## 1 1566 166
accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/sum(conf_matrix)
accuracy
## [1] 0.93316
specificity<-conf_matrix[2,2]/(conf_matrix[2,2]+conf_matrix[2,1])
specificity
## [1] 0.09584296
sensitivity<-conf_matrix[1,1]/(conf_matrix[1,1]+conf_matrix[1,2])
sensitivity
## [1] 0.9954874
Accuracy of the model is 0.933 which is greater than that of previous model.Specificity is 0.1003, Which is very low.
Decision Tree
Specificity in the logistic regression model is not satisfactory. So we build a new model using decision trees. We need to install package ‘party’ for decision trees.
library("party")
dec_tree<-ctree(SeriousDlqin2yrs ~ RevolvingUtilizationOfUnsecuredLines + age + NumberOfTime30.59DaysPastDueNotWorse + DebtRatio + MonthlyIncome+NumberOfOpenCreditLinesAndLoans+NumberOfTimes90DaysLate+ NumberRealEstateLoansOrLines+ NumberOfTime60.89DaysPastDueNotWorse+NumberOfDependents+ NA_MonthlyIncome + Outlier_DebtRatio, data = train_dataset2)
summary(dec_tree)
## Length Class Mode
## 1 BinaryTree S4
predict_values<-predict(dec_tree, newdata=test_dataset2)
predicted_values<-ifelse(predict_values>0.5,1,0)
conf_matrix<-table(test_dataset2[,2],predicted_values)
conf_matrix
## predicted_values
## 0 1
## 0 23085 183
## 1 1438 294
accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/sum(conf_matrix)
accuracy
## [1] 0.93516
specificity<-conf_matrix[2,2]/(conf_matrix[2,2]+conf_matrix[2,1])
specificity
## [1] 0.169746
sensitivity<-conf_matrix[1,1]/(conf_matrix[1,1]+conf_matrix[1,2])
sensitivity
## [1] 0.9921351
Accuracy is 0.935 and specificity is 0.169 which are greater than the logistic regression model. But we need to increase the specificity of the model
Accuracy paradox
In our data there are way more number of positive cases(0) than negative cases(1).
table(train_data[,2])
##
## 0 1
## 139974 10026
Ratio of zeros to ones is nearly 14:1 which indicates our data is imbalanced data. Imbalanced data typically refers to a problem with classification where the classes are not represented equally.
In our case models are giving more than 90% accuracy, but the accuracy is only reflecting the underlying class distribution. If we notice the confusion matrix, we can see that accuracy is mostly driven by zeros. So even though it is misclassifying most of the one’s we are getting high accuracy. This situation is called Accuracy Paradox.
In our case if we classify a good person as bad, bank wont issue loan to him and bank losses interest. If we classify bad one as good, bank issues loan to him but he wont pay the loan. Bank will have huge loss in this case. It is ok even if we calssify a good one as bad, but we should not classify bad one as good. Which mean specificity should be as high as possible in our case. Our model has only 0.169 specificity which is too low. We need to increase this value.
We Averaging models to tackle the issue of imbalance and increase the specificity.
Averaging Models
The ratio of ones to zeros is 1:14. We divide data into two groups with ones and zeros
zeros_data<-train_data2[-c(1:150000)[train_data2[,2]==1],]
ones_data<-train_data2[-c(1:150000)[train_data2[,2]==0],]
We divide the zeros_data into 14 clusters using K-means clustering.Each time clustering changes when we run the code. So to get same clusters each time we write the line ‘set.seed(42)’ in the code.
set.seed(42)
fit <- kmeans(zeros_data, 14)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 6998700)
zeros_clust<-data.frame(zeros_data, fit$cluster)
table(zeros_clust[,15])
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 9733 9725 15115 1469 1514 15188 9804 15130 13949 11901 15061 10074
## 13 14
## 9746 1565
zeros_clust has new column indicating cluster number.Above table shows number of elements in each cluster.
Next we build 14 models using each cluster and ones_data.
ones_data will be used in all clusters , thus making most of the training sets a balanced ones in building models.
models_list <- vector(mode="list", length=14)
for(j in 1:14){
models_list[[j]]<-ctree(SeriousDlqin2yrs ~ RevolvingUtilizationOfUnsecuredLines + age + NumberOfTime30.59DaysPastDueNotWorse + DebtRatio + MonthlyIncome+NumberOfOpenCreditLinesAndLoans+NumberOfTimes90DaysLate+ NumberRealEstateLoansOrLines+ NumberOfTime60.89DaysPastDueNotWorse+NumberOfDependents+ NA_MonthlyIncome + Outlier_DebtRatio, data = rbind(ones_data,zeros_data[c(1:nrow(zeros_clust))[zeros_clust[,15]==j],]))
}
Next we find accuracy,specificity,sensitivity for all models.
specificity=0
accuracy=0
sensitivity=0
predict_values<-data.frame(matrix(0, ncol = 14, nrow = 150000))
for(j in 1:14){
predict_values[,j]<-predict(models_list[[j]], newdata=train_data2)
predicted_values<-ifelse(predict_values[,j]>0.5,1,0)
conf_matrix1<-table(train_data2[,2],predicted_values)
accuracy[j]<-(conf_matrix1[1,1]+conf_matrix1[2,2])/sum(conf_matrix1)
specificity[j]<-conf_matrix1[2,2]/(conf_matrix1[2,2]+conf_matrix1[2,1])
sensitivity[j]<-conf_matrix1[1,1]/(conf_matrix1[1,1]+conf_matrix1[1,2])
}
accuracy
## [1] 0.7571667 0.7607400 0.8502800 0.0947400 0.0949000 0.8513067 0.7829267
## [8] 0.8454733 0.8068267 0.7847400 0.8475600 0.7821267 0.7776200 0.0950200
sensitivity
## [1] 0.75431151 0.75855516 0.86400331 0.03087002 0.03110578 0.86520354
## [7] 0.78372412 0.85819509 0.81168646 0.78580308 0.86078843 0.78298113
## [13] 0.77784446 0.03129867
specificity
## [1] 0.7970277 0.7912428 0.6586874 0.9864353 0.9855376 0.6572910 0.7717933
## [8] 0.6678636 0.7389787 0.7698983 0.6628765 0.7701975 0.7744863 0.9846399
As accuracy or sensitivity is increasing to a model, its specificity is decreasing. We discard models 4,5,14 as their accuracies are very low. We take average of predicted probabilities of remaining models. We use this averaged probability to get output.
predict_avg<-predict_values[,1]
for(j in c(1,2,3,6,7,8,9,10,11,12,13)){
predict_avg<-predict_avg+predict_values[,j]
}
predict_avg<-predict_avg/11
predicted_values<-ifelse(predict_avg>0.5,1,0)
conf_matrix<-table(train_data2[,2],predicted_values)
accuracy_avg<-(conf_matrix[1,1]+conf_matrix[2,2])/sum(conf_matrix)
accuracy_avg
## [1] 0.78316
specificity_avg<-conf_matrix[2,2]/(conf_matrix[2,2]+conf_matrix[2,1])
specificity_avg
## [1] 0.778077
sensitivity_avg<-conf_matrix[1,1]/(conf_matrix[1,1]+conf_matrix[1,2])
sensitivity_avg
## [1] 0.7835241
Specificity of the averaged model is 0.778 and accuracy is 0.783 which are good values. We can increase the specificity by decreasing the threshold, but buy doing so accuracy,sensitivity decreases.
Lets chaeck specificity,accuracy for 0.4 threshold.
predicted_values<-ifelse(predict_avg>0.4,1,0)
conf_matrix<-table(train_data2[,2],predicted_values)
accuracy_avg<-(conf_matrix[1,1]+conf_matrix[2,2])/sum(conf_matrix)
accuracy_avg
## [1] 0.71504
specificity_avg<-conf_matrix[2,2]/(conf_matrix[2,2]+conf_matrix[2,1])
specificity_avg
## [1] 0.8401157
sensitivity_avg<-conf_matrix[1,1]/(conf_matrix[1,1]+conf_matrix[1,2])
sensitivity_avg
## [1] 0.7060811
Specificity increased to 0.84 and accuracy decreased to 0.71.


