You can download the datasets .

Content

Problem Statement
Data Exploration
Univariate Analysis
Model Building
Decision Tree

Problem Statement

Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit.

Credit scoring algorithms, which make a guess at the probability of default, are the methods banks use to determine whether or not a loan should be granted.

The goal is to build a model that borrowers can use to help make the best financial decisions. The data is raw, you may have to spend considerable amount of time for validating and cleaning the data

Business Model

Banks give credit or loans to people for their needs. Banks earn profit by interest. Borrowers have to pay the loan in some installments. Some borrowers defalut their loan. These causes huge loss to the banks. So to reduce that cases, banks check the people to know whether the borrower is good person or not. Using the data given to us about the borrowers we need to build a model which can predict whether a borrower will pay loan or default the loan.

Data Exploration

First get the data set and metadata.Understand metadata thoroughly.
Next import the data set to R. Find the format like .csv,.xlsx etc of the data set. Use suitable function to import it into R. In our case data set is in csv format. So we use the function ‘read.csv()’ to import the data set.

train_data<-read.csv("D:\intern bang\case study 1\Data Set\training data.csv")

We are assigning the data to the variable ‘train_data’

See the dimension of the data.

dim(train_data)

## [1] 150000     12

we have 12 variables, and 150000 observations.

Get the variable names.

names(train_data)

##  [1] "X"                                   
##  [2] "SeriousDlqin2yrs"                    
##  [3] "RevolvingUtilizationOfUnsecuredLines"
##  [4] "age"                                 
##  [5] "NumberOfTime30.59DaysPastDueNotWorse"
##  [6] "DebtRatio"                           
##  [7] "MonthlyIncome"                       
##  [8] "NumberOfOpenCreditLinesAndLoans"     
##  [9] "NumberOfTimes90DaysLate"             
## [10] "NumberRealEstateLoansOrLines"        
## [11] "NumberOfTime60.89DaysPastDueNotWorse"
## [12] "NumberOfDependents"

check these names with the names in given metadata file. For our data, names are same in both. There is an extra variable ‘X’ which is serial number.

see the structure of the data

str(train_data)

## 'data.frame':    150000 obs. of  12 variables:
##  $ X                                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ SeriousDlqin2yrs                    : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ RevolvingUtilizationOfUnsecuredLines: num  0.766 0.957 0.658 0.234 0.907 ...
##  $ age                                 : int  45 40 38 30 49 74 57 39 27 57 ...
##  $ NumberOfTime30.59DaysPastDueNotWorse: int  2 0 1 0 1 0 0 0 0 0 ...
##  $ DebtRatio                           : num  0.803 0.1219 0.0851 0.036 0.0249 ...
##  $ MonthlyIncome                       : int  9120 2600 3042 3300 63588 3500 NA 3500 NA 23684 ...
##  $ NumberOfOpenCreditLinesAndLoans     : int  13 4 2 5 7 3 8 8 2 9 ...
##  $ NumberOfTimes90DaysLate             : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ NumberRealEstateLoansOrLines        : int  6 0 0 0 1 1 3 0 0 4 ...
##  $ NumberOfTime60.89DaysPastDueNotWorse: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ NumberOfDependents                  : int  2 1 0 0 0 1 0 0 NA 2 ...

In our data some variables are integers and some are numeric.

See first five and last five rows of the data to get better understanding of variables.

head(train_data)

##   X SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age
## 1 1                1                            0.7661266  45
## 2 2                0                            0.9571510  40
## 3 3                0                            0.6581801  38
## 4 4                0                            0.2338098  30
## 5 5                0                            0.9072394  49
## 6 6                0                            0.2131787  74
##   NumberOfTime30.59DaysPastDueNotWorse  DebtRatio MonthlyIncome
## 1                                    2 0.80298213          9120
## 2                                    0 0.12187620          2600
## 3                                    1 0.08511338          3042
## 4                                    0 0.03604968          3300
## 5                                    1 0.02492570         63588
## 6                                    0 0.37560697          3500
##   NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate
## 1                              13                       0
## 2                               4                       0
## 3                               2                       1
## 4                               5                       0
## 5                               7                       0
## 6                               3                       0
##   NumberRealEstateLoansOrLines NumberOfTime60.89DaysPastDueNotWorse
## 1                            6                                    0
## 2                            0                                    0
## 3                            0                                    0
## 4                            0                                    0
## 5                            1                                    0
## 6                            1                                    0
##   NumberOfDependents
## 1                  2
## 2                  1
## 3                  0
## 4                  0
## 5                  0
## 6                  1

tail(train_data)

##             X SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age
## 149995 149995                0                           0.38574226  50
## 149996 149996                0                           0.04067386  74
## 149997 149997                0                           0.29974515  44
## 149998 149998                0                           0.24604392  58
## 149999 149999                0                           0.00000000  30
## 150000 150000                0                           0.85028295  64
##        NumberOfTime30.59DaysPastDueNotWorse    DebtRatio MonthlyIncome
## 149995                                    0    0.4042929          3400
## 149996                                    0    0.2251309          2100
## 149997                                    0    0.7165622          5584
## 149998                                    0 3870.0000000            NA
## 149999                                    0    0.0000000          5716
## 150000                                    0    0.2499081          8158
##        NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate
## 149995                               7                       0
## 149996                               4                       0
## 149997                               4                       0
## 149998                              18                       0
## 149999                               4                       0
## 150000                               8                       0
##        NumberRealEstateLoansOrLines NumberOfTime60.89DaysPastDueNotWorse
## 149995                            0                                    0
## 149996                            1                                    0
## 149997                            1                                    0
## 149998                            1                                    0
## 149999                            0                                    0
## 150000                            2                                    0
##        NumberOfDependents
## 149995                  0
## 149996                  0
## 149997                  2
## 149998                  0
## 149999                  0
## 150000                  0

See the summary of whole data

summary(train_data)

##        X          SeriousDlqin2yrs  RevolvingUtilizationOfUnsecuredLines
##  Min.   :     1   Min.   :0.00000   Min.   :    0.00                    
##  1st Qu.: 37501   1st Qu.:0.00000   1st Qu.:    0.03                    
##  Median : 75001   Median :0.00000   Median :    0.15                    
##  Mean   : 75001   Mean   :0.06684   Mean   :    6.05                    
##  3rd Qu.:112500   3rd Qu.:0.00000   3rd Qu.:    0.56                    
##  Max.   :150000   Max.   :1.00000   Max.   :50708.00                    
##                                                                         
##       age        NumberOfTime30.59DaysPastDueNotWorse   DebtRatio       
##  Min.   :  0.0   Min.   : 0.000                       Min.   :     0.0  
##  1st Qu.: 41.0   1st Qu.: 0.000                       1st Qu.:     0.2  
##  Median : 52.0   Median : 0.000                       Median :     0.4  
##  Mean   : 52.3   Mean   : 0.421                       Mean   :   353.0  
##  3rd Qu.: 63.0   3rd Qu.: 0.000                       3rd Qu.:     0.9  
##  Max.   :109.0   Max.   :98.000                       Max.   :329664.0  
##                                                                         
##  MonthlyIncome     NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate
##  Min.   :      0   Min.   : 0.000                  Min.   : 0.000         
##  1st Qu.:   3400   1st Qu.: 5.000                  1st Qu.: 0.000         
##  Median :   5400   Median : 8.000                  Median : 0.000         
##  Mean   :   6670   Mean   : 8.453                  Mean   : 0.266         
##  3rd Qu.:   8249   3rd Qu.:11.000                  3rd Qu.: 0.000         
##  Max.   :3008750   Max.   :58.000                  Max.   :98.000         
##  NA's   :29731                                                            
##  NumberRealEstateLoansOrLines NumberOfTime60.89DaysPastDueNotWorse
##  Min.   : 0.000               Min.   : 0.0000                     
##  1st Qu.: 0.000               1st Qu.: 0.0000                     
##  Median : 1.000               Median : 0.0000                     
##  Mean   : 1.018               Mean   : 0.2404                     
##  3rd Qu.: 2.000               3rd Qu.: 0.0000                     
##  Max.   :54.000               Max.   :98.0000                     
##                                                                   
##  NumberOfDependents
##  Min.   : 0.000    
##  1st Qu.: 0.000    
##  Median : 0.000    
##  Mean   : 0.757    
##  3rd Qu.: 1.000    
##  Max.   :20.000    
##  NA's   :3924

Summary will show the minimum,maximum,mean,median,1^st quartile,3^rd quartile of all the variables in the data set. It also shows Whether any variable has NA(missing values) values.In our data, variables ‘MonthlyIncome’ and ‘NumberOfDependents’ have NA values.Summary gives mean of variables having NA values by excluding them.

Missing Values will not be represented by ‘NA’ always. They can be also represented by ‘?’,‘-’,‘.’ etc. To know whether a variable has a missing value or not get the summary of that variable or create a frequency table. After knowing the representation of the missing values we can find number of missing values in each variable.

In our data missing values are represented by ‘NA’. Lets find out number of missing values in each variable.

missing_values=0
for(i in 1:ncol(train_data)){
  missing_values[i]=sum(is.na(train_data[,i]))
  }
                   
#is.na will check whether a value is 'NA' or not,and gives output TRUE(1),False(0) respectively. sum() will give sum of all the 1's.

We got the number of missing values of all the variables in a vector.

missing_values

##  [1]     0     0     0     0     0     0 29731     0     0     0     0
## [12]  3924

Univariate Analysis

In univariate analysis, we validate one variable at a time. We will know what each variable signifies and how it is related to predicting variable. We also look for missing values,outliers etc. After that we sanitize the variable data which will be used for final model building.

“X”

It is the first variable in the data. It is serial number. We can get its summary

library(ggplot2)
summary(train_data$X)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1   37500   75000   75000  112500  150000

Its minimum value is 1, maximum value is 15000 which is number of rows in the data. Its mean and median are equal.

SeriousDlqin2yrs

It tells whether a Person experienced 90 days past due delinquency or worse. This is the target variable which we have to predict. Its values are 1(yes) or 0(no).

summary(train_data$SeriousDlqin2yrs)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.06684 0.00000 1.00000

Since it is yes/no type variable we will get frequency table for better understanding of the variable.

count<-table(train_data$SeriousDlqin2yrs)
count

## 
##      0      1 
## 139974  10026

Out of 150000 only 10026 are bad accounts, which shows ours is imbalanced data. We will discuss more about imbalanced data in model building.

We draw bar plot of the variable for better visualisation.

barplot(count,width=0.2,col = c("red","blue"),ylim = c(0, 140000),xlim=c(0,2),main=" Bar plot of SeriousDlqin2yrs",xlab="Yes(1)/No(0)",ylab="count")

RevolvingUtilizationOfUnsecuredLines

This variable represents total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits. It is a ratio. Its’ value should be inbetween 0 and 1.

summary(train_data$RevolvingUtilizationOfUnsecuredLines)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00     0.03     0.15     6.05     0.56 50710.00

Its’ mean is 6.05 which is greater than 1. So variable has some faulty values. Its maximum value is 50710 which is way too high.

Lets look at percentiles to know from where it is exceeding 1.

 quantile(train_data$RevolvingUtilizationOfUnsecuredLines,c(0.9,0.91,0.95,0.96,0.97,0.975,0.98,0.99,1))

##          90%          91%          95%          96%          97% 
## 9.812777e-01 9.999999e-01 9.999999e-01 9.999999e-01 9.999999e-01 
##        97.5%          98%          99%         100% 
## 9.999999e-01 1.006199e+00 1.092956e+00 5.070800e+04

Make box plot of the variable.

ggplot(train_data, aes(y=RevolvingUtilizationOfUnsecuredLines, x = 1)) + geom_boxplot()

From the box plot we can see that there are outliers present in the variable.

age

It specifies age of the barrower in years. Its’ an integer lets see the summary of age

summary(train_data$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    41.0    52.0    52.3    63.0   109.0

Minimum age is 0, which is not practical. Maximum age is 109 which is ok. Mean and median are very close which indicates outliers may not be present.

Lets see the percentile distribution.

quantile(train_data$age,c(0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1))

##   0%   1%   3%   5%   7%   9%  10%  20%  30%  40%  50%  60%  70%  80%  90% 
##    0   24   27   29   30   32   33   39   44   48   52   56   61   65   72 
## 100% 
##  109

Check whether any barrower age is less than 20 as banks mostly wont give credit to teenagers.

sum(train_data$age<20)

## [1] 1

There is only one borrower whose age is less than 20. That barrower is of minimum age 0. This is not possible as banks can not give credit to unborn.

Plot the boxplot of the variable

ggplot(train_data, aes(y=age, x = 1)) + geom_boxplot()

We can notice an outlier at the bottom of the boxplot.

NumberOfTime30.59DaysPastDueNotWorse

It shows number of times a borrower has been 30-59 days past due but no worse in the last 2 years.

str(train_data$NumberOfTime30.59DaysPastDueNotWorse)

##  int [1:150000] 2 0 1 0 1 0 0 0 0 0 ...

summary(train_data$NumberOfTime30.59DaysPastDueNotWorse)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.421   0.000  98.000

sd(train_data$NumberOfTime30.59DaysPastDueNotWorse)

## [1] 4.192781

It is an integer variable. Minimum value is zero,median is also zero. Mean is 0.421 ,SD is 4.192 and maximum value is 98. These give indication of presence of outliers.

Check the percentile distribution to know the presence of outliers.

quantile(train_data$NumberOfTime30.59DaysPastDueNotWorse,c(0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.85,0.9,0.95,1))

##   0%   1%   3%   5%   7%   9%  10%  20%  30%  40%  50%  60%  70%  80%  85% 
##    0    0    0    0    0    0    0    0    0    0    0    0    0    0    1 
##  90%  95% 100% 
##    1    2   98

100 percentile is 98 ,which is an outlier.

This variable range is from 0 to 98.It takes only integers values. Lets see it’s frequency distribution.

freq_table<-table(train_data$NumberOfTime30.59DaysPastDueNotWorse)
freq_table

## 
##      0      1      2      3      4      5      6      7      8      9 
## 126018  16033   4598   1754    747    342    140     54     25     12 
##     10     11     12     13     96     98 
##      4      1      2      1      5    264

This variables has values from 0 to 13 and 96,98. Last two are outliers.

Next plot boxplot and barplots to visualize the data.

qplot(data = train_data, x = NumberOfTime30.59DaysPastDueNotWorse)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(train_data, aes(y=NumberOfTime30.59DaysPastDueNotWorse, x = 1)) + geom_boxplot()

In this plot we can see the outliers.

DebtRatio

Debt ratio is obtained by dividing Monthly debt payments, alimony, living costs by monthly gross income.

str(train_data$DebtRatio)

##  num [1:150000] 0.803 0.1219 0.0851 0.036 0.0249 ...

summary(train_data$DebtRatio)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      0.0      0.2      0.4    353.0      0.9 329700.0

Normally debt ratio should be between 0 to 1. Somtimes it can exceed 1 ,if a person spends more than his income.Here its minimum is 0,mean is 353,median is 0.4. This indicates presence of outliers. Maximum value is 329700, which is not possible.

Lets see the percentile distribution

quantile(train_data$DebtRatio,c(0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.75,0.76,0.78,0.8,0.85,0.9,0.95,1))

##          10%          20%          30%          40%          50% 
## 3.087398e-02 1.337729e-01 2.136969e-01 2.874603e-01 3.665078e-01 
##          60%          70%          75%          76%          78% 
## 4.675064e-01 6.491891e-01 8.682538e-01 9.511839e-01 1.275069e+00 
##          80%          85%          90%          95%         100% 
## 4.000000e+00 2.691500e+02 1.267000e+03 2.449000e+03 3.296640e+05

Upto 76percentile it is less than 1.

Plot the boxplot.

ggplot(train_data, aes(y=DebtRatio, x = 1)) + geom_boxplot()

There are outlers present in the variable. We have to filter them before we use the data for model building.

MonthlyIncome

It is the monthly income of the barrower.

str(train_data$MonthlyIncome)

##  int [1:150000] 9120 2600 3042 3300 63588 3500 NA 3500 NA 23684 ...

summary(train_data$MonthlyIncome)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0    3400    5400    6670    8249 3009000   29731

This is an integer variable. It has missing values represented by ‘NA’.
Its minimum value is 0, which is practically impossible. Mean is 6670 and median is 5400 without considering NA values. Maximum value is 3009000.

We have to treat missing values first before going further into this variable analysis.

NumberOfOpenCreditLinesAndLoans

It indicates number of open loans (an installment loan such as car loan or mortgage) and lines of credit (such as credit cards).

str(train_data$NumberOfOpenCreditLinesAndLoans)

##  int [1:150000] 13 4 2 5 7 3 8 8 2 9 ...

summary(train_data$NumberOfOpenCreditLinesAndLoans)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   5.000   8.000   8.453  11.000  58.000

It is an integer variable. Its minimum value is 0,maximum value is 58. Its mean is 8.543,median is 8. Mean and median are close, so outliers may not be present.

Lets see percentile distribution to know the outliers presence.

quantile(train_data$NumberOfOpenCreditLinesAndLoans,c(0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.93,0.95,0.97,0.98,0.99,0.995,1))

##   10%   20%   30%   40%   50%   60%   70%   80%   90%   93%   95%   97% 
##     3     4     5     6     8     9    10    12    15    17    18    20 
##   98%   99% 99.5%  100% 
##    22    24    27    58

Highest value is 58 which is possible.

Lets check boxplot

ggplot(train_data, aes(y=NumberOfOpenCreditLinesAndLoans, x = 1)) + geom_boxplot()

No outliers in this variable.

NumberOfTimes90DaysLate

This variable represents number of times borrower has been 90 days or more past due.

str(train_data$NumberOfTimes90DaysLate)

##  int [1:150000] 0 0 1 0 0 0 0 0 0 0 ...

summary(train_data$NumberOfTimes90DaysLate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.266   0.000  98.000

It is an integer variable. Minimum value is zero,median is also zero. Mean is 0.266 and maximum value is 98. These give indication of presence of outliers.

Check the percentile distribution to know the presence of outliers.

quantile(train_data$NumberOfTimes90DaysLate,c(0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.85,0.9,0.95,0.97,0.99,1))

##   0%   1%   3%   5%   7%   9%  10%  20%  30%  40%  50%  60%  70%  80%  85% 
##    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0 
##  90%  95%  97%  99% 100% 
##    0    1    1    3   98

100 percentile is 98 ,which is an outlier.

This variable range is from 0 to 98.It takes only integers values. Lets see it’s frequency distribution.

freq_table<-table(train_data$NumberOfTimes90DaysLate)
freq_table

## 
##      0      1      2      3      4      5      6      7      8      9 
## 141662   5243   1555    667    291    131     80     38     21     19 
##     10     11     12     13     14     15     17     96     98 
##      8      5      2      4      2      2      1      5    264

This variables has values from 0 to 15 and 17,96,98. Last two are outliers.

Next plot boxplot and barplots to visualize the data.

barplot(freq_table,width=1,col = c("red","blue"),ylim = c(0, 140000),xlim=c(0,100),main="NumberOfTimes90DaysLate",xlab="Variable values",ylab="count")

ggplot(train_data, aes(y=NumberOfTimes90DaysLate, x = 1)) + geom_boxplot()

This plot shows presence of outliers.

NumberRealEstateLoansOrLines

It shows number of mortgage and real estate loans taken by the barrower including home equity lines of credit.

str(train_data$NumberRealEstateLoansOrLines)

##  int [1:150000] 6 0 0 0 1 1 3 0 0 4 ...

summary(train_data$NumberRealEstateLoansOrLines)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.000   1.018   2.000  54.000

It is an integer variable. Minimum value is zero,median is one. Mean is 1.018 and maximum value is 54. Mean and Median are close so there may not be outliers in this variable.

Check the percentile distribution to know the presence of outliers.

quantile(train_data$NumberRealEstateLoansOrLines,c(0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.85,0.9,0.95,0.97,0.99,1))

##   0%   1%   3%   5%   7%   9%  10%  20%  30%  40%  50%  60%  70%  80%  85% 
##    0    0    0    0    0    0    0    0    0    1    1    1    1    2    2 
##  90%  95%  97%  99% 100% 
##    2    3    3    4   54

100 percentile is 54 ,which is a possible value for this variable.

This variable range is from 0 to 54.It takes only integers values. Lets see it’s frequency distribution.

freq_table<-table(train_data$NumberRealEstateLoansOrLines)
freq_table

## 
##     0     1     2     3     4     5     6     7     8     9    10    11 
## 56188 52338 31522  6300  2170   689   320   171    93    78    37    23 
##    12    13    14    15    16    17    18    19    20    21    23    25 
##    18    15     7     7     4     4     2     2     2     1     2     3 
##    26    29    32    54 
##     1     1     1     1

This variables has values from 0 to 21 and 23,25,26,29,32,54.

Next plot boxplot and barplots to visualize the data.

barplot(freq_table,width=1,col = c("red","blue"),ylim = c(0, 60000),xlim=c(0,60),main="NumberRealEstateLoansOrLines",xlab="Variable values",ylab="count")

ggplot(train_data, aes(y=NumberRealEstateLoansOrLines, x = 1)) + geom_boxplot()

There are no outliers in this variable.

NumberOfTime60.89DaysPastDueNotWorse

It shows number of times borrower has been 60 to 89 days past due but no worse in the last 2 years.

str(train_data$NumberOfTime60.89DaysPastDueNotWorse)

##  int [1:150000] 0 0 0 0 0 0 0 0 0 0 ...

summary(train_data$NumberOfTime60.89DaysPastDueNotWorse)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2404  0.0000 98.0000

It is an integer variable. Minimum value is zero,median is 0. Mean is 0.2404 and maximum value is 98. These give indication of presence of outliers.

Check the percentile distribution to know the presence of outliers.

quantile(train_data$NumberOfTime60.89DaysPastDueNotWorse,c(0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.85,0.9,0.95,0.97,0.99,1))

##   0%   1%   3%   5%   7%   9%  10%  20%  30%  40%  50%  60%  70%  80%  85% 
##    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0 
##  90%  95%  97%  99% 100% 
##    0    1    1    2   98

100 percentile is 98 ,which is an outlier.

This variable range is from 0 to 98.It takes only integers values. Lets see it’s frequency distribution.

freq_table<-table(train_data$NumberOfTime60.89DaysPastDueNotWorse)
freq_table

## 
##      0      1      2      3      4      5      6      7      8      9 
## 142396   5731   1118    318    105     34     16      9      2      1 
##     11     96     98 
##      1      5    264

This variables has values from 0 to 9 and 11,96,98. Last two are outliers.

Next plot boxplot and barplots to visualize the data.

barplot(freq_table,width=1,col = c("red","blue"),ylim = c(0, 150000),xlim=c(0,100),main="NumberOfTime60.89DaysPastDueNotWorse",xlab="Variable values",ylab="count")

ggplot(train_data, aes(y=NumberOfTime60.89DaysPastDueNotWorse, x = 1)) + geom_boxplot()

This plot shows presence of outliers.

NumberOfDependents

It represents number of dependents in the family of borrower excluding himself.

str(train_data$NumberOfDependents)

##  int [1:150000] 2 1 0 0 0 1 0 0 NA 2 ...

summary(train_data$NumberOfDependents)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   0.757   1.000  20.000    3924

It is an integer variable.It has missing values represented by ‘NA’.
Its minimum value is 0. Mean is 0.757 and median is 0 without considering NA values. Maximum value is 20.

Lets tabularize what we found in univariate analysis

Variable	Missing Values	Outliers
X	Nill	Nill
SeriousDlqin2yrs	Nill	Nill
RevolvingUtilizationOfUnsecuredLines	Nill	Present(<10%)
age	Nill	Present(<10%)
NumberOfTime30.59DaysPastDueNotWorse	Nill	Present(<10%)
DebtRatio	Nill	Present(23.4%)
MonthlyIncome	Present(19.82%)	has to be Analysed
NumberOfOpenCreditLinesAndLoans	Nill	Nill
NumberOfTimes90DaysLate	Nill	Present(<10%)
NumberRealEstateLoansOrLines	Nill	Nill
NumberOfTime60.89DaysPastDueNotWorse	Nill	Present(<10%)
NumberOfDependents	Present(<10%)	has to be Analysed

We have to treat missing values first before going further into this variable analysis.

Next we build a model using uncleaned data and later clean the data.

Model Building

Since the predictor variable (SeriousDlqin2yrs) is YES or NO type ,first we will use logistic regression model.

We have only training data set but not test data set. So we divide our data set into two parts, first 125000 rows for training and remaining 25000 rows for testing.

First we build model without cleaning data, then we replace missing values,outliers with mean and build new model. We will check the accuracy of the model in both cases. If accuracy is increasing after treating data then we will keep the model otherwise we will discard it. We do this for other models as well, we will take the model which gives better results.

Logistic Regression

First lets create datasets for training and testing

train_dataset<-train_data[1:125000,]
test_dataset<-train_data[125001:150000,]

Next we will build the model using train_dataset

logmodel<- glm(SeriousDlqin2yrs ~ RevolvingUtilizationOfUnsecuredLines + age + NumberOfTime30.59DaysPastDueNotWorse + DebtRatio +   MonthlyIncome+NumberOfOpenCreditLinesAndLoans+NumberOfTimes90DaysLate+ NumberRealEstateLoansOrLines+ NumberOfTime60.89DaysPastDueNotWorse+NumberOfDependents, data = train_dataset, family = "binomial")

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Using this model we will predict the output of test_dataset

predicted_values<-ifelse(predict(logmodel,test_dataset)>0.5,1,0)

Here we kept threshold as 0.5. Any value greater than 0 is taken as 1. Any value less than or equal to 0 is taken as 0.

After predicting we need to find out the accuracy of the model, for that we calculate confusion matrix.

conf_matrix<-table(test_dataset[,2],predicted_values)
conf_matrix

##    predicted_values
##         0     1
##   0 18576    27
##   1  1402    41

Columns are predicted values, rows are actual values.

Accuracy is total true predictions divided by total predictions.We will also find sensitivity and specificity. Sensitivity is true positives/(true positives+false negatives) specificity is true negatives/(true negatives+false positivies)

accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/sum(conf_matrix)
accuracy

## [1] 0.928714

specificity<-conf_matrix[2,2]/(conf_matrix[2,2]+conf_matrix[2,1])
specificity

## [1] 0.02841303

sensitivity<-conf_matrix[1,1]/(conf_matrix[1,1]+conf_matrix[1,2])
sensitivity

## [1] 0.9985486

Our model accuracy is 0.928. This is without treating the data.Specificity is 0.028, which is very low, we need to improve it.

Next we will replace the missing values with their means and will build new model.

Missing values Treatment

MonthlyIncome and NumberOfDependents have missing values. We will replace them by their column mean values.We create new dataset.

In MonthlyIncome missing values are of 19.82%. So we create a new column NA_MonthlyIncome which indicates whether the value of MonthlyIncome in new dataset is origanal one(FALSE) or missing value replaced by the mean(TRUE).

train_data1<-train_data
c1<-c(1:150000)[is.na(train_data[,7])]
train_data1[c1,7]<-6670   
#We can get the mean of remaining observations in summary of that variable
train_data1$NA_MonthlyIncome<-is.na(train_data[,7])

In NumberOfDependents missing values are of only 2.616%, so we dont create any new column.We replace missing values by mean of remaining values.

c1<-c(1:150000)[is.na(train_data[,12])]
#We can get the mean of remaining observations in summary of that variable
train_data1[c1,12]<-0.757

Next we divide new dataset for testing and training

train_dataset1<-train_data1[1:125000,]
test_dataset1<-train_data1[125001:150000,]

Create the model and test it. Find its accuracy

logmodel1 <- glm(SeriousDlqin2yrs ~ RevolvingUtilizationOfUnsecuredLines + age + NumberOfTime30.59DaysPastDueNotWorse + DebtRatio +   MonthlyIncome+NumberOfOpenCreditLinesAndLoans+NumberOfTimes90DaysLate+ NumberRealEstateLoansOrLines+ NumberOfTime60.89DaysPastDueNotWorse+NumberOfDependents+NA_MonthlyIncome, data = train_dataset1, family = "binomial")

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

predicted_values1<-ifelse(predict(logmodel1,test_dataset1)>0.5,1,0)
conf_matrix<-table(test_dataset1[,2],predicted_values1)
conf_matrix

##    predicted_values1
##         0     1
##   0 23234    34
##   1  1679    53

accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/sum(conf_matrix)
accuracy

## [1] 0.93148

specificity<-conf_matrix[2,2]/(conf_matrix[2,2]+conf_matrix[2,1])
specificity

## [1] 0.03060046

sensitivity<-conf_matrix[1,1]/(conf_matrix[1,1]+conf_matrix[1,2])
sensitivity

## [1] 0.9985388

For this model accuracy is 0.931. There is an increase in accuracy of the model after replacing missing values.But not much improvement in specificity.

Next we treat outliers.

Outliers Removal

We create new dataset train_data2 in which we replace outliers with mean

train_data2<-train_data1

RevolvingUtilizationOfUnsecuredLines

RevolvingUtilizationOfUnsecuredLines has outliers. Since outliers percentage is less than 10 We will replace outliers with mean of reaming data.Outliers are with value greater than 1.

remain_mean<-mean(train_data1$RevolvingUtilizationOfUnsecuredLines[c(1:150000)[train_data1$RevolvingUtilizationOfUnsecuredLines<=1]])

train_data2[c(1:150000)[train_data1$RevolvingUtilizationOfUnsecuredLines>1],3]<-remain_mean

age

Next in age there is an outlier whose value is zero we replace it with other values mean.

remain_mean<-mean(train_data1$age[c(1:150000)[train_data1$age>0]])

train_data2[c(1:150000)[train_data1$age==0],4]<-remain_mean

NumberOfTime30.59DaysPastDueNotWorse

NumberOfTime30.59DaysPastDueNotWorse has values 96,98 as outliers which are of less than 10%. We treat the outliers based on the related variable ‘SeriousDlqin2yrs’. ‘NumberOfTime30.59DaysPastDueNotWorse’ is directly related to ‘SeriousDlqin2yrs’. So we create a frequency table between SeriousDlqin2yrs and NumberOfTime30.59DaysPastDueNotWorse.

cross_table<-table(train_data1$NumberOfTime30.59DaysPastDueNotWorse,train_data1$SeriousDlqin2yrs)
cross_table

##     
##           0      1
##   0  120977   5041
##   1   13624   2409
##   2    3379   1219
##   3    1136    618
##   4     429    318
##   5     188    154
##   6      66     74
##   7      26     28
##   8      17      8
##   9       8      4
##   10      1      3
##   11      0      1
##   12      1      1
##   13      0      1
##   96      1      4
##   98    121    143

For all the values in NumberOfTime30.59DaysPastDueNotWorse find the percentage of 0’s in SeriousDlqin2yrs.As both variables are related, We replace 96,98 with the values whose 0’s percentage is same as former values.

percent=0
for(i in 1:nrow(cross_table)){
  percent[i]=(cross_table[i,1]/(cross_table[i,1]+cross_table[i,2]))*100 
}
percent

##  [1] 95.99978 84.97474 73.48847 64.76625 57.42972 54.97076 47.14286
##  [8] 48.14815 68.00000 66.66667 25.00000  0.00000 50.00000  0.00000
## [15] 20.00000 45.83333

We can see that 98,6 values have nearly same percent and there are only 5 values of 96 in the variable. So we also replace 98 and also 96 by 6.

train_data2[c(1:150000)[train_data1[,5]>13],5]<-6

DebtRatio

DebtRatio has ooutliers.We take anything greater than 1 as outlier and replace them with remaining values mean. Outliers percentage is 23.4%. So we crete a new row Outlier_DebtRatio to indicate whether that value is outlier and replaced or not.

remain_mean<-mean(train_data1$DebtRatio[c(1:150000)[train_data1$DebtRatio<1]])
train_data2$Outlier_DebtRatio<-train_data1$DebtRatio>1
train_data2[c(1:150000)[train_data1$DebtRatio>1],6]<-remain_mean

MonthlyIncome

In MonthlyIncome we replaced missing values with mean lets check for outliers.

summary(train_data1$MonthlyIncome)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3903    6600    6670    7400 3009000

quantile(train_data1$MonthlyIncome,c(0,0.01,0.02,0.03,0.04,0.05,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.97,0.99,0.995,0.9975,1))

##         0%         1%         2%         3%         4%         5% 
##       0.00       0.00     600.00    1000.00    1300.00    1500.00 
##        10%        20%        30%        40%        50%        60% 
##    2325.00    3400.00    4333.00    5400.00    6600.00    6670.00 
##        70%        80%        90%        95%        97%        99% 
##    6670.00    8250.00   10750.00   13500.00   16000.00   23000.00 
##      99.5%     99.75%       100% 
##   31250.00   44395.19 3008750.00

Observing the percentile distribution we will take upper limit as 50000 and lower limit as 1000. Anything out of this range will be considered as outlier.Lets see total number of outliers.

sum(train_data1$MonthlyIncome>50000)

## [1] 301

sum(train_data1$MonthlyIncome<1000)

## [1] 4428

outliers are of 3.152%.

we replace outliers with median of remaining values.

median<-median(c(1:150000)[train_data1$MonthlyIncome<50000&train_data1$MonthlyIncome>1000])
median

## [1] 75010.5

train_data2$MonthlyIncome[c(1:150000)[train_data1$MonthlyIncome<50000&train_data1$MonthlyIncome>1000]]<-median

NumberOfTimes90DaysLate

Outlier treatment for NumberOfTimes90DaysLate is same as NumberOfTime30.59DaysPastDueNotWorse. Outliers are 96,98.

cross_table<-table(train_data1$NumberOfTimes90DaysLate,train_data1$SeriousDlqin2yrs)
cross_table

##     
##           0      1
##   0  135108   6554
##   1    3478   1765
##   2     779    776
##   3     282    385
##   4      96    195
##   5      48     83
##   6      32     48
##   7       7     31
##   8       6     15
##   9       5     14
##   10      3      5
##   11      2      3
##   12      1      1
##   13      2      2
##   14      1      1
##   15      2      0
##   17      0      1
##   96      1      4
##   98    121    143

percent=0
for(i in 1:nrow(cross_table)){
  percent[i]=(cross_table[i,1]/(cross_table[i,1]+cross_table[i,2]))*100 
}
percent

##  [1]  95.37349  66.33607  50.09646  42.27886  32.98969  36.64122  40.00000
##  [8]  18.42105  28.57143  26.31579  37.50000  40.00000  50.00000  50.00000
## [15]  50.00000 100.00000   0.00000  20.00000  45.83333

Values 98,3 has close percentage. We replace 96,98 with 3.

train_data2[c(1:150000)[train_data1[,9]>17],9]<-3

NumberOfTime60.89DaysPastDueNotWorse

This has 96,98 as outliers. This outlier treatment is same as NumberOfTime30.59DaysPastDueNotWorse.

cross_table<-table(train_data1$NumberOfTime60.89DaysPastDueNotWorse,train_data1$SeriousDlqin2yrs)
cross_table

##     
##           0      1
##   0  135140   7256
##   1    3954   1777
##   2     557    561
##   3     138    180
##   4      40     65
##   5      13     21
##   6       4     12
##   7       4      5
##   8       1      1
##   9       1      0
##   11      0      1
##   96      1      4
##   98    121    143

percent=0
for(i in 1:nrow(cross_table)){
  percent[i]=(cross_table[i,1]/(cross_table[i,1]+cross_table[i,2]))*100 
}
percent

##  [1]  94.90435  68.99319  49.82111  43.39623  38.09524  38.23529  25.00000
##  [8]  44.44444  50.00000 100.00000   0.00000  20.00000  45.83333

Values 98,7 has close percentage. We replace 96,98 with 7.

train_data2[c(1:150000)[train_data1[,11]>17],11]<-7

All the outliers and missing values are cleaned. We save the final dataset for future use.

write.csv(train_data2,".../cleaned-data.csv")

After Outlier Treatment, we will build the model again using new data set.We divide data set for training and testing

train_dataset2<-train_data2[1:125000,]
test_dataset2<-train_data2[125001:150000,]

Create the model and test it. Find its accuracy

logmodel2 <- glm(SeriousDlqin2yrs ~ RevolvingUtilizationOfUnsecuredLines + age + NumberOfTime30.59DaysPastDueNotWorse + DebtRatio +   MonthlyIncome+NumberOfOpenCreditLinesAndLoans+NumberOfTimes90DaysLate+ NumberRealEstateLoansOrLines+ NumberOfTime60.89DaysPastDueNotWorse+NumberOfDependents+ NA_MonthlyIncome + Outlier_DebtRatio, data = train_dataset2, family = "binomial")

predicted_values<-ifelse(predict(logmodel2,test_dataset2)>0.5,1,0)
conf_matrix<-table(test_dataset2[,2],predicted_values)
conf_matrix

##    predicted_values
##         0     1
##   0 23163   105
##   1  1566   166

accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/sum(conf_matrix)
accuracy

## [1] 0.93316

specificity<-conf_matrix[2,2]/(conf_matrix[2,2]+conf_matrix[2,1])
specificity

## [1] 0.09584296

sensitivity<-conf_matrix[1,1]/(conf_matrix[1,1]+conf_matrix[1,2])
sensitivity

## [1] 0.9954874

Accuracy of the model is 0.933 which is greater than that of previous model.Specificity is 0.1003, Which is very low.

Decision Tree

Specificity in the logistic regression model is not satisfactory. So we build a new model using decision trees. We need to install package ‘party’ for decision trees.

library("party")

dec_tree<-ctree(SeriousDlqin2yrs ~ RevolvingUtilizationOfUnsecuredLines + age + NumberOfTime30.59DaysPastDueNotWorse + DebtRatio +   MonthlyIncome+NumberOfOpenCreditLinesAndLoans+NumberOfTimes90DaysLate+ NumberRealEstateLoansOrLines+ NumberOfTime60.89DaysPastDueNotWorse+NumberOfDependents+ NA_MonthlyIncome + Outlier_DebtRatio, data = train_dataset2)

 summary(dec_tree)

##     Length      Class       Mode 
##          1 BinaryTree         S4

 predict_values<-predict(dec_tree, newdata=test_dataset2)
 predicted_values<-ifelse(predict_values>0.5,1,0)
conf_matrix<-table(test_dataset2[,2],predicted_values)
conf_matrix

##    predicted_values
##         0     1
##   0 23085   183
##   1  1438   294

accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/sum(conf_matrix)
accuracy

## [1] 0.93516

specificity<-conf_matrix[2,2]/(conf_matrix[2,2]+conf_matrix[2,1])
specificity

## [1] 0.169746

sensitivity<-conf_matrix[1,1]/(conf_matrix[1,1]+conf_matrix[1,2])
sensitivity

## [1] 0.9921351

Accuracy is 0.935 and specificity is 0.169 which are greater than the logistic regression model. But we need to increase the specificity of the model

Accuracy paradox

In our data there are way more number of positive cases(0) than negative cases(1).

table(train_data[,2])

## 
##      0      1 
## 139974  10026

Ratio of zeros to ones is nearly 14:1 which indicates our data is imbalanced data. Imbalanced data typically refers to a problem with classification where the classes are not represented equally.

In our case models are giving more than 90% accuracy, but the accuracy is only reflecting the underlying class distribution. If we notice the confusion matrix, we can see that accuracy is mostly driven by zeros. So even though it is misclassifying most of the one’s we are getting high accuracy. This situation is called Accuracy Paradox.

In our case if we classify a good person as bad, bank wont issue loan to him and bank losses interest. If we classify bad one as good, bank issues loan to him but he wont pay the loan. Bank will have huge loss in this case. It is ok even if we calssify a good one as bad, but we should not classify bad one as good. Which mean specificity should be as high as possible in our case. Our model has only 0.169 specificity which is too low. We need to increase this value.

We Averaging models to tackle the issue of imbalance and increase the specificity.

Averaging Models

The ratio of ones to zeros is 1:14. We divide data into two groups with ones and zeros

zeros_data<-train_data2[-c(1:150000)[train_data2[,2]==1],]
ones_data<-train_data2[-c(1:150000)[train_data2[,2]==0],]

We divide the zeros_data into 14 clusters using K-means clustering.Each time clustering changes when we run the code. So to get same clusters each time we write the line ‘set.seed(42)’ in the code.

set.seed(42)
fit <- kmeans(zeros_data, 14)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 6998700)

zeros_clust<-data.frame(zeros_data, fit$cluster)
table(zeros_clust[,15])

## 
##     1     2     3     4     5     6     7     8     9    10    11    12 
##  9733  9725 15115  1469  1514 15188  9804 15130 13949 11901 15061 10074 
##    13    14 
##  9746  1565

zeros_clust has new column indicating cluster number.Above table shows number of elements in each cluster.

Next we build 14 models using each cluster and ones_data.

ones_data will be used in all clusters , thus making most of the training sets a balanced ones in building models.

models_list <- vector(mode="list", length=14)
for(j in 1:14){
  models_list[[j]]<-ctree(SeriousDlqin2yrs ~ RevolvingUtilizationOfUnsecuredLines + age + NumberOfTime30.59DaysPastDueNotWorse + DebtRatio +   MonthlyIncome+NumberOfOpenCreditLinesAndLoans+NumberOfTimes90DaysLate+ NumberRealEstateLoansOrLines+ NumberOfTime60.89DaysPastDueNotWorse+NumberOfDependents+ NA_MonthlyIncome + Outlier_DebtRatio, data = rbind(ones_data,zeros_data[c(1:nrow(zeros_clust))[zeros_clust[,15]==j],]))
  
}

Next we find accuracy,specificity,sensitivity for all models.

specificity=0
accuracy=0
sensitivity=0
predict_values<-data.frame(matrix(0, ncol = 14, nrow = 150000))
for(j in 1:14){
predict_values[,j]<-predict(models_list[[j]], newdata=train_data2)
predicted_values<-ifelse(predict_values[,j]>0.5,1,0)
conf_matrix1<-table(train_data2[,2],predicted_values)

accuracy[j]<-(conf_matrix1[1,1]+conf_matrix1[2,2])/sum(conf_matrix1)
specificity[j]<-conf_matrix1[2,2]/(conf_matrix1[2,2]+conf_matrix1[2,1])
sensitivity[j]<-conf_matrix1[1,1]/(conf_matrix1[1,1]+conf_matrix1[1,2])

}
accuracy

##  [1] 0.7571667 0.7607400 0.8502800 0.0947400 0.0949000 0.8513067 0.7829267
##  [8] 0.8454733 0.8068267 0.7847400 0.8475600 0.7821267 0.7776200 0.0950200

sensitivity

##  [1] 0.75431151 0.75855516 0.86400331 0.03087002 0.03110578 0.86520354
##  [7] 0.78372412 0.85819509 0.81168646 0.78580308 0.86078843 0.78298113
## [13] 0.77784446 0.03129867

specificity

##  [1] 0.7970277 0.7912428 0.6586874 0.9864353 0.9855376 0.6572910 0.7717933
##  [8] 0.6678636 0.7389787 0.7698983 0.6628765 0.7701975 0.7744863 0.9846399

As accuracy or sensitivity is increasing to a model, its specificity is decreasing. We discard models 4,5,14 as their accuracies are very low. We take average of predicted probabilities of remaining models. We use this averaged probability to get output.

predict_avg<-predict_values[,1]
for(j in c(1,2,3,6,7,8,9,10,11,12,13)){
     predict_avg<-predict_avg+predict_values[,j]
      }
predict_avg<-predict_avg/11
predicted_values<-ifelse(predict_avg>0.5,1,0)
 
conf_matrix<-table(train_data2[,2],predicted_values)
accuracy_avg<-(conf_matrix[1,1]+conf_matrix[2,2])/sum(conf_matrix)
accuracy_avg

## [1] 0.78316

specificity_avg<-conf_matrix[2,2]/(conf_matrix[2,2]+conf_matrix[2,1])
specificity_avg

## [1] 0.778077

sensitivity_avg<-conf_matrix[1,1]/(conf_matrix[1,1]+conf_matrix[1,2])
sensitivity_avg

## [1] 0.7835241

Specificity of the averaged model is 0.778 and accuracy is 0.783 which are good values. We can increase the specificity by decreasing the threshold, but buy doing so accuracy,sensitivity decreases.

Lets chaeck specificity,accuracy for 0.4 threshold.

predicted_values<-ifelse(predict_avg>0.4,1,0)
 
conf_matrix<-table(train_data2[,2],predicted_values)
accuracy_avg<-(conf_matrix[1,1]+conf_matrix[2,2])/sum(conf_matrix)
accuracy_avg

## [1] 0.71504

specificity_avg<-conf_matrix[2,2]/(conf_matrix[2,2]+conf_matrix[2,1])
specificity_avg

## [1] 0.8401157

sensitivity_avg<-conf_matrix[1,1]/(conf_matrix[1,1]+conf_matrix[1,2])
sensitivity_avg

## [1] 0.7060811

Specificity increased to 0.84 and accuracy decreased to 0.71.

Consumer Loan Default Prediction

You can download the datasets .

Content

Problem Statement

Business Model

Data Exploration

Univariate Analysis

“X”

SeriousDlqin2yrs

RevolvingUtilizationOfUnsecuredLines

age

NumberOfTime30.59DaysPastDueNotWorse

DebtRatio

MonthlyIncome

NumberOfOpenCreditLinesAndLoans

NumberOfTimes90DaysLate

NumberRealEstateLoansOrLines

NumberOfTime60.89DaysPastDueNotWorse

NumberOfDependents

Model Building

Logistic Regression

Missing values Treatment

Outliers Removal

Decision Tree

Averaging Models