• No products in the cart.

Consumer Loan Default Prediction

Before start our lesson please download the datasets.

CS5 Consumer Loan Default Prediction

Problem Statement:

Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit.

Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted.

The goal is to build a model that borrowers can use to help make the best financial decisions.The data is raw, you may have to spend considerable amount of time for validating and cleaning the data

Methods:

i used two popular data mining algorithms (decision tree and Naïve Bayesian classifier) along with a most commonly used statistical method (logistic regression) to develop the prediction models using a large dataset (150000 instances).

The problem is to classify borrower as defaulter or non defaulter. It is commonly desired for banks to classify borrower accurately so as to manage their loan risk better and increase business.

Data Importing

In [136]:
import pandas as pd
loan=pd.read_csv("C:\\Users\\Personal\\Google Drive\\cs-training.csv")
loan.shape
Out[136]:
(150000, 12)

Data set has 150000 rows and 12 variables.

In [3]:
loan.columns.values
Out[3]:
array(['Sr_No', 'SeriousDlqin2yrs', 'RevolvingUtilizationOfUnsecuredLines',
       'age', 'NumberOfTime30-59DaysPastDueNotWorse', 'DebtRatio',
       'MonthlyIncome', 'NumberOfOpenCreditLinesAndLoans',
       'NumberOfTimes90DaysLate', 'NumberRealEstateLoansOrLines',
       'NumberOfTime60-89DaysPastDueNotWorse', 'NumberOfDependents'], dtype=object)

Variable Name: Description

1. Sr_No:serial number

2. SeriousDlqin2yrs : Person experienced 90 days past due delinquency or worse 


3. RevolvingUtilizationOfUnsecuredLines :Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits. 

4. age: Age of borrower in years 

5. NumberOfTime30-59DaysPastDueNotWorse: Number of times borrower has been 30-59 days past due but no worse in the last 2 years. 

6. DebtRatio: Monthly debt payments, alimony,living costs divided by monthy gross income

7. MonthlyIncome :Monthly income 

8. NumberOfOpenCreditLinesAndLoans: Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards) 

9. NumberOfTimes90DaysLate: Number of times borrower has been 90 days or more past due. 

10. NumberRealEstateLoansOrLines: Number of mortgage and real estate loans including home equity lines of credit 

11. NumberOfTime60-89DaysPastDueNotWorse: Number of times borrower has been 60-89 days past due but no worse in the last 2 years. 

12. NumberOfDependents: Number of dependents in family excluding themselves (spouse, children etc.)
In [4]:
loan.head()
Out[4]:
Sr_No SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age NumberOfTime30-59DaysPastDueNotWorse DebtRatio MonthlyIncome NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate NumberRealEstateLoansOrLines NumberOfTime60-89DaysPastDueNotWorse NumberOfDependents
0 1 1 0.766127 45 2 0.802982 9120.0 13 0 6 0 2.0
1 2 0 0.957151 40 0 0.121876 2600.0 4 0 0 0 1.0
2 3 0 0.658180 38 1 0.085113 3042.0 2 1 0 0 0.0
3 4 0 0.233810 30 0 0.036050 3300.0 5 0 0 0 0.0
4 5 0 0.907239 49 1 0.024926 63588.0 7 0 1 0 0.0

Data Exploration

In [5]:
import pandas as pd
import sklearn as sk
import math
import numpy as np
from scipy import stats
import matplotlib as matlab
import statsmodels
loan=pd.read_csv("C:\\Users\\Personal\\Google Drive\\cs-training.csv")
loan.shape
loan.columns.values
loan.head(10)
loan.describe()
Out[5]:
Sr_No SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age NumberOfTime30-59DaysPastDueNotWorse DebtRatio MonthlyIncome NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate NumberRealEstateLoansOrLines NumberOfTime60-89DaysPastDueNotWorse NumberOfDependents
count 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000 1.202690e+05 150000.000000 150000.000000 150000.000000 150000.000000 146076.000000
mean 75000.500000 0.066840 6.048438 52.295207 0.421033 353.005076 6.670221e+03 8.452760 0.265973 1.018240 0.240387 0.757222
std 43301.414527 0.249746 249.755371 14.771866 4.192781 2037.818523 1.438467e+04 5.145951 4.169304 1.129771 4.155179 1.115086
min 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000
25% 37500.750000 0.000000 0.029867 41.000000 0.000000 0.175074 3.400000e+03 5.000000 0.000000 0.000000 0.000000 0.000000
50% 75000.500000 0.000000 0.154181 52.000000 0.000000 0.366508 5.400000e+03 8.000000 0.000000 1.000000 0.000000 0.000000
75% 112500.250000 0.000000 0.559046 63.000000 0.000000 0.868254 8.249000e+03 11.000000 0.000000 2.000000 0.000000 1.000000
max 150000.000000 1.000000 50708.000000 109.000000 98.000000 329664.000000 3.008750e+06 58.000000 98.000000 54.000000 98.000000 20.000000

describe will show the minimum, maximum, mean, median, 1st quartile, 3rd quartile of all the variables in the data set. It also shows missing values in the data set.In our dataset, variables ‘MonthlyIncome’ and ‘NumberOfDependents’ have NA values.Summary gives mean of variables having NA values by excluding them.

checking missing values

In [6]:
loan.isnull().sum()
Out[6]:
Sr_No                                       0
SeriousDlqin2yrs                            0
RevolvingUtilizationOfUnsecuredLines        0
age                                         0
NumberOfTime30-59DaysPastDueNotWorse        0
DebtRatio                                   0
MonthlyIncome                           29731
NumberOfOpenCreditLinesAndLoans             0
NumberOfTimes90DaysLate                     0
NumberRealEstateLoansOrLines                0
NumberOfTime60-89DaysPastDueNotWorse        0
NumberOfDependents                       3924
dtype: int64

monthlyincome and number of dependents have missing values.

SeriousDlqin2yrs:

Person experienced 90 days past due delinquency or worse.This is the target variable which we have to predict.it is a binary data.

In [7]:
loan['SeriousDlqin2yrs'].describe()
Out[7]:
count    150000.000000
mean          0.066840
std           0.249746
min           0.000000
25%           0.000000
50%           0.000000
75%           0.000000
max           1.000000
Name: SeriousDlqin2yrs, dtype: float64
In [8]:
frequency_table=loan['SeriousDlqin2yrs'].value_counts()
frequency_table
Out[8]:
0    139974
1     10026
Name: SeriousDlqin2yrs, dtype: int64

0 -indicates non-defaulters, 1 -indicates defaulters. Out of 150000 only 10026 are defaulters.

RevolvingUtilizationOfUnsecuredLines

This variable represents total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits. It is a ratio. Its’ value should be inbetween 0 and 1

In [9]:
loan['RevolvingUtilizationOfUnsecuredLines'].describe()
Out[9]:
count    150000.000000
mean          6.048438
std         249.755371
min           0.000000
25%           0.029867
50%           0.154181
75%           0.559046
max       50708.000000
Name: RevolvingUtilizationOfUnsecuredLines, dtype: float64
In [10]:
import matplotlib.pyplot as plt
%matplotlib inline
loan.boxplot(column="RevolvingUtilizationOfUnsecuredLines")
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:3: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  app.launch_new_instance()
Out[10]:
{'boxes': [<matplotlib.lines.Line2D at 0x9ddab70>],
 'caps': [<matplotlib.lines.Line2D at 0x9de6cf0>,
  <matplotlib.lines.Line2D at 0x9de6d90>],
 'fliers': [<matplotlib.lines.Line2D at 0x9defb50>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x9def270>],
 'whiskers': [<matplotlib.lines.Line2D at 0x9ddae70>,
  <matplotlib.lines.Line2D at 0x9de6830>]}

From the box plot we can see that there are outliers present in the variable.

age

It specifies age of the barrower in years. Its’ an integer lets see the summary of age

In [11]:
loan['age'].describe()
Out[11]:
count    150000.000000
mean         52.295207
std          14.771866
min           0.000000
25%          41.000000
50%          52.000000
75%          63.000000
max         109.000000
Name: age, dtype: float64

Minimum age is 0, which is not practical. Maximum age is 109 which is ok. Mean and median are very close which indicates outliers may not be present.

Lets see the percentile distribution.

In [12]:
loan['age'].quantile([0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1])
Out[12]:
0.00      0.0
0.01     24.0
0.03     27.0
0.05     29.0
0.07     30.0
0.09     32.0
0.10     33.0
0.20     39.0
0.30     44.0
0.40     48.0
0.50     52.0
0.60     56.0
0.70     61.0
0.80     65.0
0.90     72.0
1.00    109.0
Name: age, dtype: float64
In [13]:
import matplotlib.pyplot as plt
%matplotlib inline
loan.boxplot(column="age")
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:3: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  app.launch_new_instance()
Out[13]:
{'boxes': [<matplotlib.lines.Line2D at 0x4d503b0>],
 'caps': [<matplotlib.lines.Line2D at 0x4d50ef0>,
  <matplotlib.lines.Line2D at 0x4d597f0>],
 'fliers': [<matplotlib.lines.Line2D at 0x4d59cb0>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x4d59890>],
 'whiskers': [<matplotlib.lines.Line2D at 0x4d50990>,
  <matplotlib.lines.Line2D at 0x4d50e50>]}

We can notice an outlier at the top of the boxplot.

NumberOfTime30-59DaysPastDueNotWorse

It shows number of times a borrower has been 30-59 days past due but no worse in the last 2 years.

In [14]:
loan['NumberOfTime30-59DaysPastDueNotWorse'].describe()
Out[14]:
count    150000.000000
mean          0.421033
std           4.192781
min           0.000000
25%           0.000000
50%           0.000000
75%           0.000000
max          98.000000
Name: NumberOfTime30-59DaysPastDueNotWorse, dtype: float64

It is an integer variable. Minimum value is zero,median is also zero. Mean is 0.421 ,SD is 4.192 and maximum value is 98. These give indication of presence of outliers.

Check the percentile distribution to know the presence of outliers.

In [15]:
loan['NumberOfTime30-59DaysPastDueNotWorse'].quantile([0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.85,0.9,0.95,1])
Out[15]:
0.00     0.0
0.01     0.0
0.03     0.0
0.05     0.0
0.07     0.0
0.09     0.0
0.10     0.0
0.20     0.0
0.30     0.0
0.40     0.0
0.50     0.0
0.60     0.0
0.70     0.0
0.80     0.0
0.85     1.0
0.90     1.0
0.95     2.0
1.00    98.0
Name: NumberOfTime30-59DaysPastDueNotWorse, dtype: float64

100 percentile is 98 ,which is an outlier.

This variable range is from 0 to 98.It takes only integers values. Lets see it’s frequency distribution.

In [16]:
freq_tab=loan['NumberOfTime30-59DaysPastDueNotWorse'].value_counts()
freq_tab
Out[16]:
0     126018
1      16033
2       4598
3       1754
4        747
5        342
98       264
6        140
7         54
8         25
9         12
96         5
10         4
12         2
13         1
11         1
Name: NumberOfTime30-59DaysPastDueNotWorse, dtype: int64

This variables has values from 0 to 13 and 96,98. Last two are outliers.

Next plot boxplot to visualize the data.

In [17]:
import matplotlib.pyplot as plt
%matplotlib inline
loan.boxplot(column="NumberOfTime30-59DaysPastDueNotWorse")
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:3: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  app.launch_new_instance()
Out[17]:
{'boxes': [<matplotlib.lines.Line2D at 0x4d9f190>],
 'caps': [<matplotlib.lines.Line2D at 0x4d9ffd0>,
  <matplotlib.lines.Line2D at 0x4da3530>],
 'fliers': [<matplotlib.lines.Line2D at 0x4da3eb0>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x4da35d0>],
 'whiskers': [<matplotlib.lines.Line2D at 0x4d9f6d0>,
  <matplotlib.lines.Line2D at 0x4d9fb90>]}

DebtRatio

Debt ratio is obtained by dividing Monthly debt payments, alimony, living costs by monthly gross income

In [18]:
loan['DebtRatio'].describe()
Out[18]:
count    150000.000000
mean        353.005076
std        2037.818523
min           0.000000
25%           0.175074
50%           0.366508
75%           0.868254
max      329664.000000
Name: DebtRatio, dtype: float64

Normally debt ratio should be between 0 to 1. Somtimes it can exceed 1 ,if a person spends more than his income.Here its minimum is 0,mean is 353,median is 0.4. This indicates presence of outliers. Maximum value is 329700, which is not possible.

Lets see the percentile distribution

In [19]:
loan['DebtRatio'].quantile([0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.75,0.76,0.78,0.8,0.85,0.9,0.95,1])
Out[19]:
0.10         0.030874
0.20         0.133773
0.30         0.213697
0.40         0.287460
0.50         0.366508
0.60         0.467506
0.70         0.649189
0.75         0.868254
0.76         0.951184
0.78         1.275069
0.80         4.000000
0.85       269.150000
0.90      1267.000000
0.95      2449.000000
1.00    329664.000000
Name: DebtRatio, dtype: float64

Upto 76percentile it is less than 1.

Plot the boxplot.

In [20]:
import matplotlib.pyplot as plt
%matplotlib inline
loan.boxplot(column="DebtRatio")
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:3: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  app.launch_new_instance()
Out[20]:
{'boxes': [<matplotlib.lines.Line2D at 0x4ddf0f0>],
 'caps': [<matplotlib.lines.Line2D at 0x4ddffb0>,
  <matplotlib.lines.Line2D at 0x4de6490>],
 'fliers': [<matplotlib.lines.Line2D at 0x4de6e10>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x4de6530>],
 'whiskers': [<matplotlib.lines.Line2D at 0x4ddf630>,
  <matplotlib.lines.Line2D at 0x4ddfaf0>]}

There are outlers present in the variable. We have to filter them before we use the data for model building.

MonthlyIncome

It is the monthly income of the barrower.

In [21]:
loan['MonthlyIncome'].describe()
Out[21]:
count    1.202690e+05
mean     6.670221e+03
std      1.438467e+04
min      0.000000e+00
25%      3.400000e+03
50%      5.400000e+03
75%      8.249000e+03
max      3.008750e+06
Name: MonthlyIncome, dtype: float64

This is an integer variable. It has missing values represented by ‘NA’. Its minimum value is 0, which is practically impossible. Mean is 6670 and median is 5400 without considering NA values.

NumberOfOpenCreditLinesAndLoans

It indicates number of open loans (an installment loan such as car loan or mortgage) and lines of credit (such as credit cards)

In [22]:
loan['NumberOfOpenCreditLinesAndLoans'].describe()
Out[22]:
count    150000.000000
mean          8.452760
std           5.145951
min           0.000000
25%           5.000000
50%           8.000000
75%          11.000000
max          58.000000
Name: NumberOfOpenCreditLinesAndLoans, dtype: float64

It is an integer variable. Its minimum value is 0,maximum value is 58. Its mean is 8.543,median is 8. Mean and median are close, so outliers may not be present.

Lets see percentile distribution to know the outliers presence.

In [23]:
loan['NumberOfOpenCreditLinesAndLoans'].quantile([0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.93,0.95,0.97,0.98,0.99,0.995,1])
Out[23]:
0.100     3.0
0.200     4.0
0.300     5.0
0.400     6.0
0.500     8.0
0.600     9.0
0.700    10.0
0.800    12.0
0.900    15.0
0.930    17.0
0.950    18.0
0.970    20.0
0.980    22.0
0.990    24.0
0.995    27.0
1.000    58.0
Name: NumberOfOpenCreditLinesAndLoans, dtype: float64

Highest value is 58 which is possible.

Lets check boxplot

In [24]:
import matplotlib.pyplot as plt
%matplotlib inline
loan.boxplot(column="NumberOfOpenCreditLinesAndLoans")
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:3: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  app.launch_new_instance()
Out[24]:
{'boxes': [<matplotlib.lines.Line2D at 0x4e20290>],
 'caps': [<matplotlib.lines.Line2D at 0x4e20dd0>,
  <matplotlib.lines.Line2D at 0x4e276d0>],
 'fliers': [<matplotlib.lines.Line2D at 0x4e27fb0>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x4e27770>],
 'whiskers': [<matplotlib.lines.Line2D at 0x4e20870>,
  <matplotlib.lines.Line2D at 0x4e20d30>]}

NumberOfTimes90DaysLate

This variable represents number of times borrower has been 90 days or more past due.

In [25]:
loan['NumberOfTimes90DaysLate'].describe()
Out[25]:
count    150000.000000
mean          0.265973
std           4.169304
min           0.000000
25%           0.000000
50%           0.000000
75%           0.000000
max          98.000000
Name: NumberOfTimes90DaysLate, dtype: float64

It is an integer variable. Minimum value is zero,median is also zero. Mean is 0.266 and maximum value is 98. These give indication of presence of outliers.

Check the percentile distribution to know the presence of outliers.

In [26]:
loan['NumberOfTimes90DaysLate'].quantile([0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.85,0.9,0.95,0.97,0.99,1])
Out[26]:
0.00     0.0
0.01     0.0
0.03     0.0
0.05     0.0
0.07     0.0
0.09     0.0
0.10     0.0
0.20     0.0
0.30     0.0
0.40     0.0
0.50     0.0
0.60     0.0
0.70     0.0
0.80     0.0
0.85     0.0
0.90     0.0
0.95     1.0
0.97     1.0
0.99     3.0
1.00    98.0
Name: NumberOfTimes90DaysLate, dtype: float64

100 percentile is 98 ,which is an outlier.

This variable range is from 0 to 98.It takes only integers values. Lets see it’s frequency distribution.

In [27]:
freq=loan['NumberOfTimes90DaysLate'].value_counts()
freq
Out[27]:
0     141662
1       5243
2       1555
3        667
4        291
98       264
5        131
6         80
7         38
8         21
9         19
10         8
11         5
96         5
13         4
12         2
14         2
15         2
17         1
Name: NumberOfTimes90DaysLate, dtype: int64

This variables has values from 0 to 15 and 17,96,98. Last two are outliers.

Next plot boxplot to visualize the data.

In [28]:
import matplotlib.pyplot as plt
%matplotlib inline
loan.boxplot(column="NumberOfTimes90DaysLate")
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:3: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  app.launch_new_instance()
Out[28]:
{'boxes': [<matplotlib.lines.Line2D at 0x4e5f830>],
 'caps': [<matplotlib.lines.Line2D at 0x4e667b0>,
  <matplotlib.lines.Line2D at 0x4e66c70>],
 'fliers': [<matplotlib.lines.Line2D at 0x4e6b570>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x4e66d10>],
 'whiskers': [<matplotlib.lines.Line2D at 0x4e5fe10>,
  <matplotlib.lines.Line2D at 0x4e5feb0>]}

NumberRealEstateLoansOrLines

It shows number of mortgage and real estate loans taken by the barrower including home equity lines of credit.

In [29]:
loan['NumberRealEstateLoansOrLines'].describe()
Out[29]:
count    150000.000000
mean          1.018240
std           1.129771
min           0.000000
25%           0.000000
50%           1.000000
75%           2.000000
max          54.000000
Name: NumberRealEstateLoansOrLines, dtype: float64

It is an integer variable. Minimum value is zero,median is one. Mean is 1.018 and maximum value is 54. Mean and Median are close so there may not be outliers in this variable.

Check the percentile distribution to know the presence of outliers.

In [30]:
loan['NumberRealEstateLoansOrLines'].quantile([0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.85,0.9,0.95,0.97,0.99,1])
Out[30]:
0.00     0.0
0.01     0.0
0.03     0.0
0.05     0.0
0.07     0.0
0.09     0.0
0.10     0.0
0.20     0.0
0.30     0.0
0.40     1.0
0.50     1.0
0.60     1.0
0.70     1.0
0.80     2.0
0.85     2.0
0.90     2.0
0.95     3.0
0.97     3.0
0.99     4.0
1.00    54.0
Name: NumberRealEstateLoansOrLines, dtype: float64

100 percentile is 54 ,which is a possible value for this variable.

This variable range is from 0 to 54.It takes only integers values. Lets see it’s frequency distribution.

In [31]:
frque=loan['NumberRealEstateLoansOrLines'].value_counts()
frque
Out[31]:
0     56188
1     52338
2     31522
3      6300
4      2170
5       689
6       320
7       171
8        93
9        78
10       37
11       23
12       18
13       15
14        7
15        7
16        4
17        4
25        3
18        2
19        2
20        2
23        2
32        1
21        1
26        1
29        1
54        1
Name: NumberRealEstateLoansOrLines, dtype: int64

This variables has values from 0 to 21 and 23,25,26,29,32,54.

Next plot boxplot to visualize the data.

In [32]:
import matplotlib.pyplot as plt
%matplotlib inline
loan.boxplot(column="NumberRealEstateLoansOrLines")
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:3: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  app.launch_new_instance()
Out[32]:
{'boxes': [<matplotlib.lines.Line2D at 0x4ea47f0>],
 'caps': [<matplotlib.lines.Line2D at 0x4eaa6d0>,
  <matplotlib.lines.Line2D at 0x4eaab90>],
 'fliers': [<matplotlib.lines.Line2D at 0x4eb0530>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x4eaac30>],
 'whiskers': [<matplotlib.lines.Line2D at 0x4ea4d30>,
  <matplotlib.lines.Line2D at 0x4ea4dd0>]}

There are no outliers in this variable.

NumberOfTime60-89DaysPastDueNotWorse

It shows number of times borrower has been 60 to 89 days past due but no worse in the last 2 years.

In [33]:
loan['NumberOfTime60-89DaysPastDueNotWorse'].describe()
Out[33]:
count    150000.000000
mean          0.240387
std           4.155179
min           0.000000
25%           0.000000
50%           0.000000
75%           0.000000
max          98.000000
Name: NumberOfTime60-89DaysPastDueNotWorse, dtype: float64

It is an integer variable. Minimum value is zero,median is 0. Mean is 0.2404 and maximum value is 98. These give indication of presence of outliers.

Check the percentile distribution to know the presence of outliers.

In [34]:
loan['NumberOfTime60-89DaysPastDueNotWorse'].quantile([0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.85,0.9,0.95,0.97,0.99,1])
Out[34]:
0.00     0.0
0.01     0.0
0.03     0.0
0.05     0.0
0.07     0.0
0.09     0.0
0.10     0.0
0.20     0.0
0.30     0.0
0.40     0.0
0.50     0.0
0.60     0.0
0.70     0.0
0.80     0.0
0.85     0.0
0.90     0.0
0.95     1.0
0.97     1.0
0.99     2.0
1.00    98.0
Name: NumberOfTime60-89DaysPastDueNotWorse, dtype: float64

100 percentile is 98 ,which is an outlier.

This variable range is from 0 to 98.It takes only integers values. Lets see it’s frequency distribution.

In [35]:
frequency=loan['NumberOfTime60-89DaysPastDueNotWorse'].value_counts()
frequency
Out[35]:
0     142396
1       5731
2       1118
3        318
98       264
4        105
5         34
6         16
7          9
96         5
8          2
11         1
9          1
Name: NumberOfTime60-89DaysPastDueNotWorse, dtype: int64

This variables has values from 0 to 9 and 11,96,98. Last two are outliers.

Next plot boxplot to visualize the data.

In [36]:
import matplotlib.pyplot as plt
%matplotlib inline
loan.boxplot(column="NumberOfTime60-89DaysPastDueNotWorse")
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:3: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  app.launch_new_instance()
Out[36]:
{'boxes': [<matplotlib.lines.Line2D at 0x4ee2190>],
 'caps': [<matplotlib.lines.Line2D at 0x4ee2cd0>,
  <matplotlib.lines.Line2D at 0x4ee95d0>],
 'fliers': [<matplotlib.lines.Line2D at 0x4ee9eb0>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x4edced0>],
 'whiskers': [<matplotlib.lines.Line2D at 0x4ee2770>,
  <matplotlib.lines.Line2D at 0x4ee2c30>]}

NumberOfDependents

It represents number of dependents in the family of borrower excluding himself.

In [37]:
loan['NumberOfDependents'].describe()
Out[37]:
count    146076.000000
mean          0.757222
std           1.115086
min           0.000000
25%           0.000000
50%           0.000000
75%           1.000000
max          20.000000
Name: NumberOfDependents, dtype: float64

It is an integer variable.It has missing values represented by ‘NA’. Its minimum value is 0. Mean is 0.757 and median is 0 without considering NA values. Maximum value is 20.

Lets tabularize what we found in univariate analysis

Variable                                       Missing Values                    Outliers 

X                                              Nill                                  Nill

SeriousDlqin2yrs                               Nill                                  Nill

RevolvingUtilizationOfUnsecuredLines           Nill                              Present(<10%)

  age                                          Nill                              Present(<10%)

NumberOfTime30.59DaysPastDueNotWorse           Nill                               Present(<10%)

 DebtRatio                                     Nill                               Present(23.4%)

MonthlyIncome                               Present(19.82%)                      has to be Analysed

NumberOfOpenCreditLinesAndLoans                 Nill                                  Nill

NumberOfTimes90DaysLate                         Nill                              Present(<10%)

NumberRealEstateLoansOrLines                     Nill                                 Nill

NumberOfTime60.89DaysPastDueNotWorse             Nill                              Present(<10%)

NumberOfDependents                             Present(<10%)                       has to be Analysed

Missing values Treatment

MonthlyIncome and NumberOfDependents have missing values. We will replace them by their column mean values.We create new dataset.

In MonthlyIncome missing values are of 19.82%. So we create a new column NA_MonthlyIncome which indicates whether the value of MonthlyIncome in new dataset is origanal one(FALSE) or missing value replaced by the mean(TRUE).

In [38]:
loan1=loan
loan1['MonthlyIncome_new']=loan1['MonthlyIncome']
#to display all the rows which have missing values in 'MonthlyIncome_new' Column:
loan1.ix[loan1['MonthlyIncome_new'].isnull()]
#to get axis=0 index (row index) which have missing values in this column
loan1.ix[loan1['MonthlyIncome_new'].isnull()].index

#Once identified where missing values exist, the next task usually is to fill them (data imputation). Depending upon the context,
#in this case, I am assigning mean value(6670) to all those positions where missing value is present:
loan1.loc[loan1['MonthlyIncome_new'].isnull(),'MonthlyIncome_new']=6670
sum(loan1['MonthlyIncome_new'].isnull())
#and as the output suggests, this column doesn't have any missing values now
Out[38]:
0

In NumberOfDependents missing values are of only 2.616%, so we dont create any new column.We replace missing values by mean of remaining values.

In [39]:
loan1['NumberOfDependents_new']=loan1['NumberOfDependents']
#to display all the rows which have missing values in 'NumberOfDependents_new' Column:
loan1.ix[loan1['NumberOfDependents_new'].isnull()]
#to get axis=0 index (row index) which have missing values in this column
loan1.ix[loan1['NumberOfDependents_new'].isnull()].index

#Once identified where missing values exist, the next task usually is to fill them (data imputation). Depending upon the context,
#in this case, I am assigning mean value(0.757) to all those positions where missing value is present:
loan1.loc[loan1['NumberOfDependents_new'].isnull(),'NumberOfDependents_new']=0.757
sum(loan1['NumberOfDependents_new'].isnull())
#and as the output suggests, this column doesn't have any missing values now
Out[39]:
0

Model Building:Logistic regression

Since the predictor variable (SeriousDlqin2yrs) is YES or NO type ,first we will use logistic regression model.

We have only training data set but not test data set. So we divide our data set into two parts, first 120000 rows for training and remaining 30000 rows for testing.

In [42]:
from sklearn.cross_validation import train_test_split
features=list(loan1[["RevolvingUtilizationOfUnsecuredLines"]+["age"]+["NumberOfTime30-59DaysPastDueNotWorse"]+["DebtRatio"]+["MonthlyIncome_new"]+["NumberOfOpenCreditLinesAndLoans"]+["NumberOfTimes90DaysLate"]+["NumberRealEstateLoansOrLines"]+["NumberOfTime60-89DaysPastDueNotWorse"]+["NumberOfDependents_new"]])
X = loan1[features]
y = loan1['SeriousDlqin2yrs']
X_train, X_test, Y_train, Y_test = train_test_split(X, y, train_size=0.8) 
Y_test.shape, X_test.shape,X_train.shape,Y_train.shape
Out[42]:
((30000,), (30000, 10), (120000, 10), (120000,))
In [43]:
from sklearn.linear_model import LogisticRegression
logistic= LogisticRegression()
logistic.fit(X_train,Y_train)
Out[43]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
In [44]:
predict=logistic.predict(X_test)
In [45]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
cm1 = confusion_matrix(Y_test,predict)
print(cm1)
total1=sum(sum(cm1))
[[28003    30]
 [ 1940    27]]
In [46]:
accuracy1=(cm1[0,0]+cm1[1,1])/total1
accuracy1
Out[46]:
0.93433333333333335
In [47]:
specificity=cm1[1,1]/(cm1[1,1]+cm1[1,0])
specificity
Out[47]:
0.013726487036095577
In [48]:
sensitivity=cm1[0,0]/(cm1[0,0]+cm1[0,1])
sensitivity
Out[48]:
0.99892983269717828

Our model accuracy is 0.93.Specificity is 0.031, which is very low, we need to improve it.

Next we treat outliers.

Outliers Removal

We create new dataset loan2 in which we replace outliers with mean

In [49]:
loan2=loan1
loan2.shape
Out[49]:
(150000, 14)

RevolvingUtilizationOfUnsecuredLines

RevolvingUtilizationOfUnsecuredLines has outliers. Since outliers percentage is less than 10 We will replace outliers with mean of reaming data.Outliers are with value greater than 1.

In [50]:
remain_m=loan2['RevolvingUtilizationOfUnsecuredLines'][loan2['RevolvingUtilizationOfUnsecuredLines']<=1].mean()
remain_m
Out[50]:
0.3037815510208745
In [51]:
loan2['RevolvingUtilizationOfUnsecuredLines_new']=loan2['RevolvingUtilizationOfUnsecuredLines']
loan2['RevolvingUtilizationOfUnsecuredLines_new'][loan2['RevolvingUtilizationOfUnsecuredLines_new']>1]=remain_m
loan2['RevolvingUtilizationOfUnsecuredLines_new'].describe()
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
Out[51]:
count    150000.000000
mean          0.303782
std           0.334130
min           0.000000
25%           0.029867
50%           0.154181
75%           0.506929
max           1.000000
Name: RevolvingUtilizationOfUnsecuredLines_new, dtype: float64

age

Next in age there is an outlier whose value is zero we replace it with other values mean.

In [52]:
remain_mean=loan2['age'][loan2['age']>0].mean()
remain_mean
Out[52]:
52.295555303702024
In [53]:
loan2['age_new']=loan2['age']
loan2['age_new'][loan2['age_new']==0]=remain_m
loan2['age_new'].describe()
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
Out[53]:
count    150000.000000
mean         52.295209
std          14.771859
min           0.303782
25%          41.000000
50%          52.000000
75%          63.000000
max         109.000000
Name: age_new, dtype: float64

NumberOfTime30.59DaysPastDueNotWorse

NumberOfTime30.59DaysPastDueNotWorse has values 96,98 as outliers which are of less than 10%. We treat the outliers based on the related variable ‘SeriousDlqin2yrs’. ‘NumberOfTime30.59DaysPastDueNotWorse’ is directly related to ‘SeriousDlqin2yrs’. So we create a frequency table between SeriousDlqin2yrs and NumberOfTime30.59DaysPastDueNotWorse.

In [54]:
import pandas as pd
cross_table=pd.crosstab(loan2['NumberOfTime30-59DaysPastDueNotWorse'],loan2['SeriousDlqin2yrs'])
cross_table
Out[54]:
SeriousDlqin2yrs 0 1
NumberOfTime30-59DaysPastDueNotWorse
0 120977 5041
1 13624 2409
2 3379 1219
3 1136 618
4 429 318
5 188 154
6 66 74
7 26 28
8 17 8
9 8 4
10 1 3
11 0 1
12 1 1
13 0 1
96 1 4
98 121 143

For all the values in NumberOfTime30.59DaysPastDueNotWorse find the percentage of 0’s in SeriousDlqin2yrs.As both variables are related, We replace 96,98 with the values whose 0’s percentage is same as former values.

In [55]:
cross_table.astype(float).div(cross_table.sum(axis=1), axis=0)
Out[55]:
SeriousDlqin2yrs 0 1
NumberOfTime30-59DaysPastDueNotWorse
0 0.959998 0.040002
1 0.849747 0.150253
2 0.734885 0.265115
3 0.647662 0.352338
4 0.574297 0.425703
5 0.549708 0.450292
6 0.471429 0.528571
7 0.481481 0.518519
8 0.680000 0.320000
9 0.666667 0.333333
10 0.250000 0.750000
11 0.000000 1.000000
12 0.500000 0.500000
13 0.000000 1.000000
96 0.200000 0.800000
98 0.458333 0.541667

so the bad rate(defaluters) in group 98 is 54% and the nearest group with a bad rate is 52.8%. the apt substitution for 98 will be 6, since there is no other group whose bad rate(defaulter) is similar to this group . there are only 5 values in 96. So we also replace 98 and also 96 by 6.

In [56]:
loan2['NumberOfTime30-59DaysPastDueNotWorse_new']=loan2['NumberOfTime30-59DaysPastDueNotWorse']
loan2['NumberOfTime30-59DaysPastDueNotWorse_new'][loan2['NumberOfTime30-59DaysPastDueNotWorse_new']>13]=6
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
In [57]:
freq_tab=loan2['NumberOfTime30-59DaysPastDueNotWorse_new'].value_counts()
freq_tab
Out[57]:
0     126018
1      16033
2       4598
3       1754
4        747
6        409
5        342
7         54
8         25
9         12
10         4
12         2
13         1
11         1
Name: NumberOfTime30-59DaysPastDueNotWorse_new, dtype: int64

DebtRatio

DebtRatio has ooutliers.We take anything greater than 1 as outlier and replace them with remaining values mean. Outliers percentage is 23.4%. So we crete a new row Outlier_DebtRatio to indicate whether that value is outlier and replaced or not.

In [58]:
remain_mn=loan2['DebtRatio'][loan2['DebtRatio']<1].mean()
remain_mn
Out[58]:
0.3016293900012304
In [59]:
loan2['DebtRatio_new']=loan2['DebtRatio']
loan2['DebtRatio_new'][loan2['DebtRatio_new']>1]=remain_mn
loan2['DebtRatio_new'].describe()
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
Out[59]:
count    150000.000000
mean          0.302696
std           0.198018
min           0.000000
25%           0.175074
50%           0.301629
75%           0.380021
max           1.000000
Name: DebtRatio_new, dtype: float64

NumberOfTimes90DaysLate

Outlier treatment for NumberOfTimes90DaysLate is same as NumberOfTime30.59DaysPastDueNotWorse. Outliers are 96,98.

In [60]:
import pandas as pd
cross_table1=pd.crosstab(loan2['NumberOfTimes90DaysLate'],loan2['SeriousDlqin2yrs'])
cross_table1
Out[60]:
SeriousDlqin2yrs 0 1
NumberOfTimes90DaysLate
0 135108 6554
1 3478 1765
2 779 776
3 282 385
4 96 195
5 48 83
6 32 48
7 7 31
8 6 15
9 5 14
10 3 5
11 2 3
12 1 1
13 2 2
14 1 1
15 2 0
17 0 1
96 1 4
98 121 143
In [61]:
cross_table1.astype(float).div(cross_table1.sum(axis=1), axis=0)
Out[61]:
SeriousDlqin2yrs 0 1
NumberOfTimes90DaysLate
0 0.953735 0.046265
1 0.663361 0.336639
2 0.500965 0.499035
3 0.422789 0.577211
4 0.329897 0.670103
5 0.366412 0.633588
6 0.400000 0.600000
7 0.184211 0.815789
8 0.285714 0.714286
9 0.263158 0.736842
10 0.375000 0.625000
11 0.400000 0.600000
12 0.500000 0.500000
13 0.500000 0.500000
14 0.500000 0.500000
15 1.000000 0.000000
17 0.000000 1.000000
96 0.200000 0.800000
98 0.458333 0.541667

Values 98,3 has close percentage. We replace 96,98 with 3.

In [62]:
loan2['NumberOfTimes90DaysLate_new']=loan2['NumberOfTimes90DaysLate']
loan2['NumberOfTimes90DaysLate_new'][loan2['NumberOfTimes90DaysLate_new']>17]=3
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
In [63]:
fr_tab=loan2['NumberOfTimes90DaysLate_new'].value_counts()
fr_tab
Out[63]:
0     141662
1       5243
2       1555
3        936
4        291
5        131
6         80
7         38
8         21
9         19
10         8
11         5
13         4
15         2
12         2
14         2
17         1
Name: NumberOfTimes90DaysLate_new, dtype: int64

NumberOfTime60-89DaysPastDueNotWorse

This has 96,98 as outliers. This outlier treatment is same as NumberOfTime30.59DaysPastDueNotWorse.

In [64]:
import pandas as pd
cross_table2=pd.crosstab(loan2['NumberOfTime60-89DaysPastDueNotWorse'],loan2['SeriousDlqin2yrs'])
cross_table2
Out[64]:
SeriousDlqin2yrs 0 1
NumberOfTime60-89DaysPastDueNotWorse
0 135140 7256
1 3954 1777
2 557 561
3 138 180
4 40 65
5 13 21
6 4 12
7 4 5
8 1 1
9 1 0
11 0 1
96 1 4
98 121 143
In [65]:
cross_table2.astype(float).div(cross_table2.sum(axis=1), axis=0)
Out[65]:
SeriousDlqin2yrs 0 1
NumberOfTime60-89DaysPastDueNotWorse
0 0.949044 0.050956
1 0.689932 0.310068
2 0.498211 0.501789
3 0.433962 0.566038
4 0.380952 0.619048
5 0.382353 0.617647
6 0.250000 0.750000
7 0.444444 0.555556
8 0.500000 0.500000
9 1.000000 0.000000
11 0.000000 1.000000
96 0.200000 0.800000
98 0.458333 0.541667

Values 98,7 has close percentage. We replace 96,98 with 7.

In [66]:
loan2['NumberOfTime60-89DaysPastDueNotWorse_new']=loan2['NumberOfTime60-89DaysPastDueNotWorse']
loan2['NumberOfTime60-89DaysPastDueNotWorse_new'][loan2['NumberOfTime60-89DaysPastDueNotWorse_new']>11]=7
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
In [67]:
freqq_tab=loan2['NumberOfTime60-89DaysPastDueNotWorse_new'].value_counts()
freqq_tab
Out[67]:
0     142396
1       5731
2       1118
3        318
7        278
4        105
5         34
6         16
8          2
11         1
9          1
Name: NumberOfTime60-89DaysPastDueNotWorse_new, dtype: int64

All the outliers and missing values are cleaned. We save the final dataset for future use.

In [68]:
loan2.columns.values
Out[68]:
array(['Sr_No', 'SeriousDlqin2yrs', 'RevolvingUtilizationOfUnsecuredLines',
       'age', 'NumberOfTime30-59DaysPastDueNotWorse', 'DebtRatio',
       'MonthlyIncome', 'NumberOfOpenCreditLinesAndLoans',
       'NumberOfTimes90DaysLate', 'NumberRealEstateLoansOrLines',
       'NumberOfTime60-89DaysPastDueNotWorse', 'NumberOfDependents',
       'MonthlyIncome_new', 'NumberOfDependents_new',
       'RevolvingUtilizationOfUnsecuredLines_new', 'age_new',
       'NumberOfTime30-59DaysPastDueNotWorse_new', 'DebtRatio_new',
       'NumberOfTimes90DaysLate_new',
       'NumberOfTime60-89DaysPastDueNotWorse_new'], dtype=object)
In [69]:
loan2.shape
Out[69]:
(150000, 20)
In [70]:
loan2.isnull().sum()
Out[70]:
Sr_No                                           0
SeriousDlqin2yrs                                0
RevolvingUtilizationOfUnsecuredLines            0
age                                             0
NumberOfTime30-59DaysPastDueNotWorse            0
DebtRatio                                       0
MonthlyIncome                               29731
NumberOfOpenCreditLinesAndLoans                 0
NumberOfTimes90DaysLate                         0
NumberRealEstateLoansOrLines                    0
NumberOfTime60-89DaysPastDueNotWorse            0
NumberOfDependents                           3924
MonthlyIncome_new                               0
NumberOfDependents_new                          0
RevolvingUtilizationOfUnsecuredLines_new        0
age_new                                         0
NumberOfTime30-59DaysPastDueNotWorse_new        0
DebtRatio_new                                   0
NumberOfTimes90DaysLate_new                     0
NumberOfTime60-89DaysPastDueNotWorse_new        0
dtype: int64

After Outlier Treatment, we will build the model again using new data set.We divide data set for training and testing

In [71]:
from sklearn.cross_validation import train_test_split
features=list(loan2[["RevolvingUtilizationOfUnsecuredLines_new"]+["age_new"]+["NumberOfTime30-59DaysPastDueNotWorse_new"]+["DebtRatio_new"]+["MonthlyIncome_new"]+["NumberOfOpenCreditLinesAndLoans"]+["NumberOfTimes90DaysLate_new"]+["NumberRealEstateLoansOrLines"]+["NumberOfTime60-89DaysPastDueNotWorse_new"]+["NumberOfDependents_new"]])
X1 = loan2[features]
y1 = loan2['SeriousDlqin2yrs']
X1_train, X1_test, Y1_train, Y1_test = train_test_split(X1, y1, train_size=0.8) 
Y1_test.shape, X1_test.shape,X1_train.shape,Y1_train.shape
Out[71]:
((30000,), (30000, 10), (120000, 10), (120000,))

Create the model and test it. Find its accuracy

In [72]:
from sklearn.linear_model import LogisticRegression
logistic1= LogisticRegression()
logistic1.fit(X1_train,Y1_train)
Out[72]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
In [216]:
predict1=logistic.predict(X1_test)
In [140]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
cm2 = confusion_matrix(Y1_test,predict1)
print(cm2)
total2=sum(sum(cm2))
[[27962    12]
 [ 2018     8]]
In [75]:
accuracy2=(cm2[0,0]+cm2[1,1])/total2
accuracy2
Out[75]:
0.93233333333333335
In [172]:
specificity1=cm2[1,1]/(cm2[1,1]+cm2[1,0])
specificity1
Out[172]:
0.0039486673247778872
In [174]:
sensitivity1=cm2[0,0]/(cm2[0,0]+cm2[0,1])
sensitivity1
Out[174]:
0.99957103024236793

Accuracy of the model is 0.933 which is greater than that of previous model.Specificity is 0.02, Which is very low.

In [78]:
import statsmodels.formula.api as sm
logistic2=sm.Logit(loan2['SeriousDlqin2yrs'],loan2[["RevolvingUtilizationOfUnsecuredLines_new"]+["age_new"]+["NumberOfTime30-59DaysPastDueNotWorse_new"]+["DebtRatio_new"]+["MonthlyIncome_new"]+["NumberOfOpenCreditLinesAndLoans"]+["NumberOfTimes90DaysLate_new"]+["NumberRealEstateLoansOrLines"]+["NumberOfTime60-89DaysPastDueNotWorse_new"]+["NumberOfDependents_new"]])
logistic2 
result1=logistic2.fit()
summary_1=result1.summary()
summary_1 
Optimization terminated successfully.
         Current function value: 0.206064
         Iterations 8
Out[78]:
Logit Regression Results
Dep. Variable: SeriousDlqin2yrs No. Observations: 150000
Model: Logit Df Residuals: 149990
Method: MLE Df Model: 9
Date: Fri, 18 Nov 2016 Pseudo R-squ.: 0.1602
Time: 08:21:37 Log-Likelihood: -30910.
converged: True LL-Null: -36808.
LLR p-value: 0.000
coef std err z P>|z| [95.0% Conf. Int.]
RevolvingUtilizationOfUnsecuredLines_new 0.8184 0.029 28.407 0.000 0.762 0.875
age_new -0.0611 0.001 -89.965 0.000 -0.062 -0.060
NumberOfTime30-59DaysPastDueNotWorse_new 0.4969 0.011 44.622 0.000 0.475 0.519
DebtRatio_new -0.4813 0.061 -7.906 0.000 -0.601 -0.362
MonthlyIncome_new -6.703e-05 3.62e-06 -18.538 0.000 -7.41e-05 -5.99e-05
NumberOfOpenCreditLinesAndLoans -0.0016 0.003 -0.564 0.573 -0.007 0.004
NumberOfTimes90DaysLate_new 0.6944 0.016 42.383 0.000 0.662 0.727
NumberRealEstateLoansOrLines 0.1742 0.011 15.947 0.000 0.153 0.196
NumberOfTime60-89DaysPastDueNotWorse_new 0.3185 0.022 14.343 0.000 0.275 0.362
NumberOfDependents_new -0.0278 0.010 -2.873 0.004 -0.047 -0.009

ROC AND AUC

In [79]:
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
actual = Y1_test
false_positive_rate, true_positive_rate, thresholds = roc_curve(actual, predict1)
plt.title('Receiver Operating Characteristic')
roc_auc = auc(false_positive_rate, true_positive_rate)
plt.plot(false_positive_rate, true_positive_rate,label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

roc_auc = auc(false_positive_rate, true_positive_rate)
roc_auc
Out[79]:
0.50175984878357294

Decision Tree

Specificity in the logistic regression model is not satisfactory. So we build a new model using decision trees. We need to install package ‘party’ for decision trees.

library(“party”)

In [123]:
import pandas as pd
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X1_train,Y1_train)
clf
Out[123]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
In [124]:
predict2 = clf.predict(X1_test)
predict2
from sklearn.metrics import confusion_matrix
cm3=confusion_matrix(Y1_test, predict2)
print (cm3)
[[26408  1566]
 [ 1437   589]]
In [125]:
total3 = sum(sum(cm3))
accuracy3 = (cm3[0,0]+cm3[1,1])/total3
accuracy3
Out[125]:
0.89990000000000003
In [126]:
specificity2=cm3[1,1]/(cm3[1,1]+cm3[1,0])
specificity2
Out[126]:
0.29072063178677199
In [127]:
sensitivity2=cm3[0,0]/(cm3[0,0]+cm3[0,1])
sensitivity2
Out[127]:
0.9440194466290126

Naive Bayes

In [128]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X1_train,Y1_train)
Out[128]:
GaussianNB()
In [129]:
predict3 = model.predict(X1_test)
predict3
from sklearn.metrics import confusion_matrix
cm4=confusion_matrix(Y1_test, predict3)
print (cm4)
[[27168   806]
 [ 1356   670]]
In [130]:
from sklearn import metrics
print(metrics.classification_report(Y1_test, predict3))
print(metrics.confusion_matrix(Y1_test, predict3))
             precision    recall  f1-score   support

          0       0.95      0.97      0.96     27974
          1       0.45      0.33      0.38      2026

avg / total       0.92      0.93      0.92     30000

[[27168   806]
 [ 1356   670]]
In [131]:
total4 = sum(sum(cm4))
accuracy4 = (cm4[0,0]+cm4[1,1])/total4
accuracy4
Out[131]:
0.92793333333333339
In [175]:
specificity3=cm4[1,1]/(cm4[1,1]+cm4[1,0])
specificity3
Out[175]:
0.33070088845014806
In [176]:
sensitivity3=cm4[0,0]/(cm4[0,0]+cm4[0,1])
sensitivity3
Out[176]:
0.97118753127904478

K Fold Cross Validation

In [221]:
import numpy as np
from sklearn.cross_validation import KFold
from sklearn import cross_validation
from sklearn.cross_validation import train_test_split
features=list(loan2[["RevolvingUtilizationOfUnsecuredLines_new"]+["age_new"]+["NumberOfTime30-59DaysPastDueNotWorse_new"]+["DebtRatio_new"]+["MonthlyIncome_new"]+["NumberOfOpenCreditLinesAndLoans"]+["NumberOfTimes90DaysLate_new"]+["NumberRealEstateLoansOrLines"]+["NumberOfTime60-89DaysPastDueNotWorse_new"]+["NumberOfDependents_new"]])
X1 = loan2[features]
y1 = loan2['SeriousDlqin2yrs']
X1_train, X1_test, Y1_train, Y1_test = train_test_split(X1, y1, train_size=0.8) 
Y1_test.shape, X1_test.shape,X1_train.shape,Y1_train.shape
Out[221]:
((30000,), (30000, 10), (120000, 10), (120000,))
In [232]:
from sklearn.cross_validation import KFold
scores = cross_validation.cross_val_score(logistic1, X1, y1, cv=10)
scores
Out[232]:
array([ 0.93540431,  0.9359376 ,  0.93513766,  0.9359376 ,  0.93546667,
        0.9374    ,  0.9374625 ,  0.93739583,  0.93586239,  0.93639576])
In [233]:
print(logistic1.score(X1_test, Y1_test))
print(scores)
0.9371
[ 0.93540431  0.9359376   0.93513766  0.9359376   0.93546667  0.9374
  0.9374625   0.93739583  0.93586239  0.93639576]

K Fold Cross Validation accuracy is 93%

© 2020. All Rights Reserved.