Before start our lesson please download the datasets.

CS5 Consumer Loan Default Prediction

Problem Statement:

Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit.

Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted.

The goal is to build a model that borrowers can use to help make the best financial decisions.The data is raw, you may have to spend considerable amount of time for validating and cleaning the data

Methods:

i used two popular data mining algorithms (decision tree and Naïve Bayesian classifier) along with a most commonly used statistical method (logistic regression) to develop the prediction models using a large dataset (150000 instances).

The problem is to classify borrower as defaulter or non defaulter. It is commonly desired for banks to classify borrower accurately so as to manage their loan risk better and increase business.

Data Importing

In [136]:

import pandas as pd
loan=pd.read_csv("C:\\Users\\Personal\\Google Drive\\cs-training.csv")
loan.shape

Out[136]:

(150000, 12)

Data set has 150000 rows and 12 variables.

In [3]:

loan.columns.values

Out[3]:

array(['Sr_No', 'SeriousDlqin2yrs', 'RevolvingUtilizationOfUnsecuredLines',
       'age', 'NumberOfTime30-59DaysPastDueNotWorse', 'DebtRatio',
       'MonthlyIncome', 'NumberOfOpenCreditLinesAndLoans',
       'NumberOfTimes90DaysLate', 'NumberRealEstateLoansOrLines',
       'NumberOfTime60-89DaysPastDueNotWorse', 'NumberOfDependents'], dtype=object)

Variable Name: Description

1. Sr_No:serial number

2. SeriousDlqin2yrs : Person experienced 90 days past due delinquency or worse 


3. RevolvingUtilizationOfUnsecuredLines :Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits. 

4. age: Age of borrower in years 

5. NumberOfTime30-59DaysPastDueNotWorse: Number of times borrower has been 30-59 days past due but no worse in the last 2 years. 

6. DebtRatio: Monthly debt payments, alimony,living costs divided by monthy gross income

7. MonthlyIncome :Monthly income 

8. NumberOfOpenCreditLinesAndLoans: Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards) 

9. NumberOfTimes90DaysLate: Number of times borrower has been 90 days or more past due. 

10. NumberRealEstateLoansOrLines: Number of mortgage and real estate loans including home equity lines of credit 

11. NumberOfTime60-89DaysPastDueNotWorse: Number of times borrower has been 60-89 days past due but no worse in the last 2 years. 

12. NumberOfDependents: Number of dependents in family excluding themselves (spouse, children etc.)

In [4]:

loan.head()

Out[4]:

	Sr_No	SeriousDlqin2yrs	RevolvingUtilizationOfUnsecuredLines	age	NumberOfTime30-59DaysPastDueNotWorse	DebtRatio	MonthlyIncome	NumberOfOpenCreditLinesAndLoans	NumberOfTimes90DaysLate	NumberRealEstateLoansOrLines	NumberOfDependents
0	1	1	0.766127	45	2	0.802982	9120.0	13	0	6	2.0
1	2	0	0.957151	40	0	0.121876	2600.0	4	0	0	1.0
2	3	0	0.658180	38	1	0.085113	3042.0	2	1	0	0.0
3	4	0	0.233810	30	0	0.036050	3300.0	5	0	0	0.0
4	5	0	0.907239	49	1	0.024926	63588.0	7	0	1	0.0

Data Exploration

In [5]:

import pandas as pd
import sklearn as sk
import math
import numpy as np
from scipy import stats
import matplotlib as matlab
import statsmodels
loan=pd.read_csv("C:\\Users\\Personal\\Google Drive\\cs-training.csv")
loan.shape
loan.columns.values
loan.head(10)
loan.describe()

Out[5]:

	Sr_No	SeriousDlqin2yrs	RevolvingUtilizationOfUnsecuredLines	age	NumberOfTime30-59DaysPastDueNotWorse	DebtRatio	MonthlyIncome	NumberOfOpenCreditLinesAndLoans	NumberOfTimes90DaysLate	NumberRealEstateLoansOrLines	NumberOfTime60-89DaysPastDueNotWorse	NumberOfDependents
count	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000	1.202690e+05	150000.000000	150000.000000	150000.000000	150000.000000	146076.000000
mean	75000.500000	0.066840	6.048438	52.295207	0.421033	353.005076	6.670221e+03	8.452760	0.265973	1.018240	0.240387	0.757222
std	43301.414527	0.249746	249.755371	14.771866	4.192781	2037.818523	1.438467e+04	5.145951	4.169304	1.129771	4.155179	1.115086
min	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000e+00	0.000000	0.000000	0.000000	0.000000	0.000000
25%	37500.750000	0.000000	0.029867	41.000000	0.000000	0.175074	3.400000e+03	5.000000	0.000000	0.000000	0.000000	0.000000
50%	75000.500000	0.000000	0.154181	52.000000	0.000000	0.366508	5.400000e+03	8.000000	0.000000	1.000000	0.000000	0.000000
75%	112500.250000	0.000000	0.559046	63.000000	0.000000	0.868254	8.249000e+03	11.000000	0.000000	2.000000	0.000000	1.000000
max	150000.000000	1.000000	50708.000000	109.000000	98.000000	329664.000000	3.008750e+06	58.000000	98.000000	54.000000	98.000000	20.000000

describe will show the minimum, maximum, mean, median, 1st quartile, 3rd quartile of all the variables in the data set. It also shows missing values in the data set.In our dataset, variables ‘MonthlyIncome’ and ‘NumberOfDependents’ have NA values.Summary gives mean of variables having NA values by excluding them.

checking missing values

In [6]:

loan.isnull().sum()

Out[6]:

Sr_No                                       0
SeriousDlqin2yrs                            0
RevolvingUtilizationOfUnsecuredLines        0
age                                         0
NumberOfTime30-59DaysPastDueNotWorse        0
DebtRatio                                   0
MonthlyIncome                           29731
NumberOfOpenCreditLinesAndLoans             0
NumberOfTimes90DaysLate                     0
NumberRealEstateLoansOrLines                0
NumberOfTime60-89DaysPastDueNotWorse        0
NumberOfDependents                       3924
dtype: int64

monthlyincome and number of dependents have missing values.

SeriousDlqin2yrs:

Person experienced 90 days past due delinquency or worse.This is the target variable which we have to predict.it is a binary data.

In [7]:

loan['SeriousDlqin2yrs'].describe()

Out[7]:

count    150000.000000
mean          0.066840
std           0.249746
min           0.000000
25%           0.000000
50%           0.000000
75%           0.000000
max           1.000000
Name: SeriousDlqin2yrs, dtype: float64

In [8]:

frequency_table=loan['SeriousDlqin2yrs'].value_counts()
frequency_table

Out[8]:

0    139974
1     10026
Name: SeriousDlqin2yrs, dtype: int64

0 -indicates non-defaulters, 1 -indicates defaulters. Out of 150000 only 10026 are defaulters.

RevolvingUtilizationOfUnsecuredLines

This variable represents total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits. It is a ratio. Its’ value should be inbetween 0 and 1

In [9]:

loan['RevolvingUtilizationOfUnsecuredLines'].describe()

Out[9]:

count    150000.000000
mean          6.048438
std         249.755371
min           0.000000
25%           0.029867
50%           0.154181
75%           0.559046
max       50708.000000
Name: RevolvingUtilizationOfUnsecuredLines, dtype: float64

In [10]:

import matplotlib.pyplot as plt
%matplotlib inline
loan.boxplot(column="RevolvingUtilizationOfUnsecuredLines")

C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:3: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  app.launch_new_instance()

Out[10]:

{'boxes': [<matplotlib.lines.Line2D at 0x9ddab70>],
 'caps': [<matplotlib.lines.Line2D at 0x9de6cf0>,
  <matplotlib.lines.Line2D at 0x9de6d90>],
 'fliers': [<matplotlib.lines.Line2D at 0x9defb50>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x9def270>],
 'whiskers': [<matplotlib.lines.Line2D at 0x9ddae70>,
  <matplotlib.lines.Line2D at 0x9de6830>]}

From the box plot we can see that there are outliers present in the variable.

age

It specifies age of the barrower in years. Its’ an integer lets see the summary of age

In [11]:

loan['age'].describe()

Out[11]:

count    150000.000000
mean         52.295207
std          14.771866
min           0.000000
25%          41.000000
50%          52.000000
75%          63.000000
max         109.000000
Name: age, dtype: float64

Minimum age is 0, which is not practical. Maximum age is 109 which is ok. Mean and median are very close which indicates outliers may not be present.

Lets see the percentile distribution.

In [12]:

loan['age'].quantile([0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1])

Out[12]:

0.00      0.0
0.01     24.0
0.03     27.0
0.05     29.0
0.07     30.0
0.09     32.0
0.10     33.0
0.20     39.0
0.30     44.0
0.40     48.0
0.50     52.0
0.60     56.0
0.70     61.0
0.80     65.0
0.90     72.0
1.00    109.0
Name: age, dtype: float64

In [13]:

import matplotlib.pyplot as plt
%matplotlib inline
loan.boxplot(column="age")

C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:3: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  app.launch_new_instance()

Out[13]:

{'boxes': [<matplotlib.lines.Line2D at 0x4d503b0>],
 'caps': [<matplotlib.lines.Line2D at 0x4d50ef0>,
  <matplotlib.lines.Line2D at 0x4d597f0>],
 'fliers': [<matplotlib.lines.Line2D at 0x4d59cb0>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x4d59890>],
 'whiskers': [<matplotlib.lines.Line2D at 0x4d50990>,
  <matplotlib.lines.Line2D at 0x4d50e50>]}

We can notice an outlier at the top of the boxplot.

NumberOfTime30-59DaysPastDueNotWorse

It shows number of times a borrower has been 30-59 days past due but no worse in the last 2 years.

In [14]:

loan['NumberOfTime30-59DaysPastDueNotWorse'].describe()

Out[14]:

count    150000.000000
mean          0.421033
std           4.192781
min           0.000000
25%           0.000000
50%           0.000000
75%           0.000000
max          98.000000
Name: NumberOfTime30-59DaysPastDueNotWorse, dtype: float64

It is an integer variable. Minimum value is zero,median is also zero. Mean is 0.421 ,SD is 4.192 and maximum value is 98. These give indication of presence of outliers.

Check the percentile distribution to know the presence of outliers.

In [15]:

loan['NumberOfTime30-59DaysPastDueNotWorse'].quantile([0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.85,0.9,0.95,1])

Out[15]:

0.00     0.0
0.01     0.0
0.03     0.0
0.05     0.0
0.07     0.0
0.09     0.0
0.10     0.0
0.20     0.0
0.30     0.0
0.40     0.0
0.50     0.0
0.60     0.0
0.70     0.0
0.80     0.0
0.85     1.0
0.90     1.0
0.95     2.0
1.00    98.0
Name: NumberOfTime30-59DaysPastDueNotWorse, dtype: float64

100 percentile is 98 ,which is an outlier.

This variable range is from 0 to 98.It takes only integers values. Lets see it’s frequency distribution.

In [16]:

freq_tab=loan['NumberOfTime30-59DaysPastDueNotWorse'].value_counts()
freq_tab

Out[16]:

0     126018
1      16033
2       4598
3       1754
4        747
5        342
98       264
6        140
7         54
8         25
9         12
96         5
10         4
12         2
13         1
11         1
Name: NumberOfTime30-59DaysPastDueNotWorse, dtype: int64

This variables has values from 0 to 13 and 96,98. Last two are outliers.

Next plot boxplot to visualize the data.

In [17]:

import matplotlib.pyplot as plt
%matplotlib inline
loan.boxplot(column="NumberOfTime30-59DaysPastDueNotWorse")

C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:3: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  app.launch_new_instance()

Out[17]:

{'boxes': [<matplotlib.lines.Line2D at 0x4d9f190>],
 'caps': [<matplotlib.lines.Line2D at 0x4d9ffd0>,
  <matplotlib.lines.Line2D at 0x4da3530>],
 'fliers': [<matplotlib.lines.Line2D at 0x4da3eb0>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x4da35d0>],
 'whiskers': [<matplotlib.lines.Line2D at 0x4d9f6d0>,
  <matplotlib.lines.Line2D at 0x4d9fb90>]}

DebtRatio

Debt ratio is obtained by dividing Monthly debt payments, alimony, living costs by monthly gross income

In [18]:

loan['DebtRatio'].describe()

Out[18]:

count    150000.000000
mean        353.005076
std        2037.818523
min           0.000000
25%           0.175074
50%           0.366508
75%           0.868254
max      329664.000000
Name: DebtRatio, dtype: float64

Normally debt ratio should be between 0 to 1. Somtimes it can exceed 1 ,if a person spends more than his income.Here its minimum is 0,mean is 353,median is 0.4. This indicates presence of outliers. Maximum value is 329700, which is not possible.

Lets see the percentile distribution

In [19]:

loan['DebtRatio'].quantile([0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.75,0.76,0.78,0.8,0.85,0.9,0.95,1])

Out[19]:

0.10         0.030874
0.20         0.133773
0.30         0.213697
0.40         0.287460
0.50         0.366508
0.60         0.467506
0.70         0.649189
0.75         0.868254
0.76         0.951184
0.78         1.275069
0.80         4.000000
0.85       269.150000
0.90      1267.000000
0.95      2449.000000
1.00    329664.000000
Name: DebtRatio, dtype: float64

Upto 76percentile it is less than 1.

Plot the boxplot.

In [20]:

import matplotlib.pyplot as plt
%matplotlib inline
loan.boxplot(column="DebtRatio")

C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:3: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  app.launch_new_instance()

Out[20]:

{'boxes': [<matplotlib.lines.Line2D at 0x4ddf0f0>],
 'caps': [<matplotlib.lines.Line2D at 0x4ddffb0>,
  <matplotlib.lines.Line2D at 0x4de6490>],
 'fliers': [<matplotlib.lines.Line2D at 0x4de6e10>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x4de6530>],
 'whiskers': [<matplotlib.lines.Line2D at 0x4ddf630>,
  <matplotlib.lines.Line2D at 0x4ddfaf0>]}

There are outlers present in the variable. We have to filter them before we use the data for model building.

MonthlyIncome

It is the monthly income of the barrower.

In [21]:

loan['MonthlyIncome'].describe()

Out[21]:

count    1.202690e+05
mean     6.670221e+03
std      1.438467e+04
min      0.000000e+00
25%      3.400000e+03
50%      5.400000e+03
75%      8.249000e+03
max      3.008750e+06
Name: MonthlyIncome, dtype: float64

This is an integer variable. It has missing values represented by ‘NA’. Its minimum value is 0, which is practically impossible. Mean is 6670 and median is 5400 without considering NA values.

NumberOfOpenCreditLinesAndLoans

It indicates number of open loans (an installment loan such as car loan or mortgage) and lines of credit (such as credit cards)

In [22]:

loan['NumberOfOpenCreditLinesAndLoans'].describe()

Out[22]:

count    150000.000000
mean          8.452760
std           5.145951
min           0.000000
25%           5.000000
50%           8.000000
75%          11.000000
max          58.000000
Name: NumberOfOpenCreditLinesAndLoans, dtype: float64

It is an integer variable. Its minimum value is 0,maximum value is 58. Its mean is 8.543,median is 8. Mean and median are close, so outliers may not be present.

Lets see percentile distribution to know the outliers presence.

In [23]:

loan['NumberOfOpenCreditLinesAndLoans'].quantile([0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.93,0.95,0.97,0.98,0.99,0.995,1])

Out[23]:

0.100     3.0
0.200     4.0
0.300     5.0
0.400     6.0
0.500     8.0
0.600     9.0
0.700    10.0
0.800    12.0
0.900    15.0
0.930    17.0
0.950    18.0
0.970    20.0
0.980    22.0
0.990    24.0
0.995    27.0
1.000    58.0
Name: NumberOfOpenCreditLinesAndLoans, dtype: float64

Highest value is 58 which is possible.

Lets check boxplot

In [24]:

import matplotlib.pyplot as plt
%matplotlib inline
loan.boxplot(column="NumberOfOpenCreditLinesAndLoans")

C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:3: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  app.launch_new_instance()

Out[24]:

{'boxes': [<matplotlib.lines.Line2D at 0x4e20290>],
 'caps': [<matplotlib.lines.Line2D at 0x4e20dd0>,
  <matplotlib.lines.Line2D at 0x4e276d0>],
 'fliers': [<matplotlib.lines.Line2D at 0x4e27fb0>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x4e27770>],
 'whiskers': [<matplotlib.lines.Line2D at 0x4e20870>,
  <matplotlib.lines.Line2D at 0x4e20d30>]}

NumberOfTimes90DaysLate

This variable represents number of times borrower has been 90 days or more past due.

In [25]:

loan['NumberOfTimes90DaysLate'].describe()

Out[25]:

count    150000.000000
mean          0.265973
std           4.169304
min           0.000000
25%           0.000000
50%           0.000000
75%           0.000000
max          98.000000
Name: NumberOfTimes90DaysLate, dtype: float64

It is an integer variable. Minimum value is zero,median is also zero. Mean is 0.266 and maximum value is 98. These give indication of presence of outliers.

Check the percentile distribution to know the presence of outliers.

In [26]:

loan['NumberOfTimes90DaysLate'].quantile([0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.85,0.9,0.95,0.97,0.99,1])

Out[26]:

0.00     0.0
0.01     0.0
0.03     0.0
0.05     0.0
0.07     0.0
0.09     0.0
0.10     0.0
0.20     0.0
0.30     0.0
0.40     0.0
0.50     0.0
0.60     0.0
0.70     0.0
0.80     0.0
0.85     0.0
0.90     0.0
0.95     1.0
0.97     1.0
0.99     3.0
1.00    98.0
Name: NumberOfTimes90DaysLate, dtype: float64

100 percentile is 98 ,which is an outlier.

This variable range is from 0 to 98.It takes only integers values. Lets see it’s frequency distribution.

In [27]:

freq=loan['NumberOfTimes90DaysLate'].value_counts()
freq

Out[27]:

0     141662
1       5243
2       1555
3        667
4        291
98       264
5        131
6         80
7         38
8         21
9         19
10         8
11         5
96         5
13         4
12         2
14         2
15         2
17         1
Name: NumberOfTimes90DaysLate, dtype: int64

This variables has values from 0 to 15 and 17,96,98. Last two are outliers.

Next plot boxplot to visualize the data.

In [28]:

import matplotlib.pyplot as plt
%matplotlib inline
loan.boxplot(column="NumberOfTimes90DaysLate")

C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:3: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  app.launch_new_instance()

Out[28]:

{'boxes': [<matplotlib.lines.Line2D at 0x4e5f830>],
 'caps': [<matplotlib.lines.Line2D at 0x4e667b0>,
  <matplotlib.lines.Line2D at 0x4e66c70>],
 'fliers': [<matplotlib.lines.Line2D at 0x4e6b570>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x4e66d10>],
 'whiskers': [<matplotlib.lines.Line2D at 0x4e5fe10>,
  <matplotlib.lines.Line2D at 0x4e5feb0>]}

NumberRealEstateLoansOrLines

It shows number of mortgage and real estate loans taken by the barrower including home equity lines of credit.

In [29]:

loan['NumberRealEstateLoansOrLines'].describe()

Out[29]:

count    150000.000000
mean          1.018240
std           1.129771
min           0.000000
25%           0.000000
50%           1.000000
75%           2.000000
max          54.000000
Name: NumberRealEstateLoansOrLines, dtype: float64

It is an integer variable. Minimum value is zero,median is one. Mean is 1.018 and maximum value is 54. Mean and Median are close so there may not be outliers in this variable.

Check the percentile distribution to know the presence of outliers.

In [30]:

loan['NumberRealEstateLoansOrLines'].quantile([0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.85,0.9,0.95,0.97,0.99,1])

Out[30]:

0.00     0.0
0.01     0.0
0.03     0.0
0.05     0.0
0.07     0.0
0.09     0.0
0.10     0.0
0.20     0.0
0.30     0.0
0.40     1.0
0.50     1.0
0.60     1.0
0.70     1.0
0.80     2.0
0.85     2.0
0.90     2.0
0.95     3.0
0.97     3.0
0.99     4.0
1.00    54.0
Name: NumberRealEstateLoansOrLines, dtype: float64

100 percentile is 54 ,which is a possible value for this variable.

This variable range is from 0 to 54.It takes only integers values. Lets see it’s frequency distribution.

In [31]:

frque=loan['NumberRealEstateLoansOrLines'].value_counts()
frque

Out[31]:

0     56188
1     52338
2     31522
3      6300
4      2170
5       689
6       320
7       171
8        93
9        78
10       37
11       23
12       18
13       15
14        7
15        7
16        4
17        4
25        3
18        2
19        2
20        2
23        2
32        1
21        1
26        1
29        1
54        1
Name: NumberRealEstateLoansOrLines, dtype: int64

This variables has values from 0 to 21 and 23,25,26,29,32,54.

Next plot boxplot to visualize the data.

In [32]:

import matplotlib.pyplot as plt
%matplotlib inline
loan.boxplot(column="NumberRealEstateLoansOrLines")

C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:3: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  app.launch_new_instance()

Out[32]:

{'boxes': [<matplotlib.lines.Line2D at 0x4ea47f0>],
 'caps': [<matplotlib.lines.Line2D at 0x4eaa6d0>,
  <matplotlib.lines.Line2D at 0x4eaab90>],
 'fliers': [<matplotlib.lines.Line2D at 0x4eb0530>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x4eaac30>],
 'whiskers': [<matplotlib.lines.Line2D at 0x4ea4d30>,
  <matplotlib.lines.Line2D at 0x4ea4dd0>]}

There are no outliers in this variable.

NumberOfTime60-89DaysPastDueNotWorse

It shows number of times borrower has been 60 to 89 days past due but no worse in the last 2 years.

In [33]:

loan['NumberOfTime60-89DaysPastDueNotWorse'].describe()

Out[33]:

count    150000.000000
mean          0.240387
std           4.155179
min           0.000000
25%           0.000000
50%           0.000000
75%           0.000000
max          98.000000
Name: NumberOfTime60-89DaysPastDueNotWorse, dtype: float64

It is an integer variable. Minimum value is zero,median is 0. Mean is 0.2404 and maximum value is 98. These give indication of presence of outliers.

Check the percentile distribution to know the presence of outliers.

In [34]:

loan['NumberOfTime60-89DaysPastDueNotWorse'].quantile([0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.85,0.9,0.95,0.97,0.99,1])

Out[34]:

0.00     0.0
0.01     0.0
0.03     0.0
0.05     0.0
0.07     0.0
0.09     0.0
0.10     0.0
0.20     0.0
0.30     0.0
0.40     0.0
0.50     0.0
0.60     0.0
0.70     0.0
0.80     0.0
0.85     0.0
0.90     0.0
0.95     1.0
0.97     1.0
0.99     2.0
1.00    98.0
Name: NumberOfTime60-89DaysPastDueNotWorse, dtype: float64

100 percentile is 98 ,which is an outlier.

This variable range is from 0 to 98.It takes only integers values. Lets see it’s frequency distribution.

In [35]:

frequency=loan['NumberOfTime60-89DaysPastDueNotWorse'].value_counts()
frequency

Out[35]:

0     142396
1       5731
2       1118
3        318
98       264
4        105
5         34
6         16
7          9
96         5
8          2
11         1
9          1
Name: NumberOfTime60-89DaysPastDueNotWorse, dtype: int64

This variables has values from 0 to 9 and 11,96,98. Last two are outliers.

Next plot boxplot to visualize the data.

In [36]:

import matplotlib.pyplot as plt
%matplotlib inline
loan.boxplot(column="NumberOfTime60-89DaysPastDueNotWorse")

C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:3: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  app.launch_new_instance()

Out[36]:

{'boxes': [<matplotlib.lines.Line2D at 0x4ee2190>],
 'caps': [<matplotlib.lines.Line2D at 0x4ee2cd0>,
  <matplotlib.lines.Line2D at 0x4ee95d0>],
 'fliers': [<matplotlib.lines.Line2D at 0x4ee9eb0>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x4edced0>],
 'whiskers': [<matplotlib.lines.Line2D at 0x4ee2770>,
  <matplotlib.lines.Line2D at 0x4ee2c30>]}

NumberOfDependents

It represents number of dependents in the family of borrower excluding himself.

In [37]:

loan['NumberOfDependents'].describe()

Out[37]:

count    146076.000000
mean          0.757222
std           1.115086
min           0.000000
25%           0.000000
50%           0.000000
75%           1.000000
max          20.000000
Name: NumberOfDependents, dtype: float64

It is an integer variable.It has missing values represented by ‘NA’. Its minimum value is 0. Mean is 0.757 and median is 0 without considering NA values. Maximum value is 20.

Lets tabularize what we found in univariate analysis

Variable                                       Missing Values                    Outliers 

X                                              Nill                                  Nill

SeriousDlqin2yrs                               Nill                                  Nill

RevolvingUtilizationOfUnsecuredLines           Nill                              Present(<10%)

  age                                          Nill                              Present(<10%)

NumberOfTime30.59DaysPastDueNotWorse           Nill                               Present(<10%)

 DebtRatio                                     Nill                               Present(23.4%)

MonthlyIncome                               Present(19.82%)                      has to be Analysed

NumberOfOpenCreditLinesAndLoans                 Nill                                  Nill

NumberOfTimes90DaysLate                         Nill                              Present(<10%)

NumberRealEstateLoansOrLines                     Nill                                 Nill

NumberOfTime60.89DaysPastDueNotWorse             Nill                              Present(<10%)

NumberOfDependents                             Present(<10%)                       has to be Analysed

Missing values Treatment

MonthlyIncome and NumberOfDependents have missing values. We will replace them by their column mean values.We create new dataset.

In MonthlyIncome missing values are of 19.82%. So we create a new column NA_MonthlyIncome which indicates whether the value of MonthlyIncome in new dataset is origanal one(FALSE) or missing value replaced by the mean(TRUE).

In [38]:

loan1=loan
loan1['MonthlyIncome_new']=loan1['MonthlyIncome']
#to display all the rows which have missing values in 'MonthlyIncome_new' Column:
loan1.ix[loan1['MonthlyIncome_new'].isnull()]
#to get axis=0 index (row index) which have missing values in this column
loan1.ix[loan1['MonthlyIncome_new'].isnull()].index

#Once identified where missing values exist, the next task usually is to fill them (data imputation). Depending upon the context,
#in this case, I am assigning mean value(6670) to all those positions where missing value is present:
loan1.loc[loan1['MonthlyIncome_new'].isnull(),'MonthlyIncome_new']=6670
sum(loan1['MonthlyIncome_new'].isnull())
#and as the output suggests, this column doesn't have any missing values now

Out[38]:

In NumberOfDependents missing values are of only 2.616%, so we dont create any new column.We replace missing values by mean of remaining values.

In [39]:

loan1['NumberOfDependents_new']=loan1['NumberOfDependents']
#to display all the rows which have missing values in 'NumberOfDependents_new' Column:
loan1.ix[loan1['NumberOfDependents_new'].isnull()]
#to get axis=0 index (row index) which have missing values in this column
loan1.ix[loan1['NumberOfDependents_new'].isnull()].index

#Once identified where missing values exist, the next task usually is to fill them (data imputation). Depending upon the context,
#in this case, I am assigning mean value(0.757) to all those positions where missing value is present:
loan1.loc[loan1['NumberOfDependents_new'].isnull(),'NumberOfDependents_new']=0.757
sum(loan1['NumberOfDependents_new'].isnull())
#and as the output suggests, this column doesn't have any missing values now

Out[39]:

Model Building:Logistic regression

Since the predictor variable (SeriousDlqin2yrs) is YES or NO type ,first we will use logistic regression model.

We have only training data set but not test data set. So we divide our data set into two parts, first 120000 rows for training and remaining 30000 rows for testing.

In [42]:

from sklearn.cross_validation import train_test_split
features=list(loan1[["RevolvingUtilizationOfUnsecuredLines"]+["age"]+["NumberOfTime30-59DaysPastDueNotWorse"]+["DebtRatio"]+["MonthlyIncome_new"]+["NumberOfOpenCreditLinesAndLoans"]+["NumberOfTimes90DaysLate"]+["NumberRealEstateLoansOrLines"]+["NumberOfTime60-89DaysPastDueNotWorse"]+["NumberOfDependents_new"]])
X = loan1[features]
y = loan1['SeriousDlqin2yrs']
X_train, X_test, Y_train, Y_test = train_test_split(X, y, train_size=0.8) 
Y_test.shape, X_test.shape,X_train.shape,Y_train.shape

Out[42]:

((30000,), (30000, 10), (120000, 10), (120000,))

In [43]:

from sklearn.linear_model import LogisticRegression
logistic= LogisticRegression()
logistic.fit(X_train,Y_train)

Out[43]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [44]:

predict=logistic.predict(X_test)

In [45]:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
cm1 = confusion_matrix(Y_test,predict)
print(cm1)
total1=sum(sum(cm1))

[[28003    30]
 [ 1940    27]]

In [46]:

accuracy1=(cm1[0,0]+cm1[1,1])/total1
accuracy1

Out[46]:

0.93433333333333335

In [47]:

specificity=cm1[1,1]/(cm1[1,1]+cm1[1,0])
specificity

Out[47]:

0.013726487036095577

In [48]:

sensitivity=cm1[0,0]/(cm1[0,0]+cm1[0,1])
sensitivity

Out[48]:

0.99892983269717828

Our model accuracy is 0.93.Specificity is 0.031, which is very low, we need to improve it.

Next we treat outliers.

Outliers Removal

We create new dataset loan2 in which we replace outliers with mean

In [49]:

loan2=loan1
loan2.shape

Out[49]:

(150000, 14)

RevolvingUtilizationOfUnsecuredLines

RevolvingUtilizationOfUnsecuredLines has outliers. Since outliers percentage is less than 10 We will replace outliers with mean of reaming data.Outliers are with value greater than 1.

In [50]:

remain_m=loan2['RevolvingUtilizationOfUnsecuredLines'][loan2['RevolvingUtilizationOfUnsecuredLines']<=1].mean()
remain_m

Out[50]:

0.3037815510208745

In [51]:

loan2['RevolvingUtilizationOfUnsecuredLines_new']=loan2['RevolvingUtilizationOfUnsecuredLines']
loan2['RevolvingUtilizationOfUnsecuredLines_new'][loan2['RevolvingUtilizationOfUnsecuredLines_new']>1]=remain_m
loan2['RevolvingUtilizationOfUnsecuredLines_new'].describe()

C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app

Out[51]:

count    150000.000000
mean          0.303782
std           0.334130
min           0.000000
25%           0.029867
50%           0.154181
75%           0.506929
max           1.000000
Name: RevolvingUtilizationOfUnsecuredLines_new, dtype: float64

age

Next in age there is an outlier whose value is zero we replace it with other values mean.

In [52]:

remain_mean=loan2['age'][loan2['age']>0].mean()
remain_mean

Out[52]:

52.295555303702024

In [53]:

loan2['age_new']=loan2['age']
loan2['age_new'][loan2['age_new']==0]=remain_m
loan2['age_new'].describe()

C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app

Out[53]:

count    150000.000000
mean         52.295209
std          14.771859
min           0.303782
25%          41.000000
50%          52.000000
75%          63.000000
max         109.000000
Name: age_new, dtype: float64

NumberOfTime30.59DaysPastDueNotWorse

NumberOfTime30.59DaysPastDueNotWorse has values 96,98 as outliers which are of less than 10%. We treat the outliers based on the related variable ‘SeriousDlqin2yrs’. ‘NumberOfTime30.59DaysPastDueNotWorse’ is directly related to ‘SeriousDlqin2yrs’. So we create a frequency table between SeriousDlqin2yrs and NumberOfTime30.59DaysPastDueNotWorse.

In [54]:

import pandas as pd
cross_table=pd.crosstab(loan2['NumberOfTime30-59DaysPastDueNotWorse'],loan2['SeriousDlqin2yrs'])
cross_table

Out[54]:

SeriousDlqin2yrs	0	1
NumberOfTime30-59DaysPastDueNotWorse
0	120977	5041
1	13624	2409
2	3379	1219
3	1136	618
4	429	318
5	188	154
6	66	74
7	26	28
8	17	8
9	8	4
10	1	3
11	0	1
12	1	1
13	0	1
96	1	4
98	121	143

For all the values in NumberOfTime30.59DaysPastDueNotWorse find the percentage of 0’s in SeriousDlqin2yrs.As both variables are related, We replace 96,98 with the values whose 0’s percentage is same as former values.

In [55]:

cross_table.astype(float).div(cross_table.sum(axis=1), axis=0)

Out[55]:

SeriousDlqin2yrs	0	1
NumberOfTime30-59DaysPastDueNotWorse
0	0.959998	0.040002
1	0.849747	0.150253
2	0.734885	0.265115
3	0.647662	0.352338
4	0.574297	0.425703
5	0.549708	0.450292
6	0.471429	0.528571
7	0.481481	0.518519
8	0.680000	0.320000
9	0.666667	0.333333
10	0.250000	0.750000
11	0.000000	1.000000
12	0.500000	0.500000
13	0.000000	1.000000
96	0.200000	0.800000
98	0.458333	0.541667

so the bad rate(defaluters) in group 98 is 54% and the nearest group with a bad rate is 52.8%. the apt substitution for 98 will be 6, since there is no other group whose bad rate(defaulter) is similar to this group . there are only 5 values in 96. So we also replace 98 and also 96 by 6.

In [56]:

loan2['NumberOfTime30-59DaysPastDueNotWorse_new']=loan2['NumberOfTime30-59DaysPastDueNotWorse']
loan2['NumberOfTime30-59DaysPastDueNotWorse_new'][loan2['NumberOfTime30-59DaysPastDueNotWorse_new']>13]=6

C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app

In [57]:

freq_tab=loan2['NumberOfTime30-59DaysPastDueNotWorse_new'].value_counts()
freq_tab

Out[57]:

0     126018
1      16033
2       4598
3       1754
4        747
6        409
5        342
7         54
8         25
9         12
10         4
12         2
13         1
11         1
Name: NumberOfTime30-59DaysPastDueNotWorse_new, dtype: int64

DebtRatio

DebtRatio has ooutliers.We take anything greater than 1 as outlier and replace them with remaining values mean. Outliers percentage is 23.4%. So we crete a new row Outlier_DebtRatio to indicate whether that value is outlier and replaced or not.

In [58]:

remain_mn=loan2['DebtRatio'][loan2['DebtRatio']<1].mean()
remain_mn

Out[58]:

0.3016293900012304

In [59]:

loan2['DebtRatio_new']=loan2['DebtRatio']
loan2['DebtRatio_new'][loan2['DebtRatio_new']>1]=remain_mn
loan2['DebtRatio_new'].describe()

C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app

Out[59]:

count    150000.000000
mean          0.302696
std           0.198018
min           0.000000
25%           0.175074
50%           0.301629
75%           0.380021
max           1.000000
Name: DebtRatio_new, dtype: float64

NumberOfTimes90DaysLate

Outlier treatment for NumberOfTimes90DaysLate is same as NumberOfTime30.59DaysPastDueNotWorse. Outliers are 96,98.

In [60]:

import pandas as pd
cross_table1=pd.crosstab(loan2['NumberOfTimes90DaysLate'],loan2['SeriousDlqin2yrs'])
cross_table1

Out[60]:

SeriousDlqin2yrs	0	1
NumberOfTimes90DaysLate
0	135108	6554
1	3478	1765
2	779	776
3	282	385
4	96	195
5	48	83
6	32	48
7	7	31
8	6	15
9	5	14
10	3	5
11	2	3
12	1	1
13	2	2
14	1	1
15	2	0
17	0	1
96	1	4
98	121	143

In [61]:

cross_table1.astype(float).div(cross_table1.sum(axis=1), axis=0)

Out[61]:

SeriousDlqin2yrs	0	1
NumberOfTimes90DaysLate
0	0.953735	0.046265
1	0.663361	0.336639
2	0.500965	0.499035
3	0.422789	0.577211
4	0.329897	0.670103
5	0.366412	0.633588
6	0.400000	0.600000
7	0.184211	0.815789
8	0.285714	0.714286
9	0.263158	0.736842
10	0.375000	0.625000
11	0.400000	0.600000
12	0.500000	0.500000
13	0.500000	0.500000
14	0.500000	0.500000
15	1.000000	0.000000
17	0.000000	1.000000
96	0.200000	0.800000
98	0.458333	0.541667

Values 98,3 has close percentage. We replace 96,98 with 3.

In [62]:

loan2['NumberOfTimes90DaysLate_new']=loan2['NumberOfTimes90DaysLate']
loan2['NumberOfTimes90DaysLate_new'][loan2['NumberOfTimes90DaysLate_new']>17]=3

C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app

In [63]:

fr_tab=loan2['NumberOfTimes90DaysLate_new'].value_counts()
fr_tab

Out[63]:

0     141662
1       5243
2       1555
3        936
4        291
5        131
6         80
7         38
8         21
9         19
10         8
11         5
13         4
15         2
12         2
14         2
17         1
Name: NumberOfTimes90DaysLate_new, dtype: int64

NumberOfTime60-89DaysPastDueNotWorse

This has 96,98 as outliers. This outlier treatment is same as NumberOfTime30.59DaysPastDueNotWorse.

In [64]:

import pandas as pd
cross_table2=pd.crosstab(loan2['NumberOfTime60-89DaysPastDueNotWorse'],loan2['SeriousDlqin2yrs'])
cross_table2

Out[64]:

SeriousDlqin2yrs	0	1
NumberOfTime60-89DaysPastDueNotWorse
0	135140	7256
1	3954	1777
2	557	561
3	138	180
4	40	65
5	13	21
6	4	12
7	4	5
8	1	1
9	1	0
11	0	1
96	1	4
98	121	143

In [65]:

cross_table2.astype(float).div(cross_table2.sum(axis=1), axis=0)

Out[65]:

SeriousDlqin2yrs	0	1
NumberOfTime60-89DaysPastDueNotWorse
0	0.949044	0.050956
1	0.689932	0.310068
2	0.498211	0.501789
3	0.433962	0.566038
4	0.380952	0.619048
5	0.382353	0.617647
6	0.250000	0.750000
7	0.444444	0.555556
8	0.500000	0.500000
9	1.000000	0.000000
11	0.000000	1.000000
96	0.200000	0.800000
98	0.458333	0.541667

Values 98,7 has close percentage. We replace 96,98 with 7.

In [66]:

loan2['NumberOfTime60-89DaysPastDueNotWorse_new']=loan2['NumberOfTime60-89DaysPastDueNotWorse']
loan2['NumberOfTime60-89DaysPastDueNotWorse_new'][loan2['NumberOfTime60-89DaysPastDueNotWorse_new']>11]=7

C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app

In [67]:

freqq_tab=loan2['NumberOfTime60-89DaysPastDueNotWorse_new'].value_counts()
freqq_tab

Out[67]:

0     142396
1       5731
2       1118
3        318
7        278
4        105
5         34
6         16
8          2
11         1
9          1
Name: NumberOfTime60-89DaysPastDueNotWorse_new, dtype: int64

All the outliers and missing values are cleaned. We save the final dataset for future use.

In [68]:

loan2.columns.values

Out[68]:

array(['Sr_No', 'SeriousDlqin2yrs', 'RevolvingUtilizationOfUnsecuredLines',
       'age', 'NumberOfTime30-59DaysPastDueNotWorse', 'DebtRatio',
       'MonthlyIncome', 'NumberOfOpenCreditLinesAndLoans',
       'NumberOfTimes90DaysLate', 'NumberRealEstateLoansOrLines',
       'NumberOfTime60-89DaysPastDueNotWorse', 'NumberOfDependents',
       'MonthlyIncome_new', 'NumberOfDependents_new',
       'RevolvingUtilizationOfUnsecuredLines_new', 'age_new',
       'NumberOfTime30-59DaysPastDueNotWorse_new', 'DebtRatio_new',
       'NumberOfTimes90DaysLate_new',
       'NumberOfTime60-89DaysPastDueNotWorse_new'], dtype=object)

In [69]:

loan2.shape

Out[69]:

(150000, 20)

In [70]:

loan2.isnull().sum()

Out[70]:

Sr_No                                           0
SeriousDlqin2yrs                                0
RevolvingUtilizationOfUnsecuredLines            0
age                                             0
NumberOfTime30-59DaysPastDueNotWorse            0
DebtRatio                                       0
MonthlyIncome                               29731
NumberOfOpenCreditLinesAndLoans                 0
NumberOfTimes90DaysLate                         0
NumberRealEstateLoansOrLines                    0
NumberOfTime60-89DaysPastDueNotWorse            0
NumberOfDependents                           3924
MonthlyIncome_new                               0
NumberOfDependents_new                          0
RevolvingUtilizationOfUnsecuredLines_new        0
age_new                                         0
NumberOfTime30-59DaysPastDueNotWorse_new        0
DebtRatio_new                                   0
NumberOfTimes90DaysLate_new                     0
NumberOfTime60-89DaysPastDueNotWorse_new        0
dtype: int64

After Outlier Treatment, we will build the model again using new data set.We divide data set for training and testing

In [71]:

from sklearn.cross_validation import train_test_split
features=list(loan2[["RevolvingUtilizationOfUnsecuredLines_new"]+["age_new"]+["NumberOfTime30-59DaysPastDueNotWorse_new"]+["DebtRatio_new"]+["MonthlyIncome_new"]+["NumberOfOpenCreditLinesAndLoans"]+["NumberOfTimes90DaysLate_new"]+["NumberRealEstateLoansOrLines"]+["NumberOfTime60-89DaysPastDueNotWorse_new"]+["NumberOfDependents_new"]])
X1 = loan2[features]
y1 = loan2['SeriousDlqin2yrs']
X1_train, X1_test, Y1_train, Y1_test = train_test_split(X1, y1, train_size=0.8) 
Y1_test.shape, X1_test.shape,X1_train.shape,Y1_train.shape

Out[71]:

((30000,), (30000, 10), (120000, 10), (120000,))

Create the model and test it. Find its accuracy

In [72]:

from sklearn.linear_model import LogisticRegression
logistic1= LogisticRegression()
logistic1.fit(X1_train,Y1_train)

Out[72]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [216]:

predict1=logistic.predict(X1_test)

In [140]:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
cm2 = confusion_matrix(Y1_test,predict1)
print(cm2)
total2=sum(sum(cm2))

[[27962    12]
 [ 2018     8]]

In [75]:

accuracy2=(cm2[0,0]+cm2[1,1])/total2
accuracy2

Out[75]:

0.93233333333333335

In [172]:

specificity1=cm2[1,1]/(cm2[1,1]+cm2[1,0])
specificity1

Out[172]:

0.0039486673247778872

In [174]:

sensitivity1=cm2[0,0]/(cm2[0,0]+cm2[0,1])
sensitivity1

Out[174]:

0.99957103024236793

Accuracy of the model is 0.933 which is greater than that of previous model.Specificity is 0.02, Which is very low.

In [78]:

import statsmodels.formula.api as sm
logistic2=sm.Logit(loan2['SeriousDlqin2yrs'],loan2[["RevolvingUtilizationOfUnsecuredLines_new"]+["age_new"]+["NumberOfTime30-59DaysPastDueNotWorse_new"]+["DebtRatio_new"]+["MonthlyIncome_new"]+["NumberOfOpenCreditLinesAndLoans"]+["NumberOfTimes90DaysLate_new"]+["NumberRealEstateLoansOrLines"]+["NumberOfTime60-89DaysPastDueNotWorse_new"]+["NumberOfDependents_new"]])
logistic2 
result1=logistic2.fit()
summary_1=result1.summary()
summary_1

Optimization terminated successfully.
         Current function value: 0.206064
         Iterations 8

Out[78]:

Logit Regression Results
Dep. Variable:	SeriousDlqin2yrs	No. Observations:	150000
Model:	Logit	Df Residuals:	149990
Method:	MLE	Df Model:	9
Date:	Fri, 18 Nov 2016	Pseudo R-squ.:	0.1602
Time:	08:21:37	Log-Likelihood:	-30910.
converged:	True	LL-Null:	-36808.
		LLR p-value:	0.000

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
RevolvingUtilizationOfUnsecuredLines_new	0.8184	0.029	28.407	0.000	0.762 0.875
age_new	-0.0611	0.001	-89.965	0.000	-0.062 -0.060
NumberOfTime30-59DaysPastDueNotWorse_new	0.4969	0.011	44.622	0.000	0.475 0.519
DebtRatio_new	-0.4813	0.061	-7.906	0.000	-0.601 -0.362
MonthlyIncome_new	-6.703e-05	3.62e-06	-18.538	0.000	-7.41e-05 -5.99e-05
NumberOfOpenCreditLinesAndLoans	-0.0016	0.003	-0.564	0.573	-0.007 0.004
NumberOfTimes90DaysLate_new	0.6944	0.016	42.383	0.000	0.662 0.727
NumberRealEstateLoansOrLines	0.1742	0.011	15.947	0.000	0.153 0.196
NumberOfTime60-89DaysPastDueNotWorse_new	0.3185	0.022	14.343	0.000	0.275 0.362
NumberOfDependents_new	-0.0278	0.010	-2.873	0.004	-0.047 -0.009

ROC AND AUC

In [79]:

from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
actual = Y1_test
false_positive_rate, true_positive_rate, thresholds = roc_curve(actual, predict1)
plt.title('Receiver Operating Characteristic')
roc_auc = auc(false_positive_rate, true_positive_rate)
plt.plot(false_positive_rate, true_positive_rate,label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

roc_auc = auc(false_positive_rate, true_positive_rate)
roc_auc

Out[79]:

0.50175984878357294

Decision Tree

Specificity in the logistic regression model is not satisfactory. So we build a new model using decision trees. We need to install package ‘party’ for decision trees.

library(“party”)

In [123]:

import pandas as pd
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X1_train,Y1_train)
clf

Out[123]:

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [124]:

predict2 = clf.predict(X1_test)
predict2
from sklearn.metrics import confusion_matrix
cm3=confusion_matrix(Y1_test, predict2)
print (cm3)

[[26408  1566]
 [ 1437   589]]

In [125]:

total3 = sum(sum(cm3))
accuracy3 = (cm3[0,0]+cm3[1,1])/total3
accuracy3

Out[125]:

0.89990000000000003

In [126]:

specificity2=cm3[1,1]/(cm3[1,1]+cm3[1,0])
specificity2

Out[126]:

0.29072063178677199

In [127]:

sensitivity2=cm3[0,0]/(cm3[0,0]+cm3[0,1])
sensitivity2

Out[127]:

0.9440194466290126

Naive Bayes

In [128]:

from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X1_train,Y1_train)

Out[128]:

GaussianNB()

In [129]:

predict3 = model.predict(X1_test)
predict3
from sklearn.metrics import confusion_matrix
cm4=confusion_matrix(Y1_test, predict3)
print (cm4)

[[27168   806]
 [ 1356   670]]

In [130]:

from sklearn import metrics
print(metrics.classification_report(Y1_test, predict3))
print(metrics.confusion_matrix(Y1_test, predict3))

             precision    recall  f1-score   support

          0       0.95      0.97      0.96     27974
          1       0.45      0.33      0.38      2026

avg / total       0.92      0.93      0.92     30000

[[27168   806]
 [ 1356   670]]

In [131]:

total4 = sum(sum(cm4))
accuracy4 = (cm4[0,0]+cm4[1,1])/total4
accuracy4

Out[131]:

0.92793333333333339

In [175]:

specificity3=cm4[1,1]/(cm4[1,1]+cm4[1,0])
specificity3

Out[175]:

0.33070088845014806

In [176]:

sensitivity3=cm4[0,0]/(cm4[0,0]+cm4[0,1])
sensitivity3

Out[176]:

0.97118753127904478

K Fold Cross Validation

In [221]:

import numpy as np
from sklearn.cross_validation import KFold
from sklearn import cross_validation
from sklearn.cross_validation import train_test_split
features=list(loan2[["RevolvingUtilizationOfUnsecuredLines_new"]+["age_new"]+["NumberOfTime30-59DaysPastDueNotWorse_new"]+["DebtRatio_new"]+["MonthlyIncome_new"]+["NumberOfOpenCreditLinesAndLoans"]+["NumberOfTimes90DaysLate_new"]+["NumberRealEstateLoansOrLines"]+["NumberOfTime60-89DaysPastDueNotWorse_new"]+["NumberOfDependents_new"]])
X1 = loan2[features]
y1 = loan2['SeriousDlqin2yrs']
X1_train, X1_test, Y1_train, Y1_test = train_test_split(X1, y1, train_size=0.8) 
Y1_test.shape, X1_test.shape,X1_train.shape,Y1_train.shape

Out[221]:

((30000,), (30000, 10), (120000, 10), (120000,))

In [232]:

from sklearn.cross_validation import KFold
scores = cross_validation.cross_val_score(logistic1, X1, y1, cv=10)
scores

Out[232]:

array([ 0.93540431,  0.9359376 ,  0.93513766,  0.9359376 ,  0.93546667,
        0.9374    ,  0.9374625 ,  0.93739583,  0.93586239,  0.93639576])

In [233]:

print(logistic1.score(X1_test, Y1_test))
print(scores)

0.9371
[ 0.93540431  0.9359376   0.93513766  0.9359376   0.93546667  0.9374
  0.9374625   0.93739583  0.93586239  0.93639576]

K Fold Cross Validation accuracy is 93%