Before start our lesson please download the datasets.
Problem Statement:
Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit.
Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted.
The goal is to build a model that borrowers can use to help make the best financial decisions.The data is raw, you may have to spend considerable amount of time for validating and cleaning the data
Methods:
i used two popular data mining algorithms (decision tree and Naïve Bayesian classifier) along with a most commonly used statistical method (logistic regression) to develop the prediction models using a large dataset (150000 instances).
The problem is to classify borrower as defaulter or non defaulter. It is commonly desired for banks to classify borrower accurately so as to manage their loan risk better and increase business.
import pandas as pd
loan=pd.read_csv("C:\\Users\\Personal\\Google Drive\\cs-training.csv")
loan.shape
Data set has 150000 rows and 12 variables.
loan.columns.values
Variable Name: Description
1. Sr_No:serial number
2. SeriousDlqin2yrs : Person experienced 90 days past due delinquency or worse
3. RevolvingUtilizationOfUnsecuredLines :Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits.
4. age: Age of borrower in years
5. NumberOfTime30-59DaysPastDueNotWorse: Number of times borrower has been 30-59 days past due but no worse in the last 2 years.
6. DebtRatio: Monthly debt payments, alimony,living costs divided by monthy gross income
7. MonthlyIncome :Monthly income
8. NumberOfOpenCreditLinesAndLoans: Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards)
9. NumberOfTimes90DaysLate: Number of times borrower has been 90 days or more past due.
10. NumberRealEstateLoansOrLines: Number of mortgage and real estate loans including home equity lines of credit
11. NumberOfTime60-89DaysPastDueNotWorse: Number of times borrower has been 60-89 days past due but no worse in the last 2 years.
12. NumberOfDependents: Number of dependents in family excluding themselves (spouse, children etc.)
loan.head()
import pandas as pd
import sklearn as sk
import math
import numpy as np
from scipy import stats
import matplotlib as matlab
import statsmodels
loan=pd.read_csv("C:\\Users\\Personal\\Google Drive\\cs-training.csv")
loan.shape
loan.columns.values
loan.head(10)
loan.describe()
describe will show the minimum, maximum, mean, median, 1st quartile, 3rd quartile of all the variables in the data set. It also shows missing values in the data set.In our dataset, variables ‘MonthlyIncome’ and ‘NumberOfDependents’ have NA values.Summary gives mean of variables having NA values by excluding them.
loan.isnull().sum()
monthlyincome and number of dependents have missing values.
loan['SeriousDlqin2yrs'].describe()
frequency_table=loan['SeriousDlqin2yrs'].value_counts()
frequency_table
0 -indicates non-defaulters, 1 -indicates defaulters. Out of 150000 only 10026 are defaulters.
loan['RevolvingUtilizationOfUnsecuredLines'].describe()
import matplotlib.pyplot as plt
%matplotlib inline
loan.boxplot(column="RevolvingUtilizationOfUnsecuredLines")
From the box plot we can see that there are outliers present in the variable.
loan['age'].describe()
Minimum age is 0, which is not practical. Maximum age is 109 which is ok. Mean and median are very close which indicates outliers may not be present.
Lets see the percentile distribution.
loan['age'].quantile([0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1])
import matplotlib.pyplot as plt
%matplotlib inline
loan.boxplot(column="age")
We can notice an outlier at the top of the boxplot.
loan['NumberOfTime30-59DaysPastDueNotWorse'].describe()
It is an integer variable. Minimum value is zero,median is also zero. Mean is 0.421 ,SD is 4.192 and maximum value is 98. These give indication of presence of outliers.
Check the percentile distribution to know the presence of outliers.
loan['NumberOfTime30-59DaysPastDueNotWorse'].quantile([0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.85,0.9,0.95,1])
100 percentile is 98 ,which is an outlier.
This variable range is from 0 to 98.It takes only integers values. Lets see it’s frequency distribution.
freq_tab=loan['NumberOfTime30-59DaysPastDueNotWorse'].value_counts()
freq_tab
This variables has values from 0 to 13 and 96,98. Last two are outliers.
Next plot boxplot to visualize the data.
import matplotlib.pyplot as plt
%matplotlib inline
loan.boxplot(column="NumberOfTime30-59DaysPastDueNotWorse")
loan['DebtRatio'].describe()
Normally debt ratio should be between 0 to 1. Somtimes it can exceed 1 ,if a person spends more than his income.Here its minimum is 0,mean is 353,median is 0.4. This indicates presence of outliers. Maximum value is 329700, which is not possible.
Lets see the percentile distribution
loan['DebtRatio'].quantile([0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.75,0.76,0.78,0.8,0.85,0.9,0.95,1])
Upto 76percentile it is less than 1.
Plot the boxplot.
import matplotlib.pyplot as plt
%matplotlib inline
loan.boxplot(column="DebtRatio")
There are outlers present in the variable. We have to filter them before we use the data for model building.
loan['MonthlyIncome'].describe()
This is an integer variable. It has missing values represented by ‘NA’. Its minimum value is 0, which is practically impossible. Mean is 6670 and median is 5400 without considering NA values.
loan['NumberOfOpenCreditLinesAndLoans'].describe()
It is an integer variable. Its minimum value is 0,maximum value is 58. Its mean is 8.543,median is 8. Mean and median are close, so outliers may not be present.
Lets see percentile distribution to know the outliers presence.
loan['NumberOfOpenCreditLinesAndLoans'].quantile([0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.93,0.95,0.97,0.98,0.99,0.995,1])
Highest value is 58 which is possible.
Lets check boxplot
import matplotlib.pyplot as plt
%matplotlib inline
loan.boxplot(column="NumberOfOpenCreditLinesAndLoans")
loan['NumberOfTimes90DaysLate'].describe()
It is an integer variable. Minimum value is zero,median is also zero. Mean is 0.266 and maximum value is 98. These give indication of presence of outliers.
Check the percentile distribution to know the presence of outliers.
loan['NumberOfTimes90DaysLate'].quantile([0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.85,0.9,0.95,0.97,0.99,1])
100 percentile is 98 ,which is an outlier.
This variable range is from 0 to 98.It takes only integers values. Lets see it’s frequency distribution.
freq=loan['NumberOfTimes90DaysLate'].value_counts()
freq
This variables has values from 0 to 15 and 17,96,98. Last two are outliers.
Next plot boxplot to visualize the data.
import matplotlib.pyplot as plt
%matplotlib inline
loan.boxplot(column="NumberOfTimes90DaysLate")
loan['NumberRealEstateLoansOrLines'].describe()
It is an integer variable. Minimum value is zero,median is one. Mean is 1.018 and maximum value is 54. Mean and Median are close so there may not be outliers in this variable.
Check the percentile distribution to know the presence of outliers.
loan['NumberRealEstateLoansOrLines'].quantile([0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.85,0.9,0.95,0.97,0.99,1])
100 percentile is 54 ,which is a possible value for this variable.
This variable range is from 0 to 54.It takes only integers values. Lets see it’s frequency distribution.
frque=loan['NumberRealEstateLoansOrLines'].value_counts()
frque
This variables has values from 0 to 21 and 23,25,26,29,32,54.
Next plot boxplot to visualize the data.
import matplotlib.pyplot as plt
%matplotlib inline
loan.boxplot(column="NumberRealEstateLoansOrLines")
There are no outliers in this variable.
loan['NumberOfTime60-89DaysPastDueNotWorse'].describe()
It is an integer variable. Minimum value is zero,median is 0. Mean is 0.2404 and maximum value is 98. These give indication of presence of outliers.
Check the percentile distribution to know the presence of outliers.
loan['NumberOfTime60-89DaysPastDueNotWorse'].quantile([0,0.01,0.03,0.05,0.07,0.09,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.85,0.9,0.95,0.97,0.99,1])
100 percentile is 98 ,which is an outlier.
This variable range is from 0 to 98.It takes only integers values. Lets see it’s frequency distribution.
frequency=loan['NumberOfTime60-89DaysPastDueNotWorse'].value_counts()
frequency
This variables has values from 0 to 9 and 11,96,98. Last two are outliers.
Next plot boxplot to visualize the data.
import matplotlib.pyplot as plt
%matplotlib inline
loan.boxplot(column="NumberOfTime60-89DaysPastDueNotWorse")
loan['NumberOfDependents'].describe()
It is an integer variable.It has missing values represented by ‘NA’. Its minimum value is 0. Mean is 0.757 and median is 0 without considering NA values. Maximum value is 20.
Lets tabularize what we found in univariate analysis
Variable Missing Values Outliers
X Nill Nill
SeriousDlqin2yrs Nill Nill
RevolvingUtilizationOfUnsecuredLines Nill Present(<10%)
age Nill Present(<10%)
NumberOfTime30.59DaysPastDueNotWorse Nill Present(<10%)
DebtRatio Nill Present(23.4%)
MonthlyIncome Present(19.82%) has to be Analysed
NumberOfOpenCreditLinesAndLoans Nill Nill
NumberOfTimes90DaysLate Nill Present(<10%)
NumberRealEstateLoansOrLines Nill Nill
NumberOfTime60.89DaysPastDueNotWorse Nill Present(<10%)
NumberOfDependents Present(<10%) has to be Analysed
Missing values Treatment
MonthlyIncome and NumberOfDependents have missing values. We will replace them by their column mean values.We create new dataset.
In MonthlyIncome missing values are of 19.82%. So we create a new column NA_MonthlyIncome which indicates whether the value of MonthlyIncome in new dataset is origanal one(FALSE) or missing value replaced by the mean(TRUE).
loan1=loan
loan1['MonthlyIncome_new']=loan1['MonthlyIncome']
#to display all the rows which have missing values in 'MonthlyIncome_new' Column:
loan1.ix[loan1['MonthlyIncome_new'].isnull()]
#to get axis=0 index (row index) which have missing values in this column
loan1.ix[loan1['MonthlyIncome_new'].isnull()].index
#Once identified where missing values exist, the next task usually is to fill them (data imputation). Depending upon the context,
#in this case, I am assigning mean value(6670) to all those positions where missing value is present:
loan1.loc[loan1['MonthlyIncome_new'].isnull(),'MonthlyIncome_new']=6670
sum(loan1['MonthlyIncome_new'].isnull())
#and as the output suggests, this column doesn't have any missing values now
In NumberOfDependents missing values are of only 2.616%, so we dont create any new column.We replace missing values by mean of remaining values.
loan1['NumberOfDependents_new']=loan1['NumberOfDependents']
#to display all the rows which have missing values in 'NumberOfDependents_new' Column:
loan1.ix[loan1['NumberOfDependents_new'].isnull()]
#to get axis=0 index (row index) which have missing values in this column
loan1.ix[loan1['NumberOfDependents_new'].isnull()].index
#Once identified where missing values exist, the next task usually is to fill them (data imputation). Depending upon the context,
#in this case, I am assigning mean value(0.757) to all those positions where missing value is present:
loan1.loc[loan1['NumberOfDependents_new'].isnull(),'NumberOfDependents_new']=0.757
sum(loan1['NumberOfDependents_new'].isnull())
#and as the output suggests, this column doesn't have any missing values now
Since the predictor variable (SeriousDlqin2yrs) is YES or NO type ,first we will use logistic regression model.
We have only training data set but not test data set. So we divide our data set into two parts, first 120000 rows for training and remaining 30000 rows for testing.
from sklearn.cross_validation import train_test_split
features=list(loan1[["RevolvingUtilizationOfUnsecuredLines"]+["age"]+["NumberOfTime30-59DaysPastDueNotWorse"]+["DebtRatio"]+["MonthlyIncome_new"]+["NumberOfOpenCreditLinesAndLoans"]+["NumberOfTimes90DaysLate"]+["NumberRealEstateLoansOrLines"]+["NumberOfTime60-89DaysPastDueNotWorse"]+["NumberOfDependents_new"]])
X = loan1[features]
y = loan1['SeriousDlqin2yrs']
X_train, X_test, Y_train, Y_test = train_test_split(X, y, train_size=0.8)
Y_test.shape, X_test.shape,X_train.shape,Y_train.shape
from sklearn.linear_model import LogisticRegression
logistic= LogisticRegression()
logistic.fit(X_train,Y_train)
predict=logistic.predict(X_test)
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
cm1 = confusion_matrix(Y_test,predict)
print(cm1)
total1=sum(sum(cm1))
accuracy1=(cm1[0,0]+cm1[1,1])/total1
accuracy1
specificity=cm1[1,1]/(cm1[1,1]+cm1[1,0])
specificity
sensitivity=cm1[0,0]/(cm1[0,0]+cm1[0,1])
sensitivity
Our model accuracy is 0.93.Specificity is 0.031, which is very low, we need to improve it.
Next we treat outliers.
loan2=loan1
loan2.shape
remain_m=loan2['RevolvingUtilizationOfUnsecuredLines'][loan2['RevolvingUtilizationOfUnsecuredLines']<=1].mean()
remain_m
loan2['RevolvingUtilizationOfUnsecuredLines_new']=loan2['RevolvingUtilizationOfUnsecuredLines']
loan2['RevolvingUtilizationOfUnsecuredLines_new'][loan2['RevolvingUtilizationOfUnsecuredLines_new']>1]=remain_m
loan2['RevolvingUtilizationOfUnsecuredLines_new'].describe()
remain_mean=loan2['age'][loan2['age']>0].mean()
remain_mean
loan2['age_new']=loan2['age']
loan2['age_new'][loan2['age_new']==0]=remain_m
loan2['age_new'].describe()
NumberOfTime30.59DaysPastDueNotWorse
NumberOfTime30.59DaysPastDueNotWorse has values 96,98 as outliers which are of less than 10%. We treat the outliers based on the related variable ‘SeriousDlqin2yrs’. ‘NumberOfTime30.59DaysPastDueNotWorse’ is directly related to ‘SeriousDlqin2yrs’. So we create a frequency table between SeriousDlqin2yrs and NumberOfTime30.59DaysPastDueNotWorse.
import pandas as pd
cross_table=pd.crosstab(loan2['NumberOfTime30-59DaysPastDueNotWorse'],loan2['SeriousDlqin2yrs'])
cross_table
For all the values in NumberOfTime30.59DaysPastDueNotWorse find the percentage of 0’s in SeriousDlqin2yrs.As both variables are related, We replace 96,98 with the values whose 0’s percentage is same as former values.
cross_table.astype(float).div(cross_table.sum(axis=1), axis=0)
so the bad rate(defaluters) in group 98 is 54% and the nearest group with a bad rate is 52.8%. the apt substitution for 98 will be 6, since there is no other group whose bad rate(defaulter) is similar to this group . there are only 5 values in 96. So we also replace 98 and also 96 by 6.
loan2['NumberOfTime30-59DaysPastDueNotWorse_new']=loan2['NumberOfTime30-59DaysPastDueNotWorse']
loan2['NumberOfTime30-59DaysPastDueNotWorse_new'][loan2['NumberOfTime30-59DaysPastDueNotWorse_new']>13]=6
freq_tab=loan2['NumberOfTime30-59DaysPastDueNotWorse_new'].value_counts()
freq_tab
remain_mn=loan2['DebtRatio'][loan2['DebtRatio']<1].mean()
remain_mn
loan2['DebtRatio_new']=loan2['DebtRatio']
loan2['DebtRatio_new'][loan2['DebtRatio_new']>1]=remain_mn
loan2['DebtRatio_new'].describe()
import pandas as pd
cross_table1=pd.crosstab(loan2['NumberOfTimes90DaysLate'],loan2['SeriousDlqin2yrs'])
cross_table1
cross_table1.astype(float).div(cross_table1.sum(axis=1), axis=0)
Values 98,3 has close percentage. We replace 96,98 with 3.
loan2['NumberOfTimes90DaysLate_new']=loan2['NumberOfTimes90DaysLate']
loan2['NumberOfTimes90DaysLate_new'][loan2['NumberOfTimes90DaysLate_new']>17]=3
fr_tab=loan2['NumberOfTimes90DaysLate_new'].value_counts()
fr_tab
import pandas as pd
cross_table2=pd.crosstab(loan2['NumberOfTime60-89DaysPastDueNotWorse'],loan2['SeriousDlqin2yrs'])
cross_table2
cross_table2.astype(float).div(cross_table2.sum(axis=1), axis=0)
Values 98,7 has close percentage. We replace 96,98 with 7.
loan2['NumberOfTime60-89DaysPastDueNotWorse_new']=loan2['NumberOfTime60-89DaysPastDueNotWorse']
loan2['NumberOfTime60-89DaysPastDueNotWorse_new'][loan2['NumberOfTime60-89DaysPastDueNotWorse_new']>11]=7
freqq_tab=loan2['NumberOfTime60-89DaysPastDueNotWorse_new'].value_counts()
freqq_tab
All the outliers and missing values are cleaned. We save the final dataset for future use.
loan2.columns.values
loan2.shape
loan2.isnull().sum()
After Outlier Treatment, we will build the model again using new data set.We divide data set for training and testing
from sklearn.cross_validation import train_test_split
features=list(loan2[["RevolvingUtilizationOfUnsecuredLines_new"]+["age_new"]+["NumberOfTime30-59DaysPastDueNotWorse_new"]+["DebtRatio_new"]+["MonthlyIncome_new"]+["NumberOfOpenCreditLinesAndLoans"]+["NumberOfTimes90DaysLate_new"]+["NumberRealEstateLoansOrLines"]+["NumberOfTime60-89DaysPastDueNotWorse_new"]+["NumberOfDependents_new"]])
X1 = loan2[features]
y1 = loan2['SeriousDlqin2yrs']
X1_train, X1_test, Y1_train, Y1_test = train_test_split(X1, y1, train_size=0.8)
Y1_test.shape, X1_test.shape,X1_train.shape,Y1_train.shape
Create the model and test it. Find its accuracy
from sklearn.linear_model import LogisticRegression
logistic1= LogisticRegression()
logistic1.fit(X1_train,Y1_train)
predict1=logistic.predict(X1_test)
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
cm2 = confusion_matrix(Y1_test,predict1)
print(cm2)
total2=sum(sum(cm2))
accuracy2=(cm2[0,0]+cm2[1,1])/total2
accuracy2
specificity1=cm2[1,1]/(cm2[1,1]+cm2[1,0])
specificity1
sensitivity1=cm2[0,0]/(cm2[0,0]+cm2[0,1])
sensitivity1
Accuracy of the model is 0.933 which is greater than that of previous model.Specificity is 0.02, Which is very low.
import statsmodels.formula.api as sm
logistic2=sm.Logit(loan2['SeriousDlqin2yrs'],loan2[["RevolvingUtilizationOfUnsecuredLines_new"]+["age_new"]+["NumberOfTime30-59DaysPastDueNotWorse_new"]+["DebtRatio_new"]+["MonthlyIncome_new"]+["NumberOfOpenCreditLinesAndLoans"]+["NumberOfTimes90DaysLate_new"]+["NumberRealEstateLoansOrLines"]+["NumberOfTime60-89DaysPastDueNotWorse_new"]+["NumberOfDependents_new"]])
logistic2
result1=logistic2.fit()
summary_1=result1.summary()
summary_1
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
actual = Y1_test
false_positive_rate, true_positive_rate, thresholds = roc_curve(actual, predict1)
plt.title('Receiver Operating Characteristic')
roc_auc = auc(false_positive_rate, true_positive_rate)
plt.plot(false_positive_rate, true_positive_rate,label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
roc_auc = auc(false_positive_rate, true_positive_rate)
roc_auc
import pandas as pd
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X1_train,Y1_train)
clf
predict2 = clf.predict(X1_test)
predict2
from sklearn.metrics import confusion_matrix
cm3=confusion_matrix(Y1_test, predict2)
print (cm3)
total3 = sum(sum(cm3))
accuracy3 = (cm3[0,0]+cm3[1,1])/total3
accuracy3
specificity2=cm3[1,1]/(cm3[1,1]+cm3[1,0])
specificity2
sensitivity2=cm3[0,0]/(cm3[0,0]+cm3[0,1])
sensitivity2
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X1_train,Y1_train)
predict3 = model.predict(X1_test)
predict3
from sklearn.metrics import confusion_matrix
cm4=confusion_matrix(Y1_test, predict3)
print (cm4)
from sklearn import metrics
print(metrics.classification_report(Y1_test, predict3))
print(metrics.confusion_matrix(Y1_test, predict3))
total4 = sum(sum(cm4))
accuracy4 = (cm4[0,0]+cm4[1,1])/total4
accuracy4
specificity3=cm4[1,1]/(cm4[1,1]+cm4[1,0])
specificity3
sensitivity3=cm4[0,0]/(cm4[0,0]+cm4[0,1])
sensitivity3
import numpy as np
from sklearn.cross_validation import KFold
from sklearn import cross_validation
from sklearn.cross_validation import train_test_split
features=list(loan2[["RevolvingUtilizationOfUnsecuredLines_new"]+["age_new"]+["NumberOfTime30-59DaysPastDueNotWorse_new"]+["DebtRatio_new"]+["MonthlyIncome_new"]+["NumberOfOpenCreditLinesAndLoans"]+["NumberOfTimes90DaysLate_new"]+["NumberRealEstateLoansOrLines"]+["NumberOfTime60-89DaysPastDueNotWorse_new"]+["NumberOfDependents_new"]])
X1 = loan2[features]
y1 = loan2['SeriousDlqin2yrs']
X1_train, X1_test, Y1_train, Y1_test = train_test_split(X1, y1, train_size=0.8)
Y1_test.shape, X1_test.shape,X1_train.shape,Y1_train.shape
from sklearn.cross_validation import KFold
scores = cross_validation.cross_val_score(logistic1, X1, y1, cv=10)
scores
print(logistic1.score(X1_test, Y1_test))
print(scores)
K Fold Cross Validation accuracy is 93%