Before start our lesson please download the datasets.

Give me some credit data

We will try to understand the data exploration, validation and data cleaning using a case study on loans data

Give me some credit data. It is loans data. Historical data are provided on 150,000 borrowers.

The final objective is to build a model that borrowers can use to help make the best financial decisions.

We generally get the data and data dictionary from the data team.

Steps in Data Exploration and Cleaning

Step-1: Basic details of the data
Step-2: Categorical variables exploration
Step-3: Continuous variables exploration
Step-4: Missing Values and Outlier Treatment

Step-1: Basic details of the data

In [ ]:

#Lab: Basic contents of the data

# Import “Give me some Credit\cs-training.csv” 
# What are number of rows and columns
# Are there any suspicious variables?
# Are all the variable names correct?
# Display the variable formats
# Print the first 10 observations
# Do we have any unique identifier?
# Do the text and numeric variables have meaningful data?
# Are there some coded values in the data?
# Do all the variables appear to have data

In [ ]:

# Code: Basic contents of the data

In [47]:

import pandas as pd
loans=pd.read_csv("C:\\Users\\jyothi\\Google Drive\\Datasets\\Give me some Credit\\cs-training.csv")
loans

Out[47]:

	Sr_No	SeriousDlqin2yrs	RevolvingUtilizationOfUnsecuredLines	age	NumberOfTime30-59DaysPastDueNotWorse	DebtRatio	MonthlyIncome	NumberOfOpenCreditLinesAndLoans	NumberOfTimes90DaysLate	NumberRealEstateLoansOrLines	NumberOfTime60-89DaysPastDueNotWorse	NumberOfDependents
0	1	1	0.766127	45	2	0.802982	9120.0	13	0	6	0	2.0
1	2	0	0.957151	40	0	0.121876	2600.0	4	0	0	0	1.0
2	3	0	0.658180	38	1	0.085113	3042.0	2	1	0	0	0.0
3	4	0	0.233810	30	0	0.036050	3300.0	5	0	0	0	0.0
4	5	0	0.907239	49	1	0.024926	63588.0	7	0	1	0	0.0
5	6	0	0.213179	74	0	0.375607	3500.0	3	0	1	0	1.0
6	7	0	0.305682	57	0	5710.000000	NaN	8	0	3	0	0.0
7	8	0	0.754464	39	0	0.209940	3500.0	8	0	0	0	0.0
8	9	0	0.116951	27	0	46.000000	NaN	2	0	0	0	NaN
9	10	0	0.189169	57	0	0.606291	23684.0	9	0	4	0	2.0
10	11	0	0.644226	30	0	0.309476	2500.0	5	0	0	0	0.0
11	12	0	0.018798	51	0	0.531529	6501.0	7	0	2	0	2.0
12	13	0	0.010352	46	0	0.298354	12454.0	13	0	2	0	2.0
13	14	1	0.964673	40	3	0.382965	13700.0	9	3	1	1	2.0
14	15	0	0.019657	76	0	477.000000	0.0	6	0	1	0	0.0
15	16	0	0.548458	64	0	0.209892	11362.0	7	0	1	0	2.0
16	17	0	0.061086	78	0	2058.000000	NaN	10	0	2	0	0.0
17	18	0	0.166284	53	0	0.188274	8800.0	7	0	0	0	0.0
18	19	0	0.221813	43	0	0.527888	3280.0	7	0	1	0	2.0
19	20	0	0.602794	25	0	0.065868	333.0	2	0	0	0	0.0
20	21	0	0.200923	43	0	0.430046	12300.0	10	0	2	0	0.0
21	22	1	0.025656	38	0	0.475841	3000.0	7	0	1	0	2.0
22	23	0	1.000000	39	0	0.241104	2500.0	4	0	0	0	0.0
23	24	0	0.075427	32	0	0.085512	7916.0	6	0	0	0	0.0
24	25	0	0.046560	58	0	0.241622	2416.0	9	0	1	0	0.0
25	26	1	0.392248	50	0	1.595253	4676.0	14	0	3	0	1.0
26	27	0	0.052436	58	0	0.097672	8333.0	22	0	1	0	0.0
27	28	0	0.034421	69	0	0.042383	2500.0	17	0	0	0	1.0
28	29	0	0.452516	24	0	0.011761	3400.0	1	0	0	0	0.0
29	30	0	0.392995	58	2	0.436103	5500.0	15	0	1	0	0.0
…	…	…	…	…	…	…	…	…	…	…	…	…
149970	149971	0	0.025449	58	0	0.253855	15500.0	7	0	2	0	2.0
149971	149972	0	0.058001	83	0	0.013997	5000.0	6	0	0	0	0.0
149972	149973	0	0.071273	42	0	0.008638	6945.0	3	0	0	0	1.0
149973	149974	0	1.026395	44	0	0.494819	5500.0	7	0	1	0	1.0
149974	149975	0	0.962721	61	2	0.603479	5000.0	11	0	1	0	0.0
149975	149976	0	0.022088	58	0	2716.000000	NaN	8	0	2	0	0.0
149976	149977	0	0.000627	76	0	60.000000	NaN	5	0	0	0	0.0
149977	149978	0	0.236450	29	0	349.000000	NaN	3	0	0	0	0.0
149978	149979	0	0.917635	52	2	0.259496	2500.0	4	0	0	0	0.0
149979	149980	1	0.224711	55	0	0.057235	8700.0	7	0	0	0	0.0
149980	149981	0	0.067644	64	0	0.254976	5525.0	12	0	1	0	0.0
149981	149982	0	0.810012	43	0	0.121752	6849.0	4	0	0	0	4.0
149982	149983	0	0.021046	37	0	0.250272	2760.0	8	0	0	0	3.0
149983	149984	0	0.002485	82	0	0.000800	5000.0	5	0	0	0	0.0
149984	149985	0	0.037548	84	0	25.000000	NaN	5	0	0	0	0.0
149985	149986	0	0.954409	26	0	0.324962	1950.0	4	0	0	0	0.0
149986	149987	0	0.168102	49	0	0.080384	5000.0	16	0	0	0	1.0
149987	149988	0	1.000000	28	0	0.055692	3249.0	3	1	0	0	0.0
149988	149989	0	0.902051	31	1	0.347924	7515.0	10	0	1	0	0.0
149989	149990	0	0.013356	62	0	0.001408	9233.0	4	0	0	0	3.0
149990	149991	0	0.055518	46	0	0.609779	4335.0	7	0	1	0	2.0
149991	149992	0	0.104112	59	0	0.477658	10316.0	10	0	2	0	0.0
149992	149993	0	0.871976	50	0	4132.000000	NaN	11	0	1	0	3.0
149993	149994	0	1.000000	22	0	0.000000	820.0	1	0	0	0	0.0
149994	149995	0	0.385742	50	0	0.404293	3400.0	7	0	0	0	0.0
149995	149996	0	0.040674	74	0	0.225131	2100.0	4	0	1	0	0.0
149996	149997	0	0.299745	44	0	0.716562	5584.0	4	0	1	0	2.0
149997	149998	0	0.246044	58	0	3870.000000	NaN	18	0	1	0	0.0
149998	149999	0	0.000000	30	0	0.000000	5716.0	4	0	0	0	0.0
149999	150000	0	0.850283	64	0	0.249908	8158.0	8	0	2	0	0.0

150000 rows × 12 columns

In [48]:

#What are number of rows and columns
loans.shape

Out[48]:

(150000, 12)

In [49]:

#Are there any suspicious variables?
loans.columns.values

Out[49]:

array(['Sr_No', 'SeriousDlqin2yrs',
       'RevolvingUtilizationOfUnsecuredLines', 'age',
       'NumberOfTime30-59DaysPastDueNotWorse', 'DebtRatio',
       'MonthlyIncome', 'NumberOfOpenCreditLinesAndLoans',
       'NumberOfTimes90DaysLate', 'NumberRealEstateLoansOrLines',
       'NumberOfTime60-89DaysPastDueNotWorse', 'NumberOfDependents'],
      dtype=object)

In [50]:

#Display the variable formats
loans.dtypes

Out[50]:

Sr_No                                     int64
SeriousDlqin2yrs                          int64
RevolvingUtilizationOfUnsecuredLines    float64
age                                       int64
NumberOfTime30-59DaysPastDueNotWorse      int64
DebtRatio                               float64
MonthlyIncome                           float64
NumberOfOpenCreditLinesAndLoans           int64
NumberOfTimes90DaysLate                   int64
NumberRealEstateLoansOrLines              int64
NumberOfTime60-89DaysPastDueNotWorse      int64
NumberOfDependents                      float64
dtype: object

In [51]:

#Print the first 10 observations
loans.head(10)

Out[51]:

	Sr_No	SeriousDlqin2yrs	RevolvingUtilizationOfUnsecuredLines	age	NumberOfTime30-59DaysPastDueNotWorse	DebtRatio	MonthlyIncome	NumberOfOpenCreditLinesAndLoans	NumberOfTimes90DaysLate	NumberRealEstateLoansOrLines	NumberOfDependents
0	1	1	0.766127	45	2	0.802982	9120.0	13	0	6	2.0
1	2	0	0.957151	40	0	0.121876	2600.0	4	0	0	1.0
2	3	0	0.658180	38	1	0.085113	3042.0	2	1	0	0.0
3	4	0	0.233810	30	0	0.036050	3300.0	5	0	0	0.0
4	5	0	0.907239	49	1	0.024926	63588.0	7	0	1	0.0
5	6	0	0.213179	74	0	0.375607	3500.0	3	0	1	1.0
6	7	0	0.305682	57	0	5710.000000	NaN	8	0	3	0.0
7	8	0	0.754464	39	0	0.209940	3500.0	8	0	0	0.0
8	9	0	0.116951	27	0	46.000000	NaN	2	0	0	NaN
9	10	0	0.189169	57	0	0.606291	23684.0	9	0	4	2.0

In [52]:

#Do we have any unique identifier?
loans.columns.values

Out[52]:

array(['Sr_No', 'SeriousDlqin2yrs',
       'RevolvingUtilizationOfUnsecuredLines', 'age',
       'NumberOfTime30-59DaysPastDueNotWorse', 'DebtRatio',
       'MonthlyIncome', 'NumberOfOpenCreditLinesAndLoans',
       'NumberOfTimes90DaysLate', 'NumberRealEstateLoansOrLines',
       'NumberOfTime60-89DaysPastDueNotWorse', 'NumberOfDependents'],
      dtype=object)

Step-2: Categorical and Descrete variables exploration

In [ ]:

# Code: Frequencies

# What are the categorical and discrete variables? What are the continues variables.
# Find the frequencies of all class variables in the data 
# Are there any variables with missing values?
# Are there any default values?
# Can you identify the variables with outliers?
# Are there any variables with other issues?

In [57]:

#What are the categorical and discrete variables? What are the continues variables.
loans.dtypes

Out[57]:

Sr_No                                     int64
SeriousDlqin2yrs                          int64
RevolvingUtilizationOfUnsecuredLines    float64
age                                       int64
NumberOfTime30-59DaysPastDueNotWorse      int64
DebtRatio                               float64
MonthlyIncome                           float64
NumberOfOpenCreditLinesAndLoans           int64
NumberOfTimes90DaysLate                   int64
NumberRealEstateLoansOrLines              int64
NumberOfTime60-89DaysPastDueNotWorse      int64
NumberOfDependents                      float64
dtype: object

In [58]:

loans.head()

Out[58]:

	Sr_No	SeriousDlqin2yrs	RevolvingUtilizationOfUnsecuredLines	age	NumberOfTime30-59DaysPastDueNotWorse	DebtRatio	MonthlyIncome	NumberOfOpenCreditLinesAndLoans	NumberOfTimes90DaysLate	NumberRealEstateLoansOrLines	NumberOfDependents
0	1	1	0.766127	45	2	0.802982	9120.0	13	0	6	2.0
1	2	0	0.957151	40	0	0.121876	2600.0	4	0	0	1.0
2	3	0	0.658180	38	1	0.085113	3042.0	2	1	0	0.0
3	4	0	0.233810	30	0	0.036050	3300.0	5	0	0	0.0
4	5	0	0.907239	49	1	0.024926	63588.0	7	0	1	0.0

In [59]:

#Find the frequencies of all class variables in the data 
loans.columns.values

Out[59]:

array(['Sr_No', 'SeriousDlqin2yrs',
       'RevolvingUtilizationOfUnsecuredLines', 'age',
       'NumberOfTime30-59DaysPastDueNotWorse', 'DebtRatio',
       'MonthlyIncome', 'NumberOfOpenCreditLinesAndLoans',
       'NumberOfTimes90DaysLate', 'NumberRealEstateLoansOrLines',
       'NumberOfTime60-89DaysPastDueNotWorse', 'NumberOfDependents'],
      dtype=object)

In [60]:

#Find the frequencies of all class variables in the data 
loans['SeriousDlqin2yrs'].value_counts()

Out[60]:

0    139974
1     10026
Name: SeriousDlqin2yrs, dtype: int64

In [62]:

loans['age'].value_counts(sort=False)

Out[62]:

0         1
21      183
22      434
23      641
24      816
25      953
26     1193
27     1338
28     1560
29     1702
30     1937
31     2038
32     2050
33     2239
34     2155
35     2246
36     2379
37     2521
38     2631
39     2987
40     3093
41     3122
42     3082
43     3208
44     3294
45     3502
46     3714
47     3719
48     3806
49     3837
       ... 
76     1183
77     1099
78     1054
79      981
80      876
81      774
82      647
83      512
84      480
85      483
86      407
87      357
88      313
89      276
90      198
91      154
92       93
93       87
94       47
95       45
96       18
97       17
98        6
99        9
101       3
102       3
103       3
105       1
107       1
109       2
Name: age, Length: 86, dtype: int64

In [63]:

loans['NumberOfTime30-59DaysPastDueNotWorse'].value_counts(sort=False)

Out[63]:

0     126018
1      16033
2       4598
3       1754
4        747
5        342
6        140
7         54
8         25
9         12
10         4
11         1
12         2
13         1
96         5
98       264
Name: NumberOfTime30-59DaysPastDueNotWorse, dtype: int64

In [64]:

loans['NumberOfOpenCreditLinesAndLoans'].value_counts(sort=False)

Out[64]:

0      1888
1      4438
2      6666
3      9058
4     11609
5     12931
6     13614
7     13245
8     12562
9     11355
10     9624
11     8321
12     7005
13     5667
14     4546
15     3645
16     3000
17     2370
18     1874
19     1433
20     1169
21      864
22      685
23      533
24      422
25      337
26      239
27      194
28      150
29      114
30       88
31       74
32       52
33       47
34       35
35       27
36       18
37        7
38       13
39        9
40       10
41        4
42        8
43        8
44        2
45        8
46        3
47        2
48        6
49        4
50        2
51        2
52        3
53        1
54        4
56        2
57        2
58        1
Name: NumberOfOpenCreditLinesAndLoans, dtype: int64

In [65]:

loans['NumberOfTimes90DaysLate'].value_counts(sort=False)

Out[65]:

0     141662
1       5243
2       1555
3        667
4        291
5        131
6         80
7         38
8         21
9         19
10         8
11         5
12         2
13         4
14         2
15         2
17         1
96         5
98       264
Name: NumberOfTimes90DaysLate, dtype: int64

In [66]:

loans['NumberRealEstateLoansOrLines'].value_counts(sort=False)

Out[66]:

0     56188
1     52338
2     31522
3      6300
4      2170
5       689
6       320
7       171
8        93
9        78
10       37
11       23
12       18
13       15
14        7
15        7
16        4
17        4
18        2
19        2
20        2
21        1
23        2
25        3
26        1
29        1
32        1
54        1
Name: NumberRealEstateLoansOrLines, dtype: int64

In [67]:

loans['NumberOfTime60-89DaysPastDueNotWorse'].value_counts(sort=False)

Out[67]:

0     142396
1       5731
2       1118
3        318
4        105
5         34
6         16
7          9
8          2
9          1
11         1
96         5
98       264
Name: NumberOfTime60-89DaysPastDueNotWorse, dtype: int64

In [68]:

loans['NumberOfDependents'].value_counts(sort=False)

Out[68]:

2.0     19522
0.0     86902
10.0        5
3.0      9483
9.0         5
4.0      2862
20.0        1
6.0       158
5.0       746
8.0        24
1.0     26316
13.0        1
7.0        51
Name: NumberOfDependents, dtype: int64

In [69]:

#Are there any   variables with missing values?
loans.isnull().sum()

Out[69]:

Sr_No                                       0
SeriousDlqin2yrs                            0
RevolvingUtilizationOfUnsecuredLines        0
age                                         0
NumberOfTime30-59DaysPastDueNotWorse        0
DebtRatio                                   0
MonthlyIncome                           29731
NumberOfOpenCreditLinesAndLoans             0
NumberOfTimes90DaysLate                     0
NumberRealEstateLoansOrLines                0
NumberOfTime60-89DaysPastDueNotWorse        0
NumberOfDependents                       3924
dtype: int64

Step-3: Continuous variables exploration

In [ ]:

#Code: Continuous variables summary

# List down the continuous variables
# Find summary statistics for each variable. Min, Max, Median, Mean, sd, Var
# Find Quartiles for each of the variables
# Create Box plots and identify outliers
# Find the percentage of missing values
# Find Percentiles and find percentage of outliers, if any P1, p5,p10,q1(p25),q3(p75), p90,p99

In [70]:

#List down the continuous variables
loans.dtypes

Out[70]:

Sr_No                                     int64
SeriousDlqin2yrs                          int64
RevolvingUtilizationOfUnsecuredLines    float64
age                                       int64
NumberOfTime30-59DaysPastDueNotWorse      int64
DebtRatio                               float64
MonthlyIncome                           float64
NumberOfOpenCreditLinesAndLoans           int64
NumberOfTimes90DaysLate                   int64
NumberRealEstateLoansOrLines              int64
NumberOfTime60-89DaysPastDueNotWorse      int64
NumberOfDependents                      float64
dtype: object

In [79]:

loans.head()

Out[79]:

	Sr_No	SeriousDlqin2yrs	RevolvingUtilizationOfUnsecuredLines	age	NumberOfTime30-59DaysPastDueNotWorse	DebtRatio	MonthlyIncome	NumberOfOpenCreditLinesAndLoans	NumberOfTimes90DaysLate	NumberRealEstateLoansOrLines	NumberOfDependents
0	1	1	0.766127	45	2	0.802982	9120.0	13	0	6	2.0
1	2	0	0.957151	40	0	0.121876	2600.0	4	0	0	1.0
2	3	0	0.658180	38	1	0.085113	3042.0	2	1	0	0.0
3	4	0	0.233810	30	0	0.036050	3300.0	5	0	0	0.0
4	5	0	0.907239	49	1	0.024926	63588.0	7	0	1	0.0

In [80]:

#Find summary statistics for each variable. Min, Max, Median, Mean, sd, Var
loans['RevolvingUtilizationOfUnsecuredLines'].describe()

Out[80]:

count    150000.000000
mean          6.048438
std         249.755371
min           0.000000
25%           0.029867
50%           0.154181
75%           0.559046
max       50708.000000
Name: RevolvingUtilizationOfUnsecuredLines, dtype: float64

In [81]:

loans['MonthlyIncome'].describe()

Out[81]:

count    1.202690e+05
mean     6.670221e+03
std      1.438467e+04
min      0.000000e+00
25%      3.400000e+03
50%      5.400000e+03
75%      8.249000e+03
max      3.008750e+06
Name: MonthlyIncome, dtype: float64

In [82]:

import numpy as np
variance = np.var(loans['RevolvingUtilizationOfUnsecuredLines'])
variance

Out[82]:

62377.329304612416

In [86]:

import numpy as np
np.var(loans['MonthlyIncome'])

Out[86]:

206917131.81275052

In [87]:

import numpy as np
np.std(loans['RevolvingUtilizationOfUnsecuredLines'])

Out[87]:

249.75453810614215

In [88]:

import numpy as np
np.std(loans['MonthlyIncome'])

Out[88]:

14384.614413071715

In [89]:

#Find Quartiles for each of the variables
loans['RevolvingUtilizationOfUnsecuredLines'].describe()

Out[89]:

count    150000.000000
mean          6.048438
std         249.755371
min           0.000000
25%           0.029867
50%           0.154181
75%           0.559046
max       50708.000000
Name: RevolvingUtilizationOfUnsecuredLines, dtype: float64

In [90]:

loans['MonthlyIncome'].describe()

Out[90]:

count    1.202690e+05
mean     6.670221e+03
std      1.438467e+04
min      0.000000e+00
25%      3.400000e+03
50%      5.400000e+03
75%      8.249000e+03
max      3.008750e+06
Name: MonthlyIncome, dtype: float64

This is an integer variable. It has missing values represented by ‘NA’. Its minimum value is 0, which is practically impossible. Mean is 6670 and median is 5400 without considering NA values.

In [92]:

#Create Box plots and identify outliers
import matplotlib.pyplot as plt
%matplotlib inline
loans.boxplot(column="RevolvingUtilizationOfUnsecuredLines")

Out[92]:

<matplotlib.axes._subplots.AxesSubplot at 0x20b2c35e278>

In [94]:

import matplotlib.pyplot as plt
%matplotlib inline
loans.boxplot(column="MonthlyIncome")

Out[94]:

<matplotlib.axes._subplots.AxesSubplot at 0x20b2c4830f0>

In [98]:

#Find the missing values in MonthlyIncome variable 
loans['MonthlyIncome'].isnull().sum()

Out[98]:

29731 values are missing in MonthlyIncome variable

In [99]:

#Find the percentage of missing values
loans['MonthlyIncome'].isnull().sum()/len(loans)

Out[99]:

0.19820666666666667

In [100]:

#Find Percentiles and find percentage of outliers, if any P1, p5,p10,q1(p25),q3(p75), p90,p99 
util_percentiles=loans['RevolvingUtilizationOfUnsecuredLines'].quantile([0.05, 0.1, 0.25, 0.5, 0.75, 0.80, 0.9,0.91,0.95,0.96,0.97,0.975,0.98,0.99,1])
round(util_percentiles,2)

Out[100]:

0.050        0.00
0.100        0.00
0.250        0.03
0.500        0.15
0.750        0.56
0.800        0.70
0.900        0.98
0.910        1.00
0.950        1.00
0.960        1.00
0.970        1.00
0.975        1.00
0.980        1.01
0.990        1.09
1.000    50708.00
Name: RevolvingUtilizationOfUnsecuredLines, dtype: float64

Code: Data Cleaning Scenario-1

In [118]:

# What percent are missing values in RevolvingUtilizationOfUnsecuredLines?
# Get the detailed percentile distribution
# Clean the variable, and create a new variable by removing all the issues

In [101]:

# # LAB: Data Cleaning Scenario-1
#What percent are missing values in RevolvingUtilizationOfUnsecuredLines?
#Get the detailed percentile distribution
util_percentiles=loans['RevolvingUtilizationOfUnsecuredLines'].quantile([0.05, 0.1, 0.25, 0.5, 0.75, 0.80, 0.9,0.91,0.95,0.96,0.97,0.975,0.98,0.99,1])
round(util_percentiles,2)

Out[101]:

0.050        0.00
0.100        0.00
0.250        0.03
0.500        0.15
0.750        0.56
0.800        0.70
0.900        0.98
0.910        1.00
0.950        1.00
0.960        1.00
0.970        1.00
0.975        1.00
0.980        1.01
0.990        1.09
1.000    50708.00
Name: RevolvingUtilizationOfUnsecuredLines, dtype: float64

In [102]:

#Clean the variable, and create a new variable by removing all the issues
#If utilization is more than 1 then it can be replaced by median
median_util=loans['RevolvingUtilizationOfUnsecuredLines'].median()
median_util

Out[102]:

0.154180737

In [103]:

loans['util_new']=loans['RevolvingUtilizationOfUnsecuredLines']
loans['util_new'][loans['util_new']>1]=median_util 
loans['util_new']

C:\Users\jyothi\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Out[103]:

0         0.766127
1         0.957151
2         0.658180
3         0.233810
4         0.907239
5         0.213179
6         0.305682
7         0.754464
8         0.116951
9         0.189169
10        0.644226
11        0.018798
12        0.010352
13        0.964673
14        0.019657
15        0.548458
16        0.061086
17        0.166284
18        0.221813
19        0.602794
20        0.200923
21        0.025656
22        1.000000
23        0.075427
24        0.046560
25        0.392248
26        0.052436
27        0.034421
28        0.452516
29        0.392995
            ...   
149970    0.025449
149971    0.058001
149972    0.071273
149973    0.154181
149974    0.962721
149975    0.022088
149976    0.000627
149977    0.236450
149978    0.917635
149979    0.224711
149980    0.067644
149981    0.810012
149982    0.021046
149983    0.002485
149984    0.037548
149985    0.954409
149986    0.168102
149987    1.000000
149988    0.902051
149989    0.013356
149990    0.055518
149991    0.104112
149992    0.871976
149993    1.000000
149994    0.385742
149995    0.040674
149996    0.299745
149997    0.246044
149998    0.000000
149999    0.850283
Name: util_new, Length: 150000, dtype: float64

In [104]:

# percentile distribution for new variable
util_percentiles1=loans['util_new'].quantile([0.05, 0.1, 0.25, 0.5, 0.75, 0.80, 0.9,0.91,0.95,0.96,0.97,0.975,0.98,0.99,1])
round(util_percentiles1,2)

Out[104]:

0.050    0.00
0.100    0.00
0.250    0.03
0.500    0.15
0.750    0.51
0.800    0.63
0.900    0.93
0.910    0.95
0.950    1.00
0.960    1.00
0.970    1.00
0.975    1.00
0.980    1.00
0.990    1.00
1.000    1.00
Name: util_new, dtype: float64

Code: Data Cleaning Scenario-2

In [ ]:

# What is the issue with NumberOfTime30_59DaysPastDueNotWorse
# Draw a frequency table
# What percent of the values are erroneous?
# Clean the variable- Look at the cross tab of variable vs target. Impute based on target .
# Create frequency table for cleaned variable

In [105]:

# # LAB: Data Cleaning Scenario-2
#What is the issue with NumberOfTime30_59DaysPastDueNotWorse
#Draw a frequency table
freq_table_30dpd=loans['NumberOfTime30-59DaysPastDueNotWorse'].value_counts(sort=False)
freq_table_30dpd

Out[105]:

0     126018
1      16033
2       4598
3       1754
4        747
5        342
6        140
7         54
8         25
9         12
10         4
11         1
12         2
13         1
96         5
98       264
Name: NumberOfTime30-59DaysPastDueNotWorse, dtype: int64

In [106]:

#One month defaults frequency can't be beyond 24 in last 24 months
#What percent of the values are erroneous?
freq_table_30dpd[13:len(freq_table_30dpd)]
freq_table_30dpd[13:len(freq_table_30dpd)].sum()/freq_table_30dpd.sum()

Out[106]:

0.0018

In [107]:

#Clean the variable- Look at the cross tab of variable vs target. Impute based on target .
#Cross tab with target
import pandas as pd
cross_tab_30dpd_target=pd.crosstab(loans['NumberOfTime30-59DaysPastDueNotWorse'],loans['SeriousDlqin2yrs'])
cross_tab_30dpd_target

Out[107]:

SeriousDlqin2yrs	0	1
NumberOfTime30-59DaysPastDueNotWorse
0	120977	5041
1	13624	2409
2	3379	1219
3	1136	618
4	429	318
5	188	154
6	66	74
7	26	28
8	17	8
9	8	4
10	1	3
11	0	1
12	1	1
13	0	1
96	1	4
98	121	143

In [108]:

#Cross tab row Percentages
cross_tab_30dpd_target_percent=cross_tab_30dpd_target.astype(float).div(cross_tab_30dpd_target.sum(axis=1), axis=0)
round(cross_tab_30dpd_target_percent,2)

Out[108]:

SeriousDlqin2yrs	0	1
NumberOfTime30-59DaysPastDueNotWorse
0	0.96	0.04
1	0.85	0.15
2	0.73	0.27
3	0.65	0.35
4	0.57	0.43
5	0.55	0.45
6	0.47	0.53
7	0.48	0.52
8	0.68	0.32
9	0.67	0.33
10	0.25	0.75
11	0.00	1.00
12	0.50	0.50
13	0.00	1.00
96	0.20	0.80
98	0.46	0.54

In [109]:

loans['num_30_59_dpd_new']=loans['NumberOfTime30-59DaysPastDueNotWorse']
loans['num_30_59_dpd_new'][loans['num_30_59_dpd_new']>12]=6
loans['num_30_59_dpd_new']

C:\Users\jyothi\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Out[109]:

0         2
1         0
2         1
3         0
4         1
5         0
6         0
7         0
8         0
9         0
10        0
11        0
12        0
13        3
14        0
15        0
16        0
17        0
18        0
19        0
20        0
21        0
22        0
23        0
24        0
25        0
26        0
27        0
28        0
29        2
         ..
149970    0
149971    0
149972    0
149973    0
149974    2
149975    0
149976    0
149977    0
149978    2
149979    0
149980    0
149981    0
149982    0
149983    0
149984    0
149985    0
149986    0
149987    0
149988    1
149989    0
149990    0
149991    0
149992    0
149993    0
149994    0
149995    0
149996    0
149997    0
149998    0
149999    0
Name: num_30_59_dpd_new, Length: 150000, dtype: int64

In [110]:

loans['num_30_59_dpd_new'].value_counts(sort=False)

Out[110]:

0     126018
1      16033
2       4598
3       1754
4        747
5        342
6        410
7         54
8         25
9         12
10         4
11         1
12         2
Name: num_30_59_dpd_new, dtype: int64

LAB: Monthly Income

In [ ]:

# Find the missing value percentage in monthly income
# Create an indicator variable for missing and non-missing
# Replace the missing values with median

In [111]:

# # Data Cleaning Scenario-3
#Find the missing value percentage in monthly income
loans['MonthlyIncome'].isnull().sum()
loans['MonthlyIncome'].isnull().sum()/len(loans)

Out[111]:

0.19820666666666667

In [115]:

#Once identified where missing values exist, the next task usually is to fill them (data imputation). Depending upon the context,
#in this case, I am assigning median value to all those positions where missing value is present:

loans['MonthlyIncome_ind']=1
loans['MonthlyIncome_ind'][loans['MonthlyIncome'].isnull()]=0
loans['MonthlyIncome_ind'].value_counts(sort=False)

C:\Users\jyothi\Anaconda3\lib\site-packages\ipykernel_launcher.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """

Out[115]:

0     29731
1    120269
Name: MonthlyIncome_ind, dtype: int64

In [116]:

loans['MonthlyIncome_new']=loans['MonthlyIncome']
loans['MonthlyIncome_new'][loans['MonthlyIncome'].isnull()]=loans['MonthlyIncome'].median()
round(loans['MonthlyIncome_new'].describe())

C:\Users\jyothi\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Out[116]:

count     150000.0
mean        6418.0
std        12890.0
min            0.0
25%         3903.0
50%         5400.0
75%         7400.0
max      3008750.0
Name: MonthlyIncome_new, dtype: float64

In [119]:

#Remaining Variables Imputation

# Debt Ratio: Imputation 
# NumberOfOpenCreditLinesAndLoans : No issues in this variable
# NumberOfTimes90DaysLate: Imputation similar to NumberOfTime30_59DaysPastDueNotW
# NumberRealEstateLoansOrLines: : No issues in this variable
# NumberOfTime60_89DaysPastDueNotW: Imputation similar to NumberOfTime30_59DaysPastDueNotW
# ConclusionNumberOfDependents: Impute based on target variable

In [120]:

#Conclusion
#Data cleaning needs patience, we need to clean for each individual variable

Select Category

Handout – Data Cleaning and Treatement

Before start our lesson please download the datasets.

Give me some credit data

Steps in Data Exploration and Cleaning

Step-1: Basic details of the data

Step-2: Categorical and Descrete variables exploration

Step-3: Continuous variables exploration

Code: Data Cleaning Scenario-1

Code: Data Cleaning Scenario-2

LAB: Monthly Income