• No products in the cart.

Handout – Data Cleaning and Treatement

Before start our lesson please download the datasets.

Give me some credit data

We will try to understand the data exploration, validation and data cleaning using a case study on loans data

Give me some credit data. It is loans data. Historical data are provided on 150,000 borrowers.

The final objective is to build a model that borrowers can use to help make the best financial decisions.

We generally get the data and data dictionary from the data team.

Steps in Data Exploration and Cleaning

Step-1: Basic details of the data
Step-2: Categorical variables exploration
Step-3: Continuous variables exploration
Step-4: Missing Values and Outlier Treatment

Step-1: Basic details of the data

In [ ]:
#Lab: Basic contents of the data

# Import “Give me some Credit\cs-training.csv” 
# What are number of rows and columns
# Are there any suspicious variables?
# Are all the variable names correct?
# Display the variable formats
# Print the first 10 observations
# Do we have any unique identifier?
# Do the text and numeric variables have meaningful data?
# Are there some coded values in the data?
# Do all the variables appear to have data
In [ ]:
# Code: Basic contents of the data
In [47]:
import pandas as pd
loans=pd.read_csv("C:\\Users\\jyothi\\Google Drive\\Datasets\\Give me some Credit\\cs-training.csv")
loans
Out[47]:
Sr_No SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age NumberOfTime30-59DaysPastDueNotWorse DebtRatio MonthlyIncome NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate NumberRealEstateLoansOrLines NumberOfTime60-89DaysPastDueNotWorse NumberOfDependents
0 1 1 0.766127 45 2 0.802982 9120.0 13 0 6 0 2.0
1 2 0 0.957151 40 0 0.121876 2600.0 4 0 0 0 1.0
2 3 0 0.658180 38 1 0.085113 3042.0 2 1 0 0 0.0
3 4 0 0.233810 30 0 0.036050 3300.0 5 0 0 0 0.0
4 5 0 0.907239 49 1 0.024926 63588.0 7 0 1 0 0.0
5 6 0 0.213179 74 0 0.375607 3500.0 3 0 1 0 1.0
6 7 0 0.305682 57 0 5710.000000 NaN 8 0 3 0 0.0
7 8 0 0.754464 39 0 0.209940 3500.0 8 0 0 0 0.0
8 9 0 0.116951 27 0 46.000000 NaN 2 0 0 0 NaN
9 10 0 0.189169 57 0 0.606291 23684.0 9 0 4 0 2.0
10 11 0 0.644226 30 0 0.309476 2500.0 5 0 0 0 0.0
11 12 0 0.018798 51 0 0.531529 6501.0 7 0 2 0 2.0
12 13 0 0.010352 46 0 0.298354 12454.0 13 0 2 0 2.0
13 14 1 0.964673 40 3 0.382965 13700.0 9 3 1 1 2.0
14 15 0 0.019657 76 0 477.000000 0.0 6 0 1 0 0.0
15 16 0 0.548458 64 0 0.209892 11362.0 7 0 1 0 2.0
16 17 0 0.061086 78 0 2058.000000 NaN 10 0 2 0 0.0
17 18 0 0.166284 53 0 0.188274 8800.0 7 0 0 0 0.0
18 19 0 0.221813 43 0 0.527888 3280.0 7 0 1 0 2.0
19 20 0 0.602794 25 0 0.065868 333.0 2 0 0 0 0.0
20 21 0 0.200923 43 0 0.430046 12300.0 10 0 2 0 0.0
21 22 1 0.025656 38 0 0.475841 3000.0 7 0 1 0 2.0
22 23 0 1.000000 39 0 0.241104 2500.0 4 0 0 0 0.0
23 24 0 0.075427 32 0 0.085512 7916.0 6 0 0 0 0.0
24 25 0 0.046560 58 0 0.241622 2416.0 9 0 1 0 0.0
25 26 1 0.392248 50 0 1.595253 4676.0 14 0 3 0 1.0
26 27 0 0.052436 58 0 0.097672 8333.0 22 0 1 0 0.0
27 28 0 0.034421 69 0 0.042383 2500.0 17 0 0 0 1.0
28 29 0 0.452516 24 0 0.011761 3400.0 1 0 0 0 0.0
29 30 0 0.392995 58 2 0.436103 5500.0 15 0 1 0 0.0
149970 149971 0 0.025449 58 0 0.253855 15500.0 7 0 2 0 2.0
149971 149972 0 0.058001 83 0 0.013997 5000.0 6 0 0 0 0.0
149972 149973 0 0.071273 42 0 0.008638 6945.0 3 0 0 0 1.0
149973 149974 0 1.026395 44 0 0.494819 5500.0 7 0 1 0 1.0
149974 149975 0 0.962721 61 2 0.603479 5000.0 11 0 1 0 0.0
149975 149976 0 0.022088 58 0 2716.000000 NaN 8 0 2 0 0.0
149976 149977 0 0.000627 76 0 60.000000 NaN 5 0 0 0 0.0
149977 149978 0 0.236450 29 0 349.000000 NaN 3 0 0 0 0.0
149978 149979 0 0.917635 52 2 0.259496 2500.0 4 0 0 0 0.0
149979 149980 1 0.224711 55 0 0.057235 8700.0 7 0 0 0 0.0
149980 149981 0 0.067644 64 0 0.254976 5525.0 12 0 1 0 0.0
149981 149982 0 0.810012 43 0 0.121752 6849.0 4 0 0 0 4.0
149982 149983 0 0.021046 37 0 0.250272 2760.0 8 0 0 0 3.0
149983 149984 0 0.002485 82 0 0.000800 5000.0 5 0 0 0 0.0
149984 149985 0 0.037548 84 0 25.000000 NaN 5 0 0 0 0.0
149985 149986 0 0.954409 26 0 0.324962 1950.0 4 0 0 0 0.0
149986 149987 0 0.168102 49 0 0.080384 5000.0 16 0 0 0 1.0
149987 149988 0 1.000000 28 0 0.055692 3249.0 3 1 0 0 0.0
149988 149989 0 0.902051 31 1 0.347924 7515.0 10 0 1 0 0.0
149989 149990 0 0.013356 62 0 0.001408 9233.0 4 0 0 0 3.0
149990 149991 0 0.055518 46 0 0.609779 4335.0 7 0 1 0 2.0
149991 149992 0 0.104112 59 0 0.477658 10316.0 10 0 2 0 0.0
149992 149993 0 0.871976 50 0 4132.000000 NaN 11 0 1 0 3.0
149993 149994 0 1.000000 22 0 0.000000 820.0 1 0 0 0 0.0
149994 149995 0 0.385742 50 0 0.404293 3400.0 7 0 0 0 0.0
149995 149996 0 0.040674 74 0 0.225131 2100.0 4 0 1 0 0.0
149996 149997 0 0.299745 44 0 0.716562 5584.0 4 0 1 0 2.0
149997 149998 0 0.246044 58 0 3870.000000 NaN 18 0 1 0 0.0
149998 149999 0 0.000000 30 0 0.000000 5716.0 4 0 0 0 0.0
149999 150000 0 0.850283 64 0 0.249908 8158.0 8 0 2 0 0.0

150000 rows × 12 columns

In [48]:
#What are number of rows and columns
loans.shape
Out[48]:
(150000, 12)
In [49]:
#Are there any suspicious variables?
loans.columns.values
Out[49]:
array(['Sr_No', 'SeriousDlqin2yrs',
       'RevolvingUtilizationOfUnsecuredLines', 'age',
       'NumberOfTime30-59DaysPastDueNotWorse', 'DebtRatio',
       'MonthlyIncome', 'NumberOfOpenCreditLinesAndLoans',
       'NumberOfTimes90DaysLate', 'NumberRealEstateLoansOrLines',
       'NumberOfTime60-89DaysPastDueNotWorse', 'NumberOfDependents'],
      dtype=object)
In [50]:
#Display the variable formats
loans.dtypes
Out[50]:
Sr_No                                     int64
SeriousDlqin2yrs                          int64
RevolvingUtilizationOfUnsecuredLines    float64
age                                       int64
NumberOfTime30-59DaysPastDueNotWorse      int64
DebtRatio                               float64
MonthlyIncome                           float64
NumberOfOpenCreditLinesAndLoans           int64
NumberOfTimes90DaysLate                   int64
NumberRealEstateLoansOrLines              int64
NumberOfTime60-89DaysPastDueNotWorse      int64
NumberOfDependents                      float64
dtype: object
In [51]:
#Print the first 10 observations
loans.head(10)
Out[51]:
Sr_No SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age NumberOfTime30-59DaysPastDueNotWorse DebtRatio MonthlyIncome NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate NumberRealEstateLoansOrLines NumberOfTime60-89DaysPastDueNotWorse NumberOfDependents
0 1 1 0.766127 45 2 0.802982 9120.0 13 0 6 0 2.0
1 2 0 0.957151 40 0 0.121876 2600.0 4 0 0 0 1.0
2 3 0 0.658180 38 1 0.085113 3042.0 2 1 0 0 0.0
3 4 0 0.233810 30 0 0.036050 3300.0 5 0 0 0 0.0
4 5 0 0.907239 49 1 0.024926 63588.0 7 0 1 0 0.0
5 6 0 0.213179 74 0 0.375607 3500.0 3 0 1 0 1.0
6 7 0 0.305682 57 0 5710.000000 NaN 8 0 3 0 0.0
7 8 0 0.754464 39 0 0.209940 3500.0 8 0 0 0 0.0
8 9 0 0.116951 27 0 46.000000 NaN 2 0 0 0 NaN
9 10 0 0.189169 57 0 0.606291 23684.0 9 0 4 0 2.0
In [52]:
#Do we have any unique identifier?
loans.columns.values
Out[52]:
array(['Sr_No', 'SeriousDlqin2yrs',
       'RevolvingUtilizationOfUnsecuredLines', 'age',
       'NumberOfTime30-59DaysPastDueNotWorse', 'DebtRatio',
       'MonthlyIncome', 'NumberOfOpenCreditLinesAndLoans',
       'NumberOfTimes90DaysLate', 'NumberRealEstateLoansOrLines',
       'NumberOfTime60-89DaysPastDueNotWorse', 'NumberOfDependents'],
      dtype=object)

Step-2: Categorical and Descrete variables exploration

In [ ]:
# Code: Frequencies

# What are the categorical and discrete variables? What are the continues variables.
# Find the frequencies of all class variables in the data 
# Are there any variables with missing values?
# Are there any default values?
# Can you identify the variables with outliers?
# Are there any variables with other issues?
In [57]:
#What are the categorical and discrete variables? What are the continues variables.
loans.dtypes
Out[57]:
Sr_No                                     int64
SeriousDlqin2yrs                          int64
RevolvingUtilizationOfUnsecuredLines    float64
age                                       int64
NumberOfTime30-59DaysPastDueNotWorse      int64
DebtRatio                               float64
MonthlyIncome                           float64
NumberOfOpenCreditLinesAndLoans           int64
NumberOfTimes90DaysLate                   int64
NumberRealEstateLoansOrLines              int64
NumberOfTime60-89DaysPastDueNotWorse      int64
NumberOfDependents                      float64
dtype: object
In [58]:
loans.head()
Out[58]:
Sr_No SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age NumberOfTime30-59DaysPastDueNotWorse DebtRatio MonthlyIncome NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate NumberRealEstateLoansOrLines NumberOfTime60-89DaysPastDueNotWorse NumberOfDependents
0 1 1 0.766127 45 2 0.802982 9120.0 13 0 6 0 2.0
1 2 0 0.957151 40 0 0.121876 2600.0 4 0 0 0 1.0
2 3 0 0.658180 38 1 0.085113 3042.0 2 1 0 0 0.0
3 4 0 0.233810 30 0 0.036050 3300.0 5 0 0 0 0.0
4 5 0 0.907239 49 1 0.024926 63588.0 7 0 1 0 0.0
In [59]:
#Find the frequencies of all class variables in the data 
loans.columns.values
Out[59]:
array(['Sr_No', 'SeriousDlqin2yrs',
       'RevolvingUtilizationOfUnsecuredLines', 'age',
       'NumberOfTime30-59DaysPastDueNotWorse', 'DebtRatio',
       'MonthlyIncome', 'NumberOfOpenCreditLinesAndLoans',
       'NumberOfTimes90DaysLate', 'NumberRealEstateLoansOrLines',
       'NumberOfTime60-89DaysPastDueNotWorse', 'NumberOfDependents'],
      dtype=object)
In [60]:
#Find the frequencies of all class variables in the data 
loans['SeriousDlqin2yrs'].value_counts()
Out[60]:
0    139974
1     10026
Name: SeriousDlqin2yrs, dtype: int64
In [62]:
loans['age'].value_counts(sort=False)
Out[62]:
0         1
21      183
22      434
23      641
24      816
25      953
26     1193
27     1338
28     1560
29     1702
30     1937
31     2038
32     2050
33     2239
34     2155
35     2246
36     2379
37     2521
38     2631
39     2987
40     3093
41     3122
42     3082
43     3208
44     3294
45     3502
46     3714
47     3719
48     3806
49     3837
       ... 
76     1183
77     1099
78     1054
79      981
80      876
81      774
82      647
83      512
84      480
85      483
86      407
87      357
88      313
89      276
90      198
91      154
92       93
93       87
94       47
95       45
96       18
97       17
98        6
99        9
101       3
102       3
103       3
105       1
107       1
109       2
Name: age, Length: 86, dtype: int64
In [63]:
loans['NumberOfTime30-59DaysPastDueNotWorse'].value_counts(sort=False)
Out[63]:
0     126018
1      16033
2       4598
3       1754
4        747
5        342
6        140
7         54
8         25
9         12
10         4
11         1
12         2
13         1
96         5
98       264
Name: NumberOfTime30-59DaysPastDueNotWorse, dtype: int64
In [64]:
loans['NumberOfOpenCreditLinesAndLoans'].value_counts(sort=False)
Out[64]:
0      1888
1      4438
2      6666
3      9058
4     11609
5     12931
6     13614
7     13245
8     12562
9     11355
10     9624
11     8321
12     7005
13     5667
14     4546
15     3645
16     3000
17     2370
18     1874
19     1433
20     1169
21      864
22      685
23      533
24      422
25      337
26      239
27      194
28      150
29      114
30       88
31       74
32       52
33       47
34       35
35       27
36       18
37        7
38       13
39        9
40       10
41        4
42        8
43        8
44        2
45        8
46        3
47        2
48        6
49        4
50        2
51        2
52        3
53        1
54        4
56        2
57        2
58        1
Name: NumberOfOpenCreditLinesAndLoans, dtype: int64
In [65]:
loans['NumberOfTimes90DaysLate'].value_counts(sort=False)
Out[65]:
0     141662
1       5243
2       1555
3        667
4        291
5        131
6         80
7         38
8         21
9         19
10         8
11         5
12         2
13         4
14         2
15         2
17         1
96         5
98       264
Name: NumberOfTimes90DaysLate, dtype: int64
In [66]:
loans['NumberRealEstateLoansOrLines'].value_counts(sort=False)
Out[66]:
0     56188
1     52338
2     31522
3      6300
4      2170
5       689
6       320
7       171
8        93
9        78
10       37
11       23
12       18
13       15
14        7
15        7
16        4
17        4
18        2
19        2
20        2
21        1
23        2
25        3
26        1
29        1
32        1
54        1
Name: NumberRealEstateLoansOrLines, dtype: int64
In [67]:
loans['NumberOfTime60-89DaysPastDueNotWorse'].value_counts(sort=False)
Out[67]:
0     142396
1       5731
2       1118
3        318
4        105
5         34
6         16
7          9
8          2
9          1
11         1
96         5
98       264
Name: NumberOfTime60-89DaysPastDueNotWorse, dtype: int64
In [68]:
loans['NumberOfDependents'].value_counts(sort=False)
Out[68]:
2.0     19522
0.0     86902
10.0        5
3.0      9483
9.0         5
4.0      2862
20.0        1
6.0       158
5.0       746
8.0        24
1.0     26316
13.0        1
7.0        51
Name: NumberOfDependents, dtype: int64
In [69]:
#Are there any   variables with missing values?
loans.isnull().sum()
Out[69]:
Sr_No                                       0
SeriousDlqin2yrs                            0
RevolvingUtilizationOfUnsecuredLines        0
age                                         0
NumberOfTime30-59DaysPastDueNotWorse        0
DebtRatio                                   0
MonthlyIncome                           29731
NumberOfOpenCreditLinesAndLoans             0
NumberOfTimes90DaysLate                     0
NumberRealEstateLoansOrLines                0
NumberOfTime60-89DaysPastDueNotWorse        0
NumberOfDependents                       3924
dtype: int64

Step-3: Continuous variables exploration

In [ ]:
#Code: Continuous variables summary

# List down the continuous variables
# Find summary statistics for each variable. Min, Max, Median, Mean, sd, Var
# Find Quartiles for each of the variables
# Create Box plots and identify outliers
# Find the percentage of missing values
# Find Percentiles and find percentage of outliers, if any P1, p5,p10,q1(p25),q3(p75), p90,p99 
In [70]:
#List down the continuous variables
loans.dtypes
Out[70]:
Sr_No                                     int64
SeriousDlqin2yrs                          int64
RevolvingUtilizationOfUnsecuredLines    float64
age                                       int64
NumberOfTime30-59DaysPastDueNotWorse      int64
DebtRatio                               float64
MonthlyIncome                           float64
NumberOfOpenCreditLinesAndLoans           int64
NumberOfTimes90DaysLate                   int64
NumberRealEstateLoansOrLines              int64
NumberOfTime60-89DaysPastDueNotWorse      int64
NumberOfDependents                      float64
dtype: object
In [79]:
loans.head()
Out[79]:
Sr_No SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age NumberOfTime30-59DaysPastDueNotWorse DebtRatio MonthlyIncome NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate NumberRealEstateLoansOrLines NumberOfTime60-89DaysPastDueNotWorse NumberOfDependents
0 1 1 0.766127 45 2 0.802982 9120.0 13 0 6 0 2.0
1 2 0 0.957151 40 0 0.121876 2600.0 4 0 0 0 1.0
2 3 0 0.658180 38 1 0.085113 3042.0 2 1 0 0 0.0
3 4 0 0.233810 30 0 0.036050 3300.0 5 0 0 0 0.0
4 5 0 0.907239 49 1 0.024926 63588.0 7 0 1 0 0.0
In [80]:
#Find summary statistics for each variable. Min, Max, Median, Mean, sd, Var
loans['RevolvingUtilizationOfUnsecuredLines'].describe()
Out[80]:
count    150000.000000
mean          6.048438
std         249.755371
min           0.000000
25%           0.029867
50%           0.154181
75%           0.559046
max       50708.000000
Name: RevolvingUtilizationOfUnsecuredLines, dtype: float64
In [81]:
loans['MonthlyIncome'].describe()
Out[81]:
count    1.202690e+05
mean     6.670221e+03
std      1.438467e+04
min      0.000000e+00
25%      3.400000e+03
50%      5.400000e+03
75%      8.249000e+03
max      3.008750e+06
Name: MonthlyIncome, dtype: float64
In [82]:
import numpy as np
variance = np.var(loans['RevolvingUtilizationOfUnsecuredLines'])
variance        
Out[82]:
62377.329304612416
In [86]:
import numpy as np
np.var(loans['MonthlyIncome'])
Out[86]:
206917131.81275052
In [87]:
import numpy as np
np.std(loans['RevolvingUtilizationOfUnsecuredLines'])
Out[87]:
249.75453810614215
In [88]:
import numpy as np
np.std(loans['MonthlyIncome'])
Out[88]:
14384.614413071715
In [89]:
#Find Quartiles for each of the variables
loans['RevolvingUtilizationOfUnsecuredLines'].describe()
Out[89]:
count    150000.000000
mean          6.048438
std         249.755371
min           0.000000
25%           0.029867
50%           0.154181
75%           0.559046
max       50708.000000
Name: RevolvingUtilizationOfUnsecuredLines, dtype: float64
In [90]:
loans['MonthlyIncome'].describe()
Out[90]:
count    1.202690e+05
mean     6.670221e+03
std      1.438467e+04
min      0.000000e+00
25%      3.400000e+03
50%      5.400000e+03
75%      8.249000e+03
max      3.008750e+06
Name: MonthlyIncome, dtype: float64

This is an integer variable. It has missing values represented by ‘NA’. Its minimum value is 0, which is practically impossible. Mean is 6670 and median is 5400 without considering NA values.

In [92]:
#Create Box plots and identify outliers
import matplotlib.pyplot as plt
%matplotlib inline
loans.boxplot(column="RevolvingUtilizationOfUnsecuredLines")
Out[92]:
<matplotlib.axes._subplots.AxesSubplot at 0x20b2c35e278>
In [94]:
import matplotlib.pyplot as plt
%matplotlib inline
loans.boxplot(column="MonthlyIncome")
Out[94]:
<matplotlib.axes._subplots.AxesSubplot at 0x20b2c4830f0>
In [98]:
#Find the missing values in MonthlyIncome variable 
loans['MonthlyIncome'].isnull().sum()
Out[98]:
29731

29731 values are missing in MonthlyIncome variable

In [99]:
#Find the percentage of missing values
loans['MonthlyIncome'].isnull().sum()/len(loans)
Out[99]:
0.19820666666666667
In [100]:
#Find Percentiles and find percentage of outliers, if any P1, p5,p10,q1(p25),q3(p75), p90,p99 
util_percentiles=loans['RevolvingUtilizationOfUnsecuredLines'].quantile([0.05, 0.1, 0.25, 0.5, 0.75, 0.80, 0.9,0.91,0.95,0.96,0.97,0.975,0.98,0.99,1])
round(util_percentiles,2)
Out[100]:
0.050        0.00
0.100        0.00
0.250        0.03
0.500        0.15
0.750        0.56
0.800        0.70
0.900        0.98
0.910        1.00
0.950        1.00
0.960        1.00
0.970        1.00
0.975        1.00
0.980        1.01
0.990        1.09
1.000    50708.00
Name: RevolvingUtilizationOfUnsecuredLines, dtype: float64

Code: Data Cleaning Scenario-1

In [118]:
# What percent are missing values in RevolvingUtilizationOfUnsecuredLines?
# Get the detailed percentile distribution
# Clean the variable, and create a new variable by removing all the issues
In [101]:
# # LAB: Data Cleaning Scenario-1
#What percent are missing values in RevolvingUtilizationOfUnsecuredLines?
#Get the detailed percentile distribution
util_percentiles=loans['RevolvingUtilizationOfUnsecuredLines'].quantile([0.05, 0.1, 0.25, 0.5, 0.75, 0.80, 0.9,0.91,0.95,0.96,0.97,0.975,0.98,0.99,1])
round(util_percentiles,2)
Out[101]:
0.050        0.00
0.100        0.00
0.250        0.03
0.500        0.15
0.750        0.56
0.800        0.70
0.900        0.98
0.910        1.00
0.950        1.00
0.960        1.00
0.970        1.00
0.975        1.00
0.980        1.01
0.990        1.09
1.000    50708.00
Name: RevolvingUtilizationOfUnsecuredLines, dtype: float64
In [102]:
#Clean the variable, and create a new variable by removing all the issues
#If utilization is more than 1 then it can be replaced by median
median_util=loans['RevolvingUtilizationOfUnsecuredLines'].median()
median_util
Out[102]:
0.154180737
In [103]:
loans['util_new']=loans['RevolvingUtilizationOfUnsecuredLines']
loans['util_new'][loans['util_new']>1]=median_util 
loans['util_new']
C:\Users\jyothi\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
Out[103]:
0         0.766127
1         0.957151
2         0.658180
3         0.233810
4         0.907239
5         0.213179
6         0.305682
7         0.754464
8         0.116951
9         0.189169
10        0.644226
11        0.018798
12        0.010352
13        0.964673
14        0.019657
15        0.548458
16        0.061086
17        0.166284
18        0.221813
19        0.602794
20        0.200923
21        0.025656
22        1.000000
23        0.075427
24        0.046560
25        0.392248
26        0.052436
27        0.034421
28        0.452516
29        0.392995
            ...   
149970    0.025449
149971    0.058001
149972    0.071273
149973    0.154181
149974    0.962721
149975    0.022088
149976    0.000627
149977    0.236450
149978    0.917635
149979    0.224711
149980    0.067644
149981    0.810012
149982    0.021046
149983    0.002485
149984    0.037548
149985    0.954409
149986    0.168102
149987    1.000000
149988    0.902051
149989    0.013356
149990    0.055518
149991    0.104112
149992    0.871976
149993    1.000000
149994    0.385742
149995    0.040674
149996    0.299745
149997    0.246044
149998    0.000000
149999    0.850283
Name: util_new, Length: 150000, dtype: float64
In [104]:
# percentile distribution for new variable
util_percentiles1=loans['util_new'].quantile([0.05, 0.1, 0.25, 0.5, 0.75, 0.80, 0.9,0.91,0.95,0.96,0.97,0.975,0.98,0.99,1])
round(util_percentiles1,2)
Out[104]:
0.050    0.00
0.100    0.00
0.250    0.03
0.500    0.15
0.750    0.51
0.800    0.63
0.900    0.93
0.910    0.95
0.950    1.00
0.960    1.00
0.970    1.00
0.975    1.00
0.980    1.00
0.990    1.00
1.000    1.00
Name: util_new, dtype: float64

Code: Data Cleaning Scenario-2

In [ ]:
# What is the issue with NumberOfTime30_59DaysPastDueNotWorse
# Draw a frequency table
# What percent of the values are erroneous?
# Clean the variable- Look at the cross tab of variable vs target. Impute based on target .
# Create frequency table for cleaned variable
In [105]:
# # LAB: Data Cleaning Scenario-2
#What is the issue with NumberOfTime30_59DaysPastDueNotWorse
#Draw a frequency table
freq_table_30dpd=loans['NumberOfTime30-59DaysPastDueNotWorse'].value_counts(sort=False)
freq_table_30dpd
Out[105]:
0     126018
1      16033
2       4598
3       1754
4        747
5        342
6        140
7         54
8         25
9         12
10         4
11         1
12         2
13         1
96         5
98       264
Name: NumberOfTime30-59DaysPastDueNotWorse, dtype: int64
In [106]:
#One month defaults frequency can't be beyond 24 in last 24 months
#What percent of the values are erroneous?
freq_table_30dpd[13:len(freq_table_30dpd)]
freq_table_30dpd[13:len(freq_table_30dpd)].sum()/freq_table_30dpd.sum()
Out[106]:
0.0018
In [107]:
#Clean the variable- Look at the cross tab of variable vs target. Impute based on target .
#Cross tab with target
import pandas as pd
cross_tab_30dpd_target=pd.crosstab(loans['NumberOfTime30-59DaysPastDueNotWorse'],loans['SeriousDlqin2yrs'])
cross_tab_30dpd_target
Out[107]:
SeriousDlqin2yrs 0 1
NumberOfTime30-59DaysPastDueNotWorse
0 120977 5041
1 13624 2409
2 3379 1219
3 1136 618
4 429 318
5 188 154
6 66 74
7 26 28
8 17 8
9 8 4
10 1 3
11 0 1
12 1 1
13 0 1
96 1 4
98 121 143
In [108]:
#Cross tab row Percentages
cross_tab_30dpd_target_percent=cross_tab_30dpd_target.astype(float).div(cross_tab_30dpd_target.sum(axis=1), axis=0)
round(cross_tab_30dpd_target_percent,2)
Out[108]:
SeriousDlqin2yrs 0 1
NumberOfTime30-59DaysPastDueNotWorse
0 0.96 0.04
1 0.85 0.15
2 0.73 0.27
3 0.65 0.35
4 0.57 0.43
5 0.55 0.45
6 0.47 0.53
7 0.48 0.52
8 0.68 0.32
9 0.67 0.33
10 0.25 0.75
11 0.00 1.00
12 0.50 0.50
13 0.00 1.00
96 0.20 0.80
98 0.46 0.54
In [109]:
loans['num_30_59_dpd_new']=loans['NumberOfTime30-59DaysPastDueNotWorse']
loans['num_30_59_dpd_new'][loans['num_30_59_dpd_new']>12]=6
loans['num_30_59_dpd_new']
C:\Users\jyothi\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
Out[109]:
0         2
1         0
2         1
3         0
4         1
5         0
6         0
7         0
8         0
9         0
10        0
11        0
12        0
13        3
14        0
15        0
16        0
17        0
18        0
19        0
20        0
21        0
22        0
23        0
24        0
25        0
26        0
27        0
28        0
29        2
         ..
149970    0
149971    0
149972    0
149973    0
149974    2
149975    0
149976    0
149977    0
149978    2
149979    0
149980    0
149981    0
149982    0
149983    0
149984    0
149985    0
149986    0
149987    0
149988    1
149989    0
149990    0
149991    0
149992    0
149993    0
149994    0
149995    0
149996    0
149997    0
149998    0
149999    0
Name: num_30_59_dpd_new, Length: 150000, dtype: int64
In [110]:
loans['num_30_59_dpd_new'].value_counts(sort=False)
Out[110]:
0     126018
1      16033
2       4598
3       1754
4        747
5        342
6        410
7         54
8         25
9         12
10         4
11         1
12         2
Name: num_30_59_dpd_new, dtype: int64

LAB: Monthly Income

In [ ]:
# Find the missing value percentage in monthly income
# Create an indicator variable for missing and non-missing
# Replace the missing values with median
In [111]:
# # Data Cleaning Scenario-3
#Find the missing value percentage in monthly income
loans['MonthlyIncome'].isnull().sum()
loans['MonthlyIncome'].isnull().sum()/len(loans)
Out[111]:
0.19820666666666667
In [115]:
#Once identified where missing values exist, the next task usually is to fill them (data imputation). Depending upon the context,
#in this case, I am assigning median value to all those positions where missing value is present:

loans['MonthlyIncome_ind']=1
loans['MonthlyIncome_ind'][loans['MonthlyIncome'].isnull()]=0
loans['MonthlyIncome_ind'].value_counts(sort=False)
C:\Users\jyothi\Anaconda3\lib\site-packages\ipykernel_launcher.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
Out[115]:
0     29731
1    120269
Name: MonthlyIncome_ind, dtype: int64
In [116]:
loans['MonthlyIncome_new']=loans['MonthlyIncome']
loans['MonthlyIncome_new'][loans['MonthlyIncome'].isnull()]=loans['MonthlyIncome'].median()
round(loans['MonthlyIncome_new'].describe())
C:\Users\jyothi\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
Out[116]:
count     150000.0
mean        6418.0
std        12890.0
min            0.0
25%         3903.0
50%         5400.0
75%         7400.0
max      3008750.0
Name: MonthlyIncome_new, dtype: float64
In [119]:
#Remaining Variables Imputation

# Debt Ratio: Imputation 
# NumberOfOpenCreditLinesAndLoans : No issues in this variable
# NumberOfTimes90DaysLate: Imputation similar to NumberOfTime30_59DaysPastDueNotW
# NumberRealEstateLoansOrLines: : No issues in this variable
# NumberOfTime60_89DaysPastDueNotW: Imputation similar to NumberOfTime30_59DaysPastDueNotW
# ConclusionNumberOfDependents: Impute based on target variable
In [120]:
#Conclusion
#Data cleaning needs patience, we need to clean for each individual variable

DV Analytics

DV Data & Analytics is a leading data science,  Cyber Security training and consulting firm, led by industry experts. We are aiming to train and prepare resources to acquire the most in-demand data science job opportunities in India and abroad.

Bangalore Center

DV Data & Analytics Bangalore Private Limited
#52, 2nd Floor:
Malleshpalya Maruthinagar Bengaluru.
Bangalore 560075
India
(+91) 9019 030 033 (+91) 8095 881 188
Email: info@dvanalyticsmds.com

Bhubneshwar Center

DV Data & Analytics Private Limited Bhubaneswar
Plot No A/7 :
Adjacent to Maharaja Cine Complex, Bhoinagar, Acharya Vihar
Bhubaneswar 751022
(+91) 8095 881 188 (+91) 8249 430 414
Email: info@dvanalyticsmds.com

top
© 2020. All Rights Reserved.