Handout – Data Cleaning and Treatment in R

You can download the datasets and R code file for this session here.

Raw Data – issues
Data Exploration
Data Validation
Data Sensitization techniques

Raw Data – Issues

Raw data is dirty:

Wrong formats : expenses is read as date
Might have missing values : Income missing for some records
Might have outliers : Number of loans is 25000
Erroneous values : Age is less than 0
Default values : Account tenure is 999999
Inconsistent : Age is 25, year of birth is 1970

Preparing data for analysis

We can’t directly start the analysis and model building with raw data.
Before getting on to core analysis and strategy building it is very important to
- Explore the data
- Validate the data
- And finally clean the data and prepare it for analysis

Data Exploration : Case Study

Give me some credit data

We will try to understand the data exploration, validation and data cleaning using a case study on loans data
Give me some credit data. It is loans data. Historical data are provided on 150,000 borrowers.
The final objective is to build a model that borrowers can use to help make the best financial decisions.
We generally get the data and data dictionary from the data team.

Data Dictionary

No	Variable Name	Short Description	Description	Varibale Type
1	SeriousDlqin2yrs	Target Variable (loan defaulter)	Person experienced 90 days past due delinquency or worse	Y/N
2	RevolvingUtilizationOfUnsecuredLines	Credit Utilization	Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits	percentage
3	age	Age	Age of borrower in years	integer
4	NumberOfTime30-59DaysPastDueNotWorse	One month late frequency	Number of times borrower has been 30-59 days past due but no worse in the last 2 years.	integer
5	DebtRatio	Debt to income ratio	Monthly debt payments, alimony,living costs divided by monthy gross income	percentage
6	MonthlyIncome	Income	Monthly income	real
7	NumberOfOpenCreditLinesAndLoans	Number of loans	Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards)	Integer
8	NumberOfTimes90DaysLate	Three months late frequency	Number of times borrower has been 90 days or more past due.	integer

Steps in Data Exploration and Cleaning

Step-1: Basic details of the data
Step-2: Categorical variables exploration
Step-3: Continuous variables exploration
Step-4: Missing Values and Outlier Treatment

Step-1: Basic Details of the Data

Check the Metadata – Metadata is data about data. – What are total number of observations and variables. – Check each field name, field type, Length of field. – Are there some variables which are unexpected say q9 r10?
– Are the data types and length across variables correct ? – For known variables is the data type as expected (For example if age is in date format something is suspicious)

Print first few records – Do we have any unique identifier? Is the unique identifier getting repeated in different records? – Do the text variables have meaningful data? – Are there some coded values in the data. – Do all the variables appear to have data? – Are there any missing values.

LAB : Basic Contents of the Data

Import “Give me some Credit-training.csv”
What are number of rows and columns?
Are there any suspicious variables?
Are all the variable names correct?
Display the variable formats.
Print the first 10 observations.
Do we have any unique identifier?
Do the text and numeric variables have meaningful data?
Are there some coded values in the data?
Do all the variables appear to have data?

Solution 1. Import “Give me some Credit-training.csv”

loans<- read.csv("D:/Dv Analytics/Datasets/Give me some Credit/cs-training.csv")
View(loans)

What are number of rows and columns?

dim(loans)

## [1] 150000     12

Are there any suspicious variables?

names(loans)

##  [1] "Sr_No"                               
##  [2] "SeriousDlqin2yrs"                    
##  [3] "RevolvingUtilizationOfUnsecuredLines"
##  [4] "age"                                 
##  [5] "NumberOfTime30.59DaysPastDueNotWorse"
##  [6] "DebtRatio"                           
##  [7] "MonthlyIncome"                       
##  [8] "NumberOfOpenCreditLinesAndLoans"     
##  [9] "NumberOfTimes90DaysLate"             
## [10] "NumberRealEstateLoansOrLines"        
## [11] "NumberOfTime60.89DaysPastDueNotWorse"
## [12] "NumberOfDependents"

4.Are all the variable names correct?

names(loans)

##  [1] "Sr_No"                               
##  [2] "SeriousDlqin2yrs"                    
##  [3] "RevolvingUtilizationOfUnsecuredLines"
##  [4] "age"                                 
##  [5] "NumberOfTime30.59DaysPastDueNotWorse"
##  [6] "DebtRatio"                           
##  [7] "MonthlyIncome"                       
##  [8] "NumberOfOpenCreditLinesAndLoans"     
##  [9] "NumberOfTimes90DaysLate"             
## [10] "NumberRealEstateLoansOrLines"        
## [11] "NumberOfTime60.89DaysPastDueNotWorse"
## [12] "NumberOfDependents"

Display the variable formats

str(loans)

## 'data.frame':    150000 obs. of  12 variables:
##  $ Sr_No                               : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ SeriousDlqin2yrs                    : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ RevolvingUtilizationOfUnsecuredLines: num  0.766 0.957 0.658 0.234 0.907 ...
##  $ age                                 : int  45 40 38 30 49 74 57 39 27 57 ...
##  $ NumberOfTime30.59DaysPastDueNotWorse: int  2 0 1 0 1 0 0 0 0 0 ...
##  $ DebtRatio                           : num  0.803 0.1219 0.0851 0.036 0.0249 ...
##  $ MonthlyIncome                       : int  9120 2600 3042 3300 63588 3500 NA 3500 NA 23684 ...
##  $ NumberOfOpenCreditLinesAndLoans     : int  13 4 2 5 7 3 8 8 2 9 ...
##  $ NumberOfTimes90DaysLate             : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ NumberRealEstateLoansOrLines        : int  6 0 0 0 1 1 3 0 0 4 ...
##  $ NumberOfTime60.89DaysPastDueNotWorse: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ NumberOfDependents                  : int  2 1 0 0 0 1 0 0 NA 2 ...

Print the first 10 observations

head(loans, n=10)

##    Sr_No SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age
## 1      1                1                            0.7661266  45
## 2      2                0                            0.9571510  40
## 3      3                0                            0.6581801  38
## 4      4                0                            0.2338098  30
## 5      5                0                            0.9072394  49
## 6      6                0                            0.2131787  74
## 7      7                0                            0.3056825  57
## 8      8                0                            0.7544636  39
## 9      9                0                            0.1169506  27
## 10    10                0                            0.1891691  57
##    NumberOfTime30.59DaysPastDueNotWorse    DebtRatio MonthlyIncome
## 1                                     2 8.029821e-01          9120
## 2                                     0 1.218762e-01          2600
## 3                                     1 8.511338e-02          3042
## 4                                     0 3.604968e-02          3300
## 5                                     1 2.492570e-02         63588
## 6                                     0 3.756070e-01          3500
## 7                                     0 5.710000e+03            NA
## 8                                     0 2.099400e-01          3500
## 9                                     0 4.600000e+01            NA
## 10                                    0 6.062909e-01         23684
##    NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate
## 1                               13                       0
## 2                                4                       0
## 3                                2                       1
## 4                                5                       0
## 5                                7                       0
## 6                                3                       0
## 7                                8                       0
## 8                                8                       0
## 9                                2                       0
## 10                               9                       0
##    NumberRealEstateLoansOrLines NumberOfTime60.89DaysPastDueNotWorse
## 1                             6                                    0
## 2                             0                                    0
## 3                             0                                    0
## 4                             0                                    0
## 5                             1                                    0
## 6                             1                                    0
## 7                             3                                    0
## 8                             0                                    0
## 9                             0                                    0
## 10                            4                                    0
##    NumberOfDependents
## 1                   2
## 2                   1
## 3                   0
## 4                   0
## 5                   0
## 6                   1
## 7                   0
## 8                   0
## 9                  NA
## 10                  2

Do we have any unique identifier?

names(loans)

##  [1] "Sr_No"                               
##  [2] "SeriousDlqin2yrs"                    
##  [3] "RevolvingUtilizationOfUnsecuredLines"
##  [4] "age"                                 
##  [5] "NumberOfTime30.59DaysPastDueNotWorse"
##  [6] "DebtRatio"                           
##  [7] "MonthlyIncome"                       
##  [8] "NumberOfOpenCreditLinesAndLoans"     
##  [9] "NumberOfTimes90DaysLate"             
## [10] "NumberRealEstateLoansOrLines"        
## [11] "NumberOfTime60.89DaysPastDueNotWorse"
## [12] "NumberOfDependents"

Step-2: Categorical Variables Exploration

The Frequency Table and summary – Calculate frequency counts cross-tabulation frequencies for Especially for categorical, discrete & class fields. – Frequencies + help us understanding the variable by looking at the values it’s taking and data count at each value. + They also helps us in analyzing the relationships between variables by looking at the cross tab frequencies or by looking at association.

Check Points

Are values as expected?
Variable understanding : Distinct values of a particular variable, missing percentages.
Are there any extreme values or outliers?
Any possibility of creating a new variable having small number of distinct category by clubbing certain categories with others.

LAB: Frequencies

What are the categorical and discrete variables? What are the continues variables?
Find the frequencies of all class variables in the data?
Are there any variables with missing values?
Are there any default values?
Can you identify the variables with outliers?
Are there any variables with other issues?

Solution 1. What are the categorical and discrete variables? What are the continues variables?

str(loans)

## 'data.frame':    150000 obs. of  12 variables:
##  $ Sr_No                               : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ SeriousDlqin2yrs                    : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ RevolvingUtilizationOfUnsecuredLines: num  0.766 0.957 0.658 0.234 0.907 ...
##  $ age                                 : int  45 40 38 30 49 74 57 39 27 57 ...
##  $ NumberOfTime30.59DaysPastDueNotWorse: int  2 0 1 0 1 0 0 0 0 0 ...
##  $ DebtRatio                           : num  0.803 0.1219 0.0851 0.036 0.0249 ...
##  $ MonthlyIncome                       : int  9120 2600 3042 3300 63588 3500 NA 3500 NA 23684 ...
##  $ NumberOfOpenCreditLinesAndLoans     : int  13 4 2 5 7 3 8 8 2 9 ...
##  $ NumberOfTimes90DaysLate             : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ NumberRealEstateLoansOrLines        : int  6 0 0 0 1 1 3 0 0 4 ...
##  $ NumberOfTime60.89DaysPastDueNotWorse: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ NumberOfDependents                  : int  2 1 0 0 0 1 0 0 NA 2 ...

head(loans)

##   Sr_No SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age
## 1     1                1                            0.7661266  45
## 2     2                0                            0.9571510  40
## 3     3                0                            0.6581801  38
## 4     4                0                            0.2338098  30
## 5     5                0                            0.9072394  49
## 6     6                0                            0.2131787  74
##   NumberOfTime30.59DaysPastDueNotWorse  DebtRatio MonthlyIncome
## 1                                    2 0.80298213          9120
## 2                                    0 0.12187620          2600
## 3                                    1 0.08511338          3042
## 4                                    0 0.03604968          3300
## 5                                    1 0.02492570         63588
## 6                                    0 0.37560697          3500
##   NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate
## 1                              13                       0
## 2                               4                       0
## 3                               2                       1
## 4                               5                       0
## 5                               7                       0
## 6                               3                       0
##   NumberRealEstateLoansOrLines NumberOfTime60.89DaysPastDueNotWorse
## 1                            6                                    0
## 2                            0                                    0
## 3                            0                                    0
## 4                            0                                    0
## 5                            1                                    0
## 6                            1                                    0
##   NumberOfDependents
## 1                  2
## 2                  1
## 3                  0
## 4                  0
## 5                  0
## 6                  1

Find the frequencies of all class variables in the data. Using function str() we can see the datatype of the variable.

names(loans)

##  [1] "Sr_No"                               
##  [2] "SeriousDlqin2yrs"                    
##  [3] "RevolvingUtilizationOfUnsecuredLines"
##  [4] "age"                                 
##  [5] "NumberOfTime30.59DaysPastDueNotWorse"
##  [6] "DebtRatio"                           
##  [7] "MonthlyIncome"                       
##  [8] "NumberOfOpenCreditLinesAndLoans"     
##  [9] "NumberOfTimes90DaysLate"             
## [10] "NumberRealEstateLoansOrLines"        
## [11] "NumberOfTime60.89DaysPastDueNotWorse"
## [12] "NumberOfDependents"

table(loans$SeriousDlqin2yrs)

## 
##      0      1 
## 139974  10026

table(loans$age)

## 
##    0   21   22   23   24   25   26   27   28   29   30   31   32   33   34 
##    1  183  434  641  816  953 1193 1338 1560 1702 1937 2038 2050 2239 2155 
##   35   36   37   38   39   40   41   42   43   44   45   46   47   48   49 
## 2246 2379 2521 2631 2987 3093 3122 3082 3208 3294 3502 3714 3719 3806 3837 
##   50   51   52   53   54   55   56   57   58   59   60   61   62   63   64 
## 3753 3627 3609 3648 3561 3416 3589 3375 3443 3280 3258 3522 3568 3719 3058 
##   65   66   67   68   69   70   71   72   73   74   75   76   77   78   79 
## 2594 2494 2503 2235 1954 1777 1646 1649 1520 1451 1241 1183 1099 1054  981 
##   80   81   82   83   84   85   86   87   88   89   90   91   92   93   94 
##  876  774  647  512  480  483  407  357  313  276  198  154   93   87   47 
##   95   96   97   98   99  101  102  103  105  107  109 
##   45   18   17    6    9    3    3    3    1    1    2

table(loans$NumberOfTime30.59DaysPastDueNotWorse)

## 
##      0      1      2      3      4      5      6      7      8      9 
## 126018  16033   4598   1754    747    342    140     54     25     12 
##     10     11     12     13     96     98 
##      4      1      2      1      5    264

table(loans$NumberOfOpenCreditLinesAndLoans)

## 
##     0     1     2     3     4     5     6     7     8     9    10    11 
##  1888  4438  6666  9058 11609 12931 13614 13245 12562 11355  9624  8321 
##    12    13    14    15    16    17    18    19    20    21    22    23 
##  7005  5667  4546  3645  3000  2370  1874  1433  1169   864   685   533 
##    24    25    26    27    28    29    30    31    32    33    34    35 
##   422   337   239   194   150   114    88    74    52    47    35    27 
##    36    37    38    39    40    41    42    43    44    45    46    47 
##    18     7    13     9    10     4     8     8     2     8     3     2 
##    48    49    50    51    52    53    54    56    57    58 
##     6     4     2     2     3     1     4     2     2     1

table(loans$NumberOfTimes90DaysLate)

## 
##      0      1      2      3      4      5      6      7      8      9 
## 141662   5243   1555    667    291    131     80     38     21     19 
##     10     11     12     13     14     15     17     96     98 
##      8      5      2      4      2      2      1      5    264

table(loans$NumberRealEstateLoansOrLines)

## 
##     0     1     2     3     4     5     6     7     8     9    10    11 
## 56188 52338 31522  6300  2170   689   320   171    93    78    37    23 
##    12    13    14    15    16    17    18    19    20    21    23    25 
##    18    15     7     7     4     4     2     2     2     1     2     3 
##    26    29    32    54 
##     1     1     1     1

table(loans$NumberOfTime60.89DaysPastDueNotWorse)

## 
##      0      1      2      3      4      5      6      7      8      9 
## 142396   5731   1118    318    105     34     16      9      2      1 
##     11     96     98 
##      1      5    264

table(loans$NumberOfDependents)

## 
##     0     1     2     3     4     5     6     7     8     9    10    13 
## 86902 26316 19522  9483  2862   746   158    51    24     5     5     1 
##    20 
##     1

Are there any variables with missing values? We can use summary() function to find any missing value in the dataset.

summary(loans)

##      Sr_No        SeriousDlqin2yrs  RevolvingUtilizationOfUnsecuredLines
##  Min.   :     1   Min.   :0.00000   Min.   :    0.00                    
##  1st Qu.: 37501   1st Qu.:0.00000   1st Qu.:    0.03                    
##  Median : 75001   Median :0.00000   Median :    0.15                    
##  Mean   : 75001   Mean   :0.06684   Mean   :    6.05                    
##  3rd Qu.:112500   3rd Qu.:0.00000   3rd Qu.:    0.56                    
##  Max.   :150000   Max.   :1.00000   Max.   :50708.00                    
##                                                                         
##       age        NumberOfTime30.59DaysPastDueNotWorse   DebtRatio       
##  Min.   :  0.0   Min.   : 0.000                       Min.   :     0.0  
##  1st Qu.: 41.0   1st Qu.: 0.000                       1st Qu.:     0.2  
##  Median : 52.0   Median : 0.000                       Median :     0.4  
##  Mean   : 52.3   Mean   : 0.421                       Mean   :   353.0  
##  3rd Qu.: 63.0   3rd Qu.: 0.000                       3rd Qu.:     0.9  
##  Max.   :109.0   Max.   :98.000                       Max.   :329664.0  
##                                                                         
##  MonthlyIncome     NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate
##  Min.   :      0   Min.   : 0.000                  Min.   : 0.000         
##  1st Qu.:   3400   1st Qu.: 5.000                  1st Qu.: 0.000         
##  Median :   5400   Median : 8.000                  Median : 0.000         
##  Mean   :   6670   Mean   : 8.453                  Mean   : 0.266         
##  3rd Qu.:   8249   3rd Qu.:11.000                  3rd Qu.: 0.000         
##  Max.   :3008750   Max.   :58.000                  Max.   :98.000         
##  NA's   :29731                                                            
##  NumberRealEstateLoansOrLines NumberOfTime60.89DaysPastDueNotWorse
##  Min.   : 0.000               Min.   : 0.0000                     
##  1st Qu.: 0.000               1st Qu.: 0.0000                     
##  Median : 1.000               Median : 0.0000                     
##  Mean   : 1.018               Mean   : 0.2404                     
##  3rd Qu.: 2.000               3rd Qu.: 0.0000                     
##  Max.   :54.000               Max.   :98.0000                     
##                                                                   
##  NumberOfDependents
##  Min.   : 0.000    
##  1st Qu.: 0.000    
##  Median : 0.000    
##  Mean   : 0.757    
##  3rd Qu.: 1.000    
##  Max.   :20.000    
##  NA's   :3924

We can see variable MonthlyIncome has 29731 missing values and NumberOfDependents has 3924 missing values.

Are there any default values?
Can you identify the variables with outliers?
Are there any variables with other issues?

Step-3: Continuous Variables Exploration

Summary of Continuous variables can be explored by: – Min, Max, Median, Mean, sd, Var – Quartiles – Box plots and identification of outliers – Percentiles- P1, p5,p10,q1(p25),q3(p75), p90,p99

Check Points

Are variable distribution as expected?
What is the central tendency of the variable? Mean, Median and Mode across each variable.
Is the concentration of variables as expected ? What are quartiles?
Indicates variables which are unary I.e stddev=0 ; the variables which are useless for the current objective.
Are there any outliers / extreme values for the variable?
Are outlier values as expected or they have abnormally high values -for ex for Age if max and p99 values are 10000. Then should investigate if it’s the default value or there is some error in data.
What is the % of missing value associated with the variable?

LAB: Continuous Variables Summary

List down the continuous variables.
Find summary statistics for each variable. Min, Max, Median, Mean, sd, Var.
Find Quartiles for each of the variables.
Create Box plots and identify outliers.
Find the percentage of missing values.
Find Percentiles and find percentage of outliers, if any P1, p5,p10,q1(p25),q3(p75), p90,p99.

Solution 1. List down the continuous variables.

str(loans)

## 'data.frame':    150000 obs. of  12 variables:
##  $ Sr_No                               : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ SeriousDlqin2yrs                    : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ RevolvingUtilizationOfUnsecuredLines: num  0.766 0.957 0.658 0.234 0.907 ...
##  $ age                                 : int  45 40 38 30 49 74 57 39 27 57 ...
##  $ NumberOfTime30.59DaysPastDueNotWorse: int  2 0 1 0 1 0 0 0 0 0 ...
##  $ DebtRatio                           : num  0.803 0.1219 0.0851 0.036 0.0249 ...
##  $ MonthlyIncome                       : int  9120 2600 3042 3300 63588 3500 NA 3500 NA 23684 ...
##  $ NumberOfOpenCreditLinesAndLoans     : int  13 4 2 5 7 3 8 8 2 9 ...
##  $ NumberOfTimes90DaysLate             : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ NumberRealEstateLoansOrLines        : int  6 0 0 0 1 1 3 0 0 4 ...
##  $ NumberOfTime60.89DaysPastDueNotWorse: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ NumberOfDependents                  : int  2 1 0 0 0 1 0 0 NA 2 ...

head(loans)

##   Sr_No SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age
## 1     1                1                            0.7661266  45
## 2     2                0                            0.9571510  40
## 3     3                0                            0.6581801  38
## 4     4                0                            0.2338098  30
## 5     5                0                            0.9072394  49
## 6     6                0                            0.2131787  74
##   NumberOfTime30.59DaysPastDueNotWorse  DebtRatio MonthlyIncome
## 1                                    2 0.80298213          9120
## 2                                    0 0.12187620          2600
## 3                                    1 0.08511338          3042
## 4                                    0 0.03604968          3300
## 5                                    1 0.02492570         63588
## 6                                    0 0.37560697          3500
##   NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate
## 1                              13                       0
## 2                               4                       0
## 3                               2                       1
## 4                               5                       0
## 5                               7                       0
## 6                               3                       0
##   NumberRealEstateLoansOrLines NumberOfTime60.89DaysPastDueNotWorse
## 1                            6                                    0
## 2                            0                                    0
## 3                            0                                    0
## 4                            0                                    0
## 5                            1                                    0
## 6                            1                                    0
##   NumberOfDependents
## 1                  2
## 2                  1
## 3                  0
## 4                  0
## 5                  0
## 6                  1

Find summary statistics for each variable. Min, Max, Median, Mean, sd, Var.

summary(loans$RevolvingUtilizationOfUnsecuredLines)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00     0.03     0.15     6.05     0.56 50710.00

summary(loans$MonthlyIncome)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0    3400    5400    6670    8249 3009000   29731

var(loans$RevolvingUtilizationOfUnsecuredLines)

## [1] 62377.75

var(loans$MonthlyIncome)

## [1] NA

var(loans$MonthlyIncome,na.rm = TRUE)

## [1] 206918852

sd(loans$RevolvingUtilizationOfUnsecuredLines)

## [1] 249.7554

sd(loans$MonthlyIncome)

## [1] NA

sd(loans$MonthlyIncome,na.rm = TRUE)

## [1] 14384.67

To find all the above mentioned values at one go we can use describe() function form library psych.

library(psych)

## Warning: package 'psych' was built under R version 3.3.2

describe(loans$RevolvingUtilizationOfUnsecuredLines)

##    vars      n mean     sd median trimmed  mad min   max range  skew
## X1    1 150000 6.05 249.76   0.15    0.27 0.22   0 50708 50708 97.63
##    kurtosis   se
## X1 14544.03 0.64

describe(loans$MonthlyIncome)

##    vars      n    mean       sd median trimmed     mad min     max   range
## X1    1 120269 6670.22 14384.67   5400 5787.56 3435.18   0 3008750 3008750
##      skew kurtosis    se
## X1 114.04 19503.57 41.48

Find Quartiles for each of the variables

summary(loans$RevolvingUtilizationOfUnsecuredLines)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00     0.03     0.15     6.05     0.56 50710.00

summary(loans$MonthlyIncome)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0    3400    5400    6670    8249 3009000   29731

Create Box plots and identify outliers.

boxplot(loans$RevolvingUtilizationOfUnsecuredLines)

boxplot(loans$MonthlyIncome)

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

ggplot(loans, aes(y=RevolvingUtilizationOfUnsecuredLines, x = 1)) + geom_boxplot()

ggplot(loans, aes(y=MonthlyIncome, x = 1)) + geom_boxplot()

## Warning: Removed 29731 rows containing non-finite values (stat_boxplot).

Find the percentage of missing values.

sum(is.na(loans$MonthlyIncome))

## [1] 29731

sum(is.na(loans$MonthlyIncome))/nrow(loans)

## [1] 0.1982067

Find Percentiles and find percentage of outliers, if any P1, p5,p10,q1(p25),q3(p75), p90,p99

util_percentiles<-round(quantile(loans$RevolvingUtilizationOfUnsecuredLines,c(0.05, 0.1, 0.25, 0.5, 0.75, 0.80, 0.9,0.91,0.95,0.96,0.97,0.975,0.98,0.99,1)),2)
util_percentiles

##       5%      10%      25%      50%      75%      80%      90%      91% 
##     0.00     0.00     0.03     0.15     0.56     0.70     0.98     1.00 
##      95%      96%      97%    97.5%      98%      99%     100% 
##     1.00     1.00     1.00     1.00     1.01     1.09 50708.00

We can say last 1% values are outliers.

Data Cleaning

Some variables contain outliers.
Some variables have default values.
Some variables have missing values.
These Values contain Outliers:
- RevolvingUtilizationOfUnsecuredLines
- NumberOfTime30_59DaysPastDueNotWorse
These Variables have missing values:
- MonthlyIncome
- NumberOfDependents

Shall we delete them and go ahead with our analysis? – Not really.

Missing Values & Outliers

Data is not always available E.g., many tuples have no recorded value for several attributes, such as customer income in sales data.
Missing data may be due to :
- Equipment malfunction.
- Inconsistent with other recorded data and thus deleted.
- Data not entered due to misunderstanding.
- Certain data may not be considered important at the time of entry.
- Not register history or changes of the data.
Missing data may need to be inferred.
Missing data – values, attributes, entire records, entire sections.
Missing values and defaults are indistinguishable.

Imputation

Missing Value Imputation : 1

Standalone imputation
- Mean, median, other point estimates
- Convenient, easy to implement
- Assume: Distribution of the missing values is the same as the non-missing values.
- Does not take into account inter-relationships

Eg: The average of available values is 11.4. Can we replace the missing value in this table by 11.4 ?

Missing Value Imputation : 2

Use attribute relationships
Better imputation
Two techniques
- Propensity score (nonparametric). Useful for discrete variables.
- Regression (parametric)

There are two missing values in x2. What are the most appropriate replacements?

Missing Value Imputation : 3

-There are two missing values in x2. Find the most appropriate replacements

Missing Value Imputation : 4

What if more than 50% are missing?
It doesn’t make sense to carry out the analysis on 20% or 30% of the whole data and give inferences on overall data.
The best imputation is ignore the actual values and take available or not available info.

Step-4: Missing Values and Outlier Treatment

Data Cleaning Scenario-1

RevolvingUtilizationOfUnsecuredLines

RevolvingUtilizationOfUnsecuredLines has outliers.
What type of variable is this? What are the possible values?
Its’ mean is 6.05 which is greater than 1. So variable has some faulty values. Its maximum value is 50710 which is way too high.
Lets look at percentiles to know from where it is exceeding 1.

We can check this contition with the outlier treatement image we have above, and our path would be:

Data Cleaning

RevolvingUtilizationOfUnsecuredLines has outliers.
Since outliers percentage is less than 10% We will replace outliers with mean of reaming data.
Outliers are with value greater than 1.

LAB: Data Cleaning Scenario-1

What percent are missing values in RevolvingUtilizationOfUnsecuredLines?
Get the detailed percentile distribution.
Clean the variable, and create a new variable by removing all the issues.

Solution – What percent are missing values in RevolvingUtilizationOfUnsecuredLines?

library(ggplot2)
ggplot(loans, aes(y=RevolvingUtilizationOfUnsecuredLines, x = 1)) + geom_boxplot()

Get the detailed percentile distribution.

util_percentiles<-round(quantile(loans$RevolvingUtilizationOfUnsecuredLines,c(0.05, 0.1, 0.25, 0.5, 0.75, 0.80, 0.9,0.91,0.95,0.96,0.97,0.975,0.98,0.99,0.997,1)),2)
util_percentiles

##       5%      10%      25%      50%      75%      80%      90%      91% 
##     0.00     0.00     0.03     0.15     0.56     0.70     0.98     1.00 
##      95%      96%      97%    97.5%      98%      99%    99.7%     100% 
##     1.00     1.00     1.00     1.00     1.01     1.09     1.76 50708.00

Clean the variable, and create a new variable by removing all the issues
If utilization is more than 1 then it can be replaced by median

median_util<-median(loans$RevolvingUtilizationOfUnsecuredLines)
loans$util_new<-ifelse(loans$RevolvingUtilizationOfUnsecuredLines>1,median_util,loans$RevolvingUtilizationOfUnsecuredLines)

Box_plot on cleaned variable

ggplot(loans, aes(y=util_new, x = 1)) + geom_boxplot()

percentile distribution for new variable

util_percentiles1<-round(quantile(loans$util_new,c(0.05, 0.1, 0.25, 0.5, 0.75, 0.80, 0.9,0.91,0.95,0.96,0.97,0.975,0.98,0.99,1)),2)
util_percentiles1

##    5%   10%   25%   50%   75%   80%   90%   91%   95%   96%   97% 97.5% 
##  0.00  0.00  0.03  0.15  0.51  0.63  0.93  0.95  1.00  1.00  1.00  1.00 
##   98%   99%  100% 
##  1.00  1.00  1.00

Data Cleaning Scenario-2

NumberOfTime30_59DaysPastDueNotWorse

Find bad rate in each category of this variable
Replace 96 with _____? Replace 98 with_____?

Cleaning path for this variable would be:

LAB: Data Cleaning Scenario-2

What is the issue with NumberOfTime30_59DaysPastDueNotWorse
Draw a frequency table.
What percent of the values are erroneous?
Clean the variable- Look at the cross tab of variable vs target. Impute based on target.
Create frequency table for cleaned variable.

Solution 1. What is the issue with NumberOfTime30_59DaysPastDueNotWorse.

Draw a frequency table

freq_table_30dpd<-table(loans$NumberOfTime30.59DaysPastDueNotWorse)
freq_table_30dpd

## 
##      0      1      2      3      4      5      6      7      8      9 
## 126018  16033   4598   1754    747    342    140     54     25     12 
##     10     11     12     13     96     98 
##      4      1      2      1      5    264

One month defaults frequency can’t be beyond 24 in last 24 months.

What percent of the values are erroneous?

freq_table_30dpd[14:length(freq_table_30dpd)]

## 
##  13  96  98 
##   1   5 264

sum(freq_table_30dpd[14:length(freq_table_30dpd)])/sum(freq_table_30dpd)

## [1] 0.0018

Clean the variable- Look at the cross tab of variable vs target. Impute based on target .

#Cross tab with target
cross_tab_30dpd_target<-table(loans$NumberOfTime30.59DaysPastDueNotWorse,loans$SeriousDlqin2yrs)
cross_tab_30dpd_target

##     
##           0      1
##   0  120977   5041
##   1   13624   2409
##   2    3379   1219
##   3    1136    618
##   4     429    318
##   5     188    154
##   6      66     74
##   7      26     28
##   8      17      8
##   9       8      4
##   10      1      3
##   11      0      1
##   12      1      1
##   13      0      1
##   96      1      4
##   98    121    143

#Cross tab row Percentages
cross_tab_30dpd_target_percent<-round(prop.table(cross_tab_30dpd_target, 1),2)
cross_tab_30dpd_target_percent

##     
##         0    1
##   0  0.96 0.04
##   1  0.85 0.15
##   2  0.73 0.27
##   3  0.65 0.35
##   4  0.57 0.43
##   5  0.55 0.45
##   6  0.47 0.53
##   7  0.48 0.52
##   8  0.68 0.32
##   9  0.67 0.33
##   10 0.25 0.75
##   11 0.00 1.00
##   12 0.50 0.50
##   13 0.00 1.00
##   96 0.20 0.80
##   98 0.46 0.54

data.frame(Freq_0=cross_tab_30dpd_target[,1],Freq_1=cross_tab_30dpd_target[,2],Pert_0=cross_tab_30dpd_target_percent[,1],Pert_1=cross_tab_30dpd_target_percent[,2])

##    Freq_0 Freq_1 Pert_0 Pert_1
## 0  120977   5041   0.96   0.04
## 1   13624   2409   0.85   0.15
## 2    3379   1219   0.73   0.27
## 3    1136    618   0.65   0.35
## 4     429    318   0.57   0.43
## 5     188    154   0.55   0.45
## 6      66     74   0.47   0.53
## 7      26     28   0.48   0.52
## 8      17      8   0.68   0.32
## 9       8      4   0.67   0.33
## 10      1      3   0.25   0.75
## 11      0      1   0.00   1.00
## 12      1      1   0.50   0.50
## 13      0      1   0.00   1.00
## 96      1      4   0.20   0.80
## 98    121    143   0.46   0.54

#Percentage of 0 and 1 are of 98 is near to percentages of 6. 
#Replacing error values with 6
loans$num_30_59_dpd_new<-ifelse(loans$NumberOfTime30.59DaysPastDueNotWorse>12,6,loans$NumberOfTime30.59DaysPastDueNotWorse)

Create frequency table for cleaned variable.

table(loans$num_30_59_dpd_new)

## 
##      0      1      2      3      4      5      6      7      8      9 
## 126018  16033   4598   1754    747    342    410     54     25     12 
##     10     11     12 
##      4      1      2

Data Cleaning Scenario-3

MonthlyIncome

Monthly Income has nearly 20% missing values.
Missing value percentage is significant.
Simply replacing with mean or median is not sufficient.
We can create an indicator variable to keep track of missing and non-missing values.

This would be the path to clean the variable MonthlyIncome:

LAB: Data Cleaning Scenario-3 : Monthly Income

Find the missing value percentage in monthly income.
Create an indicator variable for missing and non-missing.
Replace the missing values with median.

Solution – Find the missing value percentage in monthly income

sum(is.na(loans$MonthlyIncome))

## [1] 29731

sum(is.na(loans$MonthlyIncome))/nrow(loans)

## [1] 0.1982067

Create an indicator variable for missing and non-missing

loans$MonthlyIncome_ind<-ifelse(is.na(loans$MonthlyIncome), "Missing", "Non-Missing")
table(loans$MonthlyIncome_ind)

## 
##     Missing Non-Missing 
##       29731      120269

Replace the missing values with median

median_income<-median(loans$MonthlyIncome, na.rm = TRUE)
median_income

## [1] 5400

loans$MonthlyIncome_new<-ifelse(is.na(loans$MonthlyIncome), median_income, loans$MonthlyIncome)
sum(is.na(loans$MonthlyIncome_new))

## [1] 0

Data Cleaning Other Variables

Remaining Variables Imputation

Debt Ratio : Imputation
NumberOfOpenCreditLinesAndLoans : No issues in this variable.
NumberOfTimes90DaysLate : Imputation similar to NumberOfTime30_59DaysPastDueNotWorse.
NumberRealEstateLoansOrLines : No issues in this variable.
NumberOfTime60_89DaysPastDueNotWorse : Imputation similar to NumberOfTime30_59DaysPastDueNotWorse.
NumberOfDependents : Impute based on target variable.

Conclusion

Data cleaning is as important as data analysis.
Sometimes 80% of the overall project time is spent on data cleaning.
Data cleaning needs patience, we need to clean for each individual variable.
Apart from suggested methods, there are many heuristic ways of cleaning the data.