• No products in the cart.

Handout – Data Cleaning and Treatment in R

You can download the datasets and R code file for this session here.

Contents

  • Raw Data – issues
  • Data Exploration
  • Data Validation
  • Data Sensitization techniques

Raw Data – Issues

Raw data is dirty:

  • Wrong formats : expenses is read as date
  • Might have missing values : Income missing for some records
  • Might have outliers : Number of loans is 25000
  • Erroneous values : Age is less than 0
  • Default values : Account tenure is 999999
  • Inconsistent : Age is 25, year of birth is 1970

Preparing data for analysis

  • We can’t directly start the analysis and model building with raw data.
  • Before getting on to core analysis and strategy building it is very important to
    • Explore the data
    • Validate the data
    • And finally clean the data and prepare it for analysis

Data Exploration : Case Study

Give me some credit data

  • We will try to understand the data exploration, validation and data cleaning using a case study on loans data
  • Give me some credit data. It is loans data. Historical data are provided on 150,000 borrowers.
  • The final objective is to build a model that borrowers can use to help make the best financial decisions.
  • We generally get the data and data dictionary from the data team.

Data Dictionary

No Variable Name Short Description Description Varibale Type
1 SeriousDlqin2yrs Target Variable (loan defaulter) Person experienced 90 days past due delinquency or worse Y/N
2 RevolvingUtilizationOfUnsecuredLines Credit Utilization Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits percentage
3 age Age Age of borrower in years integer
4 NumberOfTime30-59DaysPastDueNotWorse One month late frequency Number of times borrower has been 30-59 days past due but no worse in the last 2 years. integer
5 DebtRatio Debt to income ratio Monthly debt payments, alimony,living costs divided by monthy gross income percentage
6 MonthlyIncome Income Monthly income real
7 NumberOfOpenCreditLinesAndLoans Number of loans Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards) Integer
8 NumberOfTimes90DaysLate Three months late frequency Number of times borrower has been 90 days or more past due. integer

Steps in Data Exploration and Cleaning

  • Step-1: Basic details of the data
  • Step-2: Categorical variables exploration
  • Step-3: Continuous variables exploration
  • Step-4: Missing Values and Outlier Treatment

Step-1: Basic Details of the Data

Check the Metadata – Metadata is data about data. – What are total number of observations and variables. – Check each field name, field type, Length of field. – Are there some variables which are unexpected say q9 r10?
– Are the data types and length across variables correct ? – For known variables is the data type as expected (For example if age is in date format something is suspicious)

Print first few records – Do we have any unique identifier? Is the unique identifier getting repeated in different records? – Do the text variables have meaningful data? – Are there some coded values in the data. – Do all the variables appear to have data? – Are there any missing values.

LAB : Basic Contents of the Data

  1. Import “Give me some Credit-training.csv”
  2. What are number of rows and columns?
  3. Are there any suspicious variables?
  4. Are all the variable names correct?
  5. Display the variable formats.
  6. Print the first 10 observations.
  7. Do we have any unique identifier?
  8. Do the text and numeric variables have meaningful data?
  9. Are there some coded values in the data?
  10. Do all the variables appear to have data?

Solution 1. Import “Give me some Credit-training.csv”

loans<- read.csv("D:/Dv Analytics/Datasets/Give me some Credit/cs-training.csv")
View(loans)
  1. What are number of rows and columns?
dim(loans)
## [1] 150000     12
  1. Are there any suspicious variables?
names(loans)
##  [1] "Sr_No"                               
##  [2] "SeriousDlqin2yrs"                    
##  [3] "RevolvingUtilizationOfUnsecuredLines"
##  [4] "age"                                 
##  [5] "NumberOfTime30.59DaysPastDueNotWorse"
##  [6] "DebtRatio"                           
##  [7] "MonthlyIncome"                       
##  [8] "NumberOfOpenCreditLinesAndLoans"     
##  [9] "NumberOfTimes90DaysLate"             
## [10] "NumberRealEstateLoansOrLines"        
## [11] "NumberOfTime60.89DaysPastDueNotWorse"
## [12] "NumberOfDependents"

4.Are all the variable names correct?

names(loans)
##  [1] "Sr_No"                               
##  [2] "SeriousDlqin2yrs"                    
##  [3] "RevolvingUtilizationOfUnsecuredLines"
##  [4] "age"                                 
##  [5] "NumberOfTime30.59DaysPastDueNotWorse"
##  [6] "DebtRatio"                           
##  [7] "MonthlyIncome"                       
##  [8] "NumberOfOpenCreditLinesAndLoans"     
##  [9] "NumberOfTimes90DaysLate"             
## [10] "NumberRealEstateLoansOrLines"        
## [11] "NumberOfTime60.89DaysPastDueNotWorse"
## [12] "NumberOfDependents"
  1. Display the variable formats
str(loans)
## 'data.frame':    150000 obs. of  12 variables:
##  $ Sr_No                               : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ SeriousDlqin2yrs                    : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ RevolvingUtilizationOfUnsecuredLines: num  0.766 0.957 0.658 0.234 0.907 ...
##  $ age                                 : int  45 40 38 30 49 74 57 39 27 57 ...
##  $ NumberOfTime30.59DaysPastDueNotWorse: int  2 0 1 0 1 0 0 0 0 0 ...
##  $ DebtRatio                           : num  0.803 0.1219 0.0851 0.036 0.0249 ...
##  $ MonthlyIncome                       : int  9120 2600 3042 3300 63588 3500 NA 3500 NA 23684 ...
##  $ NumberOfOpenCreditLinesAndLoans     : int  13 4 2 5 7 3 8 8 2 9 ...
##  $ NumberOfTimes90DaysLate             : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ NumberRealEstateLoansOrLines        : int  6 0 0 0 1 1 3 0 0 4 ...
##  $ NumberOfTime60.89DaysPastDueNotWorse: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ NumberOfDependents                  : int  2 1 0 0 0 1 0 0 NA 2 ...
  1. Print the first 10 observations
head(loans, n=10)
##    Sr_No SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age
## 1      1                1                            0.7661266  45
## 2      2                0                            0.9571510  40
## 3      3                0                            0.6581801  38
## 4      4                0                            0.2338098  30
## 5      5                0                            0.9072394  49
## 6      6                0                            0.2131787  74
## 7      7                0                            0.3056825  57
## 8      8                0                            0.7544636  39
## 9      9                0                            0.1169506  27
## 10    10                0                            0.1891691  57
##    NumberOfTime30.59DaysPastDueNotWorse    DebtRatio MonthlyIncome
## 1                                     2 8.029821e-01          9120
## 2                                     0 1.218762e-01          2600
## 3                                     1 8.511338e-02          3042
## 4                                     0 3.604968e-02          3300
## 5                                     1 2.492570e-02         63588
## 6                                     0 3.756070e-01          3500
## 7                                     0 5.710000e+03            NA
## 8                                     0 2.099400e-01          3500
## 9                                     0 4.600000e+01            NA
## 10                                    0 6.062909e-01         23684
##    NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate
## 1                               13                       0
## 2                                4                       0
## 3                                2                       1
## 4                                5                       0
## 5                                7                       0
## 6                                3                       0
## 7                                8                       0
## 8                                8                       0
## 9                                2                       0
## 10                               9                       0
##    NumberRealEstateLoansOrLines NumberOfTime60.89DaysPastDueNotWorse
## 1                             6                                    0
## 2                             0                                    0
## 3                             0                                    0
## 4                             0                                    0
## 5                             1                                    0
## 6                             1                                    0
## 7                             3                                    0
## 8                             0                                    0
## 9                             0                                    0
## 10                            4                                    0
##    NumberOfDependents
## 1                   2
## 2                   1
## 3                   0
## 4                   0
## 5                   0
## 6                   1
## 7                   0
## 8                   0
## 9                  NA
## 10                  2
  1. Do we have any unique identifier?
names(loans)
##  [1] "Sr_No"                               
##  [2] "SeriousDlqin2yrs"                    
##  [3] "RevolvingUtilizationOfUnsecuredLines"
##  [4] "age"                                 
##  [5] "NumberOfTime30.59DaysPastDueNotWorse"
##  [6] "DebtRatio"                           
##  [7] "MonthlyIncome"                       
##  [8] "NumberOfOpenCreditLinesAndLoans"     
##  [9] "NumberOfTimes90DaysLate"             
## [10] "NumberRealEstateLoansOrLines"        
## [11] "NumberOfTime60.89DaysPastDueNotWorse"
## [12] "NumberOfDependents"

Step-2: Categorical Variables Exploration

The Frequency Table and summary – Calculate frequency counts cross-tabulation frequencies for Especially for categorical, discrete & class fields. – Frequencies + help us understanding the variable by looking at the values it’s taking and data count at each value. + They also helps us in analyzing the relationships between variables by looking at the cross tab frequencies or by looking at association.

Check Points

  1. Are values as expected?
  2. Variable understanding : Distinct values of a particular variable, missing percentages.
  3. Are there any extreme values or outliers?
  4. Any possibility of creating a new variable having small number of distinct category by clubbing certain categories with others.

LAB: Frequencies

  1. What are the categorical and discrete variables? What are the continues variables?
  2. Find the frequencies of all class variables in the data?
  3. Are there any variables with missing values?
  4. Are there any default values?
  5. Can you identify the variables with outliers?
  6. Are there any variables with other issues?

Solution 1. What are the categorical and discrete variables? What are the continues variables?

str(loans)
## 'data.frame':    150000 obs. of  12 variables:
##  $ Sr_No                               : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ SeriousDlqin2yrs                    : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ RevolvingUtilizationOfUnsecuredLines: num  0.766 0.957 0.658 0.234 0.907 ...
##  $ age                                 : int  45 40 38 30 49 74 57 39 27 57 ...
##  $ NumberOfTime30.59DaysPastDueNotWorse: int  2 0 1 0 1 0 0 0 0 0 ...
##  $ DebtRatio                           : num  0.803 0.1219 0.0851 0.036 0.0249 ...
##  $ MonthlyIncome                       : int  9120 2600 3042 3300 63588 3500 NA 3500 NA 23684 ...
##  $ NumberOfOpenCreditLinesAndLoans     : int  13 4 2 5 7 3 8 8 2 9 ...
##  $ NumberOfTimes90DaysLate             : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ NumberRealEstateLoansOrLines        : int  6 0 0 0 1 1 3 0 0 4 ...
##  $ NumberOfTime60.89DaysPastDueNotWorse: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ NumberOfDependents                  : int  2 1 0 0 0 1 0 0 NA 2 ...
head(loans)
##   Sr_No SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age
## 1     1                1                            0.7661266  45
## 2     2                0                            0.9571510  40
## 3     3                0                            0.6581801  38
## 4     4                0                            0.2338098  30
## 5     5                0                            0.9072394  49
## 6     6                0                            0.2131787  74
##   NumberOfTime30.59DaysPastDueNotWorse  DebtRatio MonthlyIncome
## 1                                    2 0.80298213          9120
## 2                                    0 0.12187620          2600
## 3                                    1 0.08511338          3042
## 4                                    0 0.03604968          3300
## 5                                    1 0.02492570         63588
## 6                                    0 0.37560697          3500
##   NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate
## 1                              13                       0
## 2                               4                       0
## 3                               2                       1
## 4                               5                       0
## 5                               7                       0
## 6                               3                       0
##   NumberRealEstateLoansOrLines NumberOfTime60.89DaysPastDueNotWorse
## 1                            6                                    0
## 2                            0                                    0
## 3                            0                                    0
## 4                            0                                    0
## 5                            1                                    0
## 6                            1                                    0
##   NumberOfDependents
## 1                  2
## 2                  1
## 3                  0
## 4                  0
## 5                  0
## 6                  1
  1. Find the frequencies of all class variables in the data. Using function str() we can see the datatype of the variable.
names(loans)
##  [1] "Sr_No"                               
##  [2] "SeriousDlqin2yrs"                    
##  [3] "RevolvingUtilizationOfUnsecuredLines"
##  [4] "age"                                 
##  [5] "NumberOfTime30.59DaysPastDueNotWorse"
##  [6] "DebtRatio"                           
##  [7] "MonthlyIncome"                       
##  [8] "NumberOfOpenCreditLinesAndLoans"     
##  [9] "NumberOfTimes90DaysLate"             
## [10] "NumberRealEstateLoansOrLines"        
## [11] "NumberOfTime60.89DaysPastDueNotWorse"
## [12] "NumberOfDependents"
table(loans$SeriousDlqin2yrs)
## 
##      0      1 
## 139974  10026
table(loans$age)
## 
##    0   21   22   23   24   25   26   27   28   29   30   31   32   33   34 
##    1  183  434  641  816  953 1193 1338 1560 1702 1937 2038 2050 2239 2155 
##   35   36   37   38   39   40   41   42   43   44   45   46   47   48   49 
## 2246 2379 2521 2631 2987 3093 3122 3082 3208 3294 3502 3714 3719 3806 3837 
##   50   51   52   53   54   55   56   57   58   59   60   61   62   63   64 
## 3753 3627 3609 3648 3561 3416 3589 3375 3443 3280 3258 3522 3568 3719 3058 
##   65   66   67   68   69   70   71   72   73   74   75   76   77   78   79 
## 2594 2494 2503 2235 1954 1777 1646 1649 1520 1451 1241 1183 1099 1054  981 
##   80   81   82   83   84   85   86   87   88   89   90   91   92   93   94 
##  876  774  647  512  480  483  407  357  313  276  198  154   93   87   47 
##   95   96   97   98   99  101  102  103  105  107  109 
##   45   18   17    6    9    3    3    3    1    1    2
table(loans$NumberOfTime30.59DaysPastDueNotWorse)
## 
##      0      1      2      3      4      5      6      7      8      9 
## 126018  16033   4598   1754    747    342    140     54     25     12 
##     10     11     12     13     96     98 
##      4      1      2      1      5    264
table(loans$NumberOfOpenCreditLinesAndLoans)
## 
##     0     1     2     3     4     5     6     7     8     9    10    11 
##  1888  4438  6666  9058 11609 12931 13614 13245 12562 11355  9624  8321 
##    12    13    14    15    16    17    18    19    20    21    22    23 
##  7005  5667  4546  3645  3000  2370  1874  1433  1169   864   685   533 
##    24    25    26    27    28    29    30    31    32    33    34    35 
##   422   337   239   194   150   114    88    74    52    47    35    27 
##    36    37    38    39    40    41    42    43    44    45    46    47 
##    18     7    13     9    10     4     8     8     2     8     3     2 
##    48    49    50    51    52    53    54    56    57    58 
##     6     4     2     2     3     1     4     2     2     1
table(loans$NumberOfTimes90DaysLate)
## 
##      0      1      2      3      4      5      6      7      8      9 
## 141662   5243   1555    667    291    131     80     38     21     19 
##     10     11     12     13     14     15     17     96     98 
##      8      5      2      4      2      2      1      5    264
table(loans$NumberRealEstateLoansOrLines)
## 
##     0     1     2     3     4     5     6     7     8     9    10    11 
## 56188 52338 31522  6300  2170   689   320   171    93    78    37    23 
##    12    13    14    15    16    17    18    19    20    21    23    25 
##    18    15     7     7     4     4     2     2     2     1     2     3 
##    26    29    32    54 
##     1     1     1     1
table(loans$NumberOfTime60.89DaysPastDueNotWorse)
## 
##      0      1      2      3      4      5      6      7      8      9 
## 142396   5731   1118    318    105     34     16      9      2      1 
##     11     96     98 
##      1      5    264
table(loans$NumberOfDependents)
## 
##     0     1     2     3     4     5     6     7     8     9    10    13 
## 86902 26316 19522  9483  2862   746   158    51    24     5     5     1 
##    20 
##     1
  1. Are there any variables with missing values? We can use summary() function to find any missing value in the dataset.
summary(loans)
##      Sr_No        SeriousDlqin2yrs  RevolvingUtilizationOfUnsecuredLines
##  Min.   :     1   Min.   :0.00000   Min.   :    0.00                    
##  1st Qu.: 37501   1st Qu.:0.00000   1st Qu.:    0.03                    
##  Median : 75001   Median :0.00000   Median :    0.15                    
##  Mean   : 75001   Mean   :0.06684   Mean   :    6.05                    
##  3rd Qu.:112500   3rd Qu.:0.00000   3rd Qu.:    0.56                    
##  Max.   :150000   Max.   :1.00000   Max.   :50708.00                    
##                                                                         
##       age        NumberOfTime30.59DaysPastDueNotWorse   DebtRatio       
##  Min.   :  0.0   Min.   : 0.000                       Min.   :     0.0  
##  1st Qu.: 41.0   1st Qu.: 0.000                       1st Qu.:     0.2  
##  Median : 52.0   Median : 0.000                       Median :     0.4  
##  Mean   : 52.3   Mean   : 0.421                       Mean   :   353.0  
##  3rd Qu.: 63.0   3rd Qu.: 0.000                       3rd Qu.:     0.9  
##  Max.   :109.0   Max.   :98.000                       Max.   :329664.0  
##                                                                         
##  MonthlyIncome     NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate
##  Min.   :      0   Min.   : 0.000                  Min.   : 0.000         
##  1st Qu.:   3400   1st Qu.: 5.000                  1st Qu.: 0.000         
##  Median :   5400   Median : 8.000                  Median : 0.000         
##  Mean   :   6670   Mean   : 8.453                  Mean   : 0.266         
##  3rd Qu.:   8249   3rd Qu.:11.000                  3rd Qu.: 0.000         
##  Max.   :3008750   Max.   :58.000                  Max.   :98.000         
##  NA's   :29731                                                            
##  NumberRealEstateLoansOrLines NumberOfTime60.89DaysPastDueNotWorse
##  Min.   : 0.000               Min.   : 0.0000                     
##  1st Qu.: 0.000               1st Qu.: 0.0000                     
##  Median : 1.000               Median : 0.0000                     
##  Mean   : 1.018               Mean   : 0.2404                     
##  3rd Qu.: 2.000               3rd Qu.: 0.0000                     
##  Max.   :54.000               Max.   :98.0000                     
##                                                                   
##  NumberOfDependents
##  Min.   : 0.000    
##  1st Qu.: 0.000    
##  Median : 0.000    
##  Mean   : 0.757    
##  3rd Qu.: 1.000    
##  Max.   :20.000    
##  NA's   :3924

We can see variable MonthlyIncome has 29731 missing values and NumberOfDependents has 3924 missing values.

  1. Are there any default values?
  2. Can you identify the variables with outliers?
  3. Are there any variables with other issues?

Step-3: Continuous Variables Exploration

Summary of Continuous variables can be explored by: – Min, Max, Median, Mean, sd, Var – Quartiles – Box plots and identification of outliers – Percentiles- P1, p5,p10,q1(p25),q3(p75), p90,p99

Check Points

  1. Are variable distribution as expected?
  2. What is the central tendency of the variable? Mean, Median and Mode across each variable.
  3. Is the concentration of variables as expected ? What are quartiles?
  4. Indicates variables which are unary I.e stddev=0 ; the variables which are useless for the current objective.
  5. Are there any outliers / extreme values for the variable?
  6. Are outlier values as expected or they have abnormally high values -for ex for Age if max and p99 values are 10000. Then should investigate if it’s the default value or there is some error in data.
  7. What is the % of missing value associated with the variable?

LAB: Continuous Variables Summary

  1. List down the continuous variables.
  2. Find summary statistics for each variable. Min, Max, Median, Mean, sd, Var.
  3. Find Quartiles for each of the variables.
  4. Create Box plots and identify outliers.
  5. Find the percentage of missing values.
  6. Find Percentiles and find percentage of outliers, if any P1, p5,p10,q1(p25),q3(p75), p90,p99.

Solution 1. List down the continuous variables.

str(loans)
## 'data.frame':    150000 obs. of  12 variables:
##  $ Sr_No                               : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ SeriousDlqin2yrs                    : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ RevolvingUtilizationOfUnsecuredLines: num  0.766 0.957 0.658 0.234 0.907 ...
##  $ age                                 : int  45 40 38 30 49 74 57 39 27 57 ...
##  $ NumberOfTime30.59DaysPastDueNotWorse: int  2 0 1 0 1 0 0 0 0 0 ...
##  $ DebtRatio                           : num  0.803 0.1219 0.0851 0.036 0.0249 ...
##  $ MonthlyIncome                       : int  9120 2600 3042 3300 63588 3500 NA 3500 NA 23684 ...
##  $ NumberOfOpenCreditLinesAndLoans     : int  13 4 2 5 7 3 8 8 2 9 ...
##  $ NumberOfTimes90DaysLate             : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ NumberRealEstateLoansOrLines        : int  6 0 0 0 1 1 3 0 0 4 ...
##  $ NumberOfTime60.89DaysPastDueNotWorse: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ NumberOfDependents                  : int  2 1 0 0 0 1 0 0 NA 2 ...
head(loans)
##   Sr_No SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age
## 1     1                1                            0.7661266  45
## 2     2                0                            0.9571510  40
## 3     3                0                            0.6581801  38
## 4     4                0                            0.2338098  30
## 5     5                0                            0.9072394  49
## 6     6                0                            0.2131787  74
##   NumberOfTime30.59DaysPastDueNotWorse  DebtRatio MonthlyIncome
## 1                                    2 0.80298213          9120
## 2                                    0 0.12187620          2600
## 3                                    1 0.08511338          3042
## 4                                    0 0.03604968          3300
## 5                                    1 0.02492570         63588
## 6                                    0 0.37560697          3500
##   NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate
## 1                              13                       0
## 2                               4                       0
## 3                               2                       1
## 4                               5                       0
## 5                               7                       0
## 6                               3                       0
##   NumberRealEstateLoansOrLines NumberOfTime60.89DaysPastDueNotWorse
## 1                            6                                    0
## 2                            0                                    0
## 3                            0                                    0
## 4                            0                                    0
## 5                            1                                    0
## 6                            1                                    0
##   NumberOfDependents
## 1                  2
## 2                  1
## 3                  0
## 4                  0
## 5                  0
## 6                  1
  1. Find summary statistics for each variable. Min, Max, Median, Mean, sd, Var.
summary(loans$RevolvingUtilizationOfUnsecuredLines)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00     0.03     0.15     6.05     0.56 50710.00
summary(loans$MonthlyIncome)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0    3400    5400    6670    8249 3009000   29731
var(loans$RevolvingUtilizationOfUnsecuredLines)
## [1] 62377.75
var(loans$MonthlyIncome)
## [1] NA
var(loans$MonthlyIncome,na.rm = TRUE)
## [1] 206918852
sd(loans$RevolvingUtilizationOfUnsecuredLines)
## [1] 249.7554
sd(loans$MonthlyIncome)
## [1] NA
sd(loans$MonthlyIncome,na.rm = TRUE)
## [1] 14384.67

To find all the above mentioned values at one go we can use describe() function form library psych.

library(psych)
## Warning: package 'psych' was built under R version 3.3.2
describe(loans$RevolvingUtilizationOfUnsecuredLines)
##    vars      n mean     sd median trimmed  mad min   max range  skew
## X1    1 150000 6.05 249.76   0.15    0.27 0.22   0 50708 50708 97.63
##    kurtosis   se
## X1 14544.03 0.64
describe(loans$MonthlyIncome)
##    vars      n    mean       sd median trimmed     mad min     max   range
## X1    1 120269 6670.22 14384.67   5400 5787.56 3435.18   0 3008750 3008750
##      skew kurtosis    se
## X1 114.04 19503.57 41.48
  1. Find Quartiles for each of the variables
summary(loans$RevolvingUtilizationOfUnsecuredLines)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00     0.03     0.15     6.05     0.56 50710.00
summary(loans$MonthlyIncome)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0    3400    5400    6670    8249 3009000   29731
  1. Create Box plots and identify outliers.
boxplot(loans$RevolvingUtilizationOfUnsecuredLines)

boxplot(loans$MonthlyIncome)

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

ggplot(loans, aes(y=RevolvingUtilizationOfUnsecuredLines, x = 1)) + geom_boxplot()

ggplot(loans, aes(y=MonthlyIncome, x = 1)) + geom_boxplot()
## Warning: Removed 29731 rows containing non-finite values (stat_boxplot).

  1. Find the percentage of missing values.
sum(is.na(loans$MonthlyIncome))
## [1] 29731
sum(is.na(loans$MonthlyIncome))/nrow(loans)
## [1] 0.1982067
  1. Find Percentiles and find percentage of outliers, if any P1, p5,p10,q1(p25),q3(p75), p90,p99
util_percentiles<-round(quantile(loans$RevolvingUtilizationOfUnsecuredLines,c(0.05, 0.1, 0.25, 0.5, 0.75, 0.80, 0.9,0.91,0.95,0.96,0.97,0.975,0.98,0.99,1)),2)
util_percentiles
##       5%      10%      25%      50%      75%      80%      90%      91% 
##     0.00     0.00     0.03     0.15     0.56     0.70     0.98     1.00 
##      95%      96%      97%    97.5%      98%      99%     100% 
##     1.00     1.00     1.00     1.00     1.01     1.09 50708.00

We can say last 1% values are outliers.

Data Cleaning

  • Some variables contain outliers.
  • Some variables have default values.
  • Some variables have missing values.
  • These Values contain Outliers:
    • RevolvingUtilizationOfUnsecuredLines
    • NumberOfTime30_59DaysPastDueNotWorse
  • These Variables have missing values:
    • MonthlyIncome
    • NumberOfDependents

Shall we delete them and go ahead with our analysis? – Not really.

Missing Values & Outliers

  • Data is not always available E.g., many tuples have no recorded value for several attributes, such as customer income in sales data.
  • Missing data may be due to :
    • Equipment malfunction.
    • Inconsistent with other recorded data and thus deleted.
    • Data not entered due to misunderstanding.
    • Certain data may not be considered important at the time of entry.
    • Not register history or changes of the data.
  • Missing data may need to be inferred.
  • Missing data – values, attributes, entire records, entire sections.
  • Missing values and defaults are indistinguishable.

Imputation

Missing Value Imputation : 1

  • Standalone imputation
    • Mean, median, other point estimates
    • Convenient, easy to implement
    • Assume: Distribution of the missing values is the same as the non-missing values.
    • Does not take into account inter-relationships
  • Eg: The average of available values is 11.4. Can we replace the missing value in this table by 11.4 ?

Missing Value Imputation : 2

  • Use attribute relationships
  • Better imputation
  • Two techniques
    • Propensity score (nonparametric). Useful for discrete variables.
    • Regression (parametric)
  • There are two missing values in x2. What are the most appropriate replacements?

Missing Value Imputation : 3

-There are two missing values in x2. Find the most appropriate replacements

Missing Value Imputation : 4

  • What if more than 50% are missing?
  • It doesn’t make sense to carry out the analysis on 20% or 30% of the whole data and give inferences on overall data.
  • The best imputation is ignore the actual values and take available or not available info.

Step-4: Missing Values and Outlier Treatment

Data Cleaning Scenario-1

RevolvingUtilizationOfUnsecuredLines

  • RevolvingUtilizationOfUnsecuredLines has outliers.
  • What type of variable is this? What are the possible values?
  • Its’ mean is 6.05 which is greater than 1. So variable has some faulty values. Its maximum value is 50710 which is way too high.
  • Lets look at percentiles to know from where it is exceeding 1.

We can check this contition with the outlier treatement image we have above, and our path would be:

Data Cleaning
  • RevolvingUtilizationOfUnsecuredLines has outliers.
  • Since outliers percentage is less than 10% We will replace outliers with mean of reaming data.
  • Outliers are with value greater than 1.

LAB: Data Cleaning Scenario-1

  • What percent are missing values in RevolvingUtilizationOfUnsecuredLines?
  • Get the detailed percentile distribution.
  • Clean the variable, and create a new variable by removing all the issues.

Solution – What percent are missing values in RevolvingUtilizationOfUnsecuredLines?

library(ggplot2)
ggplot(loans, aes(y=RevolvingUtilizationOfUnsecuredLines, x = 1)) + geom_boxplot()

  • Get the detailed percentile distribution.
util_percentiles<-round(quantile(loans$RevolvingUtilizationOfUnsecuredLines,c(0.05, 0.1, 0.25, 0.5, 0.75, 0.80, 0.9,0.91,0.95,0.96,0.97,0.975,0.98,0.99,0.997,1)),2)
util_percentiles
##       5%      10%      25%      50%      75%      80%      90%      91% 
##     0.00     0.00     0.03     0.15     0.56     0.70     0.98     1.00 
##      95%      96%      97%    97.5%      98%      99%    99.7%     100% 
##     1.00     1.00     1.00     1.00     1.01     1.09     1.76 50708.00
  • Clean the variable, and create a new variable by removing all the issues
  • If utilization is more than 1 then it can be replaced by median
median_util<-median(loans$RevolvingUtilizationOfUnsecuredLines)
loans$util_new<-ifelse(loans$RevolvingUtilizationOfUnsecuredLines>1,median_util,loans$RevolvingUtilizationOfUnsecuredLines)
  • Box_plot on cleaned variable
ggplot(loans, aes(y=util_new, x = 1)) + geom_boxplot()

  • percentile distribution for new variable
util_percentiles1<-round(quantile(loans$util_new,c(0.05, 0.1, 0.25, 0.5, 0.75, 0.80, 0.9,0.91,0.95,0.96,0.97,0.975,0.98,0.99,1)),2)
util_percentiles1
##    5%   10%   25%   50%   75%   80%   90%   91%   95%   96%   97% 97.5% 
##  0.00  0.00  0.03  0.15  0.51  0.63  0.93  0.95  1.00  1.00  1.00  1.00 
##   98%   99%  100% 
##  1.00  1.00  1.00

Data Cleaning Scenario-2

NumberOfTime30_59DaysPastDueNotWorse

  • Find bad rate in each category of this variable
  • Replace 96 with _____? Replace 98 with_____?

Cleaning path for this variable would be:

LAB: Data Cleaning Scenario-2

  1. What is the issue with NumberOfTime30_59DaysPastDueNotWorse
  2. Draw a frequency table.
  3. What percent of the values are erroneous?
  4. Clean the variable- Look at the cross tab of variable vs target. Impute based on target.
  5. Create frequency table for cleaned variable.

Solution 1. What is the issue with NumberOfTime30_59DaysPastDueNotWorse.

  1. Draw a frequency table
freq_table_30dpd<-table(loans$NumberOfTime30.59DaysPastDueNotWorse)
freq_table_30dpd
## 
##      0      1      2      3      4      5      6      7      8      9 
## 126018  16033   4598   1754    747    342    140     54     25     12 
##     10     11     12     13     96     98 
##      4      1      2      1      5    264

One month defaults frequency can’t be beyond 24 in last 24 months.

  1. What percent of the values are erroneous?
freq_table_30dpd[14:length(freq_table_30dpd)]
## 
##  13  96  98 
##   1   5 264
sum(freq_table_30dpd[14:length(freq_table_30dpd)])/sum(freq_table_30dpd)
## [1] 0.0018
  1. Clean the variable- Look at the cross tab of variable vs target. Impute based on target .
#Cross tab with target
cross_tab_30dpd_target<-table(loans$NumberOfTime30.59DaysPastDueNotWorse,loans$SeriousDlqin2yrs)
cross_tab_30dpd_target
##     
##           0      1
##   0  120977   5041
##   1   13624   2409
##   2    3379   1219
##   3    1136    618
##   4     429    318
##   5     188    154
##   6      66     74
##   7      26     28
##   8      17      8
##   9       8      4
##   10      1      3
##   11      0      1
##   12      1      1
##   13      0      1
##   96      1      4
##   98    121    143
#Cross tab row Percentages
cross_tab_30dpd_target_percent<-round(prop.table(cross_tab_30dpd_target, 1),2)
cross_tab_30dpd_target_percent
##     
##         0    1
##   0  0.96 0.04
##   1  0.85 0.15
##   2  0.73 0.27
##   3  0.65 0.35
##   4  0.57 0.43
##   5  0.55 0.45
##   6  0.47 0.53
##   7  0.48 0.52
##   8  0.68 0.32
##   9  0.67 0.33
##   10 0.25 0.75
##   11 0.00 1.00
##   12 0.50 0.50
##   13 0.00 1.00
##   96 0.20 0.80
##   98 0.46 0.54
data.frame(Freq_0=cross_tab_30dpd_target[,1],Freq_1=cross_tab_30dpd_target[,2],Pert_0=cross_tab_30dpd_target_percent[,1],Pert_1=cross_tab_30dpd_target_percent[,2])
##    Freq_0 Freq_1 Pert_0 Pert_1
## 0  120977   5041   0.96   0.04
## 1   13624   2409   0.85   0.15
## 2    3379   1219   0.73   0.27
## 3    1136    618   0.65   0.35
## 4     429    318   0.57   0.43
## 5     188    154   0.55   0.45
## 6      66     74   0.47   0.53
## 7      26     28   0.48   0.52
## 8      17      8   0.68   0.32
## 9       8      4   0.67   0.33
## 10      1      3   0.25   0.75
## 11      0      1   0.00   1.00
## 12      1      1   0.50   0.50
## 13      0      1   0.00   1.00
## 96      1      4   0.20   0.80
## 98    121    143   0.46   0.54
#Percentage of 0 and 1 are of 98 is near to percentages of 6. 
#Replacing error values with 6
loans$num_30_59_dpd_new<-ifelse(loans$NumberOfTime30.59DaysPastDueNotWorse>12,6,loans$NumberOfTime30.59DaysPastDueNotWorse)
  1. Create frequency table for cleaned variable.
table(loans$num_30_59_dpd_new)
## 
##      0      1      2      3      4      5      6      7      8      9 
## 126018  16033   4598   1754    747    342    410     54     25     12 
##     10     11     12 
##      4      1      2

Data Cleaning Scenario-3

MonthlyIncome

  • Monthly Income has nearly 20% missing values.
  • Missing value percentage is significant.
  • Simply replacing with mean or median is not sufficient.
  • We can create an indicator variable to keep track of missing and non-missing values.

This would be the path to clean the variable MonthlyIncome:

LAB: Data Cleaning Scenario-3 : Monthly Income

  1. Find the missing value percentage in monthly income.
  2. Create an indicator variable for missing and non-missing.
  3. Replace the missing values with median.

Solution – Find the missing value percentage in monthly income

sum(is.na(loans$MonthlyIncome))
## [1] 29731
sum(is.na(loans$MonthlyIncome))/nrow(loans)
## [1] 0.1982067
  • Create an indicator variable for missing and non-missing
loans$MonthlyIncome_ind<-ifelse(is.na(loans$MonthlyIncome), "Missing", "Non-Missing")
table(loans$MonthlyIncome_ind)
## 
##     Missing Non-Missing 
##       29731      120269
  • Replace the missing values with median
median_income<-median(loans$MonthlyIncome, na.rm = TRUE)
median_income
## [1] 5400
loans$MonthlyIncome_new<-ifelse(is.na(loans$MonthlyIncome), median_income, loans$MonthlyIncome)
sum(is.na(loans$MonthlyIncome_new))
## [1] 0

Data Cleaning Other Variables

Remaining Variables Imputation

  • Debt Ratio : Imputation
  • NumberOfOpenCreditLinesAndLoans : No issues in this variable.
  • NumberOfTimes90DaysLate : Imputation similar to NumberOfTime30_59DaysPastDueNotWorse.
  • NumberRealEstateLoansOrLines : No issues in this variable.
  • NumberOfTime60_89DaysPastDueNotWorse : Imputation similar to NumberOfTime30_59DaysPastDueNotWorse.
  • NumberOfDependents : Impute based on target variable.

Conclusion

  • Data cleaning is as important as data analysis.
  • Sometimes 80% of the overall project time is spent on data cleaning.
  • Data cleaning needs patience, we need to clean for each individual variable.
  • Apart from suggested methods, there are many heuristic ways of cleaning the data.


 

DV Analytics

DV Data & Analytics is a leading data science,  Cyber Security training and consulting firm, led by industry experts. We are aiming to train and prepare resources to acquire the most in-demand data science job opportunities in India and abroad.

Bangalore Center

DV Data & Analytics Bangalore Private Limited
#52, 2nd Floor:
Malleshpalya Maruthinagar Bengaluru.
Bangalore 560075
India
(+91) 9019 030 033 (+91) 8095 881 188
Email: info@dvanalyticsmds.com

Bhubneshwar Center

DV Data & Analytics Private Limited Bhubaneswar
Plot No A/7 :
Adjacent to Maharaja Cine Complex, Bhoinagar, Acharya Vihar
Bhubaneswar 751022
(+91) 8095 881 188 (+91) 8249 430 414
Email: info@dvanalyticsmds.com

top
© 2020. All Rights Reserved.