You can download the datasets and R code file for this session here.
Contents
- Raw Data – issues
- Data Exploration
- Data Validation
- Data Sensitization techniques
Raw Data – Issues
Raw data is dirty:
- Wrong formats : expenses is read as date
- Might have missing values : Income missing for some records
- Might have outliers : Number of loans is 25000
- Erroneous values : Age is less than 0
- Default values : Account tenure is 999999
- Inconsistent : Age is 25, year of birth is 1970
Preparing data for analysis
- We can’t directly start the analysis and model building with raw data.
- Before getting on to core analysis and strategy building it is very important to
- Explore the data
- Validate the data
- And finally clean the data and prepare it for analysis
Data Exploration : Case Study
Give me some credit data
- We will try to understand the data exploration, validation and data cleaning using a case study on loans data
- Give me some credit data. It is loans data. Historical data are provided on 150,000 borrowers.
- The final objective is to build a model that borrowers can use to help make the best financial decisions.
- We generally get the data and data dictionary from the data team.
Data Dictionary
No | Variable Name | Short Description | Description | Varibale Type |
---|---|---|---|---|
1 | SeriousDlqin2yrs | Target Variable (loan defaulter) | Person experienced 90 days past due delinquency or worse | Y/N |
2 | RevolvingUtilizationOfUnsecuredLines | Credit Utilization | Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits | percentage |
3 | age | Age | Age of borrower in years | integer |
4 | NumberOfTime30-59DaysPastDueNotWorse | One month late frequency | Number of times borrower has been 30-59 days past due but no worse in the last 2 years. | integer |
5 | DebtRatio | Debt to income ratio | Monthly debt payments, alimony,living costs divided by monthy gross income | percentage |
6 | MonthlyIncome | Income | Monthly income | real |
7 | NumberOfOpenCreditLinesAndLoans | Number of loans | Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards) | Integer |
8 | NumberOfTimes90DaysLate | Three months late frequency | Number of times borrower has been 90 days or more past due. | integer |
Steps in Data Exploration and Cleaning
- Step-1: Basic details of the data
- Step-2: Categorical variables exploration
- Step-3: Continuous variables exploration
- Step-4: Missing Values and Outlier Treatment
Step-1: Basic Details of the Data
Check the Metadata – Metadata is data about data. – What are total number of observations and variables. – Check each field name, field type, Length of field. – Are there some variables which are unexpected say q9 r10?
– Are the data types and length across variables correct ? – For known variables is the data type as expected (For example if age is in date format something is suspicious)
Print first few records – Do we have any unique identifier? Is the unique identifier getting repeated in different records? – Do the text variables have meaningful data? – Are there some coded values in the data. – Do all the variables appear to have data? – Are there any missing values.
LAB : Basic Contents of the Data
- Import “Give me some Credit-training.csv”
- What are number of rows and columns?
- Are there any suspicious variables?
- Are all the variable names correct?
- Display the variable formats.
- Print the first 10 observations.
- Do we have any unique identifier?
- Do the text and numeric variables have meaningful data?
- Are there some coded values in the data?
- Do all the variables appear to have data?
Solution 1. Import “Give me some Credit-training.csv”
loans<- read.csv("D:/Dv Analytics/Datasets/Give me some Credit/cs-training.csv")
View(loans)
- What are number of rows and columns?
dim(loans)
## [1] 150000 12
- Are there any suspicious variables?
names(loans)
## [1] "Sr_No"
## [2] "SeriousDlqin2yrs"
## [3] "RevolvingUtilizationOfUnsecuredLines"
## [4] "age"
## [5] "NumberOfTime30.59DaysPastDueNotWorse"
## [6] "DebtRatio"
## [7] "MonthlyIncome"
## [8] "NumberOfOpenCreditLinesAndLoans"
## [9] "NumberOfTimes90DaysLate"
## [10] "NumberRealEstateLoansOrLines"
## [11] "NumberOfTime60.89DaysPastDueNotWorse"
## [12] "NumberOfDependents"
4.Are all the variable names correct?
names(loans)
## [1] "Sr_No"
## [2] "SeriousDlqin2yrs"
## [3] "RevolvingUtilizationOfUnsecuredLines"
## [4] "age"
## [5] "NumberOfTime30.59DaysPastDueNotWorse"
## [6] "DebtRatio"
## [7] "MonthlyIncome"
## [8] "NumberOfOpenCreditLinesAndLoans"
## [9] "NumberOfTimes90DaysLate"
## [10] "NumberRealEstateLoansOrLines"
## [11] "NumberOfTime60.89DaysPastDueNotWorse"
## [12] "NumberOfDependents"
- Display the variable formats
str(loans)
## 'data.frame': 150000 obs. of 12 variables:
## $ Sr_No : int 1 2 3 4 5 6 7 8 9 10 ...
## $ SeriousDlqin2yrs : int 1 0 0 0 0 0 0 0 0 0 ...
## $ RevolvingUtilizationOfUnsecuredLines: num 0.766 0.957 0.658 0.234 0.907 ...
## $ age : int 45 40 38 30 49 74 57 39 27 57 ...
## $ NumberOfTime30.59DaysPastDueNotWorse: int 2 0 1 0 1 0 0 0 0 0 ...
## $ DebtRatio : num 0.803 0.1219 0.0851 0.036 0.0249 ...
## $ MonthlyIncome : int 9120 2600 3042 3300 63588 3500 NA 3500 NA 23684 ...
## $ NumberOfOpenCreditLinesAndLoans : int 13 4 2 5 7 3 8 8 2 9 ...
## $ NumberOfTimes90DaysLate : int 0 0 1 0 0 0 0 0 0 0 ...
## $ NumberRealEstateLoansOrLines : int 6 0 0 0 1 1 3 0 0 4 ...
## $ NumberOfTime60.89DaysPastDueNotWorse: int 0 0 0 0 0 0 0 0 0 0 ...
## $ NumberOfDependents : int 2 1 0 0 0 1 0 0 NA 2 ...
- Print the first 10 observations
head(loans, n=10)
## Sr_No SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age
## 1 1 1 0.7661266 45
## 2 2 0 0.9571510 40
## 3 3 0 0.6581801 38
## 4 4 0 0.2338098 30
## 5 5 0 0.9072394 49
## 6 6 0 0.2131787 74
## 7 7 0 0.3056825 57
## 8 8 0 0.7544636 39
## 9 9 0 0.1169506 27
## 10 10 0 0.1891691 57
## NumberOfTime30.59DaysPastDueNotWorse DebtRatio MonthlyIncome
## 1 2 8.029821e-01 9120
## 2 0 1.218762e-01 2600
## 3 1 8.511338e-02 3042
## 4 0 3.604968e-02 3300
## 5 1 2.492570e-02 63588
## 6 0 3.756070e-01 3500
## 7 0 5.710000e+03 NA
## 8 0 2.099400e-01 3500
## 9 0 4.600000e+01 NA
## 10 0 6.062909e-01 23684
## NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate
## 1 13 0
## 2 4 0
## 3 2 1
## 4 5 0
## 5 7 0
## 6 3 0
## 7 8 0
## 8 8 0
## 9 2 0
## 10 9 0
## NumberRealEstateLoansOrLines NumberOfTime60.89DaysPastDueNotWorse
## 1 6 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 1 0
## 6 1 0
## 7 3 0
## 8 0 0
## 9 0 0
## 10 4 0
## NumberOfDependents
## 1 2
## 2 1
## 3 0
## 4 0
## 5 0
## 6 1
## 7 0
## 8 0
## 9 NA
## 10 2
- Do we have any unique identifier?
names(loans)
## [1] "Sr_No"
## [2] "SeriousDlqin2yrs"
## [3] "RevolvingUtilizationOfUnsecuredLines"
## [4] "age"
## [5] "NumberOfTime30.59DaysPastDueNotWorse"
## [6] "DebtRatio"
## [7] "MonthlyIncome"
## [8] "NumberOfOpenCreditLinesAndLoans"
## [9] "NumberOfTimes90DaysLate"
## [10] "NumberRealEstateLoansOrLines"
## [11] "NumberOfTime60.89DaysPastDueNotWorse"
## [12] "NumberOfDependents"
Step-2: Categorical Variables Exploration
The Frequency Table and summary – Calculate frequency counts cross-tabulation frequencies for Especially for categorical, discrete & class fields. – Frequencies + help us understanding the variable by looking at the values it’s taking and data count at each value. + They also helps us in analyzing the relationships between variables by looking at the cross tab frequencies or by looking at association.
Check Points
- Are values as expected?
- Variable understanding : Distinct values of a particular variable, missing percentages.
- Are there any extreme values or outliers?
- Any possibility of creating a new variable having small number of distinct category by clubbing certain categories with others.
LAB: Frequencies
- What are the categorical and discrete variables? What are the continues variables?
- Find the frequencies of all class variables in the data?
- Are there any variables with missing values?
- Are there any default values?
- Can you identify the variables with outliers?
- Are there any variables with other issues?
Solution 1. What are the categorical and discrete variables? What are the continues variables?
str(loans)
## 'data.frame': 150000 obs. of 12 variables:
## $ Sr_No : int 1 2 3 4 5 6 7 8 9 10 ...
## $ SeriousDlqin2yrs : int 1 0 0 0 0 0 0 0 0 0 ...
## $ RevolvingUtilizationOfUnsecuredLines: num 0.766 0.957 0.658 0.234 0.907 ...
## $ age : int 45 40 38 30 49 74 57 39 27 57 ...
## $ NumberOfTime30.59DaysPastDueNotWorse: int 2 0 1 0 1 0 0 0 0 0 ...
## $ DebtRatio : num 0.803 0.1219 0.0851 0.036 0.0249 ...
## $ MonthlyIncome : int 9120 2600 3042 3300 63588 3500 NA 3500 NA 23684 ...
## $ NumberOfOpenCreditLinesAndLoans : int 13 4 2 5 7 3 8 8 2 9 ...
## $ NumberOfTimes90DaysLate : int 0 0 1 0 0 0 0 0 0 0 ...
## $ NumberRealEstateLoansOrLines : int 6 0 0 0 1 1 3 0 0 4 ...
## $ NumberOfTime60.89DaysPastDueNotWorse: int 0 0 0 0 0 0 0 0 0 0 ...
## $ NumberOfDependents : int 2 1 0 0 0 1 0 0 NA 2 ...
head(loans)
## Sr_No SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age
## 1 1 1 0.7661266 45
## 2 2 0 0.9571510 40
## 3 3 0 0.6581801 38
## 4 4 0 0.2338098 30
## 5 5 0 0.9072394 49
## 6 6 0 0.2131787 74
## NumberOfTime30.59DaysPastDueNotWorse DebtRatio MonthlyIncome
## 1 2 0.80298213 9120
## 2 0 0.12187620 2600
## 3 1 0.08511338 3042
## 4 0 0.03604968 3300
## 5 1 0.02492570 63588
## 6 0 0.37560697 3500
## NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate
## 1 13 0
## 2 4 0
## 3 2 1
## 4 5 0
## 5 7 0
## 6 3 0
## NumberRealEstateLoansOrLines NumberOfTime60.89DaysPastDueNotWorse
## 1 6 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 1 0
## 6 1 0
## NumberOfDependents
## 1 2
## 2 1
## 3 0
## 4 0
## 5 0
## 6 1
- Find the frequencies of all class variables in the data. Using function
str()
we can see the datatype of the variable.
names(loans)
## [1] "Sr_No"
## [2] "SeriousDlqin2yrs"
## [3] "RevolvingUtilizationOfUnsecuredLines"
## [4] "age"
## [5] "NumberOfTime30.59DaysPastDueNotWorse"
## [6] "DebtRatio"
## [7] "MonthlyIncome"
## [8] "NumberOfOpenCreditLinesAndLoans"
## [9] "NumberOfTimes90DaysLate"
## [10] "NumberRealEstateLoansOrLines"
## [11] "NumberOfTime60.89DaysPastDueNotWorse"
## [12] "NumberOfDependents"
table(loans$SeriousDlqin2yrs)
##
## 0 1
## 139974 10026
table(loans$age)
##
## 0 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## 1 183 434 641 816 953 1193 1338 1560 1702 1937 2038 2050 2239 2155
## 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
## 2246 2379 2521 2631 2987 3093 3122 3082 3208 3294 3502 3714 3719 3806 3837
## 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
## 3753 3627 3609 3648 3561 3416 3589 3375 3443 3280 3258 3522 3568 3719 3058
## 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
## 2594 2494 2503 2235 1954 1777 1646 1649 1520 1451 1241 1183 1099 1054 981
## 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94
## 876 774 647 512 480 483 407 357 313 276 198 154 93 87 47
## 95 96 97 98 99 101 102 103 105 107 109
## 45 18 17 6 9 3 3 3 1 1 2
table(loans$NumberOfTime30.59DaysPastDueNotWorse)
##
## 0 1 2 3 4 5 6 7 8 9
## 126018 16033 4598 1754 747 342 140 54 25 12
## 10 11 12 13 96 98
## 4 1 2 1 5 264
table(loans$NumberOfOpenCreditLinesAndLoans)
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## 1888 4438 6666 9058 11609 12931 13614 13245 12562 11355 9624 8321
## 12 13 14 15 16 17 18 19 20 21 22 23
## 7005 5667 4546 3645 3000 2370 1874 1433 1169 864 685 533
## 24 25 26 27 28 29 30 31 32 33 34 35
## 422 337 239 194 150 114 88 74 52 47 35 27
## 36 37 38 39 40 41 42 43 44 45 46 47
## 18 7 13 9 10 4 8 8 2 8 3 2
## 48 49 50 51 52 53 54 56 57 58
## 6 4 2 2 3 1 4 2 2 1
table(loans$NumberOfTimes90DaysLate)
##
## 0 1 2 3 4 5 6 7 8 9
## 141662 5243 1555 667 291 131 80 38 21 19
## 10 11 12 13 14 15 17 96 98
## 8 5 2 4 2 2 1 5 264
table(loans$NumberRealEstateLoansOrLines)
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## 56188 52338 31522 6300 2170 689 320 171 93 78 37 23
## 12 13 14 15 16 17 18 19 20 21 23 25
## 18 15 7 7 4 4 2 2 2 1 2 3
## 26 29 32 54
## 1 1 1 1
table(loans$NumberOfTime60.89DaysPastDueNotWorse)
##
## 0 1 2 3 4 5 6 7 8 9
## 142396 5731 1118 318 105 34 16 9 2 1
## 11 96 98
## 1 5 264
table(loans$NumberOfDependents)
##
## 0 1 2 3 4 5 6 7 8 9 10 13
## 86902 26316 19522 9483 2862 746 158 51 24 5 5 1
## 20
## 1
- Are there any variables with missing values? We can use
summary()
function to find any missing value in the dataset.
summary(loans)
## Sr_No SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines
## Min. : 1 Min. :0.00000 Min. : 0.00
## 1st Qu.: 37501 1st Qu.:0.00000 1st Qu.: 0.03
## Median : 75001 Median :0.00000 Median : 0.15
## Mean : 75001 Mean :0.06684 Mean : 6.05
## 3rd Qu.:112500 3rd Qu.:0.00000 3rd Qu.: 0.56
## Max. :150000 Max. :1.00000 Max. :50708.00
##
## age NumberOfTime30.59DaysPastDueNotWorse DebtRatio
## Min. : 0.0 Min. : 0.000 Min. : 0.0
## 1st Qu.: 41.0 1st Qu.: 0.000 1st Qu.: 0.2
## Median : 52.0 Median : 0.000 Median : 0.4
## Mean : 52.3 Mean : 0.421 Mean : 353.0
## 3rd Qu.: 63.0 3rd Qu.: 0.000 3rd Qu.: 0.9
## Max. :109.0 Max. :98.000 Max. :329664.0
##
## MonthlyIncome NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate
## Min. : 0 Min. : 0.000 Min. : 0.000
## 1st Qu.: 3400 1st Qu.: 5.000 1st Qu.: 0.000
## Median : 5400 Median : 8.000 Median : 0.000
## Mean : 6670 Mean : 8.453 Mean : 0.266
## 3rd Qu.: 8249 3rd Qu.:11.000 3rd Qu.: 0.000
## Max. :3008750 Max. :58.000 Max. :98.000
## NA's :29731
## NumberRealEstateLoansOrLines NumberOfTime60.89DaysPastDueNotWorse
## Min. : 0.000 Min. : 0.0000
## 1st Qu.: 0.000 1st Qu.: 0.0000
## Median : 1.000 Median : 0.0000
## Mean : 1.018 Mean : 0.2404
## 3rd Qu.: 2.000 3rd Qu.: 0.0000
## Max. :54.000 Max. :98.0000
##
## NumberOfDependents
## Min. : 0.000
## 1st Qu.: 0.000
## Median : 0.000
## Mean : 0.757
## 3rd Qu.: 1.000
## Max. :20.000
## NA's :3924
We can see variable MonthlyIncome
has 29731 missing values and NumberOfDependents
has 3924 missing values.
- Are there any default values?
- Can you identify the variables with outliers?
- Are there any variables with other issues?
Step-3: Continuous Variables Exploration
Summary of Continuous variables can be explored by: – Min, Max, Median, Mean, sd, Var – Quartiles – Box plots and identification of outliers – Percentiles- P1, p5,p10,q1(p25),q3(p75), p90,p99
Check Points
- Are variable distribution as expected?
- What is the central tendency of the variable? Mean, Median and Mode across each variable.
- Is the concentration of variables as expected ? What are quartiles?
- Indicates variables which are unary I.e stddev=0 ; the variables which are useless for the current objective.
- Are there any outliers / extreme values for the variable?
- Are outlier values as expected or they have abnormally high values -for ex for Age if max and p99 values are 10000. Then should investigate if it’s the default value or there is some error in data.
- What is the % of missing value associated with the variable?
LAB: Continuous Variables Summary
- List down the continuous variables.
- Find summary statistics for each variable. Min, Max, Median, Mean, sd, Var.
- Find Quartiles for each of the variables.
- Create Box plots and identify outliers.
- Find the percentage of missing values.
- Find Percentiles and find percentage of outliers, if any P1, p5,p10,q1(p25),q3(p75), p90,p99.
Solution 1. List down the continuous variables.
str(loans)
## 'data.frame': 150000 obs. of 12 variables:
## $ Sr_No : int 1 2 3 4 5 6 7 8 9 10 ...
## $ SeriousDlqin2yrs : int 1 0 0 0 0 0 0 0 0 0 ...
## $ RevolvingUtilizationOfUnsecuredLines: num 0.766 0.957 0.658 0.234 0.907 ...
## $ age : int 45 40 38 30 49 74 57 39 27 57 ...
## $ NumberOfTime30.59DaysPastDueNotWorse: int 2 0 1 0 1 0 0 0 0 0 ...
## $ DebtRatio : num 0.803 0.1219 0.0851 0.036 0.0249 ...
## $ MonthlyIncome : int 9120 2600 3042 3300 63588 3500 NA 3500 NA 23684 ...
## $ NumberOfOpenCreditLinesAndLoans : int 13 4 2 5 7 3 8 8 2 9 ...
## $ NumberOfTimes90DaysLate : int 0 0 1 0 0 0 0 0 0 0 ...
## $ NumberRealEstateLoansOrLines : int 6 0 0 0 1 1 3 0 0 4 ...
## $ NumberOfTime60.89DaysPastDueNotWorse: int 0 0 0 0 0 0 0 0 0 0 ...
## $ NumberOfDependents : int 2 1 0 0 0 1 0 0 NA 2 ...
head(loans)
## Sr_No SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age
## 1 1 1 0.7661266 45
## 2 2 0 0.9571510 40
## 3 3 0 0.6581801 38
## 4 4 0 0.2338098 30
## 5 5 0 0.9072394 49
## 6 6 0 0.2131787 74
## NumberOfTime30.59DaysPastDueNotWorse DebtRatio MonthlyIncome
## 1 2 0.80298213 9120
## 2 0 0.12187620 2600
## 3 1 0.08511338 3042
## 4 0 0.03604968 3300
## 5 1 0.02492570 63588
## 6 0 0.37560697 3500
## NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate
## 1 13 0
## 2 4 0
## 3 2 1
## 4 5 0
## 5 7 0
## 6 3 0
## NumberRealEstateLoansOrLines NumberOfTime60.89DaysPastDueNotWorse
## 1 6 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 1 0
## 6 1 0
## NumberOfDependents
## 1 2
## 2 1
## 3 0
## 4 0
## 5 0
## 6 1
- Find summary statistics for each variable. Min, Max, Median, Mean, sd, Var.
summary(loans$RevolvingUtilizationOfUnsecuredLines)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.03 0.15 6.05 0.56 50710.00
summary(loans$MonthlyIncome)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 3400 5400 6670 8249 3009000 29731
var(loans$RevolvingUtilizationOfUnsecuredLines)
## [1] 62377.75
var(loans$MonthlyIncome)
## [1] NA
var(loans$MonthlyIncome,na.rm = TRUE)
## [1] 206918852
sd(loans$RevolvingUtilizationOfUnsecuredLines)
## [1] 249.7554
sd(loans$MonthlyIncome)
## [1] NA
sd(loans$MonthlyIncome,na.rm = TRUE)
## [1] 14384.67
To find all the above mentioned values at one go we can use describe()
function form library psych
.
library(psych)
## Warning: package 'psych' was built under R version 3.3.2
describe(loans$RevolvingUtilizationOfUnsecuredLines)
## vars n mean sd median trimmed mad min max range skew
## X1 1 150000 6.05 249.76 0.15 0.27 0.22 0 50708 50708 97.63
## kurtosis se
## X1 14544.03 0.64
describe(loans$MonthlyIncome)
## vars n mean sd median trimmed mad min max range
## X1 1 120269 6670.22 14384.67 5400 5787.56 3435.18 0 3008750 3008750
## skew kurtosis se
## X1 114.04 19503.57 41.48
- Find Quartiles for each of the variables
summary(loans$RevolvingUtilizationOfUnsecuredLines)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.03 0.15 6.05 0.56 50710.00
summary(loans$MonthlyIncome)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 3400 5400 6670 8249 3009000 29731
- Create Box plots and identify outliers.
boxplot(loans$RevolvingUtilizationOfUnsecuredLines)
boxplot(loans$MonthlyIncome)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(loans, aes(y=RevolvingUtilizationOfUnsecuredLines, x = 1)) + geom_boxplot()
ggplot(loans, aes(y=MonthlyIncome, x = 1)) + geom_boxplot()
## Warning: Removed 29731 rows containing non-finite values (stat_boxplot).
- Find the percentage of missing values.
sum(is.na(loans$MonthlyIncome))
## [1] 29731
sum(is.na(loans$MonthlyIncome))/nrow(loans)
## [1] 0.1982067
- Find Percentiles and find percentage of outliers, if any P1, p5,p10,q1(p25),q3(p75), p90,p99
util_percentiles<-round(quantile(loans$RevolvingUtilizationOfUnsecuredLines,c(0.05, 0.1, 0.25, 0.5, 0.75, 0.80, 0.9,0.91,0.95,0.96,0.97,0.975,0.98,0.99,1)),2)
util_percentiles
## 5% 10% 25% 50% 75% 80% 90% 91%
## 0.00 0.00 0.03 0.15 0.56 0.70 0.98 1.00
## 95% 96% 97% 97.5% 98% 99% 100%
## 1.00 1.00 1.00 1.00 1.01 1.09 50708.00
We can say last 1% values are outliers.
Data Cleaning
- Some variables contain outliers.
- Some variables have default values.
- Some variables have missing values.
- These Values contain Outliers:
- RevolvingUtilizationOfUnsecuredLines
- NumberOfTime30_59DaysPastDueNotWorse
- These Variables have missing values:
- MonthlyIncome
- NumberOfDependents
Shall we delete them and go ahead with our analysis? – Not really.
Missing Values & Outliers
- Data is not always available E.g., many tuples have no recorded value for several attributes, such as customer income in sales data.
- Missing data may be due to :
- Equipment malfunction.
- Inconsistent with other recorded data and thus deleted.
- Data not entered due to misunderstanding.
- Certain data may not be considered important at the time of entry.
- Not register history or changes of the data.
- Missing data may need to be inferred.
- Missing data – values, attributes, entire records, entire sections.
- Missing values and defaults are indistinguishable.
Imputation
Missing Value Imputation : 1
- Standalone imputation
- Mean, median, other point estimates
- Convenient, easy to implement
- Assume: Distribution of the missing values is the same as the non-missing values.
- Does not take into account inter-relationships
- Eg: The average of available values is 11.4. Can we replace the missing value in this table by 11.4 ?
Missing Value Imputation : 2
- Use attribute relationships
- Better imputation
- Two techniques
- Propensity score (nonparametric). Useful for discrete variables.
- Regression (parametric)
- There are two missing values in x2. What are the most appropriate replacements?
Missing Value Imputation : 3
-There are two missing values in x2. Find the most appropriate replacements
Missing Value Imputation : 4
- What if more than 50% are missing?
- It doesn’t make sense to carry out the analysis on 20% or 30% of the whole data and give inferences on overall data.
- The best imputation is ignore the actual values and take available or not available info.
Step-4: Missing Values and Outlier Treatment
Data Cleaning Scenario-1
RevolvingUtilizationOfUnsecuredLines
- RevolvingUtilizationOfUnsecuredLines has outliers.
- What type of variable is this? What are the possible values?
- Its’ mean is 6.05 which is greater than 1. So variable has some faulty values. Its maximum value is 50710 which is way too high.
- Lets look at percentiles to know from where it is exceeding 1.
We can check this contition with the outlier treatement image we have above, and our path would be:
Data Cleaning
- RevolvingUtilizationOfUnsecuredLines has outliers.
- Since outliers percentage is less than 10% We will replace outliers with mean of reaming data.
- Outliers are with value greater than 1.
LAB: Data Cleaning Scenario-1
- What percent are missing values in RevolvingUtilizationOfUnsecuredLines?
- Get the detailed percentile distribution.
- Clean the variable, and create a new variable by removing all the issues.
Solution – What percent are missing values in RevolvingUtilizationOfUnsecuredLines?
library(ggplot2)
ggplot(loans, aes(y=RevolvingUtilizationOfUnsecuredLines, x = 1)) + geom_boxplot()
- Get the detailed percentile distribution.
util_percentiles<-round(quantile(loans$RevolvingUtilizationOfUnsecuredLines,c(0.05, 0.1, 0.25, 0.5, 0.75, 0.80, 0.9,0.91,0.95,0.96,0.97,0.975,0.98,0.99,0.997,1)),2)
util_percentiles
## 5% 10% 25% 50% 75% 80% 90% 91%
## 0.00 0.00 0.03 0.15 0.56 0.70 0.98 1.00
## 95% 96% 97% 97.5% 98% 99% 99.7% 100%
## 1.00 1.00 1.00 1.00 1.01 1.09 1.76 50708.00
- Clean the variable, and create a new variable by removing all the issues
- If utilization is more than 1 then it can be replaced by median
median_util<-median(loans$RevolvingUtilizationOfUnsecuredLines)
loans$util_new<-ifelse(loans$RevolvingUtilizationOfUnsecuredLines>1,median_util,loans$RevolvingUtilizationOfUnsecuredLines)
- Box_plot on cleaned variable
ggplot(loans, aes(y=util_new, x = 1)) + geom_boxplot()
- percentile distribution for new variable
util_percentiles1<-round(quantile(loans$util_new,c(0.05, 0.1, 0.25, 0.5, 0.75, 0.80, 0.9,0.91,0.95,0.96,0.97,0.975,0.98,0.99,1)),2)
util_percentiles1
## 5% 10% 25% 50% 75% 80% 90% 91% 95% 96% 97% 97.5%
## 0.00 0.00 0.03 0.15 0.51 0.63 0.93 0.95 1.00 1.00 1.00 1.00
## 98% 99% 100%
## 1.00 1.00 1.00
Data Cleaning Scenario-2
NumberOfTime30_59DaysPastDueNotWorse
- Find bad rate in each category of this variable
- Replace 96 with _____? Replace 98 with_____?
Cleaning path for this variable would be:
LAB: Data Cleaning Scenario-2
- What is the issue with NumberOfTime30_59DaysPastDueNotWorse
- Draw a frequency table.
- What percent of the values are erroneous?
- Clean the variable- Look at the cross tab of variable vs target. Impute based on target.
- Create frequency table for cleaned variable.
Solution 1. What is the issue with NumberOfTime30_59DaysPastDueNotWorse.
- Draw a frequency table
freq_table_30dpd<-table(loans$NumberOfTime30.59DaysPastDueNotWorse)
freq_table_30dpd
##
## 0 1 2 3 4 5 6 7 8 9
## 126018 16033 4598 1754 747 342 140 54 25 12
## 10 11 12 13 96 98
## 4 1 2 1 5 264
One month defaults frequency can’t be beyond 24 in last 24 months.
- What percent of the values are erroneous?
freq_table_30dpd[14:length(freq_table_30dpd)]
##
## 13 96 98
## 1 5 264
sum(freq_table_30dpd[14:length(freq_table_30dpd)])/sum(freq_table_30dpd)
## [1] 0.0018
- Clean the variable- Look at the cross tab of variable vs target. Impute based on target .
#Cross tab with target
cross_tab_30dpd_target<-table(loans$NumberOfTime30.59DaysPastDueNotWorse,loans$SeriousDlqin2yrs)
cross_tab_30dpd_target
##
## 0 1
## 0 120977 5041
## 1 13624 2409
## 2 3379 1219
## 3 1136 618
## 4 429 318
## 5 188 154
## 6 66 74
## 7 26 28
## 8 17 8
## 9 8 4
## 10 1 3
## 11 0 1
## 12 1 1
## 13 0 1
## 96 1 4
## 98 121 143
#Cross tab row Percentages
cross_tab_30dpd_target_percent<-round(prop.table(cross_tab_30dpd_target, 1),2)
cross_tab_30dpd_target_percent
##
## 0 1
## 0 0.96 0.04
## 1 0.85 0.15
## 2 0.73 0.27
## 3 0.65 0.35
## 4 0.57 0.43
## 5 0.55 0.45
## 6 0.47 0.53
## 7 0.48 0.52
## 8 0.68 0.32
## 9 0.67 0.33
## 10 0.25 0.75
## 11 0.00 1.00
## 12 0.50 0.50
## 13 0.00 1.00
## 96 0.20 0.80
## 98 0.46 0.54
data.frame(Freq_0=cross_tab_30dpd_target[,1],Freq_1=cross_tab_30dpd_target[,2],Pert_0=cross_tab_30dpd_target_percent[,1],Pert_1=cross_tab_30dpd_target_percent[,2])
## Freq_0 Freq_1 Pert_0 Pert_1
## 0 120977 5041 0.96 0.04
## 1 13624 2409 0.85 0.15
## 2 3379 1219 0.73 0.27
## 3 1136 618 0.65 0.35
## 4 429 318 0.57 0.43
## 5 188 154 0.55 0.45
## 6 66 74 0.47 0.53
## 7 26 28 0.48 0.52
## 8 17 8 0.68 0.32
## 9 8 4 0.67 0.33
## 10 1 3 0.25 0.75
## 11 0 1 0.00 1.00
## 12 1 1 0.50 0.50
## 13 0 1 0.00 1.00
## 96 1 4 0.20 0.80
## 98 121 143 0.46 0.54
#Percentage of 0 and 1 are of 98 is near to percentages of 6.
#Replacing error values with 6
loans$num_30_59_dpd_new<-ifelse(loans$NumberOfTime30.59DaysPastDueNotWorse>12,6,loans$NumberOfTime30.59DaysPastDueNotWorse)
- Create frequency table for cleaned variable.
table(loans$num_30_59_dpd_new)
##
## 0 1 2 3 4 5 6 7 8 9
## 126018 16033 4598 1754 747 342 410 54 25 12
## 10 11 12
## 4 1 2
Data Cleaning Scenario-3
MonthlyIncome
- Monthly Income has nearly 20% missing values.
- Missing value percentage is significant.
- Simply replacing with mean or median is not sufficient.
- We can create an indicator variable to keep track of missing and non-missing values.
This would be the path to clean the variable MonthlyIncome:
LAB: Data Cleaning Scenario-3 : Monthly Income
- Find the missing value percentage in monthly income.
- Create an indicator variable for missing and non-missing.
- Replace the missing values with median.
Solution – Find the missing value percentage in monthly income
sum(is.na(loans$MonthlyIncome))
## [1] 29731
sum(is.na(loans$MonthlyIncome))/nrow(loans)
## [1] 0.1982067
- Create an indicator variable for missing and non-missing
loans$MonthlyIncome_ind<-ifelse(is.na(loans$MonthlyIncome), "Missing", "Non-Missing")
table(loans$MonthlyIncome_ind)
##
## Missing Non-Missing
## 29731 120269
- Replace the missing values with median
median_income<-median(loans$MonthlyIncome, na.rm = TRUE)
median_income
## [1] 5400
loans$MonthlyIncome_new<-ifelse(is.na(loans$MonthlyIncome), median_income, loans$MonthlyIncome)
sum(is.na(loans$MonthlyIncome_new))
## [1] 0
Data Cleaning Other Variables
Remaining Variables Imputation
- Debt Ratio : Imputation
- NumberOfOpenCreditLinesAndLoans : No issues in this variable.
- NumberOfTimes90DaysLate : Imputation similar to NumberOfTime30_59DaysPastDueNotWorse.
- NumberRealEstateLoansOrLines : No issues in this variable.
- NumberOfTime60_89DaysPastDueNotWorse : Imputation similar to NumberOfTime30_59DaysPastDueNotWorse.
- NumberOfDependents : Impute based on target variable.
Conclusion
- Data cleaning is as important as data analysis.
- Sometimes 80% of the overall project time is spent on data cleaning.
- Data cleaning needs patience, we need to clean for each individual variable.
- Apart from suggested methods, there are many heuristic ways of cleaning the data.
// add bootstrap table styles to pandoc tables $(document).ready(function () { $('tr.header').parent('thead').parent('table').addClass('table table-condensed'); });