Before start our lesson please download the datasets.
Give me some credit data
We will try to understand the data exploration, validation and data cleaning using a case study on loans data
Give me some credit data. It is loans data. Historical data are provided on 150,000 borrowers.
The final objective is to build a model that borrowers can use to help make the best financial decisions.
We generally get the data and data dictionary from the data team.
Steps in Data Exploration and Cleaning
Step-1: Basic details of the data
Step-2: Categorical variables exploration
Step-3: Continuous variables exploration
Step-4: Missing Values and Outlier Treatment
Step-1: Basic details of the data
#Lab: Basic contents of the data
# Import “Give me some Credit\cs-training.csv”
# What are number of rows and columns
# Are there any suspicious variables?
# Are all the variable names correct?
# Display the variable formats
# Print the first 10 observations
# Do we have any unique identifier?
# Do the text and numeric variables have meaningful data?
# Are there some coded values in the data?
# Do all the variables appear to have data
# Code: Basic contents of the data
import pandas as pd
loans=pd.read_csv("C:\\Users\\jyothi\\Google Drive\\Datasets\\Give me some Credit\\cs-training.csv")
loans
#What are number of rows and columns
loans.shape
#Are there any suspicious variables?
loans.columns.values
#Display the variable formats
loans.dtypes
#Print the first 10 observations
loans.head(10)
#Do we have any unique identifier?
loans.columns.values
Step-2: Categorical and Descrete variables exploration
# Code: Frequencies
# What are the categorical and discrete variables? What are the continues variables.
# Find the frequencies of all class variables in the data
# Are there any variables with missing values?
# Are there any default values?
# Can you identify the variables with outliers?
# Are there any variables with other issues?
#What are the categorical and discrete variables? What are the continues variables.
loans.dtypes
loans.head()
#Find the frequencies of all class variables in the data
loans.columns.values
#Find the frequencies of all class variables in the data
loans['SeriousDlqin2yrs'].value_counts()
loans['age'].value_counts(sort=False)
loans['NumberOfTime30-59DaysPastDueNotWorse'].value_counts(sort=False)
loans['NumberOfOpenCreditLinesAndLoans'].value_counts(sort=False)
loans['NumberOfTimes90DaysLate'].value_counts(sort=False)
loans['NumberRealEstateLoansOrLines'].value_counts(sort=False)
loans['NumberOfTime60-89DaysPastDueNotWorse'].value_counts(sort=False)
loans['NumberOfDependents'].value_counts(sort=False)
#Are there any variables with missing values?
loans.isnull().sum()
Step-3: Continuous variables exploration
#Code: Continuous variables summary
# List down the continuous variables
# Find summary statistics for each variable. Min, Max, Median, Mean, sd, Var
# Find Quartiles for each of the variables
# Create Box plots and identify outliers
# Find the percentage of missing values
# Find Percentiles and find percentage of outliers, if any P1, p5,p10,q1(p25),q3(p75), p90,p99
#List down the continuous variables
loans.dtypes
loans.head()
#Find summary statistics for each variable. Min, Max, Median, Mean, sd, Var
loans['RevolvingUtilizationOfUnsecuredLines'].describe()
loans['MonthlyIncome'].describe()
import numpy as np
variance = np.var(loans['RevolvingUtilizationOfUnsecuredLines'])
variance
import numpy as np
np.var(loans['MonthlyIncome'])
import numpy as np
np.std(loans['RevolvingUtilizationOfUnsecuredLines'])
import numpy as np
np.std(loans['MonthlyIncome'])
#Find Quartiles for each of the variables
loans['RevolvingUtilizationOfUnsecuredLines'].describe()
loans['MonthlyIncome'].describe()
This is an integer variable. It has missing values represented by ‘NA’. Its minimum value is 0, which is practically impossible. Mean is 6670 and median is 5400 without considering NA values.
#Create Box plots and identify outliers
import matplotlib.pyplot as plt
%matplotlib inline
loans.boxplot(column="RevolvingUtilizationOfUnsecuredLines")
import matplotlib.pyplot as plt
%matplotlib inline
loans.boxplot(column="MonthlyIncome")
#Find the missing values in MonthlyIncome variable
loans['MonthlyIncome'].isnull().sum()
29731 values are missing in MonthlyIncome variable
#Find the percentage of missing values
loans['MonthlyIncome'].isnull().sum()/len(loans)
#Find Percentiles and find percentage of outliers, if any P1, p5,p10,q1(p25),q3(p75), p90,p99
util_percentiles=loans['RevolvingUtilizationOfUnsecuredLines'].quantile([0.05, 0.1, 0.25, 0.5, 0.75, 0.80, 0.9,0.91,0.95,0.96,0.97,0.975,0.98,0.99,1])
round(util_percentiles,2)
Code: Data Cleaning Scenario-1
# What percent are missing values in RevolvingUtilizationOfUnsecuredLines?
# Get the detailed percentile distribution
# Clean the variable, and create a new variable by removing all the issues
# # LAB: Data Cleaning Scenario-1
#What percent are missing values in RevolvingUtilizationOfUnsecuredLines?
#Get the detailed percentile distribution
util_percentiles=loans['RevolvingUtilizationOfUnsecuredLines'].quantile([0.05, 0.1, 0.25, 0.5, 0.75, 0.80, 0.9,0.91,0.95,0.96,0.97,0.975,0.98,0.99,1])
round(util_percentiles,2)
#Clean the variable, and create a new variable by removing all the issues
#If utilization is more than 1 then it can be replaced by median
median_util=loans['RevolvingUtilizationOfUnsecuredLines'].median()
median_util
loans['util_new']=loans['RevolvingUtilizationOfUnsecuredLines']
loans['util_new'][loans['util_new']>1]=median_util
loans['util_new']
# percentile distribution for new variable
util_percentiles1=loans['util_new'].quantile([0.05, 0.1, 0.25, 0.5, 0.75, 0.80, 0.9,0.91,0.95,0.96,0.97,0.975,0.98,0.99,1])
round(util_percentiles1,2)
Code: Data Cleaning Scenario-2
# What is the issue with NumberOfTime30_59DaysPastDueNotWorse
# Draw a frequency table
# What percent of the values are erroneous?
# Clean the variable- Look at the cross tab of variable vs target. Impute based on target .
# Create frequency table for cleaned variable
# # LAB: Data Cleaning Scenario-2
#What is the issue with NumberOfTime30_59DaysPastDueNotWorse
#Draw a frequency table
freq_table_30dpd=loans['NumberOfTime30-59DaysPastDueNotWorse'].value_counts(sort=False)
freq_table_30dpd
#One month defaults frequency can't be beyond 24 in last 24 months
#What percent of the values are erroneous?
freq_table_30dpd[13:len(freq_table_30dpd)]
freq_table_30dpd[13:len(freq_table_30dpd)].sum()/freq_table_30dpd.sum()
#Clean the variable- Look at the cross tab of variable vs target. Impute based on target .
#Cross tab with target
import pandas as pd
cross_tab_30dpd_target=pd.crosstab(loans['NumberOfTime30-59DaysPastDueNotWorse'],loans['SeriousDlqin2yrs'])
cross_tab_30dpd_target
#Cross tab row Percentages
cross_tab_30dpd_target_percent=cross_tab_30dpd_target.astype(float).div(cross_tab_30dpd_target.sum(axis=1), axis=0)
round(cross_tab_30dpd_target_percent,2)
loans['num_30_59_dpd_new']=loans['NumberOfTime30-59DaysPastDueNotWorse']
loans['num_30_59_dpd_new'][loans['num_30_59_dpd_new']>12]=6
loans['num_30_59_dpd_new']
loans['num_30_59_dpd_new'].value_counts(sort=False)
LAB: Monthly Income
# Find the missing value percentage in monthly income
# Create an indicator variable for missing and non-missing
# Replace the missing values with median
# # Data Cleaning Scenario-3
#Find the missing value percentage in monthly income
loans['MonthlyIncome'].isnull().sum()
loans['MonthlyIncome'].isnull().sum()/len(loans)
#Once identified where missing values exist, the next task usually is to fill them (data imputation). Depending upon the context,
#in this case, I am assigning median value to all those positions where missing value is present:
loans['MonthlyIncome_ind']=1
loans['MonthlyIncome_ind'][loans['MonthlyIncome'].isnull()]=0
loans['MonthlyIncome_ind'].value_counts(sort=False)
loans['MonthlyIncome_new']=loans['MonthlyIncome']
loans['MonthlyIncome_new'][loans['MonthlyIncome'].isnull()]=loans['MonthlyIncome'].median()
round(loans['MonthlyIncome_new'].describe())
#Remaining Variables Imputation
# Debt Ratio: Imputation
# NumberOfOpenCreditLinesAndLoans : No issues in this variable
# NumberOfTimes90DaysLate: Imputation similar to NumberOfTime30_59DaysPastDueNotW
# NumberRealEstateLoansOrLines: : No issues in this variable
# NumberOfTime60_89DaysPastDueNotW: Imputation similar to NumberOfTime30_59DaysPastDueNotW
# ConclusionNumberOfDependents: Impute based on target variable
#Conclusion
#Data cleaning needs patience, we need to clean for each individual variable