You can download the datasets .
Census Income Prediction
Problem Statement
Abstract of the Problem
The objective of this project is to Predict whether the income of the Citizens exceeds $50K/yr based on Census income data.
Source of Information
U.S. Census Bureau United States Department of Commerce Donor
Terran Lane and Ronny Kohavi Data Mining and Visualization Silicon Graphics. terran@ecn.purdue.edu, ronnyk@sgi.com Date Donated: March 7, 2000
Data Exploration
Dataset Information
income data – income_data income_data set contains 32,561 rows and 15 columns.
Data Import
First step is to import the data set to R. Formats like .csv,.xlsx etc are the common data formats used by data scientists or analysts. Use suitable function to import it into R. In our case data set is in csv format. So we use the function ‘read.csv()’ to import the data set.
census_income_data<-read.csv("E:/R_Census_Project/Satish/Census Income Data/income_data.csv")
dim(census_income_data)
## [1] 32561 15
names(census_income_data)
## [1] "age" "workclass" "fnlwgt" "education"
## [5] "education.num" "marital.status" "occupation" "relationship"
## [9] "race" "sex" "capital.gain" "capital.loss"
## [13] "hours.per.week" "native.country" "Income_band"
levels(census_income_data$Income_band)[1]<-0
levels(census_income_data$Income_band)[2]<-1
table(census_income_data$Income_band)
##
## 0 1
## 24720 7841
head(census_income_data)
## age workclass fnlwgt education education.num
## 1 39 State-gov 77516 Bachelors 13
## 2 50 Self-emp-not-inc 83311 Bachelors 13
## 3 38 Private 215646 HS-grad 9
## 4 53 Private 234721 11th 7
## 5 28 Private 338409 Bachelors 13
## 6 37 Private 284582 Masters 14
## marital.status occupation relationship race sex
## 1 Never-married Adm-clerical Not-in-family White Male
## 2 Married-civ-spouse Exec-managerial Husband White Male
## 3 Divorced Handlers-cleaners Not-in-family White Male
## 4 Married-civ-spouse Handlers-cleaners Husband Black Male
## 5 Married-civ-spouse Prof-specialty Wife Black Female
## 6 Married-civ-spouse Exec-managerial Wife White Female
## capital.gain capital.loss hours.per.week native.country Income_band
## 1 2174 0 40 United-States 0
## 2 0 0 13 United-States 0
## 3 0 0 40 United-States 0
## 4 0 0 40 United-States 0
## 5 0 0 40 Cuba 0
## 6 0 0 40 United-States 0
Censu_Income_data set contains 32,561 rows and 15 columns.
Univariate Analysis
Once we have the dataset and metadata, understanding metadata thoroughly is a crucial step. Exploration helps us understand all the variables throughly which is necessary to understand relation between input and predictive variables. Exploration also provides a vague understanding of what’s going on with the dataset. ####Check whether missing values are present or not?
sum(is.na(census_income_data))
## [1] 0
census_income_data set has no missing values.
Variable_1= “age”
age is a numerical data.
head(census_income_data$age)
## [1] 39 50 38 53 28 37
Univariate Analysis of age
Central tendencies of age
mean of age
mean(census_income_data$age)
## [1] 38.58165
median of age
median(census_income_data$age)
## [1] 37
Dispersion of age
Variance of age
var(census_income_data$age)
## [1] 186.0614
Standard deviation of age
sd(census_income_data$age)
## [1] 13.64043
summary gives four quartiles of age
summary(census_income_data$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.00 28.00 37.00 38.58 48.00 90.00
boxplot of age
quantile(census_income_data$age)
## 0% 25% 50% 75% 100%
## 17 28 37 48 90
quantile(census_income_data$age,c(0.75,0.80,0.90,1))
## 75% 80% 90% 100%
## 48 50 58 90
boxplot(census_income_data$age,main="age")
Output description
In this boxplot the minimum is 17 , maximum is 90, and median is 37. first quartile is 28,third quartile is 48. Note that outliers are discussed later.
Histogram of “age” variable:
hist(census_income_data$age)
Correlation between age variable and income_brands variable
library("ltm", lib.loc="C:/Program Files/R/R-3.3.1/library")
## Warning: package 'ltm' was built under R version 3.3.3
## Loading required package: MASS
## Warning: package 'MASS' was built under R version 3.3.3
## Loading required package: msm
## Warning: package 'msm' was built under R version 3.3.3
## Loading required package: polycor
## Warning: package 'polycor' was built under R version 3.3.3
biserial.cor(census_income_data$age,census_income_data$Income_band)
## [1] -0.2340335
correlation is -0.23 age and income_brands are negatively correlated
Variable_2= “workclass”
It is a Categorial data.There are 9 categories,
Not in universe
Private
Self-employed-not incorporated
Local government
State government
Self-employed-incorporated
Federal government
Never worked
Without pay
“workclass” variable is qualitative data.Central tendencies and Measures of dispersion coefficients does not make any sense.For this scenerio we calculate frequency table,mode and Histogram.Mode gives the maximum value of work class.
Frequency table of workclass
tab<-table(census_income_data$workclass)
tab
##
## ? Federal-gov Local-gov Never-worked
## 1836 960 2093 7
## Private Self-emp-inc Self-emp-not-inc State-gov
## 22696 1116 2541 1298
## Without-pay
## 14
names(tab)
## [1] " ?" " Federal-gov" " Local-gov"
## [4] " Never-worked" " Private" " Self-emp-inc"
## [7] " Self-emp-not-inc" " State-gov" " Without-pay"
sum(is.na(census_income_data$workclass))
## [1] 0
Mode of “workclass”
temp <- table(as.vector(census_income_data$workclass))
names(temp)[temp==max(temp)]
## [1] " Private"
mode of workclass is private
ggplot of “workclass”
library("ggplot2", lib.loc="C:/Program Files/R/R-3.3.1/library")
## Warning: package 'ggplot2' was built under R version 3.3.3
qplot(census_income_data$workclass,main="workclass",ylab="count",colour= I("purple"),size=I(4))
library("gmodels", lib.loc="C:/Program Files/R/R-3.3.1/library")
## Warning: package 'gmodels' was built under R version 3.3.3
CrossTable(census_income_data$workclass,census_income_data$Income_band,digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
## Warning in chisq.test(t, correct = FALSE, ...): Chi-squared approximation
## may be incorrect
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 32561
##
##
## | census_income_data$Income_band
## census_income_data$workclass | 0 | 1 | Row Total |
## -----------------------------|-----------|-----------|-----------|
## ? | 1645 | 191 | 1836 |
## | 0.1 | 0.0 | |
## -----------------------------|-----------|-----------|-----------|
## Federal-gov | 589 | 371 | 960 |
## | 0.0 | 0.0 | |
## -----------------------------|-----------|-----------|-----------|
## Local-gov | 1476 | 617 | 2093 |
## | 0.1 | 0.1 | |
## -----------------------------|-----------|-----------|-----------|
## Never-worked | 7 | 0 | 7 |
## | 0.0 | 0.0 | |
## -----------------------------|-----------|-----------|-----------|
## Private | 17733 | 4963 | 22696 |
## | 0.7 | 0.6 | |
## -----------------------------|-----------|-----------|-----------|
## Self-emp-inc | 494 | 622 | 1116 |
## | 0.0 | 0.1 | |
## -----------------------------|-----------|-----------|-----------|
## Self-emp-not-inc | 1817 | 724 | 2541 |
## | 0.1 | 0.1 | |
## -----------------------------|-----------|-----------|-----------|
## State-gov | 945 | 353 | 1298 |
## | 0.0 | 0.0 | |
## -----------------------------|-----------|-----------|-----------|
## Without-pay | 14 | 0 | 14 |
## | 0.0 | 0.0 | |
## -----------------------------|-----------|-----------|-----------|
## Column Total | 24720 | 7841 | 32561 |
## | 0.8 | 0.2 | |
## -----------------------------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 1045.709 d.f. = 8 p = 2.026505e-220
##
##
##
Varaiable_3=“fnlwgt”
The no of people the census takers believe that observation represents. We will be ignoring this variable. It is a continuous data.
head(census_income_data$fnlwgt)
## [1] 77516 83311 215646 234721 338409 284582
univariate analysis of fnlwgt
Central tendencies of fnlwgt
mean of fnlwgt
mean(census_income_data$fnlwgt)
## [1] 189778.4
median of fnlwgt
median(census_income_data$fnlwgt)
## [1] 178356
Measures of Dispersion of fnlwgt
Variance of fnlwgt
var(census_income_data$fnlwgt)
## [1] 11140797792
Standard deviation of fnlwgt
sd(census_income_data$fnlwgt)
## [1] 105550
summary gives four quartiles of fnlwgt
summary(census_income_data$fnlwgt)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12280 117800 178400 189800 237100 1485000
Boxplot of fnlwgt
quantile(census_income_data$fnlwgt)
## 0% 25% 50% 75% 100%
## 12285 117827 178356 237051 1484705
quantile(census_income_data$fnlwgt,c(0.75,0.80,0.90,1))
## 75% 80% 90% 100%
## 237051 259873 329054 1484705
boxplot(census_income_data$fnlwgt,main="fnlwgt")
output description
In this boxplot the minimum is 12285, maximum is 1484705, and median is 178356. first quartile is 117827,third quartile is 237051. Note that outliers are discussed later.
Histogram of fnlwgt
hist(census_income_data$fnlwgt)
Correlation between fnlwgt and income_brands
library("ltm", lib.loc="C:/Program Files/R/R-3.3.1/library")
biserial.cor(census_income_data$fnlwgt,census_income_data$Income_band)
## [1] 0.009462412
correlation is 0.009462412 fnlwgt and income_brands are positively correlated
Variable_4=“education”
The highest level of education achieved for that individual. It is a Categorial data.
. preschool
. 1st 2nd 3rd or 4th grade
. 5th or 6th grade
. 7th and 8th grade
. 9th grade
. 10th grade
. 11th grade
. 12th grade no diploma
. High school graduate
. Some college but no degree
. Associates degree-academic program
. Associates degree-occup /vocational
. Bachelors degree(BA AB BS)
. Masters degree(MA MS MEng MEd MSW MBA)
. Prof school degree (MD DDS DVM LLB JD)
. Doctorate degree(PhD EdD)
“education” contains qualitative data.Central tendencies,Measures of dispersion does not make any sense.frequency table,mode and baxplot are calculated for qualitative data. Mode gives the maximum value of status of education.
frequency table of education
tab<-table(census_income_data$education)
tab
##
## 10th 11th 12th 1st-4th 5th-6th
## 933 1175 433 168 333
## 7th-8th 9th Assoc-acdm Assoc-voc Bachelors
## 646 514 1067 1382 5355
## Doctorate HS-grad Masters Preschool Prof-school
## 413 10501 1723 51 576
## Some-college
## 7291
names(tab)
## [1] " 10th" " 11th" " 12th" " 1st-4th"
## [5] " 5th-6th" " 7th-8th" " 9th" " Assoc-acdm"
## [9] " Assoc-voc" " Bachelors" " Doctorate" " HS-grad"
## [13] " Masters" " Preschool" " Prof-school" " Some-college"
sum(is.na(census_income_data$education))
## [1] 0
Mode of education
temp <- table(as.vector(census_income_data$education))
names(temp)[temp==max(temp)]
## [1] " HS-grad"
Mode of education is ” HS-grad”
ggplot of education
library("ggplot2", lib.loc="C:/Program Files/R/R-3.3.1/library")
qplot(census_income_data$education,main="education",ylab="count",colour= I("purple"),size=I(4))
library("gmodels", lib.loc="C:/Program Files/R/R-3.3.1/library")
CrossTable(census_income_data$education,census_income_data$Income_band,digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 32561
##
##
## | census_income_data$Income_band
## census_income_data$education | 0 | 1 | Row Total |
## -----------------------------|-----------|-----------|-----------|
## 10th | 871 | 62 | 933 |
## | 0.0 | 0.0 | |
## -----------------------------|-----------|-----------|-----------|
## 11th | 1115 | 60 | 1175 |
## | 0.0 | 0.0 | |
## -----------------------------|-----------|-----------|-----------|
## 12th | 400 | 33 | 433 |
## | 0.0 | 0.0 | |
## -----------------------------|-----------|-----------|-----------|
## 1st-4th | 162 | 6 | 168 |
## | 0.0 | 0.0 | |
## -----------------------------|-----------|-----------|-----------|
## 5th-6th | 317 | 16 | 333 |
## | 0.0 | 0.0 | |
## -----------------------------|-----------|-----------|-----------|
## 7th-8th | 606 | 40 | 646 |
## | 0.0 | 0.0 | |
## -----------------------------|-----------|-----------|-----------|
## 9th | 487 | 27 | 514 |
## | 0.0 | 0.0 | |
## -----------------------------|-----------|-----------|-----------|
## Assoc-acdm | 802 | 265 | 1067 |
## | 0.0 | 0.0 | |
## -----------------------------|-----------|-----------|-----------|
## Assoc-voc | 1021 | 361 | 1382 |
## | 0.0 | 0.0 | |
## -----------------------------|-----------|-----------|-----------|
## Bachelors | 3134 | 2221 | 5355 |
## | 0.1 | 0.3 | |
## -----------------------------|-----------|-----------|-----------|
## Doctorate | 107 | 306 | 413 |
## | 0.0 | 0.0 | |
## -----------------------------|-----------|-----------|-----------|
## HS-grad | 8826 | 1675 | 10501 |
## | 0.4 | 0.2 | |
## -----------------------------|-----------|-----------|-----------|
## Masters | 764 | 959 | 1723 |
## | 0.0 | 0.1 | |
## -----------------------------|-----------|-----------|-----------|
## Preschool | 51 | 0 | 51 |
## | 0.0 | 0.0 | |
## -----------------------------|-----------|-----------|-----------|
## Prof-school | 153 | 423 | 576 |
## | 0.0 | 0.1 | |
## -----------------------------|-----------|-----------|-----------|
## Some-college | 5904 | 1387 | 7291 |
## | 0.2 | 0.2 | |
## -----------------------------|-----------|-----------|-----------|
## Column Total | 24720 | 7841 | 32561 |
## | 0.8 | 0.2 | |
## -----------------------------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 4429.653 d.f. = 15 p = 0
##
##
##
variable_5=“education.num”
It is a numerical data.
head(census_income_data$education.num)
## [1] 13 13 9 7 13 14
x<-sum(is.na(census_income_data$Income_band))
x
## [1] 0
Univariate analysis of education.num
Central tendencies of education.num
mean of education.num
mean(census_income_data$education.num)
## [1] 10.08068
median of education.num
median(census_income_data$education.num)
## [1] 10
Measures of Dispersion of education.num
Variance of education.num
var(census_income_data$education.num)
## [1] 6.61889
Standard deviation of education.num
sd(census_income_data$education.num)
## [1] 2.57272
summary gives four quartiles of education.num
summary(census_income_data$education.num)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 9.00 10.00 10.08 12.00 16.00
boxplot of education.num
quantile(census_income_data$education.num)
## 0% 25% 50% 75% 100%
## 1 9 10 12 16
quantile(census_income_data$education.num,c(0.75,0.80,0.90,1))
## 75% 80% 90% 100%
## 12 13 13 16
boxplot(census_income_data$education.num,main="education.num")
Output description
In this boxplot the minimum is 1, maximum is 16, and median is 10.First quartile is 9,third quartile is 12.note that outliers are discussed later.
histogram of education.num
hist(census_income_data$education.num)
correlation between education.num and income_brands
library("ltm", lib.loc="C:/Program Files/R/R-3.3.1/library")
biserial.cor(census_income_data$education.num,census_income_data$Income_band)
## [1] -0.3351488
Correlation is -0.3351488 education.num and income_brands are negatively correlated
variable_6 = “marital.status”
Marital status of the individual. It is an categorical variable.The categories are
Never married
Married-civilian spouse present
Divorced
Widowed
Separated
Married-spouse absent
Married-A F spouse present
“marital.status” contains qualitative data.Central tendencies ,dispersion does not make any sense.frequency table,mode and barplot are calculated for qualitative data.mode gives the maximum value of marital staus.
Frequency table of marital.status
tab<-table(census_income_data$marital.status)
tab
##
## Divorced Married-AF-spouse Married-civ-spouse
## 4443 23 14976
## Married-spouse-absent Never-married Separated
## 418 10683 1025
## Widowed
## 993
names(tab)
## [1] " Divorced" " Married-AF-spouse"
## [3] " Married-civ-spouse" " Married-spouse-absent"
## [5] " Never-married" " Separated"
## [7] " Widowed"
sum(is.na(census_income_data$marital.status))
## [1] 0
Mode of marital.status
temp <- table(as.vector(census_income_data$marital.status))
names(temp)[temp==max(temp)]
## [1] " Married-civ-spouse"
Mode of marital status is ” Married-civ-spouse”
ggplot of marital status
library("ggplot2", lib.loc="C:/Program Files/R/R-3.3.1/library")
qplot(census_income_data$marital.status,main="marital status",ylab="count",colour= I("purple"),size=I(4))
library("gmodels", lib.loc="C:/Program Files/R/R-3.3.1/library")
CrossTable(census_income_data$marital.status,census_income_data$Income_band,digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 32561
##
##
## | census_income_data$Income_band
## census_income_data$marital.status | 0 | 1 | Row Total |
## ----------------------------------|-----------|-----------|-----------|
## Divorced | 3980 | 463 | 4443 |
## | 0.2 | 0.1 | |
## ----------------------------------|-----------|-----------|-----------|
## Married-AF-spouse | 13 | 10 | 23 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## Married-civ-spouse | 8284 | 6692 | 14976 |
## | 0.3 | 0.9 | |
## ----------------------------------|-----------|-----------|-----------|
## Married-spouse-absent | 384 | 34 | 418 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## Never-married | 10192 | 491 | 10683 |
## | 0.4 | 0.1 | |
## ----------------------------------|-----------|-----------|-----------|
## Separated | 959 | 66 | 1025 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## Widowed | 908 | 85 | 993 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## Column Total | 24720 | 7841 | 32561 |
## | 0.8 | 0.2 | |
## ----------------------------------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 6517.742 d.f. = 6 p = 0
##
##
##
Variable_7 = “occupation”
It is a categorical data.The categories are
?
Adm-clerical
Armed-Forces
Craft-repair
Exec-managerial
Farming-fishing
Handlers-cleaners
Machine-op-inspct
Other-service
Priv-house-serv
Prof-specialty
Protective-serv
Sales
Tech-support
Transport-moving
Frequency table of occupation
tab<-table(census_income_data$occupation)
tab
##
## ? Adm-clerical Armed-Forces
## 1843 3770 9
## Craft-repair Exec-managerial Farming-fishing
## 4099 4066 994
## Handlers-cleaners Machine-op-inspct Other-service
## 1370 2002 3295
## Priv-house-serv Prof-specialty Protective-serv
## 149 4140 649
## Sales Tech-support Transport-moving
## 3650 928 1597
names(tab)
## [1] " ?" " Adm-clerical" " Armed-Forces"
## [4] " Craft-repair" " Exec-managerial" " Farming-fishing"
## [7] " Handlers-cleaners" " Machine-op-inspct" " Other-service"
## [10] " Priv-house-serv" " Prof-specialty" " Protective-serv"
## [13] " Sales" " Tech-support" " Transport-moving"
sum(is.na(census_income_data$occupation))
## [1] 0
Mode of occupation
temp <- table(as.vector(census_income_data$occupation))
names(temp)[temp==max(temp)]
## [1] " Prof-specialty"
Mode of occupation is ” Prof-specialty”
ggplot of occupation
library("ggplot2", lib.loc="C:/Program Files/R/R-3.3.1/library")
qplot(census_income_data$occupation,main="occupation",ylab="count",colour= I("purple"),size=I(4))
library("gmodels", lib.loc="C:/Program Files/R/R-3.3.1/library")
CrossTable(census_income_data$occupation,census_income_data$Income_band,digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
## Warning in chisq.test(t, correct = FALSE, ...): Chi-squared approximation
## may be incorrect
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 32561
##
##
## | census_income_data$Income_band
## census_income_data$occupation | 0 | 1 | Row Total |
## ------------------------------|-----------|-----------|-----------|
## ? | 1652 | 191 | 1843 |
## | 0.1 | 0.0 | |
## ------------------------------|-----------|-----------|-----------|
## Adm-clerical | 3263 | 507 | 3770 |
## | 0.1 | 0.1 | |
## ------------------------------|-----------|-----------|-----------|
## Armed-Forces | 8 | 1 | 9 |
## | 0.0 | 0.0 | |
## ------------------------------|-----------|-----------|-----------|
## Craft-repair | 3170 | 929 | 4099 |
## | 0.1 | 0.1 | |
## ------------------------------|-----------|-----------|-----------|
## Exec-managerial | 2098 | 1968 | 4066 |
## | 0.1 | 0.3 | |
## ------------------------------|-----------|-----------|-----------|
## Farming-fishing | 879 | 115 | 994 |
## | 0.0 | 0.0 | |
## ------------------------------|-----------|-----------|-----------|
## Handlers-cleaners | 1284 | 86 | 1370 |
## | 0.1 | 0.0 | |
## ------------------------------|-----------|-----------|-----------|
## Machine-op-inspct | 1752 | 250 | 2002 |
## | 0.1 | 0.0 | |
## ------------------------------|-----------|-----------|-----------|
## Other-service | 3158 | 137 | 3295 |
## | 0.1 | 0.0 | |
## ------------------------------|-----------|-----------|-----------|
## Priv-house-serv | 148 | 1 | 149 |
## | 0.0 | 0.0 | |
## ------------------------------|-----------|-----------|-----------|
## Prof-specialty | 2281 | 1859 | 4140 |
## | 0.1 | 0.2 | |
## ------------------------------|-----------|-----------|-----------|
## Protective-serv | 438 | 211 | 649 |
## | 0.0 | 0.0 | |
## ------------------------------|-----------|-----------|-----------|
## Sales | 2667 | 983 | 3650 |
## | 0.1 | 0.1 | |
## ------------------------------|-----------|-----------|-----------|
## Tech-support | 645 | 283 | 928 |
## | 0.0 | 0.0 | |
## ------------------------------|-----------|-----------|-----------|
## Transport-moving | 1277 | 320 | 1597 |
## | 0.1 | 0.0 | |
## ------------------------------|-----------|-----------|-----------|
## Column Total | 24720 | 7841 | 32561 |
## | 0.8 | 0.2 | |
## ------------------------------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 4031.974 d.f. = 14 p = 0
##
##
##
Variable_8=“relationship”
It is categorical data.The categories are,
Wife
Own-child
Husband
Not-in-family
Other-relative
Unmarried.
Frequency table of relationship
tab<-table(census_income_data$relationship)
tab
##
## Husband Not-in-family Other-relative Own-child
## 13193 8305 981 5068
## Unmarried Wife
## 3446 1568
names(tab)
## [1] " Husband" " Not-in-family" " Other-relative" " Own-child"
## [5] " Unmarried" " Wife"
sum(is.na(census_income_data$relationship))
## [1] 0
Mode of relationship
temp <- table(as.vector(census_income_data$relationship))
names(temp)[temp==max(temp)]
## [1] " Husband"
Mode of relation ship is ” Husband”
ggplot of relationship
library("ggplot2", lib.loc="C:/Program Files/R/R-3.3.1/library")
qplot(census_income_data$relationship,main="relationship",ylab="count",colour= I("purple"),size=I(4))
library("gmodels", lib.loc="C:/Program Files/R/R-3.3.1/library")
CrossTable(census_income_data$relationship,census_income_data$Income_band,digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 32561
##
##
## | census_income_data$Income_band
## census_income_data$relationship | 0 | 1 | Row Total |
## --------------------------------|-----------|-----------|-----------|
## Husband | 7275 | 5918 | 13193 |
## | 0.3 | 0.8 | |
## --------------------------------|-----------|-----------|-----------|
## Not-in-family | 7449 | 856 | 8305 |
## | 0.3 | 0.1 | |
## --------------------------------|-----------|-----------|-----------|
## Other-relative | 944 | 37 | 981 |
## | 0.0 | 0.0 | |
## --------------------------------|-----------|-----------|-----------|
## Own-child | 5001 | 67 | 5068 |
## | 0.2 | 0.0 | |
## --------------------------------|-----------|-----------|-----------|
## Unmarried | 3228 | 218 | 3446 |
## | 0.1 | 0.0 | |
## --------------------------------|-----------|-----------|-----------|
## Wife | 823 | 745 | 1568 |
## | 0.0 | 0.1 | |
## --------------------------------|-----------|-----------|-----------|
## Column Total | 24720 | 7841 | 32561 |
## | 0.8 | 0.2 | |
## --------------------------------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 6699.077 d.f. = 5 p = 0
##
##
##
Variable_9=“race”
The variable is a categorical variable.The categories are
White
Black
Asian or Pacific Islander
Other
Amer Indian Aleut or Eskimo
Frequency table of race
tab<-table(census_income_data$race)
tab
##
## Amer-Indian-Eskimo Asian-Pac-Islander Black
## 311 1039 3124
## Other White
## 271 27816
names(tab)
## [1] " Amer-Indian-Eskimo" " Asian-Pac-Islander" " Black"
## [4] " Other" " White"
sum(is.na(census_income_data$race))
## [1] 0
Mode of race
temp <- table(as.vector(census_income_data$race))
names(temp)[temp==max(temp)]
## [1] " White"
Mode of race is “white”
ggplot of race
library("ggplot2", lib.loc="C:/Program Files/R/R-3.3.1/library")
qplot(census_income_data$race,main="race",ylab="count",colour= I("purple"),size=I(4))
library("gmodels", lib.loc="C:/Program Files/R/R-3.3.1/library")
CrossTable(census_income_data$race,census_income_data$Income_band,digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 32561
##
##
## | census_income_data$Income_band
## census_income_data$race | 0 | 1 | Row Total |
## ------------------------|-----------|-----------|-----------|
## Amer-Indian-Eskimo | 275 | 36 | 311 |
## | 0.0 | 0.0 | |
## ------------------------|-----------|-----------|-----------|
## Asian-Pac-Islander | 763 | 276 | 1039 |
## | 0.0 | 0.0 | |
## ------------------------|-----------|-----------|-----------|
## Black | 2737 | 387 | 3124 |
## | 0.1 | 0.0 | |
## ------------------------|-----------|-----------|-----------|
## Other | 246 | 25 | 271 |
## | 0.0 | 0.0 | |
## ------------------------|-----------|-----------|-----------|
## White | 20699 | 7117 | 27816 |
## | 0.8 | 0.9 | |
## ------------------------|-----------|-----------|-----------|
## Column Total | 24720 | 7841 | 32561 |
## | 0.8 | 0.2 | |
## ------------------------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 330.9204 d.f. = 4 p = 2.305961e-70
##
##
##
Variable_10=“sex”
It is a categorical data.The data points are
Female
Male
Frequency table of sex
tab<-table(census_income_data$sex)
tab
##
## Female Male
## 10771 21790
names(tab)
## [1] " Female" " Male"
sum(is.na(census_income_data$sex))
## [1] 0
Mode of sex
temp <- table(as.vector(census_income_data$sex))
names(temp)[temp==max(temp)]
## [1] " Male"
Mode of sex is “male”
ggplot of sex
library("ggplot2", lib.loc="C:/Program Files/R/R-3.3.1/library")
qplot(census_income_data$sex,main="sex",ylab="count",colour= I("purple"),size=I(4))
library("gmodels", lib.loc="C:/Program Files/R/R-3.3.1/library")
CrossTable(census_income_data$sex,census_income_data$Income_band,digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 32561
##
##
## | census_income_data$Income_band
## census_income_data$sex | 0 | 1 | Row Total |
## -----------------------|-----------|-----------|-----------|
## Female | 9592 | 1179 | 10771 |
## | 0.4 | 0.2 | |
## -----------------------|-----------|-----------|-----------|
## Male | 15128 | 6662 | 21790 |
## | 0.6 | 0.8 | |
## -----------------------|-----------|-----------|-----------|
## Column Total | 24720 | 7841 | 32561 |
## | 0.8 | 0.2 | |
## -----------------------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 1518.887 d.f. = 1 p = 0
##
## Pearson's Chi-squared test with Yates' continuity correction
## ------------------------------------------------------------
## Chi^2 = 1517.813 d.f. = 1 p = 0
##
##
variable_11=“capital.gain”
Capital.gain is a Numerical variable,
head(census_income_data$capital.gain)
## [1] 2174 0 0 0 0 0
univariate analysis of capital.gain
Central tendencies of capital.gain
Mean of capital.gain
mean(census_income_data$capital.gain)
## [1] 1077.649
Median of capital.gain
median(census_income_data$capital.gain)
## [1] 0
Measures of Dispersion of capital.gain
Variance of capital.gain
var(census_income_data$capital.gain)
## [1] 54542539
Standard deviation of capital.gain
sd(census_income_data$capital.gain)
## [1] 7385.292
summary gives four quartiles of capital.gain
summary(census_income_data$capital.gain)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 1078 0 100000
boxplot of capital.gain
quantile(census_income_data$capital.gain)
## 0% 25% 50% 75% 100%
## 0 0 0 0 99999
quantile(census_income_data$capital.gain,c(0.75,0.80,0.90,1))
## 75% 80% 90% 100%
## 0 0 0 99999
boxplot(census_income_data$capital.gain,main="capital.gain")
Output description
In this boxplot the minimum is 0, maximum is 100000, and median is 0. first quartile is 0,third quartile is 0. Note that outliers are discussed later.
Histogram of capital.gain
hist(census_income_data$capital.gain)
correlation between capital.gain and income_brands
library("ltm", lib.loc="C:/Program Files/R/R-3.3.1/library")
biserial.cor(census_income_data$capital.gain,census_income_data$Income_band)
## [1] -0.2233254
correlation is -0.2233254 capital.gain and income_brands are negatively correlated
variable_12=“capital.loss”
Capital.loss is a numerical variable,
head(census_income_data$capital.loss)
## [1] 0 0 0 0 0 0
univariate analysis of capital.loss
Central tendencies of capital.loss
Mean of capital.loss
mean(census_income_data$capital.loss)
## [1] 87.30383
Median of capital.loss
median(census_income_data$capital.loss)
## [1] 0
Measures of Dispersion of capital.loss
Variance of capital.loss
var(census_income_data$capital.loss)
## [1] 162376.9
Standard deviation of capital.loss
sd(census_income_data$capital.loss)
## [1] 402.9602
Summary gives four quartiles of capital.loss
summary(census_income_data$capital.loss)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 87.3 0.0 4356.0
Boxplot of capital.loss
quantile(census_income_data$capital.loss)
## 0% 25% 50% 75% 100%
## 0 0 0 0 4356
quantile(census_income_data$capital.loss,c(0.75,0.80,0.90,1))
## 75% 80% 90% 100%
## 0 0 0 4356
boxplot(census_income_data$capital.loss,main="capital.loss")
####Output description
In this boxplot the minimum is 0, maximum is 4356, and median is 0. first quartile is 0,third quartile is 0. Note that outliers are discussed later.
Histogram of capital.loss
hist(census_income_data$capital.loss)
Correlation between capital.loss and income_brands
library("ltm", lib.loc="C:/Program Files/R/R-3.3.1/library")
biserial.cor(census_income_data$capital.loss,census_income_data$Income_band)
## [1] -0.150524
correlation is -0.150524 capital.loss and income_brands are negatively correlated
variable_13= “hours.per.week”
hours.per.week is an Numerical variable,
head(census_income_data$hours.per.week)
## [1] 40 13 40 40 40 40
univariate analysis of hours.per.week
Central tendencies of hours.per.week
Mean of hours.per.week
mean(census_income_data$hours.per.week)
## [1] 40.43746
Median of hours.per.week
median(census_income_data$hours.per.week)
## [1] 40
Measures of Dispersion of hours.per.week
Variance of hours.per.week
var(census_income_data$hours.per.week)
## [1] 152.459
Standard deviation of hours.per.week
sd(census_income_data$hours.per.week)
## [1] 12.34743
summary gives four quartiles of hours.per.week
summary(census_income_data$hours.per.week)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 40.00 40.00 40.44 45.00 99.00
boxplot of hours.per.week
quantile(census_income_data$hours.per.week)
## 0% 25% 50% 75% 100%
## 1 40 40 45 99
quantile(census_income_data$hours.per.week,c(0.75,0.80,0.90,1))
## 75% 80% 90% 100%
## 45 48 55 99
boxplot(census_income_data$hours.per.week,main="hours.per.week")
Output description
In this boxplot the minimum is 1, maximum is 99, and median is 40. first quartile is 40,third quartile is 45.note that outliers are discussed later.
Histogram of hours.per.week
hist(census_income_data$hours.per.week)
Correlation between capital.loss and income_brands
library("ltm", lib.loc="C:/Program Files/R/R-3.3.1/library")
biserial.cor(census_income_data$hours.per.week,census_income_data$Income_band)
## [1] -0.2296855
correlation is -0.2296855 hours.per.week and income_brands are negatively correlated
Variable_14=“native.country”
“native.country” is a cateorical variable,the categories are
United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
Frequency table of native.country
tab<-table(census_income_data$native.country)
tab
##
## ? Cambodia
## 583 19
## Canada China
## 121 75
## Columbia Cuba
## 59 95
## Dominican-Republic Ecuador
## 70 28
## El-Salvador England
## 106 90
## France Germany
## 29 137
## Greece Guatemala
## 29 64
## Haiti Holand-Netherlands
## 44 1
## Honduras Hong
## 13 20
## Hungary India
## 13 100
## Iran Ireland
## 43 24
## Italy Jamaica
## 73 81
## Japan Laos
## 62 18
## Mexico Nicaragua
## 643 34
## Outlying-US(Guam-USVI-etc) Peru
## 14 31
## Philippines Poland
## 198 60
## Portugal Puerto-Rico
## 37 114
## Scotland South
## 12 80
## Taiwan Thailand
## 51 18
## Trinadad&Tobago United-States
## 19 29170
## Vietnam Yugoslavia
## 67 16
names(tab)
## [1] " ?" " Cambodia"
## [3] " Canada" " China"
## [5] " Columbia" " Cuba"
## [7] " Dominican-Republic" " Ecuador"
## [9] " El-Salvador" " England"
## [11] " France" " Germany"
## [13] " Greece" " Guatemala"
## [15] " Haiti" " Holand-Netherlands"
## [17] " Honduras" " Hong"
## [19] " Hungary" " India"
## [21] " Iran" " Ireland"
## [23] " Italy" " Jamaica"
## [25] " Japan" " Laos"
## [27] " Mexico" " Nicaragua"
## [29] " Outlying-US(Guam-USVI-etc)" " Peru"
## [31] " Philippines" " Poland"
## [33] " Portugal" " Puerto-Rico"
## [35] " Scotland" " South"
## [37] " Taiwan" " Thailand"
## [39] " Trinadad&Tobago" " United-States"
## [41] " Vietnam" " Yugoslavia"
sum(is.na(census_income_data$native.country))
## [1] 0
Mode of native.country
temp <- table(as.vector(census_income_data$native.country))
names(temp)[temp==max(temp)]
## [1] " United-States"
Mode of native.country is ” United-States”
ggplot of native.country
library("ggplot2", lib.loc="C:/Program Files/R/R-3.3.1/library")
qplot(census_income_data$native.country,main="native.country",ylab="count",colour= I("purple"),size=I(4))
library("gmodels", lib.loc="C:/Program Files/R/R-3.3.1/library")
CrossTable(census_income_data$native.country,census_income_data$Income_band,digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
## Warning in chisq.test(t, correct = FALSE, ...): Chi-squared approximation
## may be incorrect
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 32561
##
##
## | census_income_data$Income_band
## census_income_data$native.country | 0 | 1 | Row Total |
## ----------------------------------|-----------|-----------|-----------|
## ? | 437 | 146 | 583 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## Cambodia | 12 | 7 | 19 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## Canada | 82 | 39 | 121 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## China | 55 | 20 | 75 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## Columbia | 57 | 2 | 59 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## Cuba | 70 | 25 | 95 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## Dominican-Republic | 68 | 2 | 70 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## Ecuador | 24 | 4 | 28 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## El-Salvador | 97 | 9 | 106 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## England | 60 | 30 | 90 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## France | 17 | 12 | 29 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## Germany | 93 | 44 | 137 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## Greece | 21 | 8 | 29 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## Guatemala | 61 | 3 | 64 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## Haiti | 40 | 4 | 44 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## Holand-Netherlands | 1 | 0 | 1 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## Honduras | 12 | 1 | 13 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## Hong | 14 | 6 | 20 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## Hungary | 10 | 3 | 13 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## India | 60 | 40 | 100 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## Iran | 25 | 18 | 43 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## Ireland | 19 | 5 | 24 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## Italy | 48 | 25 | 73 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## Jamaica | 71 | 10 | 81 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## Japan | 38 | 24 | 62 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## Laos | 16 | 2 | 18 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## Mexico | 610 | 33 | 643 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## Nicaragua | 32 | 2 | 34 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## Outlying-US(Guam-USVI-etc) | 14 | 0 | 14 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## Peru | 29 | 2 | 31 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## Philippines | 137 | 61 | 198 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## Poland | 48 | 12 | 60 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## Portugal | 33 | 4 | 37 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## Puerto-Rico | 102 | 12 | 114 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## Scotland | 9 | 3 | 12 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## South | 64 | 16 | 80 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## Taiwan | 31 | 20 | 51 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## Thailand | 15 | 3 | 18 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## Trinadad&Tobago | 17 | 2 | 19 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## United-States | 21999 | 7171 | 29170 |
## | 0.9 | 0.9 | |
## ----------------------------------|-----------|-----------|-----------|
## Vietnam | 62 | 5 | 67 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## Yugoslavia | 10 | 6 | 16 |
## | 0.0 | 0.0 | |
## ----------------------------------|-----------|-----------|-----------|
## Column Total | 24720 | 7841 | 32561 |
## | 0.8 | 0.2 | |
## ----------------------------------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 317.2304 d.f. = 41 p = 2.211386e-44
##
##
##
Varaiable_15=“Income_band”
It is a categorical data It is a predictor variable. The categories are, -50000 50000+
Frequency table of income_band
tab<-table(census_income_data$Income_band)
tab
##
## 0 1
## 24720 7841
names(tab)
## [1] "0" "1"
sum(is.na(census_income_data$Income_band))
## [1] 0
Model Bulding
NaiveBayesian Model
library("e1071", lib.loc="C:/Program Files/R/R-3.3.1/library")
## Warning: package 'e1071' was built under R version 3.3.3
library(class)
Model<- naiveBayes(census_income_data$Income_band~.,data=census_income_data )
Model
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## 0 1
## 0.7591904 0.2408096
##
## Conditional probabilities:
## age
## Y [,1] [,2]
## 0 36.78374 14.02009
## 1 44.24984 10.51903
##
## workclass
## Y ? Federal-gov Local-gov Never-worked Private
## 0 0.0665453074 0.0238268608 0.0597087379 0.0002831715 0.7173543689
## 1 0.0243591379 0.0473153934 0.0786889427 0.0000000000 0.6329549802
## workclass
## Y Self-emp-inc Self-emp-not-inc State-gov Without-pay
## 0 0.0199838188 0.0735032362 0.0382281553 0.0005663430
## 1 0.0793266165 0.0923351613 0.0450197679 0.0000000000
##
## fnlwgt
## Y [,1] [,2]
## 0 190340.9 106482.3
## 1 188005.0 102541.8
##
## education
## Y 10th 11th 12th 1st-4th 5th-6th
## 0 0.0352346278 0.0451051780 0.0161812298 0.0065533981 0.0128236246
## 1 0.0079071547 0.0076520852 0.0042086469 0.0007652085 0.0020405561
## education
## Y 7th-8th 9th Assoc-acdm Assoc-voc Bachelors
## 0 0.0245145631 0.0197006472 0.0324433657 0.0413025890 0.1267799353
## 1 0.0051013901 0.0034434383 0.0337967096 0.0460400459 0.2832546869
## education
## Y Doctorate HS-grad Masters Preschool Prof-school
## 0 0.0043284790 0.3570388350 0.0309061489 0.0020631068 0.0061893204
## 1 0.0390256345 0.2136207116 0.1223058283 0.0000000000 0.0539472006
## education
## Y Some-college
## 0 0.2388349515
## 1 0.1768907027
##
## education.num
## Y [,1] [,2]
## 0 9.595065 2.436147
## 1 11.611657 2.385129
##
## marital.status
## Y Divorced Married-AF-spouse Married-civ-spouse
## 0 0.161003236 0.000525890 0.335113269
## 1 0.059048591 0.001275348 0.853462569
## marital.status
## Y Married-spouse-absent Never-married Separated Widowed
## 0 0.015533981 0.412297735 0.038794498 0.036731392
## 1 0.004336182 0.062619564 0.008417294 0.010840454
##
## occupation
## Y ? Adm-clerical Armed-Forces Craft-repair
## 0 0.0668284790 0.1319983819 0.0003236246 0.1282362460
## 1 0.0243591379 0.0646601199 0.0001275348 0.1184797857
## occupation
## Y Exec-managerial Farming-fishing Handlers-cleaners
## 0 0.0848705502 0.0355582524 0.0519417476
## 1 0.2509883943 0.0146664966 0.0109679888
## occupation
## Y Machine-op-inspct Other-service Priv-house-serv Prof-specialty
## 0 0.0708737864 0.1277508091 0.0059870550 0.0922734628
## 1 0.0318836883 0.0174722612 0.0001275348 0.2370871062
## occupation
## Y Protective-serv Sales Tech-support Transport-moving
## 0 0.0177184466 0.1078883495 0.0260922330 0.0516585761
## 1 0.0269098329 0.1253666624 0.0360923352 0.0408111210
##
## relationship
## Y Husband Not-in-family Other-relative Own-child Unmarried
## 0 0.294296117 0.301334951 0.038187702 0.202305825 0.130582524
## 1 0.754750670 0.109169749 0.004718786 0.008544828 0.027802576
## relationship
## Y Wife
## 0 0.033292880
## 1 0.095013391
##
## race
## Y Amer-Indian-Eskimo Asian-Pac-Islander Black Other
## 0 0.011124595 0.030865696 0.110720065 0.009951456
## 1 0.004591251 0.035199592 0.049355949 0.003188369
## race
## Y White
## 0 0.837338188
## 1 0.907664839
##
## sex
## Y Female Male
## 0 0.3880259 0.6119741
## 1 0.1503635 0.8496365
##
## capital.gain
## Y [,1] [,2]
## 0 148.7525 963.1393
## 1 4006.1425 14570.3790
##
## capital.loss
## Y [,1] [,2]
## 0 53.14292 310.7558
## 1 195.00153 595.4876
##
## hours.per.week
## Y [,1] [,2]
## 0 38.84021 12.31899
## 1 45.47303 11.01297
##
## native.country
## Y ? Cambodia Canada China Columbia
## 0 1.767799e-02 4.854369e-04 3.317152e-03 2.224919e-03 2.305825e-03
## 1 1.862007e-02 8.927433e-04 4.973855e-03 2.550695e-03 2.550695e-04
## native.country
## Y Cuba Dominican-Republic Ecuador El-Salvador
## 0 2.831715e-03 2.750809e-03 9.708738e-04 3.923948e-03
## 1 3.188369e-03 2.550695e-04 5.101390e-04 1.147813e-03
## native.country
## Y England France Germany Greece Guatemala
## 0 2.427184e-03 6.877023e-04 3.762136e-03 8.495146e-04 2.467638e-03
## 1 3.826043e-03 1.530417e-03 5.611529e-03 1.020278e-03 3.826043e-04
## native.country
## Y Haiti Holand-Netherlands Honduras Hong
## 0 1.618123e-03 4.045307e-05 4.854369e-04 5.663430e-04
## 1 5.101390e-04 0.000000e+00 1.275348e-04 7.652085e-04
## native.country
## Y Hungary India Iran Ireland Italy
## 0 4.045307e-04 2.427184e-03 1.011327e-03 7.686084e-04 1.941748e-03
## 1 3.826043e-04 5.101390e-03 2.295626e-03 6.376738e-04 3.188369e-03
## native.country
## Y Jamaica Japan Laos Mexico Nicaragua
## 0 2.872168e-03 1.537217e-03 6.472492e-04 2.467638e-02 1.294498e-03
## 1 1.275348e-03 3.060834e-03 2.550695e-04 4.208647e-03 2.550695e-04
## native.country
## Y Outlying-US(Guam-USVI-etc) Peru Philippines Poland
## 0 5.663430e-04 1.173139e-03 5.542071e-03 1.941748e-03
## 1 0.000000e+00 2.550695e-04 7.779620e-03 1.530417e-03
## native.country
## Y Portugal Puerto-Rico Scotland South Taiwan
## 0 1.334951e-03 4.126214e-03 3.640777e-04 2.588997e-03 1.254045e-03
## 1 5.101390e-04 1.530417e-03 3.826043e-04 2.040556e-03 2.550695e-03
## native.country
## Y Thailand Trinadad&Tobago United-States Vietnam Yugoslavia
## 0 6.067961e-04 6.877023e-04 8.899272e-01 2.508091e-03 4.045307e-04
## 1 3.826043e-04 2.550695e-04 9.145517e-01 6.376738e-04 7.652085e-04
Logistic Regression Model
Logistic Regression is a classification algorithm. It is used to predict a binary outcome (1 / 0, Yes / No, True / False) given a set of independent variables. To represent binary / categorical outcome. logistic regression as a special case of linear regression when the outcome variable is categorical, where we are using log of odds as dependent variable. In simple words, it predicts the probability of occurrence of an event by fitting data to a logit function.
LogisticModel<- glm(census_income_data$Income_band~.,family=binomial,data=census_income_data)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(LogisticModel)
##
## Call:
## glm(formula = census_income_data$Income_band ~ ., family = binomial,
## data = census_income_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.0885 -0.5044 -0.1822 -0.0251 3.7656
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error z value
## (Intercept) -9.074e+00 4.405e-01 -20.601
## age 2.552e-02 1.651e-03 15.460
## workclass Federal-gov 1.097e+00 1.538e-01 7.131
## workclass Local-gov 4.118e-01 1.403e-01 2.934
## workclass Never-worked -1.045e+01 2.722e+02 -0.038
## workclass Private 5.944e-01 1.252e-01 4.746
## workclass Self-emp-inc 7.694e-01 1.497e-01 5.140
## workclass Self-emp-not-inc 1.037e-01 1.371e-01 0.756
## workclass State-gov 2.835e-01 1.518e-01 1.868
## workclass Without-pay -1.221e+01 1.985e+02 -0.062
## fnlwgt 7.072e-07 1.720e-07 4.111
## education 11th 8.500e-02 2.107e-01 0.403
## education 12th 4.891e-01 2.644e-01 1.850
## education 1st-4th -5.322e-01 4.895e-01 -1.087
## education 5th-6th -2.386e-01 3.248e-01 -0.735
## education 7th-8th -4.755e-01 2.320e-01 -2.050
## education 9th -1.939e-01 2.612e-01 -0.743
## education Assoc-acdm 1.336e+00 1.763e-01 7.574
## education Assoc-voc 1.352e+00 1.694e-01 7.981
## education Bachelors 1.936e+00 1.575e-01 12.296
## education Doctorate 2.989e+00 2.142e-01 13.954
## education HS-grad 8.134e-01 1.534e-01 5.302
## education Masters 2.289e+00 1.679e-01 13.631
## education Preschool -2.109e+01 3.665e+02 -0.058
## education Prof-school 2.793e+00 2.002e-01 13.955
## education Some-college 1.159e+00 1.556e-01 7.447
## education.num NA NA NA
## marital.status Married-AF-spouse 2.686e+00 5.538e-01 4.849
## marital.status Married-civ-spouse 2.206e+00 2.654e-01 8.312
## marital.status Married-spouse-absent -1.097e-02 2.298e-01 -0.048
## marital.status Never-married -4.825e-01 8.751e-02 -5.513
## marital.status Separated -1.334e-01 1.641e-01 -0.813
## marital.status Widowed 1.284e-01 1.538e-01 0.835
## occupation Adm-clerical 1.095e-01 9.919e-02 1.104
## occupation Armed-Forces -1.061e+00 1.543e+00 -0.688
## occupation Craft-repair 1.816e-01 8.487e-02 2.140
## occupation Exec-managerial 8.965e-01 8.724e-02 10.276
## occupation Farming-fishing -8.826e-01 1.420e-01 -6.214
## occupation Handlers-cleaners -5.698e-01 1.458e-01 -3.907
## occupation Machine-op-inspct -1.724e-01 1.062e-01 -1.624
## occupation Other-service -7.152e-01 1.245e-01 -5.746
## occupation Priv-house-serv -4.018e+00 1.664e+00 -2.415
## occupation Prof-specialty 6.251e-01 9.365e-02 6.675
## occupation Protective-serv 6.864e-01 1.304e-01 5.265
## occupation Sales 3.909e-01 9.015e-02 4.336
## occupation Tech-support 7.657e-01 1.194e-01 6.415
## occupation Transport-moving NA NA NA
## relationship Not-in-family 5.695e-01 2.627e-01 2.168
## relationship Other-relative -3.729e-01 2.427e-01 -1.536
## relationship Own-child -6.601e-01 2.600e-01 -2.539
## relationship Unmarried 4.411e-01 2.786e-01 1.583
## relationship Wife 1.363e+00 1.026e-01 13.282
## race Asian-Pac-Islander 6.650e-01 2.697e-01 2.465
## race Black 3.940e-01 2.332e-01 1.690
## race Other 1.736e-01 3.537e-01 0.491
## race White 5.728e-01 2.217e-01 2.584
## sex Male 8.618e-01 7.918e-02 10.883
## capital.gain 3.193e-04 1.031e-05 30.968
## capital.loss 6.474e-04 3.714e-05 17.431
## hours.per.week 2.970e-02 1.622e-03 18.316
## native.country Cambodia 1.482e+00 6.336e-01 2.338
## native.country Canada 5.170e-01 2.952e-01 1.751
## native.country China -5.080e-01 3.943e-01 -1.288
## native.country Columbia -1.930e+00 8.242e-01 -2.342
## native.country Cuba 5.339e-01 3.373e-01 1.583
## native.country Dominican-Republic -1.643e+00 1.049e+00 -1.566
## native.country Ecuador -9.442e-02 7.292e-01 -0.129
## native.country El-Salvador -4.230e-01 4.952e-01 -0.854
## native.country England 4.954e-01 3.335e-01 1.486
## native.country France 7.730e-01 5.289e-01 1.462
## native.country Germany 6.197e-01 2.843e-01 2.179
## native.country Greece -7.982e-01 5.657e-01 -1.411
## native.country Guatemala -6.358e-02 7.625e-01 -0.083
## native.country Haiti 1.359e-01 6.850e-01 0.198
## native.country Holand-Netherlands -1.024e+01 8.827e+02 -0.012
## native.country Honduras -1.086e+00 2.356e+00 -0.461
## native.country Hong 8.706e-02 6.810e-01 0.128
## native.country Hungary 7.262e-02 7.759e-01 0.094
## native.country India -1.895e-01 3.284e-01 -0.577
## native.country Iran 2.341e-01 4.508e-01 0.519
## native.country Ireland 7.198e-01 6.448e-01 1.116
## native.country Italy 9.944e-01 3.447e-01 2.885
## native.country Jamaica 2.285e-01 4.631e-01 0.493
## native.country Japan 5.794e-01 4.214e-01 1.375
## native.country Laos -4.209e-01 8.630e-01 -0.488
## native.country Mexico -3.643e-01 2.551e-01 -1.428
## native.country Nicaragua -6.151e-01 8.040e-01 -0.765
## native.country Outlying-US(Guam-USVI-etc) -1.208e+01 2.098e+02 -0.058
## native.country Peru -6.498e-01 8.559e-01 -0.759
## native.country Philippines 6.104e-01 2.810e-01 2.173
## native.country Poland 1.820e-01 4.216e-01 0.432
## native.country Portugal 1.542e-01 6.332e-01 0.243
## native.country Puerto-Rico -1.483e-01 4.041e-01 -0.367
## native.country Scotland 1.905e-01 7.892e-01 0.241
## native.country South -8.819e-01 4.414e-01 -1.998
## native.country Taiwan 2.248e-01 4.724e-01 0.476
## native.country Thailand -3.784e-01 8.356e-01 -0.453
## native.country Trinadad&Tobago -1.977e-01 8.709e-01 -0.227
## native.country United-States 3.815e-01 1.380e-01 2.764
## native.country Vietnam -9.593e-01 6.150e-01 -1.560
## native.country Yugoslavia 8.720e-01 6.824e-01 1.278
## Pr(>|z|)
## (Intercept) < 2e-16 ***
## age < 2e-16 ***
## workclass Federal-gov 9.99e-13 ***
## workclass Local-gov 0.00334 **
## workclass Never-worked 0.96936
## workclass Private 2.08e-06 ***
## workclass Self-emp-inc 2.74e-07 ***
## workclass Self-emp-not-inc 0.44954
## workclass State-gov 0.06173 .
## workclass Without-pay 0.95095
## fnlwgt 3.93e-05 ***
## education 11th 0.68670
## education 12th 0.06435 .
## education 1st-4th 0.27696
## education 5th-6th 0.46255
## education 7th-8th 0.04039 *
## education 9th 0.45771
## education Assoc-acdm 3.63e-14 ***
## education Assoc-voc 1.45e-15 ***
## education Bachelors < 2e-16 ***
## education Doctorate < 2e-16 ***
## education HS-grad 1.15e-07 ***
## education Masters < 2e-16 ***
## education Preschool 0.95410
## education Prof-school < 2e-16 ***
## education Some-college 9.52e-14 ***
## education.num NA
## marital.status Married-AF-spouse 1.24e-06 ***
## marital.status Married-civ-spouse < 2e-16 ***
## marital.status Married-spouse-absent 0.96192
## marital.status Never-married 3.52e-08 ***
## marital.status Separated 0.41647
## marital.status Widowed 0.40350
## occupation Adm-clerical 0.26955
## occupation Armed-Forces 0.49174
## occupation Craft-repair 0.03239 *
## occupation Exec-managerial < 2e-16 ***
## occupation Farming-fishing 5.16e-10 ***
## occupation Handlers-cleaners 9.33e-05 ***
## occupation Machine-op-inspct 0.10429
## occupation Other-service 9.12e-09 ***
## occupation Priv-house-serv 0.01572 *
## occupation Prof-specialty 2.46e-11 ***
## occupation Protective-serv 1.40e-07 ***
## occupation Sales 1.45e-05 ***
## occupation Tech-support 1.41e-10 ***
## occupation Transport-moving NA
## relationship Not-in-family 0.03015 *
## relationship Other-relative 0.12442
## relationship Own-child 0.01111 *
## relationship Unmarried 0.11338
## relationship Wife < 2e-16 ***
## race Asian-Pac-Islander 0.01369 *
## race Black 0.09106 .
## race Other 0.62365
## race White 0.00978 **
## sex Male < 2e-16 ***
## capital.gain < 2e-16 ***
## capital.loss < 2e-16 ***
## hours.per.week < 2e-16 ***
## native.country Cambodia 0.01936 *
## native.country Canada 0.07989 .
## native.country China 0.19766
## native.country Columbia 0.01919 *
## native.country Cuba 0.11349
## native.country Dominican-Republic 0.11735
## native.country Ecuador 0.89697
## native.country El-Salvador 0.39301
## native.country England 0.13735
## native.country France 0.14385
## native.country Germany 0.02931 *
## native.country Greece 0.15824
## native.country Guatemala 0.93354
## native.country Haiti 0.84275
## native.country Holand-Netherlands 0.99074
## native.country Honduras 0.64493
## native.country Hong 0.89827
## native.country Hungary 0.92543
## native.country India 0.56390
## native.country Iran 0.60364
## native.country Ireland 0.26424
## native.country Italy 0.00392 **
## native.country Jamaica 0.62170
## native.country Japan 0.16914
## native.country Laos 0.62575
## native.country Mexico 0.15325
## native.country Nicaragua 0.44424
## native.country Outlying-US(Guam-USVI-etc) 0.95407
## native.country Peru 0.44772
## native.country Philippines 0.02981 *
## native.country Poland 0.66608
## native.country Portugal 0.80763
## native.country Puerto-Rico 0.71362
## native.country Scotland 0.80929
## native.country South 0.04573 *
## native.country Taiwan 0.63409
## native.country Thailand 0.65062
## native.country Trinadad&Tobago 0.82041
## native.country United-States 0.00570 **
## native.country Vietnam 0.11884
## native.country Yugoslavia 0.20131
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 35948 on 32560 degrees of freedom
## Residual deviance: 20565 on 32462 degrees of freedom
## AIC: 20763
##
## Number of Fisher Scoring iterations: 13
Classification Table
library("caret", lib.loc="C:/Program Files/R/R-3.3.1/library")
## Warning: package 'caret' was built under R version 3.3.3
## Loading required package: lattice
library("rattle", lib.loc="C:/Program Files/R/R-3.3.1/library")
## Warning: package 'rattle' was built under R version 3.3.3
## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
threshold=0.5
predicted_values<-ifelse(predict(LogisticModel,type="response")>threshold,1,0)
actual_values<-LogisticModel$y
conf_matrix<-table(predicted_values,actual_values)
conf_matrix
## actual_values
## predicted_values 0 1
## 0 23037 3093
## 1 1683 4748
sensitivity(conf_matrix)
## [1] 0.9319175
specificity(conf_matrix)
## [1] 0.605535
Logistic regression Accuracy
accuracy1<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy1
## [1] 0.8533215
Changing Threshold value
threshold=0.8
predicted_values<-ifelse(predict(LogisticModel,type="response")>threshold,1,0)
actual_values<-LogisticModel$y
conf_matrix<-table(predicted_values,actual_values)
conf_matrix
## actual_values
## predicted_values 0 1
## 0 24495 5663
## 1 225 2178
sensitivity(conf_matrix)
## [1] 0.9908981
specificity(conf_matrix)
## [1] 0.2777707
accuracy2<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy2
## [1] 0.8191702
Multicollinearity
library("car", lib.loc="C:/Program Files/R/R-3.3.1/library")
## Warning: package 'car' was built under R version 3.3.3
summary(LogisticModel)
##
## Call:
## glm(formula = census_income_data$Income_band ~ ., family = binomial,
## data = census_income_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.0885 -0.5044 -0.1822 -0.0251 3.7656
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error z value
## (Intercept) -9.074e+00 4.405e-01 -20.601
## age 2.552e-02 1.651e-03 15.460
## workclass Federal-gov 1.097e+00 1.538e-01 7.131
## workclass Local-gov 4.118e-01 1.403e-01 2.934
## workclass Never-worked -1.045e+01 2.722e+02 -0.038
## workclass Private 5.944e-01 1.252e-01 4.746
## workclass Self-emp-inc 7.694e-01 1.497e-01 5.140
## workclass Self-emp-not-inc 1.037e-01 1.371e-01 0.756
## workclass State-gov 2.835e-01 1.518e-01 1.868
## workclass Without-pay -1.221e+01 1.985e+02 -0.062
## fnlwgt 7.072e-07 1.720e-07 4.111
## education 11th 8.500e-02 2.107e-01 0.403
## education 12th 4.891e-01 2.644e-01 1.850
## education 1st-4th -5.322e-01 4.895e-01 -1.087
## education 5th-6th -2.386e-01 3.248e-01 -0.735
## education 7th-8th -4.755e-01 2.320e-01 -2.050
## education 9th -1.939e-01 2.612e-01 -0.743
## education Assoc-acdm 1.336e+00 1.763e-01 7.574
## education Assoc-voc 1.352e+00 1.694e-01 7.981
## education Bachelors 1.936e+00 1.575e-01 12.296
## education Doctorate 2.989e+00 2.142e-01 13.954
## education HS-grad 8.134e-01 1.534e-01 5.302
## education Masters 2.289e+00 1.679e-01 13.631
## education Preschool -2.109e+01 3.665e+02 -0.058
## education Prof-school 2.793e+00 2.002e-01 13.955
## education Some-college 1.159e+00 1.556e-01 7.447
## education.num NA NA NA
## marital.status Married-AF-spouse 2.686e+00 5.538e-01 4.849
## marital.status Married-civ-spouse 2.206e+00 2.654e-01 8.312
## marital.status Married-spouse-absent -1.097e-02 2.298e-01 -0.048
## marital.status Never-married -4.825e-01 8.751e-02 -5.513
## marital.status Separated -1.334e-01 1.641e-01 -0.813
## marital.status Widowed 1.284e-01 1.538e-01 0.835
## occupation Adm-clerical 1.095e-01 9.919e-02 1.104
## occupation Armed-Forces -1.061e+00 1.543e+00 -0.688
## occupation Craft-repair 1.816e-01 8.487e-02 2.140
## occupation Exec-managerial 8.965e-01 8.724e-02 10.276
## occupation Farming-fishing -8.826e-01 1.420e-01 -6.214
## occupation Handlers-cleaners -5.698e-01 1.458e-01 -3.907
## occupation Machine-op-inspct -1.724e-01 1.062e-01 -1.624
## occupation Other-service -7.152e-01 1.245e-01 -5.746
## occupation Priv-house-serv -4.018e+00 1.664e+00 -2.415
## occupation Prof-specialty 6.251e-01 9.365e-02 6.675
## occupation Protective-serv 6.864e-01 1.304e-01 5.265
## occupation Sales 3.909e-01 9.015e-02 4.336
## occupation Tech-support 7.657e-01 1.194e-01 6.415
## occupation Transport-moving NA NA NA
## relationship Not-in-family 5.695e-01 2.627e-01 2.168
## relationship Other-relative -3.729e-01 2.427e-01 -1.536
## relationship Own-child -6.601e-01 2.600e-01 -2.539
## relationship Unmarried 4.411e-01 2.786e-01 1.583
## relationship Wife 1.363e+00 1.026e-01 13.282
## race Asian-Pac-Islander 6.650e-01 2.697e-01 2.465
## race Black 3.940e-01 2.332e-01 1.690
## race Other 1.736e-01 3.537e-01 0.491
## race White 5.728e-01 2.217e-01 2.584
## sex Male 8.618e-01 7.918e-02 10.883
## capital.gain 3.193e-04 1.031e-05 30.968
## capital.loss 6.474e-04 3.714e-05 17.431
## hours.per.week 2.970e-02 1.622e-03 18.316
## native.country Cambodia 1.482e+00 6.336e-01 2.338
## native.country Canada 5.170e-01 2.952e-01 1.751
## native.country China -5.080e-01 3.943e-01 -1.288
## native.country Columbia -1.930e+00 8.242e-01 -2.342
## native.country Cuba 5.339e-01 3.373e-01 1.583
## native.country Dominican-Republic -1.643e+00 1.049e+00 -1.566
## native.country Ecuador -9.442e-02 7.292e-01 -0.129
## native.country El-Salvador -4.230e-01 4.952e-01 -0.854
## native.country England 4.954e-01 3.335e-01 1.486
## native.country France 7.730e-01 5.289e-01 1.462
## native.country Germany 6.197e-01 2.843e-01 2.179
## native.country Greece -7.982e-01 5.657e-01 -1.411
## native.country Guatemala -6.358e-02 7.625e-01 -0.083
## native.country Haiti 1.359e-01 6.850e-01 0.198
## native.country Holand-Netherlands -1.024e+01 8.827e+02 -0.012
## native.country Honduras -1.086e+00 2.356e+00 -0.461
## native.country Hong 8.706e-02 6.810e-01 0.128
## native.country Hungary 7.262e-02 7.759e-01 0.094
## native.country India -1.895e-01 3.284e-01 -0.577
## native.country Iran 2.341e-01 4.508e-01 0.519
## native.country Ireland 7.198e-01 6.448e-01 1.116
## native.country Italy 9.944e-01 3.447e-01 2.885
## native.country Jamaica 2.285e-01 4.631e-01 0.493
## native.country Japan 5.794e-01 4.214e-01 1.375
## native.country Laos -4.209e-01 8.630e-01 -0.488
## native.country Mexico -3.643e-01 2.551e-01 -1.428
## native.country Nicaragua -6.151e-01 8.040e-01 -0.765
## native.country Outlying-US(Guam-USVI-etc) -1.208e+01 2.098e+02 -0.058
## native.country Peru -6.498e-01 8.559e-01 -0.759
## native.country Philippines 6.104e-01 2.810e-01 2.173
## native.country Poland 1.820e-01 4.216e-01 0.432
## native.country Portugal 1.542e-01 6.332e-01 0.243
## native.country Puerto-Rico -1.483e-01 4.041e-01 -0.367
## native.country Scotland 1.905e-01 7.892e-01 0.241
## native.country South -8.819e-01 4.414e-01 -1.998
## native.country Taiwan 2.248e-01 4.724e-01 0.476
## native.country Thailand -3.784e-01 8.356e-01 -0.453
## native.country Trinadad&Tobago -1.977e-01 8.709e-01 -0.227
## native.country United-States 3.815e-01 1.380e-01 2.764
## native.country Vietnam -9.593e-01 6.150e-01 -1.560
## native.country Yugoslavia 8.720e-01 6.824e-01 1.278
## Pr(>|z|)
## (Intercept) < 2e-16 ***
## age < 2e-16 ***
## workclass Federal-gov 9.99e-13 ***
## workclass Local-gov 0.00334 **
## workclass Never-worked 0.96936
## workclass Private 2.08e-06 ***
## workclass Self-emp-inc 2.74e-07 ***
## workclass Self-emp-not-inc 0.44954
## workclass State-gov 0.06173 .
## workclass Without-pay 0.95095
## fnlwgt 3.93e-05 ***
## education 11th 0.68670
## education 12th 0.06435 .
## education 1st-4th 0.27696
## education 5th-6th 0.46255
## education 7th-8th 0.04039 *
## education 9th 0.45771
## education Assoc-acdm 3.63e-14 ***
## education Assoc-voc 1.45e-15 ***
## education Bachelors < 2e-16 ***
## education Doctorate < 2e-16 ***
## education HS-grad 1.15e-07 ***
## education Masters < 2e-16 ***
## education Preschool 0.95410
## education Prof-school < 2e-16 ***
## education Some-college 9.52e-14 ***
## education.num NA
## marital.status Married-AF-spouse 1.24e-06 ***
## marital.status Married-civ-spouse < 2e-16 ***
## marital.status Married-spouse-absent 0.96192
## marital.status Never-married 3.52e-08 ***
## marital.status Separated 0.41647
## marital.status Widowed 0.40350
## occupation Adm-clerical 0.26955
## occupation Armed-Forces 0.49174
## occupation Craft-repair 0.03239 *
## occupation Exec-managerial < 2e-16 ***
## occupation Farming-fishing 5.16e-10 ***
## occupation Handlers-cleaners 9.33e-05 ***
## occupation Machine-op-inspct 0.10429
## occupation Other-service 9.12e-09 ***
## occupation Priv-house-serv 0.01572 *
## occupation Prof-specialty 2.46e-11 ***
## occupation Protective-serv 1.40e-07 ***
## occupation Sales 1.45e-05 ***
## occupation Tech-support 1.41e-10 ***
## occupation Transport-moving NA
## relationship Not-in-family 0.03015 *
## relationship Other-relative 0.12442
## relationship Own-child 0.01111 *
## relationship Unmarried 0.11338
## relationship Wife < 2e-16 ***
## race Asian-Pac-Islander 0.01369 *
## race Black 0.09106 .
## race Other 0.62365
## race White 0.00978 **
## sex Male < 2e-16 ***
## capital.gain < 2e-16 ***
## capital.loss < 2e-16 ***
## hours.per.week < 2e-16 ***
## native.country Cambodia 0.01936 *
## native.country Canada 0.07989 .
## native.country China 0.19766
## native.country Columbia 0.01919 *
## native.country Cuba 0.11349
## native.country Dominican-Republic 0.11735
## native.country Ecuador 0.89697
## native.country El-Salvador 0.39301
## native.country England 0.13735
## native.country France 0.14385
## native.country Germany 0.02931 *
## native.country Greece 0.15824
## native.country Guatemala 0.93354
## native.country Haiti 0.84275
## native.country Holand-Netherlands 0.99074
## native.country Honduras 0.64493
## native.country Hong 0.89827
## native.country Hungary 0.92543
## native.country India 0.56390
## native.country Iran 0.60364
## native.country Ireland 0.26424
## native.country Italy 0.00392 **
## native.country Jamaica 0.62170
## native.country Japan 0.16914
## native.country Laos 0.62575
## native.country Mexico 0.15325
## native.country Nicaragua 0.44424
## native.country Outlying-US(Guam-USVI-etc) 0.95407
## native.country Peru 0.44772
## native.country Philippines 0.02981 *
## native.country Poland 0.66608
## native.country Portugal 0.80763
## native.country Puerto-Rico 0.71362
## native.country Scotland 0.80929
## native.country South 0.04573 *
## native.country Taiwan 0.63409
## native.country Thailand 0.65062
## native.country Trinadad&Tobago 0.82041
## native.country United-States 0.00570 **
## native.country Vietnam 0.11884
## native.country Yugoslavia 0.20131
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 35948 on 32560 degrees of freedom
## Residual deviance: 20565 on 32462 degrees of freedom
## AIC: 20763
##
## Number of Fisher Scoring iterations: 13
alias(LogisticModel, scale = FALSE)
## Model :
## census_income_data$Income_band ~ age + workclass + fnlwgt + education +
## education.num + marital.status + occupation + relationship +
## race + sex + capital.gain + capital.loss + hours.per.week +
## native.country
##
## Complete :
## (Intercept) age workclass Federal-gov
## education.num 6 0 0
## occupation Transport-moving 0 0 1
## workclass Local-gov workclass Never-worked
## education.num 0 0
## occupation Transport-moving 1 0
## workclass Private workclass Self-emp-inc
## education.num 0 0
## occupation Transport-moving 1 1
## workclass Self-emp-not-inc workclass State-gov
## education.num 0 0
## occupation Transport-moving 1 1
## workclass Without-pay fnlwgt education 11th
## education.num 0 0 1
## occupation Transport-moving 1 0 0
## education 12th education 1st-4th
## education.num 2 -4
## occupation Transport-moving 0 0
## education 5th-6th education 7th-8th
## education.num -3 -2
## occupation Transport-moving 0 0
## education 9th education Assoc-acdm
## education.num -1 6
## occupation Transport-moving 0 0
## education Assoc-voc education Bachelors
## education.num 5 7
## occupation Transport-moving 0 0
## education Doctorate education HS-grad
## education.num 10 3
## occupation Transport-moving 0 0
## education Masters education Preschool
## education.num 8 -5
## occupation Transport-moving 0 0
## education Prof-school education Some-college
## education.num 9 4
## occupation Transport-moving 0 0
## marital.status Married-AF-spouse
## education.num 0
## occupation Transport-moving 0
## marital.status Married-civ-spouse
## education.num 0
## occupation Transport-moving 0
## marital.status Married-spouse-absent
## education.num 0
## occupation Transport-moving 0
## marital.status Never-married
## education.num 0
## occupation Transport-moving 0
## marital.status Separated
## education.num 0
## occupation Transport-moving 0
## marital.status Widowed occupation Adm-clerical
## education.num 0 0
## occupation Transport-moving 0 -1
## occupation Armed-Forces
## education.num 0
## occupation Transport-moving -1
## occupation Craft-repair
## education.num 0
## occupation Transport-moving -1
## occupation Exec-managerial
## education.num 0
## occupation Transport-moving -1
## occupation Farming-fishing
## education.num 0
## occupation Transport-moving -1
## occupation Handlers-cleaners
## education.num 0
## occupation Transport-moving -1
## occupation Machine-op-inspct
## education.num 0
## occupation Transport-moving -1
## occupation Other-service
## education.num 0
## occupation Transport-moving -1
## occupation Priv-house-serv
## education.num 0
## occupation Transport-moving -1
## occupation Prof-specialty
## education.num 0
## occupation Transport-moving -1
## occupation Protective-serv occupation Sales
## education.num 0 0
## occupation Transport-moving -1 -1
## occupation Tech-support
## education.num 0
## occupation Transport-moving -1
## relationship Not-in-family
## education.num 0
## occupation Transport-moving 0
## relationship Other-relative
## education.num 0
## occupation Transport-moving 0
## relationship Own-child relationship Unmarried
## education.num 0 0
## occupation Transport-moving 0 0
## relationship Wife race Asian-Pac-Islander
## education.num 0 0
## occupation Transport-moving 0 0
## race Black race Other race White sex Male
## education.num 0 0 0 0
## occupation Transport-moving 0 0 0 0
## capital.gain capital.loss hours.per.week
## education.num 0 0 0
## occupation Transport-moving 0 0 0
## native.country Cambodia native.country Canada
## education.num 0 0
## occupation Transport-moving 0 0
## native.country China native.country Columbia
## education.num 0 0
## occupation Transport-moving 0 0
## native.country Cuba
## education.num 0
## occupation Transport-moving 0
## native.country Dominican-Republic
## education.num 0
## occupation Transport-moving 0
## native.country Ecuador
## education.num 0
## occupation Transport-moving 0
## native.country El-Salvador
## education.num 0
## occupation Transport-moving 0
## native.country England native.country France
## education.num 0 0
## occupation Transport-moving 0 0
## native.country Germany native.country Greece
## education.num 0 0
## occupation Transport-moving 0 0
## native.country Guatemala native.country Haiti
## education.num 0 0
## occupation Transport-moving 0 0
## native.country Holand-Netherlands
## education.num 0
## occupation Transport-moving 0
## native.country Honduras native.country Hong
## education.num 0 0
## occupation Transport-moving 0 0
## native.country Hungary native.country India
## education.num 0 0
## occupation Transport-moving 0 0
## native.country Iran native.country Ireland
## education.num 0 0
## occupation Transport-moving 0 0
## native.country Italy native.country Jamaica
## education.num 0 0
## occupation Transport-moving 0 0
## native.country Japan native.country Laos
## education.num 0 0
## occupation Transport-moving 0 0
## native.country Mexico native.country Nicaragua
## education.num 0 0
## occupation Transport-moving 0 0
## native.country Outlying-US(Guam-USVI-etc)
## education.num 0
## occupation Transport-moving 0
## native.country Peru native.country Philippines
## education.num 0 0
## occupation Transport-moving 0 0
## native.country Poland native.country Portugal
## education.num 0 0
## occupation Transport-moving 0 0
## native.country Puerto-Rico
## education.num 0
## occupation Transport-moving 0
## native.country Scotland native.country South
## education.num 0 0
## occupation Transport-moving 0 0
## native.country Taiwan native.country Thailand
## education.num 0 0
## occupation Transport-moving 0 0
## native.country Trinadad&Tobago
## education.num 0
## occupation Transport-moving 0
## native.country United-States
## education.num 0
## occupation Transport-moving 0
## native.country Vietnam
## education.num 0
## occupation Transport-moving 0
## native.country Yugoslavia
## education.num 0
## occupation Transport-moving 0
Individual Impact of Variables
library("caret", lib.loc="C:/Program Files/R/R-3.3.1/library")
varImp(LogisticModel, scale = FALSE)
## Overall
## age 15.45978788
## workclass Federal-gov 7.13059046
## workclass Local-gov 2.93418870
## workclass Never-worked 0.03841036
## workclass Private 4.74589511
## workclass Self-emp-inc 5.14025091
## workclass Self-emp-not-inc 0.75618502
## workclass State-gov 1.86821835
## workclass Without-pay 0.06151125
## fnlwgt 4.11138462
## education 11th 0.40333184
## education 12th 1.84977469
## education 1st-4th 1.08718108
## education 5th-6th 0.73466116
## education 7th-8th 2.04970529
## education 9th 0.74262221
## education Assoc-acdm 7.57369082
## education Assoc-voc 7.98077808
## education Bachelors 12.29618486
## education Doctorate 13.95408957
## education HS-grad 5.30193981
## education Masters 13.63108622
## education Preschool 0.05755434
## education Prof-school 13.95472810
## education Some-college 7.44740391
## marital.status Married-AF-spouse 4.84934535
## marital.status Married-civ-spouse 8.31190433
## marital.status Married-spouse-absent 0.04774463
## marital.status Never-married 5.51319049
## marital.status Separated 0.81255767
## marital.status Widowed 0.83538341
## occupation Adm-clerical 1.10410539
## occupation Armed-Forces 0.68754204
## occupation Craft-repair 2.13953789
## occupation Exec-managerial 10.27577676
## occupation Farming-fishing 6.21422641
## occupation Handlers-cleaners 3.90733480
## occupation Machine-op-inspct 1.62441813
## occupation Other-service 5.74635490
## occupation Priv-house-serv 2.41532499
## occupation Prof-specialty 6.67544897
## occupation Protective-serv 5.26524365
## occupation Sales 4.33622786
## occupation Tech-support 6.41476040
## relationship Not-in-family 2.16813179
## relationship Other-relative 1.53649487
## relationship Own-child 2.53917050
## relationship Unmarried 1.58317029
## relationship Wife 13.28213390
## race Asian-Pac-Islander 2.46542508
## race Black 1.68985002
## race Other 0.49068506
## race White 2.58353199
## sex Male 10.88310426
## capital.gain 30.96796328
## capital.loss 17.43065618
## hours.per.week 18.31600453
## native.country Cambodia 2.33844906
## native.country Canada 1.75133472
## native.country China 1.28823300
## native.country Columbia 2.34173064
## native.country Cuba 1.58270581
## native.country Dominican-Republic 1.56598250
## native.country Ecuador 0.12948876
## native.country El-Salvador 0.85417585
## native.country England 1.48574758
## native.country France 1.46160799
## native.country Germany 2.17930460
## native.country Greece 1.41099951
## native.country Guatemala 0.08338756
## native.country Haiti 0.19838015
## native.country Holand-Netherlands 0.01160294
## native.country Honduras 0.46082025
## native.country Hong 0.12784747
## native.country Hungary 0.09359322
## native.country India 0.57705903
## native.country Iran 0.51917871
## native.country Ireland 1.11642112
## native.country Italy 2.88471317
## native.country Jamaica 0.49343756
## native.country Japan 1.37495795
## native.country Laos 0.48771250
## native.country Mexico 1.42813057
## native.country Nicaragua 0.76505483
## native.country Outlying-US(Guam-USVI-etc) 0.05760036
## native.country Peru 0.75922390
## native.country Philippines 2.17262061
## native.country Poland 0.43154102
## native.country Portugal 0.24348140
## native.country Puerto-Rico 0.36700309
## native.country Scotland 0.24133728
## native.country South 1.99791145
## native.country Taiwan 0.47598183
## native.country Thailand 0.45289950
## native.country Trinadad&Tobago 0.22701988
## native.country United-States 2.76431028
## native.country Vietnam 1.55965850
## native.country Yugoslavia 1.27784231
AIC and BIC
library("stats", lib.loc="C:/Program Files/R/R-3.3.1/library")
AIC(LogisticModel)
## [1] 20763.02
BIC(LogisticModel)
## [1] 21593.72
SVM Model
SVM is another black box method in Machine Learning space. Compared to other algorithms, SVM totally a different approach to learning.
library(e1071)
svm<- svm(census_income_data$Income_band~.,type="C", kernel="linear",data=census_income_data)
summary(svm)
##
## Call:
## svm(formula = census_income_data$Income_band ~ ., data = census_income_data,
## type = "C", kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
## gamma: 0.00990099
##
## Number of Support Vectors: 11152
##
## ( 5585 5567 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
There are 11152 support vectors.SVM-Type is C-classification.
Confusion Matrix
library(caret)
svm_predicted<-predict(svm)
confusionMatrix(svm_predicted,census_income_data$Income_band)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 23189 3266
## 1 1531 4575
##
## Accuracy : 0.8527
## 95% CI : (0.8488, 0.8565)
## No Information Rate : 0.7592
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5642
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9381
## Specificity : 0.5835
## Pos Pred Value : 0.8765
## Neg Pred Value : 0.7493
## Prevalence : 0.7592
## Detection Rate : 0.7122
## Detection Prevalence : 0.8125
## Balanced Accuracy : 0.7608
##
## 'Positive' Class : 0
##
Decision Tree Model
The series of questions and their possible answers can be organized in the form of a decision tree, which is a hierarchical structure consisting of nodes and directed edges
library("rpart", lib.loc="C:/Program Files/R/R-3.3.1/library")
## Warning: package 'rpart' was built under R version 3.3.3
library("tree", lib.loc="C:/Program Files/R/R-3.3.1/library")
## Warning: package 'tree' was built under R version 3.3.3
names(census_income_data)
## [1] "age" "workclass" "fnlwgt" "education"
## [5] "education.num" "marital.status" "occupation" "relationship"
## [9] "race" "sex" "capital.gain" "capital.loss"
## [13] "hours.per.week" "native.country" "Income_band"
income_tree<-rpart(census_income_data$Income_band~.,method="class", control=rpart.control(minsplit=30), data=census_income_data)
income_tree
## n= 32561
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 32561 7841 0 (0.75919044 0.24080956)
## 2) relationship= Not-in-family, Other-relative, Own-child, Unmarried 17800 1178 0 (0.93382022 0.06617978)
## 4) capital.gain< 7073.5 17482 872 0 (0.95012012 0.04987988) *
## 5) capital.gain>=7073.5 318 12 1 (0.03773585 0.96226415) *
## 3) relationship= Husband, Wife 14761 6663 0 (0.54860782 0.45139218)
## 6) education= 10th, 11th, 12th, 1st-4th, 5th-6th, 7th-8th, 9th, Assoc-acdm, Assoc-voc, HS-grad, Preschool, Some-college 10329 3456 0 (0.66540807 0.33459193)
## 12) capital.gain< 5095.5 9807 2944 0 (0.69980626 0.30019374) *
## 13) capital.gain>=5095.5 522 10 1 (0.01915709 0.98084291) *
## 7) education= Bachelors, Doctorate, Masters, Prof-school 4432 1225 1 (0.27639892 0.72360108) *
library("rattle", lib.loc="C:/Program Files/R/R-3.3.1/library")
library("rpart.plot", lib.loc="C:/Program Files/R/R-3.3.1/library")
## Warning: package 'rpart.plot' was built under R version 3.3.3
fancyRpartPlot(income_tree)
printcp(income_tree)
##
## Classification tree:
## rpart(formula = census_income_data$Income_band ~ ., data = census_income_data,
## method = "class", control = rpart.control(minsplit = 30))
##
## Variables actually used in tree construction:
## [1] capital.gain education relationship
##
## Root node error: 7841/32561 = 0.24081
##
## n= 32561
##
## CP nsplit rel error xerror xstd
## 1 0.126387 0 1.00000 1.00000 0.0098399
## 2 0.064022 2 0.74723 0.74723 0.0088402
## 3 0.037495 3 0.68320 0.68320 0.0085321
## 4 0.010000 4 0.64571 0.64571 0.0083394
plotcp(income_tree)
Prediction using the model
library("caret", lib.loc="C:/Program Files/R/R-3.3.1/library")
sample_pred<-predict(income_tree, type="class")
conf_matrix<-table(sample_pred,census_income_data$Income_band)
conf_matrix
##
## sample_pred 0 1
## 0 23473 3816
## 1 1247 4025
accuracy3<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy3
## [1] 0.8445072
Prune the Decision Tree
library("rpart", lib.loc="C:/Program Files/R/R-3.3.1/library")
library("tree", lib.loc="C:/Program Files/R/R-3.3.1/library")
income_tree1<-rpart(census_income_data$Income_band~.,method="class", control=rpart.control(minsplit=30, cp=0.037), data=census_income_data)
income_tree1
## n= 32561
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 32561 7841 0 (0.75919044 0.24080956)
## 2) relationship= Not-in-family, Other-relative, Own-child, Unmarried 17800 1178 0 (0.93382022 0.06617978)
## 4) capital.gain< 7073.5 17482 872 0 (0.95012012 0.04987988) *
## 5) capital.gain>=7073.5 318 12 1 (0.03773585 0.96226415) *
## 3) relationship= Husband, Wife 14761 6663 0 (0.54860782 0.45139218)
## 6) education= 10th, 11th, 12th, 1st-4th, 5th-6th, 7th-8th, 9th, Assoc-acdm, Assoc-voc, HS-grad, Preschool, Some-college 10329 3456 0 (0.66540807 0.33459193)
## 12) capital.gain< 5095.5 9807 2944 0 (0.69980626 0.30019374) *
## 13) capital.gain>=5095.5 522 10 1 (0.01915709 0.98084291) *
## 7) education= Bachelors, Doctorate, Masters, Prof-school 4432 1225 1 (0.27639892 0.72360108) *
library("rattle", lib.loc="C:/Program Files/R/R-3.3.1/library")
library("rpart.plot", lib.loc="C:/Program Files/R/R-3.3.1/library")
fancyRpartPlot(income_tree1)
printcp(income_tree1)
##
## Classification tree:
## rpart(formula = census_income_data$Income_band ~ ., data = census_income_data,
## method = "class", control = rpart.control(minsplit = 30,
## cp = 0.037))
##
## Variables actually used in tree construction:
## [1] capital.gain education relationship
##
## Root node error: 7841/32561 = 0.24081
##
## n= 32561
##
## CP nsplit rel error xerror xstd
## 1 0.126387 0 1.00000 1.00000 0.0098399
## 2 0.064022 2 0.74723 0.74723 0.0088402
## 3 0.037495 3 0.68320 0.68320 0.0085321
## 4 0.037000 4 0.64571 0.65553 0.0083908
plotcp(income_tree1)
library("caret", lib.loc="C:/Program Files/R/R-3.3.1/library")
sample_pred1<-predict(income_tree1, type="class")
conf_matrix<-table(sample_pred1,census_income_data$Income_band)
conf_matrix
##
## sample_pred1 0 1
## 0 23473 3816
## 1 1247 4025
accuracy4<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy4
## [1] 0.8445072
Train and Validation datasets
library("caret", lib.loc="C:/Program Files/R/R-3.3.1/library")
sampledata <- createDataPartition(census_income_data$Income_band, p=0.80, list=FALSE)
train_new <- census_income_data[sampledata,]
hold_out <- census_income_data[-sampledata,]
Overfitting
Model on training data
library("rpart", lib.loc="C:/Program Files/R/R-3.3.1/library")
library("tree", lib.loc="C:/Program Files/R/R-3.3.1/library")
income_tree<-rpart(Income_band~.,method="class", control=rpart.control(minsplit=30,cp=0.05), data=census_income_data)
income_tree
## n= 32561
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 32561 7841 0 (0.75919044 0.24080956)
## 2) relationship= Not-in-family, Other-relative, Own-child, Unmarried 17800 1178 0 (0.93382022 0.06617978) *
## 3) relationship= Husband, Wife 14761 6663 0 (0.54860782 0.45139218)
## 6) education= 10th, 11th, 12th, 1st-4th, 5th-6th, 7th-8th, 9th, Assoc-acdm, Assoc-voc, HS-grad, Preschool, Some-college 10329 3456 0 (0.66540807 0.33459193)
## 12) capital.gain< 5095.5 9807 2944 0 (0.69980626 0.30019374) *
## 13) capital.gain>=5095.5 522 10 1 (0.01915709 0.98084291) *
## 7) education= Bachelors, Doctorate, Masters, Prof-school 4432 1225 1 (0.27639892 0.72360108) *
library("rattle", lib.loc="C:/Program Files/R/R-3.3.1/library")
library("tree", lib.loc="C:/Program Files/R/R-3.3.1/library")
sample_pred<-predict(income_tree, train_new,type="class")
conf_matrix<-table(sample_pred,train_new$Income_band)
conf_matrix
##
## sample_pred 0 1
## 0 18790 3300
## 1 986 2973
accuracy5<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy5
## [1] 0.8354639
Model Validation
Validation accuracy
library("rattle", lib.loc="C:/Program Files/R/R-3.3.1/library")
hold_out$pred<- predict(income_tree, hold_out,type="class")
conf_matrix_val<-table(hold_out$pred,hold_out$Income_band)
conf_matrix_val
##
## 0 1
## 0 4695 822
## 1 249 746
accuracy_val<-(conf_matrix_val[1,1]+conf_matrix_val[2,2])/(sum(conf_matrix_val))
accuracy_val
## [1] 0.8355344
ROC and AUC on decision
library("pROC", lib.loc="C:/Program Files/R/R-3.3.1/library")
## Warning: package 'pROC' was built under R version 3.3.3
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following object is masked from 'package:gmodels':
##
## ci
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
income_tree<-glm(census_income_data$Income_band~.,family=binomial(),data=census_income_data)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
predicted_prob<-predict(income_tree,type="response")
roccurve <- roc(income_tree$y, predicted_prob)
plot(roccurve)
auc(roccurve)
## Area under the curve: 0.9089
auc(income_tree$y, predicted_prob)
## Area under the curve: 0.9089
k-fold Cross Validation building
Divide the whole dataset into k equal parts Use kth part of the data as the holdout sample, use remaining k-1 parts of the data as training data.Repeat this K times, build K models. The average error on holdout sample gives us an idea on the testing error
K=10
library("caret", lib.loc="C:/Program Files/R/R-3.3.1/library")
train_dat <- trainControl(method="cv", number=10)
train_dat
## $method
## [1] "cv"
##
## $number
## [1] 10
##
## $repeats
## [1] 1
##
## $search
## [1] "grid"
##
## $p
## [1] 0.75
##
## $initialWindow
## NULL
##
## $horizon
## [1] 1
##
## $fixedWindow
## [1] TRUE
##
## $skip
## [1] 0
##
## $verboseIter
## [1] FALSE
##
## $returnData
## [1] TRUE
##
## $returnResamp
## [1] "final"
##
## $savePredictions
## [1] FALSE
##
## $classProbs
## [1] FALSE
##
## $summaryFunction
## function (data, lev = NULL, model = NULL)
## {
## if (is.character(data$obs))
## data$obs <- factor(data$obs, levels = lev)
## postResample(data[, "pred"], data[, "obs"])
## }
## <environment: namespace:caret>
##
## $selectionFunction
## [1] "best"
##
## $preProcOptions
## $preProcOptions$thresh
## [1] 0.95
##
## $preProcOptions$ICAcomp
## [1] 3
##
## $preProcOptions$k
## [1] 5
##
## $preProcOptions$freqCut
## [1] 19
##
## $preProcOptions$uniqueCut
## [1] 10
##
## $preProcOptions$cutoff
## [1] 0.9
##
##
## $sampling
## NULL
##
## $index
## NULL
##
## $indexOut
## NULL
##
## $indexFinal
## NULL
##
## $timingSamps
## [1] 0
##
## $predictionBounds
## [1] FALSE FALSE
##
## $seeds
## [1] NA
##
## $adaptive
## $adaptive$min
## [1] 5
##
## $adaptive$alpha
## [1] 0.05
##
## $adaptive$method
## [1] "gls"
##
## $adaptive$complete
## [1] TRUE
##
##
## $trim
## [1] FALSE
##
## $allowParallel
## [1] TRUE
names(census_income_data)
## [1] "age" "workclass" "fnlwgt" "education"
## [5] "education.num" "marital.status" "occupation" "relationship"
## [9] "race" "sex" "capital.gain" "capital.loss"
## [13] "hours.per.week" "native.country" "Income_band"
census_income_data$Income_band<-as.factor(census_income_data$Income_band)
Building the models on K-fold samples
library("e1071", lib.loc="C:/Program Files/R/R-3.3.1/library")
K_fold_tree<-train(Income_band~., method="rpart", trControl=train_dat, control=rpart.control(minsplit=10, cp=0.000001), data=census_income_data)
K_fold_tree
## CART
##
## 32561 samples
## 14 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 29305, 29305, 29305, 29305, 29305, 29305, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.03685754 0.8379656 0.4988152
## 0.06453259 0.8240837 0.4422865
## 0.12492029 0.7870774 0.2043005
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.03685754.
K_fold_tree$finalModel
K_fold_tree$finalModel
## n= 32561
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 32561 7841 0 (0.75919044 0.24080956)
## 2) marital.status Married-civ-spouse< 0.5 17585 1149 0 (0.93466022 0.06533978) *
## 3) marital.status Married-civ-spouse>=0.5 14976 6692 0 (0.55315171 0.44684829)
## 6) education.num< 12.5 10507 3478 0 (0.66898258 0.33101742)
## 12) capital.gain< 5095.5 9979 2961 0 (0.70327688 0.29672312) *
## 13) capital.gain>=5095.5 528 11 1 (0.02083333 0.97916667) *
## 7) education.num>=12.5 4469 1255 1 (0.28082345 0.71917655) *
library("rattle", lib.loc="C:/Program Files/R/R-3.3.1/library")
fancyRpartPlot(K_fold_tree$finalModel)
Kfold_pred<-predict(K_fold_tree)
Caret package has confusion matrix function
conf_matrix<-confusionMatrix(Kfold_pred,census_income_data$Income_band)
conf_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 23454 4110
## 1 1266 3731
##
## Accuracy : 0.8349
## 95% CI : (0.8308, 0.8389)
## No Information Rate : 0.7592
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4846
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9488
## Specificity : 0.4758
## Pos Pred Value : 0.8509
## Neg Pred Value : 0.7466
## Prevalence : 0.7592
## Detection Rate : 0.7203
## Detection Prevalence : 0.8465
## Balanced Accuracy : 0.7123
##
## 'Positive' Class : 0
##
Bootstrap
Boot strapping is a powerful tool to get an idea on accuracy of the model
library("caret", lib.loc="C:/Program Files/R/R-3.3.1/library")
train_control <- trainControl(method="boot", number=10)
Tree model on boots straped data
Boot_Strap_model <- train(Income_band~., method="rpart", trControl=train_dat, control=rpart.control(minsplit=10, cp=0.000001), data=census_income_data)
names(census_income_data)
## [1] "age" "workclass" "fnlwgt" "education"
## [5] "education.num" "marital.status" "occupation" "relationship"
## [9] "race" "sex" "capital.gain" "capital.loss"
## [13] "hours.per.week" "native.country" "Income_band"
Boot_Strap_model$finalModel
## n= 32561
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 32561 7841 0 (0.75919044 0.24080956)
## 2) marital.status Married-civ-spouse< 0.5 17585 1149 0 (0.93466022 0.06533978) *
## 3) marital.status Married-civ-spouse>=0.5 14976 6692 0 (0.55315171 0.44684829)
## 6) education.num< 12.5 10507 3478 0 (0.66898258 0.33101742)
## 12) capital.gain< 5095.5 9979 2961 0 (0.70327688 0.29672312) *
## 13) capital.gain>=5095.5 528 11 1 (0.02083333 0.97916667) *
## 7) education.num>=12.5 4469 1255 1 (0.28082345 0.71917655) *
Boot_Strap_predictions <- predict(Boot_Strap_model)
conf_matrix<-confusionMatrix(Boot_Strap_predictions,census_income_data$Income_band)
conf_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 23454 4110
## 1 1266 3731
##
## Accuracy : 0.8349
## 95% CI : (0.8308, 0.8389)
## No Information Rate : 0.7592
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4846
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9488
## Specificity : 0.4758
## Pos Pred Value : 0.8509
## Neg Pred Value : 0.7466
## Prevalence : 0.7592
## Detection Rate : 0.7203
## Detection Prevalence : 0.8465
## Balanced Accuracy : 0.7123
##
## 'Positive' Class : 0
##
Conclusion
n= 32561
node), split, n, loss, yval, (yprob) * denotes terminal node
- root 32561 7841 0 (0.75919044 0.24080956)
- marital.status Married-civ-spouse< 0.5 17585 1149 0 (0.93466022 0.06533978) *
- marital.status Married-civ-spouse>=0.5 14976 6692 0 (0.55315171 0.44684829)
- education.num< 12.5 10507 3478 0 (0.66898258 0.33101742)
- capital.gain< 5095.5 9979 2961 0 (0.70327688 0.29672312) *
- capital.gain>=5095.5 528 11 1 (0.02083333 0.97916667) *
- education.num>=12.5 4469 1255 1 (0.28082345 0.71917655) *
Root node contains 32561 records.ie., over all records in whole data.root node termed as a <=50K, loss in that node is >50K are 7841.out of 32561loss is 7841. 76% population are earning below 50,000 (<=50K )and 24% population are earning above 50,000 (>50K).
2 nd node is marital.status Married-civ-spouse< 0.5.It has a 17585 records.2 nd node is termed as a <=50K ,loss in that node is >50K are 1149.out of 17585 loss is 1149.Below 50,000 earning population are around 17585. 93% population are earning <=50K.7% population are earning above 50,000 (>50K).
3rd node is marital.status Married-civ-spouse>=0.5.It has a 14976 records.3 rd node is termed as a <=50K ,loss in that node is >50K are 6692.out of 14976 loss is 6692.Below 50,000 earning population are around 14976. 55% population are earning <=50K.45% population are earning above 50,000 (>50K).
6th node is education.num< 12.5.It has a 10507 records.6th node is termed as a <=50K ,loss in that node is >50K are 3478. out of 10507 loss is 3478. Below 50,000 earning population are around 10507. 67% population are earning Below 50,000 (<=50K).33% population are earning above 50,000 (>50K).
12th node is capital.gain< 5095.5 .It has a 9979 records.12th node is termed as a <=50K ,loss in that node is >50K are 2961.out of 9979 loss is 2961.Below 50,000 earning population are around 9979. 70% population are earning Below 50,000 (<=50K).30% population are earning above 50,000 (>50K).
13th node is capital.gain>=5095.5 .It has a 528 records.13th node is termed as a >50K ,loss in that node is <=50K are 11.out of 528 loss is 11.above 50,000 earning population are around 528. 2% population are earning Below 50,000 (<=50K).97% population are earning above 50,000(>50K).
7th node is education.num>=12.5.It has a 4469 records.7th node is termed as a >50K ,loss in that node is <=50K are 1255.out of 4469 loss is 1255.above 50,000 earning population are around 4469. 28% population are earning Below 50,000 (<=50K).72% population are earning above 50,000 (>50K).
Cofusion Matrix and Accuracy
Reference Dataset
Prediction 0 1
0 23454 4110
1 1266 3731
Accuracy : 0.8349
95% CI : (0.8308, 0.8389)
No Information Rate : 0.7592
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.4846
Mcnemar’s Test P-Value : < 2.2e-16
Sensitivity : 0.9488
Specificity : 0.4758
Pos Pred Value : 0.8509
Neg Pred Value : 0.7466
Prevalence : 0.7592
Detection Rate : 0.7203
Detection Prevalence : 0.8465
Balanced Accuracy : 0.7123
'Positive' Class : 0
Real accuracy of the whole data is 83%.


