You can download the datasets .

Census Income Prediction

Problem Statement
- Abstract of the Problem
- Source of Information
Data Exploration
Model Bulding
Conclusion
- Cofusion Matrix and Accuracy

Problem Statement

Abstract of the Problem

The objective of this project is to Predict whether the income of the Citizens exceeds $50K/yr based on Census income data.

Source of Information

U.S. Census Bureau United States Department of Commerce Donor

Terran Lane and Ronny Kohavi Data Mining and Visualization Silicon Graphics. terran@ecn.purdue.edu, ronnyk@sgi.com Date Donated: March 7, 2000

Data Exploration

Dataset Information

income data – income_data income_data set contains 32,561 rows and 15 columns.

Data Import

First step is to import the data set to R. Formats like .csv,.xlsx etc are the common data formats used by data scientists or analysts. Use suitable function to import it into R. In our case data set is in csv format. So we use the function ‘read.csv()’ to import the data set.

census_income_data<-read.csv("E:/R_Census_Project/Satish/Census Income Data/income_data.csv")
dim(census_income_data)

## [1] 32561    15

names(census_income_data)

##  [1] "age"            "workclass"      "fnlwgt"         "education"     
##  [5] "education.num"  "marital.status" "occupation"     "relationship"  
##  [9] "race"           "sex"            "capital.gain"   "capital.loss"  
## [13] "hours.per.week" "native.country" "Income_band"

levels(census_income_data$Income_band)[1]<-0
levels(census_income_data$Income_band)[2]<-1
table(census_income_data$Income_band)

## 
##     0     1 
## 24720  7841

head(census_income_data)

##   age         workclass fnlwgt  education education.num
## 1  39         State-gov  77516  Bachelors            13
## 2  50  Self-emp-not-inc  83311  Bachelors            13
## 3  38           Private 215646    HS-grad             9
## 4  53           Private 234721       11th             7
## 5  28           Private 338409  Bachelors            13
## 6  37           Private 284582    Masters            14
##        marital.status         occupation   relationship   race     sex
## 1       Never-married       Adm-clerical  Not-in-family  White    Male
## 2  Married-civ-spouse    Exec-managerial        Husband  White    Male
## 3            Divorced  Handlers-cleaners  Not-in-family  White    Male
## 4  Married-civ-spouse  Handlers-cleaners        Husband  Black    Male
## 5  Married-civ-spouse     Prof-specialty           Wife  Black  Female
## 6  Married-civ-spouse    Exec-managerial           Wife  White  Female
##   capital.gain capital.loss hours.per.week native.country Income_band
## 1         2174            0             40  United-States           0
## 2            0            0             13  United-States           0
## 3            0            0             40  United-States           0
## 4            0            0             40  United-States           0
## 5            0            0             40           Cuba           0
## 6            0            0             40  United-States           0

Censu_Income_data set contains 32,561 rows and 15 columns.

Univariate Analysis

Once we have the dataset and metadata, understanding metadata thoroughly is a crucial step. Exploration helps us understand all the variables throughly which is necessary to understand relation between input and predictive variables. Exploration also provides a vague understanding of what’s going on with the dataset. ####Check whether missing values are present or not?

sum(is.na(census_income_data))

## [1] 0

census_income_data set has no missing values.

Variable_1= “age”

age is a numerical data.

head(census_income_data$age)

## [1] 39 50 38 53 28 37

Univariate Analysis of age

Central tendencies of age

mean of age

mean(census_income_data$age)

## [1] 38.58165

median of age

median(census_income_data$age)

## [1] 37

Dispersion of age

Variance of age

var(census_income_data$age)

## [1] 186.0614

Standard deviation of age

sd(census_income_data$age)

## [1] 13.64043

summary gives four quartiles of age

summary(census_income_data$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.00   28.00   37.00   38.58   48.00   90.00

boxplot of age

quantile(census_income_data$age)

##   0%  25%  50%  75% 100% 
##   17   28   37   48   90

quantile(census_income_data$age,c(0.75,0.80,0.90,1))

##  75%  80%  90% 100% 
##   48   50   58   90

boxplot(census_income_data$age,main="age")

Output description

In this boxplot the minimum is 17 , maximum is 90, and median is 37. first quartile is 28,third quartile is 48. Note that outliers are discussed later.

Histogram of “age” variable:

hist(census_income_data$age)

Correlation between age variable and income_brands variable

library("ltm", lib.loc="C:/Program Files/R/R-3.3.1/library")

## Warning: package 'ltm' was built under R version 3.3.3

## Loading required package: MASS

## Warning: package 'MASS' was built under R version 3.3.3

## Loading required package: msm

## Warning: package 'msm' was built under R version 3.3.3

## Loading required package: polycor

## Warning: package 'polycor' was built under R version 3.3.3

biserial.cor(census_income_data$age,census_income_data$Income_band)

## [1] -0.2340335

correlation is -0.23 age and income_brands are negatively correlated

Variable_2= “workclass”

It is a Categorial data.There are 9 categories,

Not in universe

Private

Self-employed-not incorporated

Local government

State government

Self-employed-incorporated

Federal government

Never worked

Without pay

“workclass” variable is qualitative data.Central tendencies and Measures of dispersion coefficients does not make any sense.For this scenerio we calculate frequency table,mode and Histogram.Mode gives the maximum value of work class.

Frequency table of workclass

tab<-table(census_income_data$workclass)
tab

## 
##                 ?       Federal-gov         Local-gov      Never-worked 
##              1836               960              2093                 7 
##           Private      Self-emp-inc  Self-emp-not-inc         State-gov 
##             22696              1116              2541              1298 
##       Without-pay 
##                14

names(tab)

## [1] " ?"                " Federal-gov"      " Local-gov"       
## [4] " Never-worked"     " Private"          " Self-emp-inc"    
## [7] " Self-emp-not-inc" " State-gov"        " Without-pay"

sum(is.na(census_income_data$workclass))

## [1] 0

Mode of “workclass”

temp <- table(as.vector(census_income_data$workclass))
names(temp)[temp==max(temp)]

## [1] " Private"

mode of workclass is private

ggplot of “workclass”

library("ggplot2", lib.loc="C:/Program Files/R/R-3.3.1/library")

## Warning: package 'ggplot2' was built under R version 3.3.3

qplot(census_income_data$workclass,main="workclass",ylab="count",colour= I("purple"),size=I(4))

library("gmodels", lib.loc="C:/Program Files/R/R-3.3.1/library")

## Warning: package 'gmodels' was built under R version 3.3.3

CrossTable(census_income_data$workclass,census_income_data$Income_band,digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)

## Warning in chisq.test(t, correct = FALSE, ...): Chi-squared approximation
## may be incorrect

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  32561 
## 
##  
##                              | census_income_data$Income_band 
## census_income_data$workclass |         0 |         1 | Row Total | 
## -----------------------------|-----------|-----------|-----------|
##                            ? |      1645 |       191 |      1836 | 
##                              |       0.1 |       0.0 |           | 
## -----------------------------|-----------|-----------|-----------|
##                  Federal-gov |       589 |       371 |       960 | 
##                              |       0.0 |       0.0 |           | 
## -----------------------------|-----------|-----------|-----------|
##                    Local-gov |      1476 |       617 |      2093 | 
##                              |       0.1 |       0.1 |           | 
## -----------------------------|-----------|-----------|-----------|
##                 Never-worked |         7 |         0 |         7 | 
##                              |       0.0 |       0.0 |           | 
## -----------------------------|-----------|-----------|-----------|
##                      Private |     17733 |      4963 |     22696 | 
##                              |       0.7 |       0.6 |           | 
## -----------------------------|-----------|-----------|-----------|
##                 Self-emp-inc |       494 |       622 |      1116 | 
##                              |       0.0 |       0.1 |           | 
## -----------------------------|-----------|-----------|-----------|
##             Self-emp-not-inc |      1817 |       724 |      2541 | 
##                              |       0.1 |       0.1 |           | 
## -----------------------------|-----------|-----------|-----------|
##                    State-gov |       945 |       353 |      1298 | 
##                              |       0.0 |       0.0 |           | 
## -----------------------------|-----------|-----------|-----------|
##                  Without-pay |        14 |         0 |        14 | 
##                              |       0.0 |       0.0 |           | 
## -----------------------------|-----------|-----------|-----------|
##                 Column Total |     24720 |      7841 |     32561 | 
##                              |       0.8 |       0.2 |           | 
## -----------------------------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  1045.709     d.f. =  8     p =  2.026505e-220 
## 
## 
##

Varaiable_3=“fnlwgt”

The no of people the census takers believe that observation represents. We will be ignoring this variable. It is a continuous data.

head(census_income_data$fnlwgt)

## [1]  77516  83311 215646 234721 338409 284582

univariate analysis of fnlwgt

Central tendencies of fnlwgt

mean of fnlwgt

mean(census_income_data$fnlwgt)

## [1] 189778.4

median of fnlwgt

median(census_income_data$fnlwgt)

## [1] 178356

Measures of Dispersion of fnlwgt

Variance of fnlwgt

var(census_income_data$fnlwgt)

## [1] 11140797792

Standard deviation of fnlwgt

sd(census_income_data$fnlwgt)

## [1] 105550

summary gives four quartiles of fnlwgt

summary(census_income_data$fnlwgt)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12280  117800  178400  189800  237100 1485000

Boxplot of fnlwgt

quantile(census_income_data$fnlwgt)

##      0%     25%     50%     75%    100% 
##   12285  117827  178356  237051 1484705

quantile(census_income_data$fnlwgt,c(0.75,0.80,0.90,1))

##     75%     80%     90%    100% 
##  237051  259873  329054 1484705

boxplot(census_income_data$fnlwgt,main="fnlwgt")

output description

In this boxplot the minimum is 12285, maximum is 1484705, and median is 178356. first quartile is 117827,third quartile is 237051. Note that outliers are discussed later.

Histogram of fnlwgt

hist(census_income_data$fnlwgt)

Correlation between fnlwgt and income_brands

library("ltm", lib.loc="C:/Program Files/R/R-3.3.1/library")
biserial.cor(census_income_data$fnlwgt,census_income_data$Income_band)

## [1] 0.009462412

correlation is 0.009462412 fnlwgt and income_brands are positively correlated

Variable_4=“education”

The highest level of education achieved for that individual. It is a Categorial data.

. preschool

. 1st 2nd 3rd or 4th grade

. 5th or 6th grade

. 7th and 8th grade

. 9th grade

. 10th grade

. 11th grade

. 12th grade no diploma

. High school graduate

. Some college but no degree

. Associates degree-academic program

. Associates degree-occup /vocational

. Bachelors degree(BA AB BS)

. Masters degree(MA MS MEng MEd MSW MBA)

. Prof school degree (MD DDS DVM LLB JD)

. Doctorate degree(PhD EdD)

“education” contains qualitative data.Central tendencies,Measures of dispersion does not make any sense.frequency table,mode and baxplot are calculated for qualitative data. Mode gives the maximum value of status of education.

frequency table of education

tab<-table(census_income_data$education)
tab

## 
##          10th          11th          12th       1st-4th       5th-6th 
##           933          1175           433           168           333 
##       7th-8th           9th    Assoc-acdm     Assoc-voc     Bachelors 
##           646           514          1067          1382          5355 
##     Doctorate       HS-grad       Masters     Preschool   Prof-school 
##           413         10501          1723            51           576 
##  Some-college 
##          7291

names(tab)

##  [1] " 10th"         " 11th"         " 12th"         " 1st-4th"     
##  [5] " 5th-6th"      " 7th-8th"      " 9th"          " Assoc-acdm"  
##  [9] " Assoc-voc"    " Bachelors"    " Doctorate"    " HS-grad"     
## [13] " Masters"      " Preschool"    " Prof-school"  " Some-college"

sum(is.na(census_income_data$education))

## [1] 0

Mode of education

temp <- table(as.vector(census_income_data$education))
names(temp)[temp==max(temp)]

## [1] " HS-grad"

Mode of education is ” HS-grad”

ggplot of education

library("ggplot2", lib.loc="C:/Program Files/R/R-3.3.1/library")
qplot(census_income_data$education,main="education",ylab="count",colour= I("purple"),size=I(4))

library("gmodels", lib.loc="C:/Program Files/R/R-3.3.1/library")
CrossTable(census_income_data$education,census_income_data$Income_band,digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  32561 
## 
##  
##                              | census_income_data$Income_band 
## census_income_data$education |         0 |         1 | Row Total | 
## -----------------------------|-----------|-----------|-----------|
##                         10th |       871 |        62 |       933 | 
##                              |       0.0 |       0.0 |           | 
## -----------------------------|-----------|-----------|-----------|
##                         11th |      1115 |        60 |      1175 | 
##                              |       0.0 |       0.0 |           | 
## -----------------------------|-----------|-----------|-----------|
##                         12th |       400 |        33 |       433 | 
##                              |       0.0 |       0.0 |           | 
## -----------------------------|-----------|-----------|-----------|
##                      1st-4th |       162 |         6 |       168 | 
##                              |       0.0 |       0.0 |           | 
## -----------------------------|-----------|-----------|-----------|
##                      5th-6th |       317 |        16 |       333 | 
##                              |       0.0 |       0.0 |           | 
## -----------------------------|-----------|-----------|-----------|
##                      7th-8th |       606 |        40 |       646 | 
##                              |       0.0 |       0.0 |           | 
## -----------------------------|-----------|-----------|-----------|
##                          9th |       487 |        27 |       514 | 
##                              |       0.0 |       0.0 |           | 
## -----------------------------|-----------|-----------|-----------|
##                   Assoc-acdm |       802 |       265 |      1067 | 
##                              |       0.0 |       0.0 |           | 
## -----------------------------|-----------|-----------|-----------|
##                    Assoc-voc |      1021 |       361 |      1382 | 
##                              |       0.0 |       0.0 |           | 
## -----------------------------|-----------|-----------|-----------|
##                    Bachelors |      3134 |      2221 |      5355 | 
##                              |       0.1 |       0.3 |           | 
## -----------------------------|-----------|-----------|-----------|
##                    Doctorate |       107 |       306 |       413 | 
##                              |       0.0 |       0.0 |           | 
## -----------------------------|-----------|-----------|-----------|
##                      HS-grad |      8826 |      1675 |     10501 | 
##                              |       0.4 |       0.2 |           | 
## -----------------------------|-----------|-----------|-----------|
##                      Masters |       764 |       959 |      1723 | 
##                              |       0.0 |       0.1 |           | 
## -----------------------------|-----------|-----------|-----------|
##                    Preschool |        51 |         0 |        51 | 
##                              |       0.0 |       0.0 |           | 
## -----------------------------|-----------|-----------|-----------|
##                  Prof-school |       153 |       423 |       576 | 
##                              |       0.0 |       0.1 |           | 
## -----------------------------|-----------|-----------|-----------|
##                 Some-college |      5904 |      1387 |      7291 | 
##                              |       0.2 |       0.2 |           | 
## -----------------------------|-----------|-----------|-----------|
##                 Column Total |     24720 |      7841 |     32561 | 
##                              |       0.8 |       0.2 |           | 
## -----------------------------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  4429.653     d.f. =  15     p =  0 
## 
## 
##

variable_5=“education.num”

It is a numerical data.

head(census_income_data$education.num)

## [1] 13 13  9  7 13 14

x<-sum(is.na(census_income_data$Income_band))
x

## [1] 0

Univariate analysis of education.num

Central tendencies of education.num

mean of education.num

mean(census_income_data$education.num)

## [1] 10.08068

median of education.num

median(census_income_data$education.num)

## [1] 10

Measures of Dispersion of education.num

Variance of education.num

var(census_income_data$education.num)

## [1] 6.61889

Standard deviation of education.num

sd(census_income_data$education.num)

## [1] 2.57272

summary gives four quartiles of education.num

summary(census_income_data$education.num)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    9.00   10.00   10.08   12.00   16.00

boxplot of education.num

quantile(census_income_data$education.num)

##   0%  25%  50%  75% 100% 
##    1    9   10   12   16

quantile(census_income_data$education.num,c(0.75,0.80,0.90,1))

##  75%  80%  90% 100% 
##   12   13   13   16

boxplot(census_income_data$education.num,main="education.num")

Output description

In this boxplot the minimum is 1, maximum is 16, and median is 10.First quartile is 9,third quartile is 12.note that outliers are discussed later.

histogram of education.num

hist(census_income_data$education.num)

correlation between education.num and income_brands

library("ltm", lib.loc="C:/Program Files/R/R-3.3.1/library")
biserial.cor(census_income_data$education.num,census_income_data$Income_band)

## [1] -0.3351488

Correlation is -0.3351488 education.num and income_brands are negatively correlated

variable_6 = “marital.status”

Marital status of the individual. It is an categorical variable.The categories are

Never married

Married-civilian spouse present

Divorced

Widowed

Separated

Married-spouse absent

Married-A F spouse present

“marital.status” contains qualitative data.Central tendencies ,dispersion does not make any sense.frequency table,mode and barplot are calculated for qualitative data.mode gives the maximum value of marital staus.

Frequency table of marital.status

tab<-table(census_income_data$marital.status)
tab

## 
##               Divorced      Married-AF-spouse     Married-civ-spouse 
##                   4443                     23                  14976 
##  Married-spouse-absent          Never-married              Separated 
##                    418                  10683                   1025 
##                Widowed 
##                    993

names(tab)

## [1] " Divorced"              " Married-AF-spouse"    
## [3] " Married-civ-spouse"    " Married-spouse-absent"
## [5] " Never-married"         " Separated"            
## [7] " Widowed"

sum(is.na(census_income_data$marital.status))

## [1] 0

Mode of marital.status

temp <- table(as.vector(census_income_data$marital.status))
names(temp)[temp==max(temp)]

## [1] " Married-civ-spouse"

Mode of marital status is ” Married-civ-spouse”

ggplot of marital status

library("ggplot2", lib.loc="C:/Program Files/R/R-3.3.1/library")
qplot(census_income_data$marital.status,main="marital status",ylab="count",colour= I("purple"),size=I(4))

library("gmodels", lib.loc="C:/Program Files/R/R-3.3.1/library")
CrossTable(census_income_data$marital.status,census_income_data$Income_band,digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  32561 
## 
##  
##                                   | census_income_data$Income_band 
## census_income_data$marital.status |         0 |         1 | Row Total | 
## ----------------------------------|-----------|-----------|-----------|
##                          Divorced |      3980 |       463 |      4443 | 
##                                   |       0.2 |       0.1 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                 Married-AF-spouse |        13 |        10 |        23 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                Married-civ-spouse |      8284 |      6692 |     14976 | 
##                                   |       0.3 |       0.9 |           | 
## ----------------------------------|-----------|-----------|-----------|
##             Married-spouse-absent |       384 |        34 |       418 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                     Never-married |     10192 |       491 |     10683 | 
##                                   |       0.4 |       0.1 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                         Separated |       959 |        66 |      1025 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                           Widowed |       908 |        85 |       993 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                      Column Total |     24720 |      7841 |     32561 | 
##                                   |       0.8 |       0.2 |           | 
## ----------------------------------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  6517.742     d.f. =  6     p =  0 
## 
## 
##

Variable_7 = “occupation”

It is a categorical data.The categories are

Adm-clerical

Armed-Forces

Craft-repair

Exec-managerial

Farming-fishing

Handlers-cleaners

Machine-op-inspct

Other-service

Priv-house-serv

Prof-specialty

Protective-serv

Sales

Tech-support

Transport-moving

Frequency table of occupation

tab<-table(census_income_data$occupation)
tab

## 
##                  ?       Adm-clerical       Armed-Forces 
##               1843               3770                  9 
##       Craft-repair    Exec-managerial    Farming-fishing 
##               4099               4066                994 
##  Handlers-cleaners  Machine-op-inspct      Other-service 
##               1370               2002               3295 
##    Priv-house-serv     Prof-specialty    Protective-serv 
##                149               4140                649 
##              Sales       Tech-support   Transport-moving 
##               3650                928               1597

names(tab)

##  [1] " ?"                 " Adm-clerical"      " Armed-Forces"     
##  [4] " Craft-repair"      " Exec-managerial"   " Farming-fishing"  
##  [7] " Handlers-cleaners" " Machine-op-inspct" " Other-service"    
## [10] " Priv-house-serv"   " Prof-specialty"    " Protective-serv"  
## [13] " Sales"             " Tech-support"      " Transport-moving"

sum(is.na(census_income_data$occupation))

## [1] 0

Mode of occupation

temp <- table(as.vector(census_income_data$occupation))
names(temp)[temp==max(temp)]

## [1] " Prof-specialty"

Mode of occupation is ” Prof-specialty”

ggplot of occupation

library("ggplot2", lib.loc="C:/Program Files/R/R-3.3.1/library")
qplot(census_income_data$occupation,main="occupation",ylab="count",colour= I("purple"),size=I(4))

library("gmodels", lib.loc="C:/Program Files/R/R-3.3.1/library")
CrossTable(census_income_data$occupation,census_income_data$Income_band,digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)

## Warning in chisq.test(t, correct = FALSE, ...): Chi-squared approximation
## may be incorrect

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  32561 
## 
##  
##                               | census_income_data$Income_band 
## census_income_data$occupation |         0 |         1 | Row Total | 
## ------------------------------|-----------|-----------|-----------|
##                             ? |      1652 |       191 |      1843 | 
##                               |       0.1 |       0.0 |           | 
## ------------------------------|-----------|-----------|-----------|
##                  Adm-clerical |      3263 |       507 |      3770 | 
##                               |       0.1 |       0.1 |           | 
## ------------------------------|-----------|-----------|-----------|
##                  Armed-Forces |         8 |         1 |         9 | 
##                               |       0.0 |       0.0 |           | 
## ------------------------------|-----------|-----------|-----------|
##                  Craft-repair |      3170 |       929 |      4099 | 
##                               |       0.1 |       0.1 |           | 
## ------------------------------|-----------|-----------|-----------|
##               Exec-managerial |      2098 |      1968 |      4066 | 
##                               |       0.1 |       0.3 |           | 
## ------------------------------|-----------|-----------|-----------|
##               Farming-fishing |       879 |       115 |       994 | 
##                               |       0.0 |       0.0 |           | 
## ------------------------------|-----------|-----------|-----------|
##             Handlers-cleaners |      1284 |        86 |      1370 | 
##                               |       0.1 |       0.0 |           | 
## ------------------------------|-----------|-----------|-----------|
##             Machine-op-inspct |      1752 |       250 |      2002 | 
##                               |       0.1 |       0.0 |           | 
## ------------------------------|-----------|-----------|-----------|
##                 Other-service |      3158 |       137 |      3295 | 
##                               |       0.1 |       0.0 |           | 
## ------------------------------|-----------|-----------|-----------|
##               Priv-house-serv |       148 |         1 |       149 | 
##                               |       0.0 |       0.0 |           | 
## ------------------------------|-----------|-----------|-----------|
##                Prof-specialty |      2281 |      1859 |      4140 | 
##                               |       0.1 |       0.2 |           | 
## ------------------------------|-----------|-----------|-----------|
##               Protective-serv |       438 |       211 |       649 | 
##                               |       0.0 |       0.0 |           | 
## ------------------------------|-----------|-----------|-----------|
##                         Sales |      2667 |       983 |      3650 | 
##                               |       0.1 |       0.1 |           | 
## ------------------------------|-----------|-----------|-----------|
##                  Tech-support |       645 |       283 |       928 | 
##                               |       0.0 |       0.0 |           | 
## ------------------------------|-----------|-----------|-----------|
##              Transport-moving |      1277 |       320 |      1597 | 
##                               |       0.1 |       0.0 |           | 
## ------------------------------|-----------|-----------|-----------|
##                  Column Total |     24720 |      7841 |     32561 | 
##                               |       0.8 |       0.2 |           | 
## ------------------------------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  4031.974     d.f. =  14     p =  0 
## 
## 
##

Variable_8=“relationship”

It is categorical data.The categories are,

Wife

Own-child

Husband

Not-in-family

Other-relative

Unmarried.

Frequency table of relationship

tab<-table(census_income_data$relationship)
tab

## 
##         Husband   Not-in-family  Other-relative       Own-child 
##           13193            8305             981            5068 
##       Unmarried            Wife 
##            3446            1568

names(tab)

## [1] " Husband"        " Not-in-family"  " Other-relative" " Own-child"     
## [5] " Unmarried"      " Wife"

sum(is.na(census_income_data$relationship))

## [1] 0

Mode of relationship

temp <- table(as.vector(census_income_data$relationship))
names(temp)[temp==max(temp)]

## [1] " Husband"

Mode of relation ship is ” Husband”

ggplot of relationship

library("ggplot2", lib.loc="C:/Program Files/R/R-3.3.1/library")
qplot(census_income_data$relationship,main="relationship",ylab="count",colour= I("purple"),size=I(4))

library("gmodels", lib.loc="C:/Program Files/R/R-3.3.1/library")
CrossTable(census_income_data$relationship,census_income_data$Income_band,digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  32561 
## 
##  
##                                 | census_income_data$Income_band 
## census_income_data$relationship |         0 |         1 | Row Total | 
## --------------------------------|-----------|-----------|-----------|
##                         Husband |      7275 |      5918 |     13193 | 
##                                 |       0.3 |       0.8 |           | 
## --------------------------------|-----------|-----------|-----------|
##                   Not-in-family |      7449 |       856 |      8305 | 
##                                 |       0.3 |       0.1 |           | 
## --------------------------------|-----------|-----------|-----------|
##                  Other-relative |       944 |        37 |       981 | 
##                                 |       0.0 |       0.0 |           | 
## --------------------------------|-----------|-----------|-----------|
##                       Own-child |      5001 |        67 |      5068 | 
##                                 |       0.2 |       0.0 |           | 
## --------------------------------|-----------|-----------|-----------|
##                       Unmarried |      3228 |       218 |      3446 | 
##                                 |       0.1 |       0.0 |           | 
## --------------------------------|-----------|-----------|-----------|
##                            Wife |       823 |       745 |      1568 | 
##                                 |       0.0 |       0.1 |           | 
## --------------------------------|-----------|-----------|-----------|
##                    Column Total |     24720 |      7841 |     32561 | 
##                                 |       0.8 |       0.2 |           | 
## --------------------------------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  6699.077     d.f. =  5     p =  0 
## 
## 
##

Variable_9=“race”

The variable is a categorical variable.The categories are

White

Black

Asian or Pacific Islander

Other

Amer Indian Aleut or Eskimo

Frequency table of race

tab<-table(census_income_data$race)
tab

## 
##  Amer-Indian-Eskimo  Asian-Pac-Islander               Black 
##                 311                1039                3124 
##               Other               White 
##                 271               27816

names(tab)

## [1] " Amer-Indian-Eskimo" " Asian-Pac-Islander" " Black"             
## [4] " Other"              " White"

sum(is.na(census_income_data$race))

## [1] 0

Mode of race

temp <- table(as.vector(census_income_data$race))
names(temp)[temp==max(temp)]

## [1] " White"

Mode of race is “white”

ggplot of race

library("ggplot2", lib.loc="C:/Program Files/R/R-3.3.1/library")
qplot(census_income_data$race,main="race",ylab="count",colour= I("purple"),size=I(4))

library("gmodels", lib.loc="C:/Program Files/R/R-3.3.1/library")
CrossTable(census_income_data$race,census_income_data$Income_band,digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  32561 
## 
##  
##                         | census_income_data$Income_band 
## census_income_data$race |         0 |         1 | Row Total | 
## ------------------------|-----------|-----------|-----------|
##      Amer-Indian-Eskimo |       275 |        36 |       311 | 
##                         |       0.0 |       0.0 |           | 
## ------------------------|-----------|-----------|-----------|
##      Asian-Pac-Islander |       763 |       276 |      1039 | 
##                         |       0.0 |       0.0 |           | 
## ------------------------|-----------|-----------|-----------|
##                   Black |      2737 |       387 |      3124 | 
##                         |       0.1 |       0.0 |           | 
## ------------------------|-----------|-----------|-----------|
##                   Other |       246 |        25 |       271 | 
##                         |       0.0 |       0.0 |           | 
## ------------------------|-----------|-----------|-----------|
##                   White |     20699 |      7117 |     27816 | 
##                         |       0.8 |       0.9 |           | 
## ------------------------|-----------|-----------|-----------|
##            Column Total |     24720 |      7841 |     32561 | 
##                         |       0.8 |       0.2 |           | 
## ------------------------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  330.9204     d.f. =  4     p =  2.305961e-70 
## 
## 
##

Variable_10=“sex”

It is a categorical data.The data points are

Female

Male

Frequency table of sex

tab<-table(census_income_data$sex)
tab

## 
##  Female    Male 
##   10771   21790

names(tab)

## [1] " Female" " Male"

sum(is.na(census_income_data$sex))

## [1] 0

Mode of sex

temp <- table(as.vector(census_income_data$sex))
names(temp)[temp==max(temp)]

## [1] " Male"

Mode of sex is “male”

ggplot of sex

library("ggplot2", lib.loc="C:/Program Files/R/R-3.3.1/library")
qplot(census_income_data$sex,main="sex",ylab="count",colour= I("purple"),size=I(4))

library("gmodels", lib.loc="C:/Program Files/R/R-3.3.1/library")
CrossTable(census_income_data$sex,census_income_data$Income_band,digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  32561 
## 
##  
##                        | census_income_data$Income_band 
## census_income_data$sex |         0 |         1 | Row Total | 
## -----------------------|-----------|-----------|-----------|
##                 Female |      9592 |      1179 |     10771 | 
##                        |       0.4 |       0.2 |           | 
## -----------------------|-----------|-----------|-----------|
##                   Male |     15128 |      6662 |     21790 | 
##                        |       0.6 |       0.8 |           | 
## -----------------------|-----------|-----------|-----------|
##           Column Total |     24720 |      7841 |     32561 | 
##                        |       0.8 |       0.2 |           | 
## -----------------------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  1518.887     d.f. =  1     p =  0 
## 
## Pearson's Chi-squared test with Yates' continuity correction 
## ------------------------------------------------------------
## Chi^2 =  1517.813     d.f. =  1     p =  0 
## 
##

variable_11=“capital.gain”

Capital.gain is a Numerical variable,

head(census_income_data$capital.gain)

## [1] 2174    0    0    0    0    0

univariate analysis of capital.gain

Central tendencies of capital.gain

Mean of capital.gain

mean(census_income_data$capital.gain)

## [1] 1077.649

Median of capital.gain

median(census_income_data$capital.gain)

## [1] 0

Measures of Dispersion of capital.gain

Variance of capital.gain

var(census_income_data$capital.gain)

## [1] 54542539

Standard deviation of capital.gain

sd(census_income_data$capital.gain)

## [1] 7385.292

summary gives four quartiles of capital.gain

summary(census_income_data$capital.gain)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0    1078       0  100000

boxplot of capital.gain

quantile(census_income_data$capital.gain)

##    0%   25%   50%   75%  100% 
##     0     0     0     0 99999

quantile(census_income_data$capital.gain,c(0.75,0.80,0.90,1))

##   75%   80%   90%  100% 
##     0     0     0 99999

boxplot(census_income_data$capital.gain,main="capital.gain")

Output description

In this boxplot the minimum is 0, maximum is 100000, and median is 0. first quartile is 0,third quartile is 0. Note that outliers are discussed later.

Histogram of capital.gain

hist(census_income_data$capital.gain)

correlation between capital.gain and income_brands

library("ltm", lib.loc="C:/Program Files/R/R-3.3.1/library")
biserial.cor(census_income_data$capital.gain,census_income_data$Income_band)

## [1] -0.2233254

correlation is -0.2233254 capital.gain and income_brands are negatively correlated

variable_12=“capital.loss”

Capital.loss is a numerical variable,

head(census_income_data$capital.loss)

## [1] 0 0 0 0 0 0

univariate analysis of capital.loss

Central tendencies of capital.loss

Mean of capital.loss

mean(census_income_data$capital.loss)

## [1] 87.30383

Median of capital.loss

median(census_income_data$capital.loss)

## [1] 0

Measures of Dispersion of capital.loss

Variance of capital.loss

var(census_income_data$capital.loss)

## [1] 162376.9

Standard deviation of capital.loss

sd(census_income_data$capital.loss)

## [1] 402.9602

Summary gives four quartiles of capital.loss

summary(census_income_data$capital.loss)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     0.0    87.3     0.0  4356.0

Boxplot of capital.loss

quantile(census_income_data$capital.loss)

##   0%  25%  50%  75% 100% 
##    0    0    0    0 4356

quantile(census_income_data$capital.loss,c(0.75,0.80,0.90,1))

##  75%  80%  90% 100% 
##    0    0    0 4356

boxplot(census_income_data$capital.loss,main="capital.loss")

####Output description

In this boxplot the minimum is 0, maximum is 4356, and median is 0. first quartile is 0,third quartile is 0. Note that outliers are discussed later.

Histogram of capital.loss

hist(census_income_data$capital.loss)

Correlation between capital.loss and income_brands

library("ltm", lib.loc="C:/Program Files/R/R-3.3.1/library")
biserial.cor(census_income_data$capital.loss,census_income_data$Income_band)

## [1] -0.150524

correlation is -0.150524 capital.loss and income_brands are negatively correlated

variable_13= “hours.per.week”

hours.per.week is an Numerical variable,

head(census_income_data$hours.per.week)

## [1] 40 13 40 40 40 40

univariate analysis of hours.per.week

Central tendencies of hours.per.week

Mean of hours.per.week

mean(census_income_data$hours.per.week)

## [1] 40.43746

Median of hours.per.week

median(census_income_data$hours.per.week)

## [1] 40

Measures of Dispersion of hours.per.week

Variance of hours.per.week

var(census_income_data$hours.per.week)

## [1] 152.459

Standard deviation of hours.per.week

sd(census_income_data$hours.per.week)

## [1] 12.34743

summary gives four quartiles of hours.per.week

summary(census_income_data$hours.per.week)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   40.00   40.00   40.44   45.00   99.00

boxplot of hours.per.week

quantile(census_income_data$hours.per.week)

##   0%  25%  50%  75% 100% 
##    1   40   40   45   99

quantile(census_income_data$hours.per.week,c(0.75,0.80,0.90,1))

##  75%  80%  90% 100% 
##   45   48   55   99

boxplot(census_income_data$hours.per.week,main="hours.per.week")

Output description

In this boxplot the minimum is 1, maximum is 99, and median is 40. first quartile is 40,third quartile is 45.note that outliers are discussed later.

Histogram of hours.per.week

hist(census_income_data$hours.per.week)

Correlation between capital.loss and income_brands

library("ltm", lib.loc="C:/Program Files/R/R-3.3.1/library")
biserial.cor(census_income_data$hours.per.week,census_income_data$Income_band)

## [1] -0.2296855

correlation is -0.2296855 hours.per.week and income_brands are negatively correlated

Variable_14=“native.country”

“native.country” is a cateorical variable,the categories are

United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

Frequency table of native.country

tab<-table(census_income_data$native.country)
tab

## 
##                           ?                    Cambodia 
##                         583                          19 
##                      Canada                       China 
##                         121                          75 
##                    Columbia                        Cuba 
##                          59                          95 
##          Dominican-Republic                     Ecuador 
##                          70                          28 
##                 El-Salvador                     England 
##                         106                          90 
##                      France                     Germany 
##                          29                         137 
##                      Greece                   Guatemala 
##                          29                          64 
##                       Haiti          Holand-Netherlands 
##                          44                           1 
##                    Honduras                        Hong 
##                          13                          20 
##                     Hungary                       India 
##                          13                         100 
##                        Iran                     Ireland 
##                          43                          24 
##                       Italy                     Jamaica 
##                          73                          81 
##                       Japan                        Laos 
##                          62                          18 
##                      Mexico                   Nicaragua 
##                         643                          34 
##  Outlying-US(Guam-USVI-etc)                        Peru 
##                          14                          31 
##                 Philippines                      Poland 
##                         198                          60 
##                    Portugal                 Puerto-Rico 
##                          37                         114 
##                    Scotland                       South 
##                          12                          80 
##                      Taiwan                    Thailand 
##                          51                          18 
##             Trinadad&Tobago               United-States 
##                          19                       29170 
##                     Vietnam                  Yugoslavia 
##                          67                          16

names(tab)

##  [1] " ?"                          " Cambodia"                  
##  [3] " Canada"                     " China"                     
##  [5] " Columbia"                   " Cuba"                      
##  [7] " Dominican-Republic"         " Ecuador"                   
##  [9] " El-Salvador"                " England"                   
## [11] " France"                     " Germany"                   
## [13] " Greece"                     " Guatemala"                 
## [15] " Haiti"                      " Holand-Netherlands"        
## [17] " Honduras"                   " Hong"                      
## [19] " Hungary"                    " India"                     
## [21] " Iran"                       " Ireland"                   
## [23] " Italy"                      " Jamaica"                   
## [25] " Japan"                      " Laos"                      
## [27] " Mexico"                     " Nicaragua"                 
## [29] " Outlying-US(Guam-USVI-etc)" " Peru"                      
## [31] " Philippines"                " Poland"                    
## [33] " Portugal"                   " Puerto-Rico"               
## [35] " Scotland"                   " South"                     
## [37] " Taiwan"                     " Thailand"                  
## [39] " Trinadad&Tobago"            " United-States"             
## [41] " Vietnam"                    " Yugoslavia"

sum(is.na(census_income_data$native.country))

## [1] 0

Mode of native.country

temp <- table(as.vector(census_income_data$native.country))
names(temp)[temp==max(temp)]

## [1] " United-States"

Mode of native.country is ” United-States”

ggplot of native.country

library("ggplot2", lib.loc="C:/Program Files/R/R-3.3.1/library")
qplot(census_income_data$native.country,main="native.country",ylab="count",colour= I("purple"),size=I(4))

library("gmodels", lib.loc="C:/Program Files/R/R-3.3.1/library")
CrossTable(census_income_data$native.country,census_income_data$Income_band,digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)

## Warning in chisq.test(t, correct = FALSE, ...): Chi-squared approximation
## may be incorrect

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  32561 
## 
##  
##                                   | census_income_data$Income_band 
## census_income_data$native.country |         0 |         1 | Row Total | 
## ----------------------------------|-----------|-----------|-----------|
##                                 ? |       437 |       146 |       583 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                          Cambodia |        12 |         7 |        19 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                            Canada |        82 |        39 |       121 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                             China |        55 |        20 |        75 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                          Columbia |        57 |         2 |        59 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                              Cuba |        70 |        25 |        95 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                Dominican-Republic |        68 |         2 |        70 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                           Ecuador |        24 |         4 |        28 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                       El-Salvador |        97 |         9 |       106 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                           England |        60 |        30 |        90 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                            France |        17 |        12 |        29 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                           Germany |        93 |        44 |       137 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                            Greece |        21 |         8 |        29 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                         Guatemala |        61 |         3 |        64 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                             Haiti |        40 |         4 |        44 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                Holand-Netherlands |         1 |         0 |         1 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                          Honduras |        12 |         1 |        13 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                              Hong |        14 |         6 |        20 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                           Hungary |        10 |         3 |        13 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                             India |        60 |        40 |       100 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                              Iran |        25 |        18 |        43 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                           Ireland |        19 |         5 |        24 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                             Italy |        48 |        25 |        73 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                           Jamaica |        71 |        10 |        81 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                             Japan |        38 |        24 |        62 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                              Laos |        16 |         2 |        18 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                            Mexico |       610 |        33 |       643 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                         Nicaragua |        32 |         2 |        34 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##        Outlying-US(Guam-USVI-etc) |        14 |         0 |        14 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                              Peru |        29 |         2 |        31 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                       Philippines |       137 |        61 |       198 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                            Poland |        48 |        12 |        60 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                          Portugal |        33 |         4 |        37 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                       Puerto-Rico |       102 |        12 |       114 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                          Scotland |         9 |         3 |        12 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                             South |        64 |        16 |        80 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                            Taiwan |        31 |        20 |        51 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                          Thailand |        15 |         3 |        18 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                   Trinadad&Tobago |        17 |         2 |        19 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                     United-States |     21999 |      7171 |     29170 | 
##                                   |       0.9 |       0.9 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                           Vietnam |        62 |         5 |        67 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                        Yugoslavia |        10 |         6 |        16 | 
##                                   |       0.0 |       0.0 |           | 
## ----------------------------------|-----------|-----------|-----------|
##                      Column Total |     24720 |      7841 |     32561 | 
##                                   |       0.8 |       0.2 |           | 
## ----------------------------------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  317.2304     d.f. =  41     p =  2.211386e-44 
## 
## 
##

Varaiable_15=“Income_band”

It is a categorical data It is a predictor variable. The categories are, -50000 50000+

Frequency table of income_band

tab<-table(census_income_data$Income_band)
tab

## 
##     0     1 
## 24720  7841

names(tab)

## [1] "0" "1"

sum(is.na(census_income_data$Income_band))

## [1] 0

Model Bulding

NaiveBayesian Model

library("e1071", lib.loc="C:/Program Files/R/R-3.3.1/library")

## Warning: package 'e1071' was built under R version 3.3.3

library(class)
Model<- naiveBayes(census_income_data$Income_band~.,data=census_income_data )
Model

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##         0         1 
## 0.7591904 0.2408096 
## 
## Conditional probabilities:
##    age
## Y       [,1]     [,2]
##   0 36.78374 14.02009
##   1 44.24984 10.51903
## 
##    workclass
## Y              ?  Federal-gov    Local-gov  Never-worked      Private
##   0 0.0665453074 0.0238268608 0.0597087379  0.0002831715 0.7173543689
##   1 0.0243591379 0.0473153934 0.0786889427  0.0000000000 0.6329549802
##    workclass
## Y    Self-emp-inc  Self-emp-not-inc    State-gov  Without-pay
##   0  0.0199838188      0.0735032362 0.0382281553 0.0005663430
##   1  0.0793266165      0.0923351613 0.0450197679 0.0000000000
## 
##    fnlwgt
## Y       [,1]     [,2]
##   0 190340.9 106482.3
##   1 188005.0 102541.8
## 
##    education
## Y           10th         11th         12th      1st-4th      5th-6th
##   0 0.0352346278 0.0451051780 0.0161812298 0.0065533981 0.0128236246
##   1 0.0079071547 0.0076520852 0.0042086469 0.0007652085 0.0020405561
##    education
## Y        7th-8th          9th   Assoc-acdm    Assoc-voc    Bachelors
##   0 0.0245145631 0.0197006472 0.0324433657 0.0413025890 0.1267799353
##   1 0.0051013901 0.0034434383 0.0337967096 0.0460400459 0.2832546869
##    education
## Y      Doctorate      HS-grad      Masters    Preschool  Prof-school
##   0 0.0043284790 0.3570388350 0.0309061489 0.0020631068 0.0061893204
##   1 0.0390256345 0.2136207116 0.1223058283 0.0000000000 0.0539472006
##    education
## Y    Some-college
##   0  0.2388349515
##   1  0.1768907027
## 
##    education.num
## Y        [,1]     [,2]
##   0  9.595065 2.436147
##   1 11.611657 2.385129
## 
##    marital.status
## Y      Divorced  Married-AF-spouse  Married-civ-spouse
##   0 0.161003236        0.000525890         0.335113269
##   1 0.059048591        0.001275348         0.853462569
##    marital.status
## Y    Married-spouse-absent  Never-married   Separated     Widowed
##   0            0.015533981    0.412297735 0.038794498 0.036731392
##   1            0.004336182    0.062619564 0.008417294 0.010840454
## 
##    occupation
## Y              ?  Adm-clerical  Armed-Forces  Craft-repair
##   0 0.0668284790  0.1319983819  0.0003236246  0.1282362460
##   1 0.0243591379  0.0646601199  0.0001275348  0.1184797857
##    occupation
## Y    Exec-managerial  Farming-fishing  Handlers-cleaners
##   0     0.0848705502     0.0355582524       0.0519417476
##   1     0.2509883943     0.0146664966       0.0109679888
##    occupation
## Y    Machine-op-inspct  Other-service  Priv-house-serv  Prof-specialty
##   0       0.0708737864   0.1277508091     0.0059870550    0.0922734628
##   1       0.0318836883   0.0174722612     0.0001275348    0.2370871062
##    occupation
## Y    Protective-serv        Sales  Tech-support  Transport-moving
##   0     0.0177184466 0.1078883495  0.0260922330      0.0516585761
##   1     0.0269098329 0.1253666624  0.0360923352      0.0408111210
## 
##    relationship
## Y       Husband  Not-in-family  Other-relative   Own-child   Unmarried
##   0 0.294296117    0.301334951     0.038187702 0.202305825 0.130582524
##   1 0.754750670    0.109169749     0.004718786 0.008544828 0.027802576
##    relationship
## Y          Wife
##   0 0.033292880
##   1 0.095013391
## 
##    race
## Y    Amer-Indian-Eskimo  Asian-Pac-Islander       Black       Other
##   0         0.011124595         0.030865696 0.110720065 0.009951456
##   1         0.004591251         0.035199592 0.049355949 0.003188369
##    race
## Y         White
##   0 0.837338188
##   1 0.907664839
## 
##    sex
## Y      Female      Male
##   0 0.3880259 0.6119741
##   1 0.1503635 0.8496365
## 
##    capital.gain
## Y        [,1]       [,2]
##   0  148.7525   963.1393
##   1 4006.1425 14570.3790
## 
##    capital.loss
## Y        [,1]     [,2]
##   0  53.14292 310.7558
##   1 195.00153 595.4876
## 
##    hours.per.week
## Y       [,1]     [,2]
##   0 38.84021 12.31899
##   1 45.47303 11.01297
## 
##    native.country
## Y              ?     Cambodia       Canada        China     Columbia
##   0 1.767799e-02 4.854369e-04 3.317152e-03 2.224919e-03 2.305825e-03
##   1 1.862007e-02 8.927433e-04 4.973855e-03 2.550695e-03 2.550695e-04
##    native.country
## Y           Cuba  Dominican-Republic      Ecuador  El-Salvador
##   0 2.831715e-03        2.750809e-03 9.708738e-04 3.923948e-03
##   1 3.188369e-03        2.550695e-04 5.101390e-04 1.147813e-03
##    native.country
## Y        England       France      Germany       Greece    Guatemala
##   0 2.427184e-03 6.877023e-04 3.762136e-03 8.495146e-04 2.467638e-03
##   1 3.826043e-03 1.530417e-03 5.611529e-03 1.020278e-03 3.826043e-04
##    native.country
## Y          Haiti  Holand-Netherlands     Honduras         Hong
##   0 1.618123e-03        4.045307e-05 4.854369e-04 5.663430e-04
##   1 5.101390e-04        0.000000e+00 1.275348e-04 7.652085e-04
##    native.country
## Y        Hungary        India         Iran      Ireland        Italy
##   0 4.045307e-04 2.427184e-03 1.011327e-03 7.686084e-04 1.941748e-03
##   1 3.826043e-04 5.101390e-03 2.295626e-03 6.376738e-04 3.188369e-03
##    native.country
## Y        Jamaica        Japan         Laos       Mexico    Nicaragua
##   0 2.872168e-03 1.537217e-03 6.472492e-04 2.467638e-02 1.294498e-03
##   1 1.275348e-03 3.060834e-03 2.550695e-04 4.208647e-03 2.550695e-04
##    native.country
## Y    Outlying-US(Guam-USVI-etc)         Peru  Philippines       Poland
##   0                5.663430e-04 1.173139e-03 5.542071e-03 1.941748e-03
##   1                0.000000e+00 2.550695e-04 7.779620e-03 1.530417e-03
##    native.country
## Y       Portugal  Puerto-Rico     Scotland        South       Taiwan
##   0 1.334951e-03 4.126214e-03 3.640777e-04 2.588997e-03 1.254045e-03
##   1 5.101390e-04 1.530417e-03 3.826043e-04 2.040556e-03 2.550695e-03
##    native.country
## Y       Thailand  Trinadad&Tobago  United-States      Vietnam   Yugoslavia
##   0 6.067961e-04     6.877023e-04   8.899272e-01 2.508091e-03 4.045307e-04
##   1 3.826043e-04     2.550695e-04   9.145517e-01 6.376738e-04 7.652085e-04

Logistic Regression Model

Logistic Regression is a classification algorithm. It is used to predict a binary outcome (1 / 0, Yes / No, True / False) given a set of independent variables. To represent binary / categorical outcome. logistic regression as a special case of linear regression when the outcome variable is categorical, where we are using log of odds as dependent variable. In simple words, it predicts the probability of occurrence of an event by fitting data to a logit function.

LogisticModel<- glm(census_income_data$Income_band~.,family=binomial,data=census_income_data)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(LogisticModel)

## 
## Call:
## glm(formula = census_income_data$Income_band ~ ., family = binomial, 
##     data = census_income_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.0885  -0.5044  -0.1822  -0.0251   3.7656  
## 
## Coefficients: (2 not defined because of singularities)
##                                             Estimate Std. Error z value
## (Intercept)                               -9.074e+00  4.405e-01 -20.601
## age                                        2.552e-02  1.651e-03  15.460
## workclass Federal-gov                      1.097e+00  1.538e-01   7.131
## workclass Local-gov                        4.118e-01  1.403e-01   2.934
## workclass Never-worked                    -1.045e+01  2.722e+02  -0.038
## workclass Private                          5.944e-01  1.252e-01   4.746
## workclass Self-emp-inc                     7.694e-01  1.497e-01   5.140
## workclass Self-emp-not-inc                 1.037e-01  1.371e-01   0.756
## workclass State-gov                        2.835e-01  1.518e-01   1.868
## workclass Without-pay                     -1.221e+01  1.985e+02  -0.062
## fnlwgt                                     7.072e-07  1.720e-07   4.111
## education 11th                             8.500e-02  2.107e-01   0.403
## education 12th                             4.891e-01  2.644e-01   1.850
## education 1st-4th                         -5.322e-01  4.895e-01  -1.087
## education 5th-6th                         -2.386e-01  3.248e-01  -0.735
## education 7th-8th                         -4.755e-01  2.320e-01  -2.050
## education 9th                             -1.939e-01  2.612e-01  -0.743
## education Assoc-acdm                       1.336e+00  1.763e-01   7.574
## education Assoc-voc                        1.352e+00  1.694e-01   7.981
## education Bachelors                        1.936e+00  1.575e-01  12.296
## education Doctorate                        2.989e+00  2.142e-01  13.954
## education HS-grad                          8.134e-01  1.534e-01   5.302
## education Masters                          2.289e+00  1.679e-01  13.631
## education Preschool                       -2.109e+01  3.665e+02  -0.058
## education Prof-school                      2.793e+00  2.002e-01  13.955
## education Some-college                     1.159e+00  1.556e-01   7.447
## education.num                                     NA         NA      NA
## marital.status Married-AF-spouse           2.686e+00  5.538e-01   4.849
## marital.status Married-civ-spouse          2.206e+00  2.654e-01   8.312
## marital.status Married-spouse-absent      -1.097e-02  2.298e-01  -0.048
## marital.status Never-married              -4.825e-01  8.751e-02  -5.513
## marital.status Separated                  -1.334e-01  1.641e-01  -0.813
## marital.status Widowed                     1.284e-01  1.538e-01   0.835
## occupation Adm-clerical                    1.095e-01  9.919e-02   1.104
## occupation Armed-Forces                   -1.061e+00  1.543e+00  -0.688
## occupation Craft-repair                    1.816e-01  8.487e-02   2.140
## occupation Exec-managerial                 8.965e-01  8.724e-02  10.276
## occupation Farming-fishing                -8.826e-01  1.420e-01  -6.214
## occupation Handlers-cleaners              -5.698e-01  1.458e-01  -3.907
## occupation Machine-op-inspct              -1.724e-01  1.062e-01  -1.624
## occupation Other-service                  -7.152e-01  1.245e-01  -5.746
## occupation Priv-house-serv                -4.018e+00  1.664e+00  -2.415
## occupation Prof-specialty                  6.251e-01  9.365e-02   6.675
## occupation Protective-serv                 6.864e-01  1.304e-01   5.265
## occupation Sales                           3.909e-01  9.015e-02   4.336
## occupation Tech-support                    7.657e-01  1.194e-01   6.415
## occupation Transport-moving                       NA         NA      NA
## relationship Not-in-family                 5.695e-01  2.627e-01   2.168
## relationship Other-relative               -3.729e-01  2.427e-01  -1.536
## relationship Own-child                    -6.601e-01  2.600e-01  -2.539
## relationship Unmarried                     4.411e-01  2.786e-01   1.583
## relationship Wife                          1.363e+00  1.026e-01  13.282
## race Asian-Pac-Islander                    6.650e-01  2.697e-01   2.465
## race Black                                 3.940e-01  2.332e-01   1.690
## race Other                                 1.736e-01  3.537e-01   0.491
## race White                                 5.728e-01  2.217e-01   2.584
## sex Male                                   8.618e-01  7.918e-02  10.883
## capital.gain                               3.193e-04  1.031e-05  30.968
## capital.loss                               6.474e-04  3.714e-05  17.431
## hours.per.week                             2.970e-02  1.622e-03  18.316
## native.country Cambodia                    1.482e+00  6.336e-01   2.338
## native.country Canada                      5.170e-01  2.952e-01   1.751
## native.country China                      -5.080e-01  3.943e-01  -1.288
## native.country Columbia                   -1.930e+00  8.242e-01  -2.342
## native.country Cuba                        5.339e-01  3.373e-01   1.583
## native.country Dominican-Republic         -1.643e+00  1.049e+00  -1.566
## native.country Ecuador                    -9.442e-02  7.292e-01  -0.129
## native.country El-Salvador                -4.230e-01  4.952e-01  -0.854
## native.country England                     4.954e-01  3.335e-01   1.486
## native.country France                      7.730e-01  5.289e-01   1.462
## native.country Germany                     6.197e-01  2.843e-01   2.179
## native.country Greece                     -7.982e-01  5.657e-01  -1.411
## native.country Guatemala                  -6.358e-02  7.625e-01  -0.083
## native.country Haiti                       1.359e-01  6.850e-01   0.198
## native.country Holand-Netherlands         -1.024e+01  8.827e+02  -0.012
## native.country Honduras                   -1.086e+00  2.356e+00  -0.461
## native.country Hong                        8.706e-02  6.810e-01   0.128
## native.country Hungary                     7.262e-02  7.759e-01   0.094
## native.country India                      -1.895e-01  3.284e-01  -0.577
## native.country Iran                        2.341e-01  4.508e-01   0.519
## native.country Ireland                     7.198e-01  6.448e-01   1.116
## native.country Italy                       9.944e-01  3.447e-01   2.885
## native.country Jamaica                     2.285e-01  4.631e-01   0.493
## native.country Japan                       5.794e-01  4.214e-01   1.375
## native.country Laos                       -4.209e-01  8.630e-01  -0.488
## native.country Mexico                     -3.643e-01  2.551e-01  -1.428
## native.country Nicaragua                  -6.151e-01  8.040e-01  -0.765
## native.country Outlying-US(Guam-USVI-etc) -1.208e+01  2.098e+02  -0.058
## native.country Peru                       -6.498e-01  8.559e-01  -0.759
## native.country Philippines                 6.104e-01  2.810e-01   2.173
## native.country Poland                      1.820e-01  4.216e-01   0.432
## native.country Portugal                    1.542e-01  6.332e-01   0.243
## native.country Puerto-Rico                -1.483e-01  4.041e-01  -0.367
## native.country Scotland                    1.905e-01  7.892e-01   0.241
## native.country South                      -8.819e-01  4.414e-01  -1.998
## native.country Taiwan                      2.248e-01  4.724e-01   0.476
## native.country Thailand                   -3.784e-01  8.356e-01  -0.453
## native.country Trinadad&Tobago            -1.977e-01  8.709e-01  -0.227
## native.country United-States               3.815e-01  1.380e-01   2.764
## native.country Vietnam                    -9.593e-01  6.150e-01  -1.560
## native.country Yugoslavia                  8.720e-01  6.824e-01   1.278
##                                           Pr(>|z|)    
## (Intercept)                                < 2e-16 ***
## age                                        < 2e-16 ***
## workclass Federal-gov                     9.99e-13 ***
## workclass Local-gov                        0.00334 ** 
## workclass Never-worked                     0.96936    
## workclass Private                         2.08e-06 ***
## workclass Self-emp-inc                    2.74e-07 ***
## workclass Self-emp-not-inc                 0.44954    
## workclass State-gov                        0.06173 .  
## workclass Without-pay                      0.95095    
## fnlwgt                                    3.93e-05 ***
## education 11th                             0.68670    
## education 12th                             0.06435 .  
## education 1st-4th                          0.27696    
## education 5th-6th                          0.46255    
## education 7th-8th                          0.04039 *  
## education 9th                              0.45771    
## education Assoc-acdm                      3.63e-14 ***
## education Assoc-voc                       1.45e-15 ***
## education Bachelors                        < 2e-16 ***
## education Doctorate                        < 2e-16 ***
## education HS-grad                         1.15e-07 ***
## education Masters                          < 2e-16 ***
## education Preschool                        0.95410    
## education Prof-school                      < 2e-16 ***
## education Some-college                    9.52e-14 ***
## education.num                                   NA    
## marital.status Married-AF-spouse          1.24e-06 ***
## marital.status Married-civ-spouse          < 2e-16 ***
## marital.status Married-spouse-absent       0.96192    
## marital.status Never-married              3.52e-08 ***
## marital.status Separated                   0.41647    
## marital.status Widowed                     0.40350    
## occupation Adm-clerical                    0.26955    
## occupation Armed-Forces                    0.49174    
## occupation Craft-repair                    0.03239 *  
## occupation Exec-managerial                 < 2e-16 ***
## occupation Farming-fishing                5.16e-10 ***
## occupation Handlers-cleaners              9.33e-05 ***
## occupation Machine-op-inspct               0.10429    
## occupation Other-service                  9.12e-09 ***
## occupation Priv-house-serv                 0.01572 *  
## occupation Prof-specialty                 2.46e-11 ***
## occupation Protective-serv                1.40e-07 ***
## occupation Sales                          1.45e-05 ***
## occupation Tech-support                   1.41e-10 ***
## occupation Transport-moving                     NA    
## relationship Not-in-family                 0.03015 *  
## relationship Other-relative                0.12442    
## relationship Own-child                     0.01111 *  
## relationship Unmarried                     0.11338    
## relationship Wife                          < 2e-16 ***
## race Asian-Pac-Islander                    0.01369 *  
## race Black                                 0.09106 .  
## race Other                                 0.62365    
## race White                                 0.00978 ** 
## sex Male                                   < 2e-16 ***
## capital.gain                               < 2e-16 ***
## capital.loss                               < 2e-16 ***
## hours.per.week                             < 2e-16 ***
## native.country Cambodia                    0.01936 *  
## native.country Canada                      0.07989 .  
## native.country China                       0.19766    
## native.country Columbia                    0.01919 *  
## native.country Cuba                        0.11349    
## native.country Dominican-Republic          0.11735    
## native.country Ecuador                     0.89697    
## native.country El-Salvador                 0.39301    
## native.country England                     0.13735    
## native.country France                      0.14385    
## native.country Germany                     0.02931 *  
## native.country Greece                      0.15824    
## native.country Guatemala                   0.93354    
## native.country Haiti                       0.84275    
## native.country Holand-Netherlands          0.99074    
## native.country Honduras                    0.64493    
## native.country Hong                        0.89827    
## native.country Hungary                     0.92543    
## native.country India                       0.56390    
## native.country Iran                        0.60364    
## native.country Ireland                     0.26424    
## native.country Italy                       0.00392 ** 
## native.country Jamaica                     0.62170    
## native.country Japan                       0.16914    
## native.country Laos                        0.62575    
## native.country Mexico                      0.15325    
## native.country Nicaragua                   0.44424    
## native.country Outlying-US(Guam-USVI-etc)  0.95407    
## native.country Peru                        0.44772    
## native.country Philippines                 0.02981 *  
## native.country Poland                      0.66608    
## native.country Portugal                    0.80763    
## native.country Puerto-Rico                 0.71362    
## native.country Scotland                    0.80929    
## native.country South                       0.04573 *  
## native.country Taiwan                      0.63409    
## native.country Thailand                    0.65062    
## native.country Trinadad&Tobago             0.82041    
## native.country United-States               0.00570 ** 
## native.country Vietnam                     0.11884    
## native.country Yugoslavia                  0.20131    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 35948  on 32560  degrees of freedom
## Residual deviance: 20565  on 32462  degrees of freedom
## AIC: 20763
## 
## Number of Fisher Scoring iterations: 13

Classification Table

library("caret", lib.loc="C:/Program Files/R/R-3.3.1/library")

## Warning: package 'caret' was built under R version 3.3.3

## Loading required package: lattice

library("rattle", lib.loc="C:/Program Files/R/R-3.3.1/library")

## Warning: package 'rattle' was built under R version 3.3.3

## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

threshold=0.5
predicted_values<-ifelse(predict(LogisticModel,type="response")>threshold,1,0)
actual_values<-LogisticModel$y
conf_matrix<-table(predicted_values,actual_values)
conf_matrix

##                 actual_values
## predicted_values     0     1
##                0 23037  3093
##                1  1683  4748

sensitivity(conf_matrix)

## [1] 0.9319175

specificity(conf_matrix)

## [1] 0.605535

Logistic regression Accuracy

accuracy1<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy1

## [1] 0.8533215

Changing Threshold value

threshold=0.8
predicted_values<-ifelse(predict(LogisticModel,type="response")>threshold,1,0)
actual_values<-LogisticModel$y
conf_matrix<-table(predicted_values,actual_values)
conf_matrix

##                 actual_values
## predicted_values     0     1
##                0 24495  5663
##                1   225  2178

sensitivity(conf_matrix)

## [1] 0.9908981

specificity(conf_matrix)

## [1] 0.2777707

accuracy2<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy2

## [1] 0.8191702

Multicollinearity

library("car", lib.loc="C:/Program Files/R/R-3.3.1/library")

## Warning: package 'car' was built under R version 3.3.3

summary(LogisticModel)

## 
## Call:
## glm(formula = census_income_data$Income_band ~ ., family = binomial, 
##     data = census_income_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.0885  -0.5044  -0.1822  -0.0251   3.7656  
## 
## Coefficients: (2 not defined because of singularities)
##                                             Estimate Std. Error z value
## (Intercept)                               -9.074e+00  4.405e-01 -20.601
## age                                        2.552e-02  1.651e-03  15.460
## workclass Federal-gov                      1.097e+00  1.538e-01   7.131
## workclass Local-gov                        4.118e-01  1.403e-01   2.934
## workclass Never-worked                    -1.045e+01  2.722e+02  -0.038
## workclass Private                          5.944e-01  1.252e-01   4.746
## workclass Self-emp-inc                     7.694e-01  1.497e-01   5.140
## workclass Self-emp-not-inc                 1.037e-01  1.371e-01   0.756
## workclass State-gov                        2.835e-01  1.518e-01   1.868
## workclass Without-pay                     -1.221e+01  1.985e+02  -0.062
## fnlwgt                                     7.072e-07  1.720e-07   4.111
## education 11th                             8.500e-02  2.107e-01   0.403
## education 12th                             4.891e-01  2.644e-01   1.850
## education 1st-4th                         -5.322e-01  4.895e-01  -1.087
## education 5th-6th                         -2.386e-01  3.248e-01  -0.735
## education 7th-8th                         -4.755e-01  2.320e-01  -2.050
## education 9th                             -1.939e-01  2.612e-01  -0.743
## education Assoc-acdm                       1.336e+00  1.763e-01   7.574
## education Assoc-voc                        1.352e+00  1.694e-01   7.981
## education Bachelors                        1.936e+00  1.575e-01  12.296
## education Doctorate                        2.989e+00  2.142e-01  13.954
## education HS-grad                          8.134e-01  1.534e-01   5.302
## education Masters                          2.289e+00  1.679e-01  13.631
## education Preschool                       -2.109e+01  3.665e+02  -0.058
## education Prof-school                      2.793e+00  2.002e-01  13.955
## education Some-college                     1.159e+00  1.556e-01   7.447
## education.num                                     NA         NA      NA
## marital.status Married-AF-spouse           2.686e+00  5.538e-01   4.849
## marital.status Married-civ-spouse          2.206e+00  2.654e-01   8.312
## marital.status Married-spouse-absent      -1.097e-02  2.298e-01  -0.048
## marital.status Never-married              -4.825e-01  8.751e-02  -5.513
## marital.status Separated                  -1.334e-01  1.641e-01  -0.813
## marital.status Widowed                     1.284e-01  1.538e-01   0.835
## occupation Adm-clerical                    1.095e-01  9.919e-02   1.104
## occupation Armed-Forces                   -1.061e+00  1.543e+00  -0.688
## occupation Craft-repair                    1.816e-01  8.487e-02   2.140
## occupation Exec-managerial                 8.965e-01  8.724e-02  10.276
## occupation Farming-fishing                -8.826e-01  1.420e-01  -6.214
## occupation Handlers-cleaners              -5.698e-01  1.458e-01  -3.907
## occupation Machine-op-inspct              -1.724e-01  1.062e-01  -1.624
## occupation Other-service                  -7.152e-01  1.245e-01  -5.746
## occupation Priv-house-serv                -4.018e+00  1.664e+00  -2.415
## occupation Prof-specialty                  6.251e-01  9.365e-02   6.675
## occupation Protective-serv                 6.864e-01  1.304e-01   5.265
## occupation Sales                           3.909e-01  9.015e-02   4.336
## occupation Tech-support                    7.657e-01  1.194e-01   6.415
## occupation Transport-moving                       NA         NA      NA
## relationship Not-in-family                 5.695e-01  2.627e-01   2.168
## relationship Other-relative               -3.729e-01  2.427e-01  -1.536
## relationship Own-child                    -6.601e-01  2.600e-01  -2.539
## relationship Unmarried                     4.411e-01  2.786e-01   1.583
## relationship Wife                          1.363e+00  1.026e-01  13.282
## race Asian-Pac-Islander                    6.650e-01  2.697e-01   2.465
## race Black                                 3.940e-01  2.332e-01   1.690
## race Other                                 1.736e-01  3.537e-01   0.491
## race White                                 5.728e-01  2.217e-01   2.584
## sex Male                                   8.618e-01  7.918e-02  10.883
## capital.gain                               3.193e-04  1.031e-05  30.968
## capital.loss                               6.474e-04  3.714e-05  17.431
## hours.per.week                             2.970e-02  1.622e-03  18.316
## native.country Cambodia                    1.482e+00  6.336e-01   2.338
## native.country Canada                      5.170e-01  2.952e-01   1.751
## native.country China                      -5.080e-01  3.943e-01  -1.288
## native.country Columbia                   -1.930e+00  8.242e-01  -2.342
## native.country Cuba                        5.339e-01  3.373e-01   1.583
## native.country Dominican-Republic         -1.643e+00  1.049e+00  -1.566
## native.country Ecuador                    -9.442e-02  7.292e-01  -0.129
## native.country El-Salvador                -4.230e-01  4.952e-01  -0.854
## native.country England                     4.954e-01  3.335e-01   1.486
## native.country France                      7.730e-01  5.289e-01   1.462
## native.country Germany                     6.197e-01  2.843e-01   2.179
## native.country Greece                     -7.982e-01  5.657e-01  -1.411
## native.country Guatemala                  -6.358e-02  7.625e-01  -0.083
## native.country Haiti                       1.359e-01  6.850e-01   0.198
## native.country Holand-Netherlands         -1.024e+01  8.827e+02  -0.012
## native.country Honduras                   -1.086e+00  2.356e+00  -0.461
## native.country Hong                        8.706e-02  6.810e-01   0.128
## native.country Hungary                     7.262e-02  7.759e-01   0.094
## native.country India                      -1.895e-01  3.284e-01  -0.577
## native.country Iran                        2.341e-01  4.508e-01   0.519
## native.country Ireland                     7.198e-01  6.448e-01   1.116
## native.country Italy                       9.944e-01  3.447e-01   2.885
## native.country Jamaica                     2.285e-01  4.631e-01   0.493
## native.country Japan                       5.794e-01  4.214e-01   1.375
## native.country Laos                       -4.209e-01  8.630e-01  -0.488
## native.country Mexico                     -3.643e-01  2.551e-01  -1.428
## native.country Nicaragua                  -6.151e-01  8.040e-01  -0.765
## native.country Outlying-US(Guam-USVI-etc) -1.208e+01  2.098e+02  -0.058
## native.country Peru                       -6.498e-01  8.559e-01  -0.759
## native.country Philippines                 6.104e-01  2.810e-01   2.173
## native.country Poland                      1.820e-01  4.216e-01   0.432
## native.country Portugal                    1.542e-01  6.332e-01   0.243
## native.country Puerto-Rico                -1.483e-01  4.041e-01  -0.367
## native.country Scotland                    1.905e-01  7.892e-01   0.241
## native.country South                      -8.819e-01  4.414e-01  -1.998
## native.country Taiwan                      2.248e-01  4.724e-01   0.476
## native.country Thailand                   -3.784e-01  8.356e-01  -0.453
## native.country Trinadad&Tobago            -1.977e-01  8.709e-01  -0.227
## native.country United-States               3.815e-01  1.380e-01   2.764
## native.country Vietnam                    -9.593e-01  6.150e-01  -1.560
## native.country Yugoslavia                  8.720e-01  6.824e-01   1.278
##                                           Pr(>|z|)    
## (Intercept)                                < 2e-16 ***
## age                                        < 2e-16 ***
## workclass Federal-gov                     9.99e-13 ***
## workclass Local-gov                        0.00334 ** 
## workclass Never-worked                     0.96936    
## workclass Private                         2.08e-06 ***
## workclass Self-emp-inc                    2.74e-07 ***
## workclass Self-emp-not-inc                 0.44954    
## workclass State-gov                        0.06173 .  
## workclass Without-pay                      0.95095    
## fnlwgt                                    3.93e-05 ***
## education 11th                             0.68670    
## education 12th                             0.06435 .  
## education 1st-4th                          0.27696    
## education 5th-6th                          0.46255    
## education 7th-8th                          0.04039 *  
## education 9th                              0.45771    
## education Assoc-acdm                      3.63e-14 ***
## education Assoc-voc                       1.45e-15 ***
## education Bachelors                        < 2e-16 ***
## education Doctorate                        < 2e-16 ***
## education HS-grad                         1.15e-07 ***
## education Masters                          < 2e-16 ***
## education Preschool                        0.95410    
## education Prof-school                      < 2e-16 ***
## education Some-college                    9.52e-14 ***
## education.num                                   NA    
## marital.status Married-AF-spouse          1.24e-06 ***
## marital.status Married-civ-spouse          < 2e-16 ***
## marital.status Married-spouse-absent       0.96192    
## marital.status Never-married              3.52e-08 ***
## marital.status Separated                   0.41647    
## marital.status Widowed                     0.40350    
## occupation Adm-clerical                    0.26955    
## occupation Armed-Forces                    0.49174    
## occupation Craft-repair                    0.03239 *  
## occupation Exec-managerial                 < 2e-16 ***
## occupation Farming-fishing                5.16e-10 ***
## occupation Handlers-cleaners              9.33e-05 ***
## occupation Machine-op-inspct               0.10429    
## occupation Other-service                  9.12e-09 ***
## occupation Priv-house-serv                 0.01572 *  
## occupation Prof-specialty                 2.46e-11 ***
## occupation Protective-serv                1.40e-07 ***
## occupation Sales                          1.45e-05 ***
## occupation Tech-support                   1.41e-10 ***
## occupation Transport-moving                     NA    
## relationship Not-in-family                 0.03015 *  
## relationship Other-relative                0.12442    
## relationship Own-child                     0.01111 *  
## relationship Unmarried                     0.11338    
## relationship Wife                          < 2e-16 ***
## race Asian-Pac-Islander                    0.01369 *  
## race Black                                 0.09106 .  
## race Other                                 0.62365    
## race White                                 0.00978 ** 
## sex Male                                   < 2e-16 ***
## capital.gain                               < 2e-16 ***
## capital.loss                               < 2e-16 ***
## hours.per.week                             < 2e-16 ***
## native.country Cambodia                    0.01936 *  
## native.country Canada                      0.07989 .  
## native.country China                       0.19766    
## native.country Columbia                    0.01919 *  
## native.country Cuba                        0.11349    
## native.country Dominican-Republic          0.11735    
## native.country Ecuador                     0.89697    
## native.country El-Salvador                 0.39301    
## native.country England                     0.13735    
## native.country France                      0.14385    
## native.country Germany                     0.02931 *  
## native.country Greece                      0.15824    
## native.country Guatemala                   0.93354    
## native.country Haiti                       0.84275    
## native.country Holand-Netherlands          0.99074    
## native.country Honduras                    0.64493    
## native.country Hong                        0.89827    
## native.country Hungary                     0.92543    
## native.country India                       0.56390    
## native.country Iran                        0.60364    
## native.country Ireland                     0.26424    
## native.country Italy                       0.00392 ** 
## native.country Jamaica                     0.62170    
## native.country Japan                       0.16914    
## native.country Laos                        0.62575    
## native.country Mexico                      0.15325    
## native.country Nicaragua                   0.44424    
## native.country Outlying-US(Guam-USVI-etc)  0.95407    
## native.country Peru                        0.44772    
## native.country Philippines                 0.02981 *  
## native.country Poland                      0.66608    
## native.country Portugal                    0.80763    
## native.country Puerto-Rico                 0.71362    
## native.country Scotland                    0.80929    
## native.country South                       0.04573 *  
## native.country Taiwan                      0.63409    
## native.country Thailand                    0.65062    
## native.country Trinadad&Tobago             0.82041    
## native.country United-States               0.00570 ** 
## native.country Vietnam                     0.11884    
## native.country Yugoslavia                  0.20131    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 35948  on 32560  degrees of freedom
## Residual deviance: 20565  on 32462  degrees of freedom
## AIC: 20763
## 
## Number of Fisher Scoring iterations: 13

alias(LogisticModel, scale = FALSE)

## Model :
## census_income_data$Income_band ~ age + workclass + fnlwgt + education + 
##     education.num + marital.status + occupation + relationship + 
##     race + sex + capital.gain + capital.loss + hours.per.week + 
##     native.country
## 
## Complete :
##                             (Intercept) age workclass Federal-gov
## education.num                6           0   0                   
## occupation Transport-moving  0           0   1                   
##                             workclass Local-gov workclass Never-worked
## education.num                0                   0                    
## occupation Transport-moving  1                   0                    
##                             workclass Private workclass Self-emp-inc
## education.num                0                 0                    
## occupation Transport-moving  1                 1                    
##                             workclass Self-emp-not-inc workclass State-gov
## education.num                0                          0                 
## occupation Transport-moving  1                          1                 
##                             workclass Without-pay fnlwgt education 11th
## education.num                0                     0      1            
## occupation Transport-moving  1                     0      0            
##                             education 12th education 1st-4th
## education.num                2             -4               
## occupation Transport-moving  0              0               
##                             education 5th-6th education 7th-8th
## education.num               -3                -2               
## occupation Transport-moving  0                 0               
##                             education 9th education Assoc-acdm
## education.num               -1             6                  
## occupation Transport-moving  0             0                  
##                             education Assoc-voc education Bachelors
## education.num                5                   7                 
## occupation Transport-moving  0                   0                 
##                             education Doctorate education HS-grad
## education.num               10                   3               
## occupation Transport-moving  0                   0               
##                             education Masters education Preschool
## education.num                8                -5                 
## occupation Transport-moving  0                 0                 
##                             education Prof-school education Some-college
## education.num                9                     4                    
## occupation Transport-moving  0                     0                    
##                             marital.status Married-AF-spouse
## education.num                0                              
## occupation Transport-moving  0                              
##                             marital.status Married-civ-spouse
## education.num                0                               
## occupation Transport-moving  0                               
##                             marital.status Married-spouse-absent
## education.num                0                                  
## occupation Transport-moving  0                                  
##                             marital.status Never-married
## education.num                0                          
## occupation Transport-moving  0                          
##                             marital.status Separated
## education.num                0                      
## occupation Transport-moving  0                      
##                             marital.status Widowed occupation Adm-clerical
## education.num                0                      0                     
## occupation Transport-moving  0                     -1                     
##                             occupation Armed-Forces
## education.num                0                     
## occupation Transport-moving -1                     
##                             occupation Craft-repair
## education.num                0                     
## occupation Transport-moving -1                     
##                             occupation Exec-managerial
## education.num                0                        
## occupation Transport-moving -1                        
##                             occupation Farming-fishing
## education.num                0                        
## occupation Transport-moving -1                        
##                             occupation Handlers-cleaners
## education.num                0                          
## occupation Transport-moving -1                          
##                             occupation Machine-op-inspct
## education.num                0                          
## occupation Transport-moving -1                          
##                             occupation Other-service
## education.num                0                      
## occupation Transport-moving -1                      
##                             occupation Priv-house-serv
## education.num                0                        
## occupation Transport-moving -1                        
##                             occupation Prof-specialty
## education.num                0                       
## occupation Transport-moving -1                       
##                             occupation Protective-serv occupation Sales
## education.num                0                          0              
## occupation Transport-moving -1                         -1              
##                             occupation Tech-support
## education.num                0                     
## occupation Transport-moving -1                     
##                             relationship Not-in-family
## education.num                0                        
## occupation Transport-moving  0                        
##                             relationship Other-relative
## education.num                0                         
## occupation Transport-moving  0                         
##                             relationship Own-child relationship Unmarried
## education.num                0                      0                    
## occupation Transport-moving  0                      0                    
##                             relationship Wife race Asian-Pac-Islander
## education.num                0                 0                     
## occupation Transport-moving  0                 0                     
##                             race Black race Other race White sex Male
## education.num                0          0          0          0      
## occupation Transport-moving  0          0          0          0      
##                             capital.gain capital.loss hours.per.week
## education.num                0            0            0            
## occupation Transport-moving  0            0            0            
##                             native.country Cambodia native.country Canada
## education.num                0                       0                   
## occupation Transport-moving  0                       0                   
##                             native.country China native.country Columbia
## education.num                0                    0                     
## occupation Transport-moving  0                    0                     
##                             native.country Cuba
## education.num                0                 
## occupation Transport-moving  0                 
##                             native.country Dominican-Republic
## education.num                0                               
## occupation Transport-moving  0                               
##                             native.country Ecuador
## education.num                0                    
## occupation Transport-moving  0                    
##                             native.country El-Salvador
## education.num                0                        
## occupation Transport-moving  0                        
##                             native.country England native.country France
## education.num                0                      0                   
## occupation Transport-moving  0                      0                   
##                             native.country Germany native.country Greece
## education.num                0                      0                   
## occupation Transport-moving  0                      0                   
##                             native.country Guatemala native.country Haiti
## education.num                0                        0                  
## occupation Transport-moving  0                        0                  
##                             native.country Holand-Netherlands
## education.num                0                               
## occupation Transport-moving  0                               
##                             native.country Honduras native.country Hong
## education.num                0                       0                 
## occupation Transport-moving  0                       0                 
##                             native.country Hungary native.country India
## education.num                0                      0                  
## occupation Transport-moving  0                      0                  
##                             native.country Iran native.country Ireland
## education.num                0                   0                    
## occupation Transport-moving  0                   0                    
##                             native.country Italy native.country Jamaica
## education.num                0                    0                    
## occupation Transport-moving  0                    0                    
##                             native.country Japan native.country Laos
## education.num                0                    0                 
## occupation Transport-moving  0                    0                 
##                             native.country Mexico native.country Nicaragua
## education.num                0                     0                      
## occupation Transport-moving  0                     0                      
##                             native.country Outlying-US(Guam-USVI-etc)
## education.num                0                                       
## occupation Transport-moving  0                                       
##                             native.country Peru native.country Philippines
## education.num                0                   0                        
## occupation Transport-moving  0                   0                        
##                             native.country Poland native.country Portugal
## education.num                0                     0                     
## occupation Transport-moving  0                     0                     
##                             native.country Puerto-Rico
## education.num                0                        
## occupation Transport-moving  0                        
##                             native.country Scotland native.country South
## education.num                0                       0                  
## occupation Transport-moving  0                       0                  
##                             native.country Taiwan native.country Thailand
## education.num                0                     0                     
## occupation Transport-moving  0                     0                     
##                             native.country Trinadad&Tobago
## education.num                0                            
## occupation Transport-moving  0                            
##                             native.country United-States
## education.num                0                          
## occupation Transport-moving  0                          
##                             native.country Vietnam
## education.num                0                    
## occupation Transport-moving  0                    
##                             native.country Yugoslavia
## education.num                0                       
## occupation Transport-moving  0

Individual Impact of Variables

library("caret", lib.loc="C:/Program Files/R/R-3.3.1/library")
varImp(LogisticModel, scale = FALSE)

##                                               Overall
## age                                       15.45978788
## workclass Federal-gov                      7.13059046
## workclass Local-gov                        2.93418870
## workclass Never-worked                     0.03841036
## workclass Private                          4.74589511
## workclass Self-emp-inc                     5.14025091
## workclass Self-emp-not-inc                 0.75618502
## workclass State-gov                        1.86821835
## workclass Without-pay                      0.06151125
## fnlwgt                                     4.11138462
## education 11th                             0.40333184
## education 12th                             1.84977469
## education 1st-4th                          1.08718108
## education 5th-6th                          0.73466116
## education 7th-8th                          2.04970529
## education 9th                              0.74262221
## education Assoc-acdm                       7.57369082
## education Assoc-voc                        7.98077808
## education Bachelors                       12.29618486
## education Doctorate                       13.95408957
## education HS-grad                          5.30193981
## education Masters                         13.63108622
## education Preschool                        0.05755434
## education Prof-school                     13.95472810
## education Some-college                     7.44740391
## marital.status Married-AF-spouse           4.84934535
## marital.status Married-civ-spouse          8.31190433
## marital.status Married-spouse-absent       0.04774463
## marital.status Never-married               5.51319049
## marital.status Separated                   0.81255767
## marital.status Widowed                     0.83538341
## occupation Adm-clerical                    1.10410539
## occupation Armed-Forces                    0.68754204
## occupation Craft-repair                    2.13953789
## occupation Exec-managerial                10.27577676
## occupation Farming-fishing                 6.21422641
## occupation Handlers-cleaners               3.90733480
## occupation Machine-op-inspct               1.62441813
## occupation Other-service                   5.74635490
## occupation Priv-house-serv                 2.41532499
## occupation Prof-specialty                  6.67544897
## occupation Protective-serv                 5.26524365
## occupation Sales                           4.33622786
## occupation Tech-support                    6.41476040
## relationship Not-in-family                 2.16813179
## relationship Other-relative                1.53649487
## relationship Own-child                     2.53917050
## relationship Unmarried                     1.58317029
## relationship Wife                         13.28213390
## race Asian-Pac-Islander                    2.46542508
## race Black                                 1.68985002
## race Other                                 0.49068506
## race White                                 2.58353199
## sex Male                                  10.88310426
## capital.gain                              30.96796328
## capital.loss                              17.43065618
## hours.per.week                            18.31600453
## native.country Cambodia                    2.33844906
## native.country Canada                      1.75133472
## native.country China                       1.28823300
## native.country Columbia                    2.34173064
## native.country Cuba                        1.58270581
## native.country Dominican-Republic          1.56598250
## native.country Ecuador                     0.12948876
## native.country El-Salvador                 0.85417585
## native.country England                     1.48574758
## native.country France                      1.46160799
## native.country Germany                     2.17930460
## native.country Greece                      1.41099951
## native.country Guatemala                   0.08338756
## native.country Haiti                       0.19838015
## native.country Holand-Netherlands          0.01160294
## native.country Honduras                    0.46082025
## native.country Hong                        0.12784747
## native.country Hungary                     0.09359322
## native.country India                       0.57705903
## native.country Iran                        0.51917871
## native.country Ireland                     1.11642112
## native.country Italy                       2.88471317
## native.country Jamaica                     0.49343756
## native.country Japan                       1.37495795
## native.country Laos                        0.48771250
## native.country Mexico                      1.42813057
## native.country Nicaragua                   0.76505483
## native.country Outlying-US(Guam-USVI-etc)  0.05760036
## native.country Peru                        0.75922390
## native.country Philippines                 2.17262061
## native.country Poland                      0.43154102
## native.country Portugal                    0.24348140
## native.country Puerto-Rico                 0.36700309
## native.country Scotland                    0.24133728
## native.country South                       1.99791145
## native.country Taiwan                      0.47598183
## native.country Thailand                    0.45289950
## native.country Trinadad&Tobago             0.22701988
## native.country United-States               2.76431028
## native.country Vietnam                     1.55965850
## native.country Yugoslavia                  1.27784231

AIC and BIC

library("stats", lib.loc="C:/Program Files/R/R-3.3.1/library")
AIC(LogisticModel)

## [1] 20763.02

BIC(LogisticModel)

## [1] 21593.72

SVM Model

SVM is another black box method in Machine Learning space. Compared to other algorithms, SVM totally a different approach to learning.

library(e1071)
svm<- svm(census_income_data$Income_band~.,type="C", kernel="linear",data=census_income_data)
summary(svm)

## 
## Call:
## svm(formula = census_income_data$Income_band ~ ., data = census_income_data, 
##     type = "C", kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
##       gamma:  0.00990099 
## 
## Number of Support Vectors:  11152
## 
##  ( 5585 5567 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

There are 11152 support vectors.SVM-Type is C-classification.

Confusion Matrix

library(caret)
svm_predicted<-predict(svm)
confusionMatrix(svm_predicted,census_income_data$Income_band)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 23189  3266
##          1  1531  4575
##                                           
##                Accuracy : 0.8527          
##                  95% CI : (0.8488, 0.8565)
##     No Information Rate : 0.7592          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5642          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9381          
##             Specificity : 0.5835          
##          Pos Pred Value : 0.8765          
##          Neg Pred Value : 0.7493          
##              Prevalence : 0.7592          
##          Detection Rate : 0.7122          
##    Detection Prevalence : 0.8125          
##       Balanced Accuracy : 0.7608          
##                                           
##        'Positive' Class : 0               
##

Decision Tree Model

The series of questions and their possible answers can be organized in the form of a decision tree, which is a hierarchical structure consisting of nodes and directed edges

library("rpart", lib.loc="C:/Program Files/R/R-3.3.1/library")

## Warning: package 'rpart' was built under R version 3.3.3

library("tree", lib.loc="C:/Program Files/R/R-3.3.1/library")

## Warning: package 'tree' was built under R version 3.3.3

names(census_income_data)

##  [1] "age"            "workclass"      "fnlwgt"         "education"     
##  [5] "education.num"  "marital.status" "occupation"     "relationship"  
##  [9] "race"           "sex"            "capital.gain"   "capital.loss"  
## [13] "hours.per.week" "native.country" "Income_band"

income_tree<-rpart(census_income_data$Income_band~.,method="class", control=rpart.control(minsplit=30), data=census_income_data)
income_tree

## n= 32561 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 32561 7841 0 (0.75919044 0.24080956)  
##    2) relationship= Not-in-family, Other-relative, Own-child, Unmarried 17800 1178 0 (0.93382022 0.06617978)  
##      4) capital.gain< 7073.5 17482  872 0 (0.95012012 0.04987988) *
##      5) capital.gain>=7073.5 318   12 1 (0.03773585 0.96226415) *
##    3) relationship= Husband, Wife 14761 6663 0 (0.54860782 0.45139218)  
##      6) education= 10th, 11th, 12th, 1st-4th, 5th-6th, 7th-8th, 9th, Assoc-acdm, Assoc-voc, HS-grad, Preschool, Some-college 10329 3456 0 (0.66540807 0.33459193)  
##       12) capital.gain< 5095.5 9807 2944 0 (0.69980626 0.30019374) *
##       13) capital.gain>=5095.5 522   10 1 (0.01915709 0.98084291) *
##      7) education= Bachelors, Doctorate, Masters, Prof-school 4432 1225 1 (0.27639892 0.72360108) *

library("rattle", lib.loc="C:/Program Files/R/R-3.3.1/library")
library("rpart.plot", lib.loc="C:/Program Files/R/R-3.3.1/library")

## Warning: package 'rpart.plot' was built under R version 3.3.3

fancyRpartPlot(income_tree)

printcp(income_tree)

## 
## Classification tree:
## rpart(formula = census_income_data$Income_band ~ ., data = census_income_data, 
##     method = "class", control = rpart.control(minsplit = 30))
## 
## Variables actually used in tree construction:
## [1] capital.gain education    relationship
## 
## Root node error: 7841/32561 = 0.24081
## 
## n= 32561 
## 
##         CP nsplit rel error  xerror      xstd
## 1 0.126387      0   1.00000 1.00000 0.0098399
## 2 0.064022      2   0.74723 0.74723 0.0088402
## 3 0.037495      3   0.68320 0.68320 0.0085321
## 4 0.010000      4   0.64571 0.64571 0.0083394

plotcp(income_tree)

Prediction using the model

library("caret", lib.loc="C:/Program Files/R/R-3.3.1/library")
sample_pred<-predict(income_tree, type="class")
conf_matrix<-table(sample_pred,census_income_data$Income_band)
conf_matrix

##            
## sample_pred     0     1
##           0 23473  3816
##           1  1247  4025

accuracy3<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy3

## [1] 0.8445072

Prune the Decision Tree

library("rpart", lib.loc="C:/Program Files/R/R-3.3.1/library")
library("tree", lib.loc="C:/Program Files/R/R-3.3.1/library")
income_tree1<-rpart(census_income_data$Income_band~.,method="class", control=rpart.control(minsplit=30, cp=0.037), data=census_income_data)
income_tree1

## n= 32561 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 32561 7841 0 (0.75919044 0.24080956)  
##    2) relationship= Not-in-family, Other-relative, Own-child, Unmarried 17800 1178 0 (0.93382022 0.06617978)  
##      4) capital.gain< 7073.5 17482  872 0 (0.95012012 0.04987988) *
##      5) capital.gain>=7073.5 318   12 1 (0.03773585 0.96226415) *
##    3) relationship= Husband, Wife 14761 6663 0 (0.54860782 0.45139218)  
##      6) education= 10th, 11th, 12th, 1st-4th, 5th-6th, 7th-8th, 9th, Assoc-acdm, Assoc-voc, HS-grad, Preschool, Some-college 10329 3456 0 (0.66540807 0.33459193)  
##       12) capital.gain< 5095.5 9807 2944 0 (0.69980626 0.30019374) *
##       13) capital.gain>=5095.5 522   10 1 (0.01915709 0.98084291) *
##      7) education= Bachelors, Doctorate, Masters, Prof-school 4432 1225 1 (0.27639892 0.72360108) *

library("rattle", lib.loc="C:/Program Files/R/R-3.3.1/library")
library("rpart.plot", lib.loc="C:/Program Files/R/R-3.3.1/library")
fancyRpartPlot(income_tree1)

printcp(income_tree1)

## 
## Classification tree:
## rpart(formula = census_income_data$Income_band ~ ., data = census_income_data, 
##     method = "class", control = rpart.control(minsplit = 30, 
##         cp = 0.037))
## 
## Variables actually used in tree construction:
## [1] capital.gain education    relationship
## 
## Root node error: 7841/32561 = 0.24081
## 
## n= 32561 
## 
##         CP nsplit rel error  xerror      xstd
## 1 0.126387      0   1.00000 1.00000 0.0098399
## 2 0.064022      2   0.74723 0.74723 0.0088402
## 3 0.037495      3   0.68320 0.68320 0.0085321
## 4 0.037000      4   0.64571 0.65553 0.0083908

plotcp(income_tree1)

library("caret", lib.loc="C:/Program Files/R/R-3.3.1/library")
sample_pred1<-predict(income_tree1, type="class")
conf_matrix<-table(sample_pred1,census_income_data$Income_band)
conf_matrix

##             
## sample_pred1     0     1
##            0 23473  3816
##            1  1247  4025

accuracy4<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy4

## [1] 0.8445072

Train and Validation datasets

library("caret", lib.loc="C:/Program Files/R/R-3.3.1/library")
sampledata <- createDataPartition(census_income_data$Income_band, p=0.80, list=FALSE)
train_new <- census_income_data[sampledata,]
hold_out <- census_income_data[-sampledata,]

Overfitting

Model on training data

library("rpart", lib.loc="C:/Program Files/R/R-3.3.1/library")
library("tree", lib.loc="C:/Program Files/R/R-3.3.1/library")
income_tree<-rpart(Income_band~.,method="class", control=rpart.control(minsplit=30,cp=0.05), data=census_income_data)
income_tree

## n= 32561 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 32561 7841 0 (0.75919044 0.24080956)  
##    2) relationship= Not-in-family, Other-relative, Own-child, Unmarried 17800 1178 0 (0.93382022 0.06617978) *
##    3) relationship= Husband, Wife 14761 6663 0 (0.54860782 0.45139218)  
##      6) education= 10th, 11th, 12th, 1st-4th, 5th-6th, 7th-8th, 9th, Assoc-acdm, Assoc-voc, HS-grad, Preschool, Some-college 10329 3456 0 (0.66540807 0.33459193)  
##       12) capital.gain< 5095.5 9807 2944 0 (0.69980626 0.30019374) *
##       13) capital.gain>=5095.5 522   10 1 (0.01915709 0.98084291) *
##      7) education= Bachelors, Doctorate, Masters, Prof-school 4432 1225 1 (0.27639892 0.72360108) *

library("rattle", lib.loc="C:/Program Files/R/R-3.3.1/library")
library("tree", lib.loc="C:/Program Files/R/R-3.3.1/library")
sample_pred<-predict(income_tree, train_new,type="class")
 

conf_matrix<-table(sample_pred,train_new$Income_band)
conf_matrix

##            
## sample_pred     0     1
##           0 18790  3300
##           1   986  2973

accuracy5<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy5

## [1] 0.8354639

Model Validation

Validation accuracy

library("rattle", lib.loc="C:/Program Files/R/R-3.3.1/library")
hold_out$pred<- predict(income_tree, hold_out,type="class")
conf_matrix_val<-table(hold_out$pred,hold_out$Income_band)
conf_matrix_val

##    
##        0    1
##   0 4695  822
##   1  249  746

accuracy_val<-(conf_matrix_val[1,1]+conf_matrix_val[2,2])/(sum(conf_matrix_val))
accuracy_val

## [1] 0.8355344

ROC and AUC on decision

library("pROC", lib.loc="C:/Program Files/R/R-3.3.1/library")

## Warning: package 'pROC' was built under R version 3.3.3

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following object is masked from 'package:gmodels':
## 
##     ci

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

income_tree<-glm(census_income_data$Income_band~.,family=binomial(),data=census_income_data)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

predicted_prob<-predict(income_tree,type="response")
roccurve <- roc(income_tree$y, predicted_prob)
plot(roccurve)

auc(roccurve)

## Area under the curve: 0.9089

auc(income_tree$y, predicted_prob)

## Area under the curve: 0.9089

k-fold Cross Validation building

Divide the whole dataset into k equal parts Use kth part of the data as the holdout sample, use remaining k-1 parts of the data as training data.Repeat this K times, build K models. The average error on holdout sample gives us an idea on the testing error

K=10

library("caret", lib.loc="C:/Program Files/R/R-3.3.1/library")
train_dat <- trainControl(method="cv", number=10)
train_dat

## $method
## [1] "cv"
## 
## $number
## [1] 10
## 
## $repeats
## [1] 1
## 
## $search
## [1] "grid"
## 
## $p
## [1] 0.75
## 
## $initialWindow
## NULL
## 
## $horizon
## [1] 1
## 
## $fixedWindow
## [1] TRUE
## 
## $skip
## [1] 0
## 
## $verboseIter
## [1] FALSE
## 
## $returnData
## [1] TRUE
## 
## $returnResamp
## [1] "final"
## 
## $savePredictions
## [1] FALSE
## 
## $classProbs
## [1] FALSE
## 
## $summaryFunction
## function (data, lev = NULL, model = NULL) 
## {
##     if (is.character(data$obs)) 
##         data$obs <- factor(data$obs, levels = lev)
##     postResample(data[, "pred"], data[, "obs"])
## }
## <environment: namespace:caret>
## 
## $selectionFunction
## [1] "best"
## 
## $preProcOptions
## $preProcOptions$thresh
## [1] 0.95
## 
## $preProcOptions$ICAcomp
## [1] 3
## 
## $preProcOptions$k
## [1] 5
## 
## $preProcOptions$freqCut
## [1] 19
## 
## $preProcOptions$uniqueCut
## [1] 10
## 
## $preProcOptions$cutoff
## [1] 0.9
## 
## 
## $sampling
## NULL
## 
## $index
## NULL
## 
## $indexOut
## NULL
## 
## $indexFinal
## NULL
## 
## $timingSamps
## [1] 0
## 
## $predictionBounds
## [1] FALSE FALSE
## 
## $seeds
## [1] NA
## 
## $adaptive
## $adaptive$min
## [1] 5
## 
## $adaptive$alpha
## [1] 0.05
## 
## $adaptive$method
## [1] "gls"
## 
## $adaptive$complete
## [1] TRUE
## 
## 
## $trim
## [1] FALSE
## 
## $allowParallel
## [1] TRUE

names(census_income_data)

##  [1] "age"            "workclass"      "fnlwgt"         "education"     
##  [5] "education.num"  "marital.status" "occupation"     "relationship"  
##  [9] "race"           "sex"            "capital.gain"   "capital.loss"  
## [13] "hours.per.week" "native.country" "Income_band"

census_income_data$Income_band<-as.factor(census_income_data$Income_band)

Building the models on K-fold samples

library("e1071", lib.loc="C:/Program Files/R/R-3.3.1/library")
K_fold_tree<-train(Income_band~., method="rpart", trControl=train_dat, control=rpart.control(minsplit=10, cp=0.000001),  data=census_income_data)
K_fold_tree

## CART 
## 
## 32561 samples
##    14 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 29305, 29305, 29305, 29305, 29305, 29305, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.03685754  0.8379656  0.4988152
##   0.06453259  0.8240837  0.4422865
##   0.12492029  0.7870774  0.2043005
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.03685754.

K_fold_tree$finalModel

K_fold_tree$finalModel

## n= 32561 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 32561 7841 0 (0.75919044 0.24080956)  
##    2) marital.status Married-civ-spouse< 0.5 17585 1149 0 (0.93466022 0.06533978) *
##    3) marital.status Married-civ-spouse>=0.5 14976 6692 0 (0.55315171 0.44684829)  
##      6) education.num< 12.5 10507 3478 0 (0.66898258 0.33101742)  
##       12) capital.gain< 5095.5 9979 2961 0 (0.70327688 0.29672312) *
##       13) capital.gain>=5095.5 528   11 1 (0.02083333 0.97916667) *
##      7) education.num>=12.5 4469 1255 1 (0.28082345 0.71917655) *

library("rattle", lib.loc="C:/Program Files/R/R-3.3.1/library")
fancyRpartPlot(K_fold_tree$finalModel)

Kfold_pred<-predict(K_fold_tree)

Caret package has confusion matrix function

conf_matrix<-confusionMatrix(Kfold_pred,census_income_data$Income_band)
conf_matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 23454  4110
##          1  1266  3731
##                                           
##                Accuracy : 0.8349          
##                  95% CI : (0.8308, 0.8389)
##     No Information Rate : 0.7592          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4846          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9488          
##             Specificity : 0.4758          
##          Pos Pred Value : 0.8509          
##          Neg Pred Value : 0.7466          
##              Prevalence : 0.7592          
##          Detection Rate : 0.7203          
##    Detection Prevalence : 0.8465          
##       Balanced Accuracy : 0.7123          
##                                           
##        'Positive' Class : 0               
##

Bootstrap

Boot strapping is a powerful tool to get an idea on accuracy of the model

library("caret", lib.loc="C:/Program Files/R/R-3.3.1/library")
train_control <- trainControl(method="boot", number=10)

Tree model on boots straped data

Boot_Strap_model <- train(Income_band~., method="rpart", trControl=train_dat, control=rpart.control(minsplit=10, cp=0.000001),  data=census_income_data)
names(census_income_data)

##  [1] "age"            "workclass"      "fnlwgt"         "education"     
##  [5] "education.num"  "marital.status" "occupation"     "relationship"  
##  [9] "race"           "sex"            "capital.gain"   "capital.loss"  
## [13] "hours.per.week" "native.country" "Income_band"

Boot_Strap_model$finalModel

## n= 32561 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 32561 7841 0 (0.75919044 0.24080956)  
##    2) marital.status Married-civ-spouse< 0.5 17585 1149 0 (0.93466022 0.06533978) *
##    3) marital.status Married-civ-spouse>=0.5 14976 6692 0 (0.55315171 0.44684829)  
##      6) education.num< 12.5 10507 3478 0 (0.66898258 0.33101742)  
##       12) capital.gain< 5095.5 9979 2961 0 (0.70327688 0.29672312) *
##       13) capital.gain>=5095.5 528   11 1 (0.02083333 0.97916667) *
##      7) education.num>=12.5 4469 1255 1 (0.28082345 0.71917655) *

Boot_Strap_predictions <- predict(Boot_Strap_model)
conf_matrix<-confusionMatrix(Boot_Strap_predictions,census_income_data$Income_band)
conf_matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 23454  4110
##          1  1266  3731
##                                           
##                Accuracy : 0.8349          
##                  95% CI : (0.8308, 0.8389)
##     No Information Rate : 0.7592          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4846          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9488          
##             Specificity : 0.4758          
##          Pos Pred Value : 0.8509          
##          Neg Pred Value : 0.7466          
##              Prevalence : 0.7592          
##          Detection Rate : 0.7203          
##    Detection Prevalence : 0.8465          
##       Balanced Accuracy : 0.7123          
##                                           
##        'Positive' Class : 0               
##

Conclusion

n= 32561

node), split, n, loss, yval, (yprob) * denotes terminal node

root 32561 7841 0 (0.75919044 0.24080956)
1. marital.status Married-civ-spouse< 0.5 17585 1149 0 (0.93466022 0.06533978) *
2. marital.status Married-civ-spouse>=0.5 14976 6692 0 (0.55315171 0.44684829)
3. education.num< 12.5 10507 3478 0 (0.66898258 0.33101742)
4. capital.gain< 5095.5 9979 2961 0 (0.70327688 0.29672312) *
5. capital.gain>=5095.5 528 11 1 (0.02083333 0.97916667) *
6. education.num>=12.5 4469 1255 1 (0.28082345 0.71917655) *

Root node contains 32561 records.ie., over all records in whole data.root node termed as a <=50K, loss in that node is >50K are 7841.out of 32561loss is 7841. 76% population are earning below 50,000 (<=50K )and 24% population are earning above 50,000 (>50K).

2 nd node is marital.status Married-civ-spouse< 0.5.It has a 17585 records.2 nd node is termed as a <=50K ,loss in that node is >50K are 1149.out of 17585 loss is 1149.Below 50,000 earning population are around 17585. 93% population are earning <=50K.7% population are earning above 50,000 (>50K).

3rd node is marital.status Married-civ-spouse>=0.5.It has a 14976 records.3 rd node is termed as a <=50K ,loss in that node is >50K are 6692.out of 14976 loss is 6692.Below 50,000 earning population are around 14976. 55% population are earning <=50K.45% population are earning above 50,000 (>50K).

6th node is education.num< 12.5.It has a 10507 records.6th node is termed as a <=50K ,loss in that node is >50K are 3478. out of 10507 loss is 3478. Below 50,000 earning population are around 10507. 67% population are earning Below 50,000 (<=50K).33% population are earning above 50,000 (>50K).

12th node is capital.gain< 5095.5 .It has a 9979 records.12th node is termed as a <=50K ,loss in that node is >50K are 2961.out of 9979 loss is 2961.Below 50,000 earning population are around 9979. 70% population are earning Below 50,000 (<=50K).30% population are earning above 50,000 (>50K).

13th node is capital.gain>=5095.5 .It has a 528 records.13th node is termed as a >50K ,loss in that node is <=50K are 11.out of 528 loss is 11.above 50,000 earning population are around 528. 2% population are earning Below 50,000 (<=50K).97% population are earning above 50,000(>50K).

7th node is education.num>=12.5.It has a 4469 records.7th node is termed as a >50K ,loss in that node is <=50K are 1255.out of 4469 loss is 1255.above 50,000 earning population are around 4469. 28% population are earning Below 50,000 (<=50K).72% population are earning above 50,000 (>50K).

Cofusion Matrix and Accuracy

      Reference Dataset

Prediction 0 1

        0 23454  4110
        1 1266  3731
                                      
     Accuracy : 0.8349          
     95% CI : (0.8308, 0.8389)
No Information Rate : 0.7592          
P-Value [Acc > NIR] : < 2.2e-16       
                                     
              Kappa : 0.4846

Mcnemar’s Test P-Value : < 2.2e-16

        Sensitivity : 0.9488          
        Specificity : 0.4758          
     Pos Pred Value : 0.8509          
      Neg Pred Value : 0.7466          
         Prevalence : 0.7592          
      Detection Rate : 0.7203

Detection Prevalence : 0.8465
Balanced Accuracy : 0.7123

  'Positive' Class : 0

Real accuracy of the whole data is 83%.