You can download the datasets and R code file for this session here.
Pre-requisites: R Programming, Machine learning
Problem Statement
The car company need to come up with the best pricing strategy for their new car such that, the price is neither too low nor too high. Build a model to predict the optimal price of a car using the car features.
Data Exploration
Loading the Data File
Since, our dataset is in .txt format, we should use the following code to read the file. In this file, the header information is not provided. Therefore, header=FALSE.
automobileData <- read.csv("D:\\Dv Analytics\\Datasets\\Automobile Data Set\\AutoDataset.csv")
If the data file is in web, we can directly read it from web using the following command
webData <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data"), header=FALSE)
dim() function is used to find out the dimension of the dataset.
dim(automobileData)
## [1] 205 26
So, we have 205 samples of data and 26 variables. Use names() function to find out the variable names.
names(automobileData)
## [1] "symboling" "normalized.losses" "make"
## [4] "fuel.type" "aspiration" "num.of.doors"
## [7] "body.style" "drive.wheels" "engine.location"
## [10] "wheel.base" "length" "width"
## [13] "height" "curb.weight" "engine.type"
## [16] "num.of.cylinders" "engine.size" "fuel.system"
## [19] "bore" "stroke" "compression.ratio"
## [22] "horsepower" "peak.rpm" "city.mpg"
## [25] "highway.mpg" "price"
To know the structure(class and levels) of each variable, str() function is used.
str(automobileData)
## 'data.frame': 205 obs. of 26 variables:
## $ symboling : int 3 3 1 2 2 2 1 1 1 0 ...
## $ normalized.losses: int NA NA NA 164 164 NA 158 NA 158 NA ...
## $ make : Factor w/ 22 levels "alfa-romero",..: 1 1 1 2 2 2 2 2 2 2 ...
## $ fuel.type : Factor w/ 2 levels "diesel","gas": 2 2 2 2 2 2 2 2 2 2 ...
## $ aspiration : Factor w/ 2 levels "std","turbo": 1 1 1 1 1 1 1 1 2 2 ...
## $ num.of.doors : Factor w/ 3 levels "","four","two": 3 3 3 2 2 3 2 2 2 3 ...
## $ body.style : Factor w/ 5 levels "convertible",..: 1 1 3 4 4 4 4 5 4 3 ...
## $ drive.wheels : Factor w/ 3 levels "4wd","fwd","rwd": 3 3 3 2 1 2 2 2 2 1 ...
## $ engine.location : Factor w/ 2 levels "front","rear": 1 1 1 1 1 1 1 1 1 1 ...
## $ wheel.base : num 88.6 88.6 94.5 99.8 99.4 ...
## $ length : num 169 169 171 177 177 ...
## $ width : num 64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ...
## $ height : num 48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 ...
## $ curb.weight : int 2548 2548 2823 2337 2824 2507 2844 2954 3086 3053 ...
## $ engine.type : Factor w/ 7 levels "dohc","dohcv",..: 1 1 6 4 4 4 4 4 4 4 ...
## $ num.of.cylinders : Factor w/ 7 levels "eight","five",..: 3 3 4 3 2 2 2 2 2 2 ...
## $ engine.size : int 130 130 152 109 136 136 136 136 131 131 ...
## $ fuel.system : Factor w/ 8 levels "1bbl","2bbl",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ bore : num 3.47 3.47 2.68 3.19 3.19 3.19 3.19 3.19 3.13 3.13 ...
## $ stroke : num 2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ...
## $ compression.ratio: num 9 9 9 10 8 8.5 8.5 8.5 8.3 7 ...
## $ horsepower : int 111 111 154 102 115 110 110 110 140 160 ...
## $ peak.rpm : int 5000 5000 5000 5500 5500 5500 5500 5500 5500 5500 ...
## $ city.mpg : int 21 21 19 24 18 19 19 19 17 16 ...
## $ highway.mpg : int 27 27 26 30 22 25 25 25 20 22 ...
## $ price : num 13495 16500 16500 13950 17450 ...
To have a short look at the dataset, we can use head() or tail() functions to see a few data samples.
head(automobileData)
## symboling normalized.losses make fuel.type aspiration
## 1 3 NA alfa-romero gas std
## 2 3 NA alfa-romero gas std
## 3 1 NA alfa-romero gas std
## 4 2 164 audi gas std
## 5 2 164 audi gas std
## 6 2 NA audi gas std
## num.of.doors body.style drive.wheels engine.location wheel.base length
## 1 two convertible rwd front 88.6 168.8
## 2 two convertible rwd front 88.6 168.8
## 3 two hatchback rwd front 94.5 171.2
## 4 four sedan fwd front 99.8 176.6
## 5 four sedan 4wd front 99.4 176.6
## 6 two sedan fwd front 99.8 177.3
## width height curb.weight engine.type num.of.cylinders engine.size
## 1 64.1 48.8 2548 dohc four 130
## 2 64.1 48.8 2548 dohc four 130
## 3 65.5 52.4 2823 ohcv six 152
## 4 66.2 54.3 2337 ohc four 109
## 5 66.4 54.3 2824 ohc five 136
## 6 66.3 53.1 2507 ohc five 136
## fuel.system bore stroke compression.ratio horsepower peak.rpm city.mpg
## 1 mpfi 3.47 2.68 9.0 111 5000 21
## 2 mpfi 3.47 2.68 9.0 111 5000 21
## 3 mpfi 2.68 3.47 9.0 154 5000 19
## 4 mpfi 3.19 3.40 10.0 102 5500 24
## 5 mpfi 3.19 3.40 8.0 115 5500 18
## 6 mpfi 3.19 3.40 8.5 110 5500 19
## highway.mpg price
## 1 27 13495
## 2 27 16500
## 3 26 16500
## 4 30 13950
## 5 22 17450
## 6 25 15250
To have an overiew of our dataset, we can use the function summary().
summary(automobileData)
## symboling normalized.losses make fuel.type
## Min. :-2.0000 Min. : 65 toyota : 32 diesel: 20
## 1st Qu.: 0.0000 1st Qu.: 94 nissan : 18 gas :185
## Median : 1.0000 Median :115 mazda : 17
## Mean : 0.8341 Mean :122 honda : 13
## 3rd Qu.: 2.0000 3rd Qu.:150 mitsubishi: 13
## Max. : 3.0000 Max. :256 subaru : 12
## NA's :41 (Other) :100
## aspiration num.of.doors body.style drive.wheels engine.location
## std :168 : 2 convertible: 6 4wd: 9 front:202
## turbo: 37 four:114 hardtop : 8 fwd:120 rear : 3
## two : 89 hatchback :70 rwd: 76
## sedan :96
## wagon :25
##
##
## wheel.base length width height
## Min. : 86.60 Min. :141.1 Min. :60.30 Min. :47.80
## 1st Qu.: 94.50 1st Qu.:166.3 1st Qu.:64.10 1st Qu.:52.00
## Median : 97.00 Median :173.2 Median :65.50 Median :54.10
## Mean : 98.76 Mean :174.0 Mean :65.91 Mean :53.72
## 3rd Qu.:102.40 3rd Qu.:183.1 3rd Qu.:66.90 3rd Qu.:55.50
## Max. :120.90 Max. :208.1 Max. :72.30 Max. :59.80
##
## curb.weight engine.type num.of.cylinders engine.size fuel.system
## Min. :1488 dohc : 12 eight : 5 Min. : 61.0 mpfi :94
## 1st Qu.:2145 dohcv: 1 five : 11 1st Qu.: 97.0 2bbl :66
## Median :2414 l : 12 four :159 Median :120.0 idi :20
## Mean :2556 ohc :148 six : 24 Mean :126.9 1bbl :11
## 3rd Qu.:2935 ohcf : 15 three : 1 3rd Qu.:141.0 spdi : 9
## Max. :4066 ohcv : 13 twelve: 1 Max. :326.0 4bbl : 3
## rotor: 4 two : 4 (Other): 2
## bore stroke compression.ratio horsepower
## Min. :2.54 Min. :2.070 Min. : 7.00 Min. : 48.0
## 1st Qu.:3.15 1st Qu.:3.110 1st Qu.: 8.60 1st Qu.: 70.0
## Median :3.31 Median :3.290 Median : 9.00 Median : 95.0
## Mean :3.33 Mean :3.255 Mean :10.14 Mean :104.3
## 3rd Qu.:3.59 3rd Qu.:3.410 3rd Qu.: 9.40 3rd Qu.:116.0
## Max. :3.94 Max. :4.170 Max. :23.00 Max. :288.0
## NA's :4 NA's :4 NA's :2
## peak.rpm city.mpg highway.mpg price
## Min. :4150 Min. :13.00 Min. :16.00 Min. : 5118
## 1st Qu.:4800 1st Qu.:19.00 1st Qu.:25.00 1st Qu.: 7788
## Median :5200 Median :24.00 Median :30.00 Median :10595
## Mean :5125 Mean :25.22 Mean :30.75 Mean :13207
## 3rd Qu.:5500 3rd Qu.:30.00 3rd Qu.:34.00 3rd Qu.:16500
## Max. :6600 Max. :49.00 Max. :54.00 Max. :45400
## NA's :2
From the overview of our data, we can see how the missing values are named if there are any. Usually, datasets have some missing values. They can be either in the form of ?, NA etc. is.na() can be used to find out how many NA values are there in the dataset. It is recommended to view seperately for each variable. Similarly, dataset$variable==“?” can be used to find out whether there are any missing values.
is.na(automobileData$symboling)
automobileData$symboling=="?"
The above commands compares the values in the columns to NA’s and ‘?’ and returns boolian if the value is NA or ‘?’. We can use sum(is.na()) or sum(automobileData$symboling=="?") functions to count the number of missing values. If the output is zero, then there are no missing values.
sum(is.na(automobileData$symboling))
## [1] 0
sum(automobileData$symboling=="?")
## [1] 0
We can see there are no missing values in the column ‘Symboling’.
To find the correlation between two numeric variables, we can use the function cor().
cor(automobileData$wheel.base,automobileData$price)
To find the correlation between a numeric and discrete variable, we can use the function biserial.cor() if the discrete variable is dichotomous. If the catagorical variable is multi level use function polyserial(). Note: you need “ltm” library for executing this function.
library("ltm")
biserial.cor(automobileData$price, automobileData$engine.location)
To find the correlation between two categorical variables, we should first create a table of the two variables and then using assocstats() function, we can find the correlation between them. Note: we need “vcd” library for that.
library("vcd")
stat_table<-table(automobileData$aspiration, automobileData$price)
assocstats(stat_table)
We can draw a plot between two variables using qplot() function, part of ‘ggplot2’ package.
library("ggplot2")
qplot(automobileData$symboling, automobileData$price, colour=I("red"), size=I(4))
We can draw a 3D plot using plot3d function.
library("plot3D")
library("rgl")
plot3d(automobileData$symboling,automobileData$price,automobileData$length,col="blue")
Univariate Analysis:
Now that we have seen the summary of whole dataset, we can move into univariate analysis, where we take one variable at a time and try to find any missing values or underlying outliers.
Symboling:
This variable denotes the risk score of the vehicle. +3 indicates high rish vehicle and -3 indicates probably safe vehicle.
summary(automobileData$symboling)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.0000 0.0000 1.0000 0.8341 2.0000 3.0000
library("ggplot2")
ggplot(automobileData, aes(y=symboling, x = 1)) + geom_boxplot(outlier.color = "red")
summary() function shows no missing values and boxplot is clean, which indicates no outliers.
normallized.losses
It’s relative average loss payment per insured vehicle year for insurance company. This value is normalized for all autos within a particular size classification and represents the average loss per car per year. Also this variable has 41 missing values.
summary(automobileData$normalized.losses)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 65 94 115 122 150 256 41
Summary shows mean and medians are close enough, meaning a good distribution of data in this field.
ggplot(automobileData, aes(y=normalized.losses, x = 1)) + geom_boxplot(outlier.color = "red")
## Warning: Removed 41 rows containing non-finite values (stat_boxplot).
From plot we can see one value sticking out of the distribution.
Make:
This variable shows the name of the maker company of the car.
summary(automobileData$make)
## alfa-romero audi bmw chevrolet dodge
## 3 7 8 3 9
## honda isuzu jaguar mazda mercedes-benz
## 13 4 3 17 8
## mercury mitsubishi nissan peugot plymouth
## 1 13 18 11 7
## porsche renault saab subaru toyota
## 5 2 6 12 32
## volkswagen volvo
## 12 11
str(automobileData$make)
## Factor w/ 22 levels "alfa-romero",..: 1 1 1 2 2 2 2 2 2 2 ...
Summary and structure shows there are 22 maker companies in the dataset.
qplot(data = automobileData, x = make)
Fuel Type:
summary(automobileData$fuel.type)
## diesel gas
## 20 185
dichotomous variable : diesel and gas, representing the fuel type of the vehicle.
qplot(data = automobileData, x = fuel.type)
Clearly 90% vehicles work on Gasoline. No missing values.
Aspiration:
Represents if the vehicle is turbocharged or standard(normally) aspireted.
summary(automobileData$aspiration)
## std turbo
## 168 37
qplot(data = automobileData, x = aspiration)
18% vehicles are turbo charged rest follows standard air intake.
Num-of-Doors:
Catagorical variable with 2 levels: four and two
summary(automobileData$num.of.doors)
## four two
## 2 114 89
qplot(data = automobileData, x = num.of.doors)
This variable contains 2 missing values.
Body-Style:
summary(automobileData$body.style)
## convertible hardtop hatchback sedan wagon
## 6 8 70 96 25
qplot(data = automobileData, x = body.style)
Catagorical variable showing the type of body: convertible, hardtop, hatchback, sedan, wagon. Clearly Sedan and hatchback body styles dominates others.
Drive-wheels:
3 catagories of the drive wheels type: 4wd, fwd, rwd
summary(automobileData$drive.wheels)
## 4wd fwd rwd
## 9 120 76
qplot(data = automobileData, x = drive.wheels)
Engine-location:
If the engine is placed in front or the rear.
qplot(data = automobileData, x = engine.location)
Wheel-Base:
summary(automobileData$wheel.base)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 86.60 94.50 97.00 98.76 102.40 120.90
ggplot(automobileData, aes(y=wheel.base, x = 1)) + geom_boxplot(outlier.color = "red")
Length:
Continuous variable representing the length of the vehicle.
summary(automobileData$length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 141.1 166.3 173.2 174.0 183.1 208.1
ggplot(automobileData, aes(y=length, x = 1)) + geom_boxplot(outlier.color = "red")
At the low length one variable seem to be out of distribution which can be avoided.
Width:
Continuous variable representing width of the vehicle.
summary(automobileData$width)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 60.30 64.10 65.50 65.91 66.90 72.30
Mean and medians are nearby, chances of outliers are low.
ggplot(automobileData, aes(y=width, x = 1)) + geom_boxplot(outlier.color = "red")
4 values seem to be out of the distribution.
height:
Continuous variable, representing height of the vehicle.
summary(automobileData$height)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 47.80 52.00 54.10 53.72 55.50 59.80
Mean and medians are close, seem to have continuous distribution.
ggplot(automobileData, aes(y=height, x = 1)) + geom_boxplot(outlier.color = "red")
No outlier in this variable.
Curb-weight:
continuous variable Representing the total weight of the vehicle without the occupents and luggage.
summary(automobileData$curb.weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1488 2145 2414 2556 2935 4066
ggplot(automobileData, aes(y=curb.weight, x = 1)) + geom_boxplot(outlier.color = "red")
Both summary and boxplot proves no sign of outliers.
Engine-type:
Catagorical variable showing they engine type used. 7 catagories in the field.
summary(automobileData$engine.type)
## dohc dohcv l ohc ohcf ohcv rotor
## 12 1 12 148 15 13 4
qplot(data = automobileData, x = engine.type)
Looks like a large number of vehicles has OHC(Over Head Cam-Shaft) engines.
num-of-cylinders:
Again a catagorical variable stating the number of cylinder the car engine have.
summary(automobileData$num.of.cylinders)
## eight five four six three twelve two
## 5 11 159 24 1 1 4
qplot(data =automobileData, x = num.of.cylinders)
77.5% vehicles have 4-cylinder engines in use.
Engine-size:
Continuous variable for the size of the engine.
summary(automobileData$engine.size)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 61.0 97.0 120.0 126.9 141.0 326.0
ggplot(automobileData, aes(y = engine.size, x = 1)) + geom_boxplot(outlier.color = "red")
Boxplot shows 6 values as outliers, which are not following the distribution of the data.
Fuel System:
Catagorical variable showing the fuel system used in the car with 8 levels.
summary(automobileData$fuel.system)
## 1bbl 2bbl 4bbl idi mfi mpfi spdi spfi
## 11 66 3 20 1 94 9 1
qplot(data=automobileData, x = fuel.system)
plot shows MPFI and 2BBL fuel system are dominating.
Bore:
Continuous variable presenting the engine bore size:
summary(automobileData$bore)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.54 3.15 3.31 3.33 3.59 3.94 4
mean and medians are nearby but there are 4 missing values.
ggplot(automobileData, aes(y = bore, x = 1)) + geom_boxplot(outlier.color = "red")
## Warning: Removed 4 rows containing non-finite values (stat_boxplot).
Boxplot seem clear, no outliers.
Stroke:
Continuous variable for the stroke length of the engine.
summary(automobileData$stroke)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.070 3.110 3.290 3.255 3.410 4.170 4
Summary shows mean and medians to be close, but this variable contains a 4 missing values.
ggplot(automobileData, aes(y=stroke, x=1)) + geom_boxplot(outlier.color = "red")
## Warning: Removed 4 rows containing non-finite values (stat_boxplot).
Boxplot shows few outliers but these can be avoided.
Compression-Ratio:
Continuous variable giving the compression ratio of each car-engine.
summary(automobileData$compression.ratio)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 8.60 9.00 10.14 9.40 23.00
ggplot(automobileData, aes(y = compression.ratio, x = 1)) + geom_boxplot(outlier.color = "red")
boxplot shows outliers in the variable.
Horsepower:
Continuous variable ranging from 48 to 288.
summary(automobileData$horsepower)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 48.0 70.0 95.0 104.3 116.0 288.0 2
Summary shows 2 missing values.
ggplot(automobileData, aes(y=horsepower, x = 1)) + geom_boxplot(outlier.color = "red")
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
Boxplot shows 2 values above 200 to be outliers.
Peak-rpm:
Continuous variable for the peak rpm value of the car engine.
summary(automobileData$peak.rpm)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 4150 4800 5200 5125 5500 6600 2
Ranges from 4150 to 6600, with 2 missing values.
ggplot(automobileData, aes(y=peak.rpm, x=1)) + geom_boxplot(outlier.color = "red")
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
The max value 6600 seem to be fall out of the distribution.
City-mpg:
Continuous variable for mileage per gallon fuel in when driven in city.
summary(automobileData$city.mpg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 19.00 24.00 25.22 30.00 49.00
ggplot(automobileData, aes(y=city.mpg, x=1)) + geom_boxplot(outlier.color="red")
Boxplot shows 2 values above 45 as outliers.
Highway-mpg:
Continuous variable for mileage per gallon fuel in when driven on highway.
summary(automobileData$highway.mpg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 16.00 25.00 30.00 30.75 34.00 54.00
ggplot(automobileData, aes(y = highway.mpg, x=1)) + geom_boxplot(outlier.color = "red")
Boxplot shows values above 47mpg to be outliers.
Price:
Price is our output/predictive variable. A continuous variable showing the price of the car/vehicle.
summary(automobileData$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5118 7788 10600 13210 16500 45400
price ranges form 5118 to 45400.
ggplot(automobileData, aes(y=price, x=1)) + geom_boxplot(outlier.color = "red")
According to boxplot, cars pricier than 30000 goes beyond the regular distribution of this variable.
Training Set
Now, I have removed all the variables which has missing values except the target variable in which we have replaced them with mean.
traindata<-VehicleData[,c(-2,-6,-19,-20,-22,-23)]
traindata<-automobileData[,c(-2,-6,-19,-20,-22,-23)]
st_table<-table(traindata$make)
st_table
barplot(st_table)
Data Cleaning
NOTE: The dataset has very few observation managing the outliers could make us lose some of the information in the data. There are very few outliers, we have left them untouched.
We will create a new dataframe keeping the original unharmed.
automobileData2 <- automobileData
Managing Missing Values
We have 7 variables with missing values:
| Variable | Missing Values |
|---|---|
| normalized.losses | 41 (20%) |
| num.of.doors | 2 (0.9%) |
| bore | 4 (1.9%) |
| stroke | 4 (1.9%) |
| horsepower | 2 (0.9%) |
| peak.rpm | 2 (09.%) |
normalized.losses
This variable has around 20% missing values. Textbook method to treat missing values is to create another indicator variable stating if the values are missing or non-missing and replace the original with mean or median.
First we create an indicator variable indicating missing and non-missing values.
automobileData2$normalized.losses_ind <- as.factor(ifelse(is.na(automobileData$normalized.losses),"missing", "non-missing"))
summary(automobileData2$normalized.losses_ind)
## missing non-missing
## 41 164
Next we can replace missing values with mean in the original column.
automobileData2$normalized.losses <- ifelse(is.na(automobileData$normalized.losses),122, automobileData$normalized.losses)
summary(automobileData2$normalized.losses)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 65 101 122 122 137 256
num.of.doors
This variable contains just two missing values, we can just replace the missing values with either of the dichtomouns value. As the missing value is denoted by Factor level : “”, we can replace “”, with “two” or “four” using levels function.
levels(automobileData2$num.of.doors)[levels(automobileData2$num.of.doors)==""] <- "two"
summary(automobileData2$num.of.doors)
## two four
## 91 114
bore
It’s a continuous variable we can replace the missing values with mean:3.33
automobileData2$bore <- ifelse(is.na(automobileData$bore),3.33, automobileData$bore)
summary(automobileData2$bore)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.54 3.15 3.31 3.33 3.58 3.94
Stroke
We can replace 4 missing values with the mean : 3.255
automobileData2$stroke <- ifelse(is.na(automobileData$stroke),3.225, automobileData$stroke)
summary(automobileData2$stroke)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.070 3.110 3.290 3.255 3.410 4.170
Horsepower
Two missing values can be replaced with mean: 104.3
automobileData2$horsepower <- ifelse(is.na(automobileData$horsepower),104.3, automobileData$horsepower)
summary(automobileData2$horsepower)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 48.0 70.0 95.0 104.3 116.0 288.0
Peak-rpm
2 missing values can be replaced with mean : 5125
automobileData2$peak.rpm <- ifelse(is.na(automobileData$peak.rpm),5125, automobileData$peak.rpm)
summary(automobileData2$peak.rpm)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4150 4800 5200 5125 5500 6600
Combining Levels
Some catagorical variables have high number of levels. While training a model each level will have a dummy variable again and we don’t have enough observations/row to fit a model with that many variables.
To overcome this problem, we will combine the levels which are less than 5 in number to a same level in all the variables which have high number of level.
Variables make, engine.type, num.of.cylinders and fuel.system has more than 5 levels.
Make
levels that represent low frequency can be merged into single value. We will take levels that represents last ~10 percentile of the distribution.
summary(automobileData2$make)
## alfa-romero audi bmw chevrolet dodge
## 3 7 8 3 9
## honda isuzu jaguar mazda mercedes-benz
## 13 4 3 17 8
## mercury mitsubishi nissan peugot plymouth
## 1 13 18 11 7
## porsche renault saab subaru toyota
## 5 2 6 12 32
## volkswagen volvo
## 12 11
sort(prop.table(table(automobileData$make))*100)
##
## mercury renault alfa-romero chevrolet jaguar
## 0.4878049 0.9756098 1.4634146 1.4634146 1.4634146
## isuzu porsche saab audi plymouth
## 1.9512195 2.4390244 2.9268293 3.4146341 3.4146341
## bmw mercedes-benz dodge peugot volvo
## 3.9024390 3.9024390 4.3902439 5.3658537 5.3658537
## subaru volkswagen honda mitsubishi mazda
## 5.8536585 5.8536585 6.3414634 6.3414634 8.2926829
## nissan toyota
## 8.7804878 15.6097561
levels(automobileData2$make)[levels(automobileData$make)%in%c("mercury", "renault", "alfa-romero", "chevrolet", "jaguar", "isuzu", "porsche", "saab", "audi", "plymouth")] <- "other"
summary(automobileData2$make)
## other bmw dodge honda mazda
## 41 8 9 13 17
## mercedes-benz mitsubishi nissan peugot subaru
## 8 13 18 11 12
## toyota volkswagen volvo
## 32 12 11
Engine-Type
dohcv, rotor, dohc and l represents around 12% of the values, we will merge these values into one.
summary(automobileData2$engine.type)
## dohc dohcv l ohc ohcf ohcv rotor
## 12 1 12 148 15 13 4
sort(prop.table(table(automobileData2$engine.type))*100)
##
## dohcv rotor dohc l ohcv ohcf
## 0.4878049 1.9512195 5.8536585 5.8536585 6.3414634 7.3170732
## ohc
## 72.1951220
levels(automobileData2$engine.type)[levels(automobileData2$engine.type)%in%c("dohcv", "rotor", "dohc", "l")] <- "other"
summary(automobileData2$engine.type)
## other ohc ohcf ohcv
## 29 148 15 13
num-of-cylinder We can convert this variable into a numeric variable.
library(plyr)
automobileData2$num.of.cylinders <- as.character(automobileData2$num.of.cylinders)
automobileData2$num.of.cylinders <- mapvalues(automobileData2$num.of.cylinders, from=c("eight", "five", "four", "six", "three", "twelve", "two"), to=c(8,5,4,6,3,12,2))
automobileData2$num.of.cylinders <- as.numeric(automobileData2$num.of.cylinders)
Fuel-system out of 8 levels, 4 levels represents only 12% of the data, we can combine those 4 levels into one.
summary(automobileData2$fuel.system)
## 1bbl 2bbl 4bbl idi mfi mpfi spdi spfi
## 11 66 3 20 1 94 9 1
levels(automobileData2$fuel.system)[levels(automobileData2$fuel.system)%in%c("1bbl","4bbl", "mfi", "spdi","spfi")] <- "other"
summary(automobileData2$fuel.system)
## other 2bbl idi mpfi
## 25 66 20 94
Model Building
Now that we have curated the data, we can move to model building part. Our predictive variable is a continuous variable, we can approach this problem with regression algorithms.
Linear Regression Model.
We will train our data on simple linear regression algorithm to predict our price variable.
Vif gives the importance of a variable. In general, it should be less than 5.
summary(automobileData2)
## symboling normalized.losses make fuel.type
## Min. :-2.0000 Min. : 65 other :41 diesel: 20
## 1st Qu.: 0.0000 1st Qu.:101 toyota :32 gas :185
## Median : 1.0000 Median :122 nissan :18
## Mean : 0.8341 Mean :122 mazda :17
## 3rd Qu.: 2.0000 3rd Qu.:137 honda :13
## Max. : 3.0000 Max. :256 mitsubishi:13
## (Other) :71
## aspiration num.of.doors body.style drive.wheels engine.location
## std :168 two : 91 convertible: 6 4wd: 9 front:202
## turbo: 37 four:114 hardtop : 8 fwd:120 rear : 3
## hatchback :70 rwd: 76
## sedan :96
## wagon :25
##
##
## wheel.base length width height
## Min. : 86.60 Min. :141.1 Min. :60.30 Min. :47.80
## 1st Qu.: 94.50 1st Qu.:166.3 1st Qu.:64.10 1st Qu.:52.00
## Median : 97.00 Median :173.2 Median :65.50 Median :54.10
## Mean : 98.76 Mean :174.0 Mean :65.91 Mean :53.72
## 3rd Qu.:102.40 3rd Qu.:183.1 3rd Qu.:66.90 3rd Qu.:55.50
## Max. :120.90 Max. :208.1 Max. :72.30 Max. :59.80
##
## curb.weight engine.type num.of.cylinders engine.size fuel.system
## Min. :1488 other: 29 Min. : 2.00 Min. : 61.0 other:25
## 1st Qu.:2145 ohc :148 1st Qu.: 4.00 1st Qu.: 97.0 2bbl :66
## Median :2414 ohcf : 15 Median : 4.00 Median :120.0 idi :20
## Mean :2556 ohcv : 13 Mean : 4.38 Mean :126.9 mpfi :94
## 3rd Qu.:2935 3rd Qu.: 4.00 3rd Qu.:141.0
## Max. :4066 Max. :12.00 Max. :326.0
##
## bore stroke compression.ratio horsepower
## Min. :2.54 Min. :2.070 Min. : 7.00 Min. : 48.0
## 1st Qu.:3.15 1st Qu.:3.110 1st Qu.: 8.60 1st Qu.: 70.0
## Median :3.31 Median :3.290 Median : 9.00 Median : 95.0
## Mean :3.33 Mean :3.255 Mean :10.14 Mean :104.3
## 3rd Qu.:3.58 3rd Qu.:3.410 3rd Qu.: 9.40 3rd Qu.:116.0
## Max. :3.94 Max. :4.170 Max. :23.00 Max. :288.0
##
## peak.rpm city.mpg highway.mpg price
## Min. :4150 Min. :13.00 Min. :16.00 Min. : 5118
## 1st Qu.:4800 1st Qu.:19.00 1st Qu.:25.00 1st Qu.: 7788
## Median :5200 Median :24.00 Median :30.00 Median :10595
## Mean :5125 Mean :25.22 Mean :30.75 Mean :13207
## 3rd Qu.:5500 3rd Qu.:30.00 3rd Qu.:34.00 3rd Qu.:16500
## Max. :6600 Max. :49.00 Max. :54.00 Max. :45400
##
## normalized.losses_ind
## missing : 41
## non-missing:164
##
##
##
##
##
str(automobileData2)
## 'data.frame': 205 obs. of 27 variables:
## $ symboling : int 3 3 1 2 2 2 1 1 1 0 ...
## $ normalized.losses : num 122 122 122 164 164 122 158 122 158 122 ...
## $ make : Factor w/ 13 levels "other","bmw",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ fuel.type : Factor w/ 2 levels "diesel","gas": 2 2 2 2 2 2 2 2 2 2 ...
## $ aspiration : Factor w/ 2 levels "std","turbo": 1 1 1 1 1 1 1 1 2 2 ...
## $ num.of.doors : Factor w/ 2 levels "two","four": 1 1 1 2 2 1 2 2 2 1 ...
## $ body.style : Factor w/ 5 levels "convertible",..: 1 1 3 4 4 4 4 5 4 3 ...
## $ drive.wheels : Factor w/ 3 levels "4wd","fwd","rwd": 3 3 3 2 1 2 2 2 2 1 ...
## $ engine.location : Factor w/ 2 levels "front","rear": 1 1 1 1 1 1 1 1 1 1 ...
## $ wheel.base : num 88.6 88.6 94.5 99.8 99.4 ...
## $ length : num 169 169 171 177 177 ...
## $ width : num 64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ...
## $ height : num 48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 ...
## $ curb.weight : int 2548 2548 2823 2337 2824 2507 2844 2954 3086 3053 ...
## $ engine.type : Factor w/ 4 levels "other","ohc",..: 1 1 4 2 2 2 2 2 2 2 ...
## $ num.of.cylinders : num 4 4 6 4 5 5 5 5 5 5 ...
## $ engine.size : int 130 130 152 109 136 136 136 136 131 131 ...
## $ fuel.system : Factor w/ 4 levels "other","2bbl",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ bore : num 3.47 3.47 2.68 3.19 3.19 3.19 3.19 3.19 3.13 3.13 ...
## $ stroke : num 2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ...
## $ compression.ratio : num 9 9 9 10 8 8.5 8.5 8.5 8.3 7 ...
## $ horsepower : num 111 111 154 102 115 110 110 110 140 160 ...
## $ peak.rpm : num 5000 5000 5000 5500 5500 5500 5500 5500 5500 5500 ...
## $ city.mpg : int 21 21 19 24 18 19 19 19 17 16 ...
## $ highway.mpg : int 27 27 26 30 22 25 25 25 20 22 ...
## $ price : num 13495 16500 16500 13950 17450 ...
## $ normalized.losses_ind: Factor w/ 2 levels "missing","non-missing": 1 1 1 2 2 1 2 1 2 1 ...
traindata1<-automobileData2[,c(-2,-3,-13,-14,-16,-22)]
dim(automobileData2)
## [1] 205 27
LMmodel<-lm(price~.,data=automobileData2)
summary(LMmodel)
In the summary table we see 2 variables ‘engine.typeohcf’, ‘fuel.systemidi’ shows “NA” in all the columns. It’s because we ran into perfect collinearity between these variables and some other. In other words, these variables are fully correlated to (or say representing exact impact as) one or more predictor variables. Using alias() function will help us to find out the variables these tow are highly correlated to.
alias(LMmodel)
## Model :
## price ~ symboling + normalized.losses + make + fuel.type + aspiration +
## num.of.doors + body.style + drive.wheels + engine.location +
## wheel.base + length + width + height + curb.weight + engine.type +
## num.of.cylinders + engine.size + fuel.system + bore + stroke +
## compression.ratio + horsepower + peak.rpm + city.mpg + highway.mpg +
## normalized.losses_ind
##
## Complete :
## (Intercept) symboling normalized.losses makebmw makedodge
## engine.typeohcf 0 0 0 0 0
## fuel.systemidi 1 0 0 0 0
## makehonda makemazda makemercedes-benz makemitsubishi
## engine.typeohcf 0 0 0 0
## fuel.systemidi 0 0 0 0
## makenissan makepeugot makesubaru maketoyota makevolkswagen
## engine.typeohcf 0 0 1 0 0
## fuel.systemidi 0 0 0 0 0
## makevolvo fuel.typegas aspirationturbo num.of.doorsfour
## engine.typeohcf 0 0 0 0
## fuel.systemidi 0 -1 0 0
## body.stylehardtop body.stylehatchback body.stylesedan
## engine.typeohcf 0 0 0
## fuel.systemidi 0 0 0
## body.stylewagon drive.wheelsfwd drive.wheelsrwd
## engine.typeohcf 0 0 0
## fuel.systemidi 0 0 0
## engine.locationrear wheel.base length width height
## engine.typeohcf 1 0 0 0 0
## fuel.systemidi 0 0 0 0 0
## curb.weight engine.typeohc engine.typeohcv
## engine.typeohcf 0 0 0
## fuel.systemidi 0 0 0
## num.of.cylinders engine.size fuel.system2bbl
## engine.typeohcf 0 0 0
## fuel.systemidi 0 0 0
## fuel.systemmpfi bore stroke compression.ratio horsepower
## engine.typeohcf 0 0 0 0 0
## fuel.systemidi 0 0 0 0 0
## peak.rpm city.mpg highway.mpg
## engine.typeohcf 0 0 0
## fuel.systemidi 0 0 0
## normalized.losses_indnon-missing
## engine.typeohcf 0
## fuel.systemidi 0
Improving model by chunking insignificant variables Variable which shows P value more than 5% will impact our Adj. \(R^2\) value in negative way. We will chunk those variables and build a new model with rest of the variables.
LMmodelNew1<-lm(price~make+aspiration+engine.location+wheel.base +width+curb.weight +num.of.cylinders+engine.size+bore+stroke+peak.rpm, data=automobileData2)
summary(LMmodelNew1)
We see that there is a very small loss in Adjusted R-squared value after removing the insignificant variables based on their P-values. This insignificant loss in R-squared proves we eliminated variables correctly.
To check further collinearity we can use vif() function form package: car
library(car)
vif(LMmodelNew1)
## GVIF Df GVIF^(1/(2*Df))
## make 33.297023 12 1.157267
## aspiration 1.745709 1 1.321253
## engine.location 1.523864 1 1.234449
## wheel.base 5.820991 1 2.412673
## width 6.516676 1 2.552778
## curb.weight 11.488619 1 3.389486
## num.of.cylinders 10.541836 1 3.246819
## engine.size 25.004334 1 5.000433
## bore 5.441816 1 2.332770
## stroke 2.816223 1 1.678161
## peak.rpm 2.075670 1 1.440718
Linear Regression Model Validation
So far we built a model and tried improving We will use caTools to divides the data into two subsets. train_new is our train data. hold_out is our test data. rmpeTrain is the training error. rmpeTest is the testing error.
library(caTools)
set.seed(90) # to get the same split everytime
spl = sample.split(automobileData2$price, SplitRatio = 0.8)
train_data = subset(automobileData2, spl==TRUE)
test_data = subset(automobileData2, spl==FALSE)
LMmodel2<-lm(price ~ aspiration+engine.location+wheel.base +width +engine.size+bore+stroke+peak.rpm, data=train_data)
summary(LMmodel2)
##
## Call:
## lm(formula = price ~ aspiration + engine.location + wheel.base +
## width + engine.size + bore + stroke + peak.rpm, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12773.3 -1502.0 -294.2 1254.6 15037.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.857e+04 1.226e+04 -3.960 0.000114 ***
## aspirationturbo 2.248e+03 6.941e+02 3.238 0.001471 **
## engine.locationrear 1.455e+04 2.519e+03 5.775 4.08e-08 ***
## wheel.base 2.917e+02 7.179e+01 4.063 7.68e-05 ***
## width 3.434e+02 2.315e+02 1.484 0.139904
## engine.size 1.497e+02 1.129e+01 13.257 < 2e-16 ***
## bore -2.782e+03 1.222e+03 -2.276 0.024225 *
## stroke -2.924e+03 8.794e+02 -3.325 0.001104 **
## peak.rpm 1.856e+00 5.816e-01 3.191 0.001716 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3071 on 155 degrees of freedom
## Multiple R-squared: 0.8538, Adjusted R-squared: 0.8463
## F-statistic: 113.2 on 8 and 155 DF, p-value: < 2.2e-16
With training data this new model is giving R-squared value of 0.86% We can add few variables back into the model and try to improve the R-squared value.
Let’s make predictions on this new model for both training and testing data sets.
Trprediction<-predict(LMmodel2,train_data)
Trtarget<-train_data$price
rmpeTrain<-mean(abs(Trtarget-Trprediction)/Trtarget)*100
print(100-rmpeTrain)
## [1] 84.17113
Tsprediction<-predict(LMmodel2,test_data)
Tstarget<-test_data$price
rmpeTest<-mean(abs(Tstarget-Tsprediction)/Tstarget)*100
print(100-rmpeTest)
## [1] 76.5475
From above we can see we got 84.2% accuracy on testing set, and 76.5% accuracy on the test data.
We can also perform cross validation with 10 folds to get more precise performance of our model.
library(DAAG)
## Loading required package: lattice
##
## Attaching package: 'DAAG'
## The following object is masked from 'package:car':
##
## vif
## The following object is masked from 'package:plyr':
##
## ozone
crossVal<-cv.lm(automobileData2, LMmodel2, 10)
## Analysis of Variance Table
##
## Response: price
## Df Sum Sq Mean Sq F value Pr(>F)
## aspiration 1 3.97e+08 3.97e+08 37.72 4.5e-09 ***
## engine.location 1 1.47e+09 1.47e+09 140.14 < 2e-16 ***
## wheel.base 1 4.98e+09 4.98e+09 472.96 < 2e-16 ***
## width 1 1.71e+09 1.71e+09 162.76 < 2e-16 ***
## engine.size 1 1.75e+09 1.75e+09 166.07 < 2e-16 ***
## bore 1 3.29e+07 3.29e+07 3.13 0.07854 .
## stroke 1 1.01e+08 1.01e+08 9.59 0.00224 **
## peak.rpm 1 1.24e+08 1.24e+08 11.79 0.00073 ***
## Residuals 196 2.06e+09 1.05e+07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## fold 1
## Observations in test set: 20
## 2 52 70 72 93 95 105 107 118 125 128
## Predicted 10468 6418 22989 33325 7083 7083 18735 20746 19248 15269 34528
## cvpred 10302 6360 22786 33189 7120 7120 18597 20684 19103 15229 34778
## price 16500 6095 28176 34184 6849 7299 17199 18399 18150 12764 34028
## CV residual 6198 -265 5390 995 -271 179 -1398 -2285 -953 -2465 -750
## 136 138 139 151 154 156 169 183 197
## Predicted 12526 14687 7384 6785 6785 6785 13018 7983 16334
## cvpred 12489 14481 7227 6736 6736 6736 13111 7964 16378
## price 15510 18620 5118 5348 6918 8778 9639 7775 15985
## CV residual 3021 4139 -2109 -1388 182 2042 -3472 -189 -393
##
## Sum of squares = 1.32e+08 Mean square = 6621678 n = 20
##
## fold 2
## Observations in test set: 21
## 1 28 29 41 42 44 53 61 63 64 74
## Predicted 10468 9288 11746 8985 10390 6850 6418 11279 11279 11001 42414
## cvpred 10968 8746 11071 8078 10424 5986 6493 11649 11649 11371 42494
## price 13495 8558 8921 10295 12945 6785 6795 8495 10245 10795 40960
## CV residual 2527 -188 -2150 2217 2521 799 302 -3154 -1404 -576 -1534
## 85 88 121 123 130 135 137 143 158 165
## Predicted 15269 11840 7098 8139 25936 17403 14687 9106 7655 7142
## cvpred 15292 11633 6957 7980 28338 17879 14778 9471 7638 7080
## price 14489 9279 6229 7609 13207 15040 18150 7775 7198 8238
## CV residual -803 -2354 -728 -371 -15131 -2839 3372 -1696 -440 1158
##
## Sum of squares = 2.97e+08 Mean square = 14157321 n = 21
##
## fold 3
## Observations in test set: 21
## 8 11 14 40 50 51 60 99 103 119
## Predicted 19175 12286 16147 10390 42411 6418 11279 7235 20323 7098
## cvpred 19363 12250 16532 10510 44911 6498 11285 7229 21028 7207
## price 18920 16430 21105 8845 36000 5195 8845 8249 14399 5572
## CV residual -443 4180 4573 -1665 -8911 -1303 -2440 1020 -6629 -1635
## 120 132 149 162 181 186 192 196 198 204
## Predicted 9288 11089 9773 7655 20235 9962 15459 16334 16334 20701
## cvpred 9278 11186 9774 7652 20711 10004 15760 16329 16329 20657
## price 7957 9895 8013 8358 15690 8195 13295 13415 16515 22470
## CV residual -1321 -1291 -1761 706 -5021 -1809 -2465 -2914 186 1813
## 205
## Predicted 20136
## cvpred 19841
## price 22625
## CV residual 2784
##
## Sum of squares = 2.37e+08 Mean square = 11295531 n = 21
##
## fold 4
## Observations in test set: 21
## 19 24 25 32 33 47 59 65 73 82
## Predicted 57.4 9288 7098 6214 6317 10185 7294 11279 27864 10357
## cvpred 119.7 9483 7039 5882 6201 10016 7042 11084 27179 10217
## price 5151.0 7957 6229 6855 5399 11048 15645 11245 35056 8499
## CV residual 5031.3 -1526 -810 973 -802 1032 8603 161 7877 -1718
## 83 97 102 106 117 166 173 175 176 200
## Predicted 15292 7083 20323 20431 17924 10243 13018 12153 10885 16413
## cvpred 15323 7040 20302 20040 18114 10235 12949 12318 10841 16586
## price 12629 7499 13499 19699 17950 9298 17669 10698 9988 18950
## CV residual -2694 459 -6803 -341 -164 -937 4720 -1620 -853 2364
## 201
## Predicted 18440
## cvpred 18405
## price 16845
## CV residual -1560
##
## Sum of squares = 2.58e+08 Mean square = 12305150 n = 21
##
## fold 5
## Observations in test set: 21
## 9 15 21 22 26 38 48 56 57 62
## Predicted 20360 17825 7177 7098 7098 10390 31575 5992 5992 11279
## cvpred 19615 17888 7278 7129 7129 10359 31939 5629 5629 11085
## price 23875 24565 6575 5572 6692 7895 32250 10945 11845 10595
## CV residual 4260 6677 -703 -1557 -437 -2464 311 5316 6216 -490
## 75 79 84 104 111 141 157 177 182 199
## Predicted 39784 7670 15269 20323 19528 7281 7655 10885.4 18933 16413
## cvpred 39719 7579 15339 20445 19879 7328 7653 10835.7 19149 16506
## price 45400 6669 14869 13499 13860 7603 6938 10898.0 15750 18420
## CV residual 5681 -910 -470 -6946 -6019 275 -715 62.3 -3399 1914
## 203
## Predicted 23971
## cvpred 23907
## price 21485
## CV residual -2422
##
## Sum of squares = 2.79e+08 Mean square = 13268277 n = 21
##
## fold 6
## Observations in test set: 21
## 3 7 31 34 35 39 49 58 67 87
## Predicted 15394 19175 3984 8073 8073 10390 31575 5992 12342 10357
## cvpred 15190 18870 3959 7972 7972 10227 31057 6012 12057 10286
## price 16500 17710 6479 6529 7129 9095 35550 13645 18344 8189
## CV residual 1310 -1160 2520 -1443 -843 -1132 4493 7633 6287 -2097
## 101 113 116 140 142 145 147 167 170 172
## Predicted 10615 17924 15041 7279 9849 9798 9798 10243 13018 13018
## cvpred 10527 17848 14884 7665 10165 10117 10117 10379 12979 12979
## price 9549 16900 16630 7053 7126 9233 7463 9538 9989 11549
## CV residual -978 -948 1746 -612 -3039 -884 -2654 -841 -2990 -1430
## 187
## Predicted 9962
## cvpred 9852
## price 8495
## CV residual -1357
##
## Sum of squares = 1.72e+08 Mean square = 8178094 n = 21
##
## fold 7
## Observations in test set: 20
## 4 10 17 37 45 55 71 77 94 112 122
## Predicted 11427 16935 25109 8710 7177 6301 25983 7670 7083 17586 7098
## cvpred 11333 16739 24597 8517 7145 6396 25684 7652 7075 17561 7059
## price 13950 13207 41315 7295 13207 7395 31600 5389 7349 15580 6692
## CV residual 2617 -3532 16718 -1222 6062 999 5916 -2263 274 -1981 -367
## 129 164 168 171 174 178 189 191 193
## Predicted 34528 7142 13018 13018 10885 10885 10426 8933 10639
## cvpred 33278 7221 12917 12917 10933 10933 10343 8824 10709
## price 37028 8058 8449 11199 8948 11248 9995 9980 13845
## CV residual 3750 837 -4468 -1718 -1985 315 -348 1156 3136
##
## Sum of squares = 4.35e+08 Mean square = 21766366 n = 20
##
## fold 8
## Observations in test set: 20
## 12 36 54 81 86 90 98 100 115 126
## Predicted 12286 8786 6418 11840 10357 7083 7083 10615 19528 15628
## cvpred 12097 9047 6545 12021 10378 7191 7191 10662 19921 14868
## price 16925 7295 6695 9959 6989 5499 7999 8949 17075 22018
## CV residual 4828 -1752 150 -2062 -3389 -1692 808 -1713 -2846 7150
## 148 152 153 161 179 180 184 185 188 194
## Predicted 10541.9 6785 6785 7655.50 20452 20452 9962 7983 9121.3 11944
## cvpred 10140.3 7012 7012 7735.32 20660 20660 10065 8268 9550.6 12027
## price 10198.0 6338 6488 7738.00 16558 15998 7975 7995 9495.0 12290
## CV residual 57.7 -674 -524 2.68 -4102 -4662 -2090 -273 -55.6 263
##
## Sum of squares = 1.52e+08 Mean square = 7580236 n = 20
##
## fold 9
## Observations in test set: 20
## 5 16 20 27 43 66 69 76 78 80
## Predicted 14944 24589 7177 7098 10248.4 15062 23829 17241 7670 9110
## cvpred 14860 24158 7172 7073 10255.3 15089 23945 17435 7682 9030
## price 17450 30760 6295 7609 10345.0 18280 28248 16503 6189 7689
## CV residual 2590 6602 -877 536 89.7 3191 4303 -932 -1493 -1341
## 96 108 109 114 127 131 134 146 155 195
## Predicted 7083 15041 17924 19190 34528 11037 12526 11494 6785 16334
## cvpred 7059 15355 18123 19662 35528 10973 12649 11731 6850 16401
## price 7799 11900 13200 16695 32528 9295 12170 11259 7898 12940
## CV residual 740 -3455 -4923 -2967 -3000 -1678 -479 -472 1048 -3461
##
## Sum of squares = 1.56e+08 Mean square = 7791166 n = 20
##
## fold 10
## Observations in test set: 20
## 6 13 18 23 30 46 68 89 91 92 110
## Predicted 14994 16147 28325 7098 15144 7177 23829 10144 7035 7083 16645
## cvpred 14853 15894 28081 7013 15177 7103 23779 10091 6927 7042 16736
## price 15250 20970 36880 6377 12964 13207 25552 9279 7099 6649 12440
## CV residual 397 5076 8799 -636 -2213 6104 1773 -812 172 -393 -4296
## 124 133 144 150 159 160 163 190 202
## Predicted 11723 12526 10593 11469 7659 7659 7655 9037 19898
## cvpred 11738 12518 10599 11546 7606 7606 7585 8981 20056
## price 8921 11850 9960 11694 7898 7788 9258 11595 19045
## CV residual -2817 -668 -639 148 292 182 1673 2614 -1011
##
## Sum of squares = 1.88e+08 Mean square = 9396303 n = 20
##
## Overall (Sum over all 20 folds)
## ms
## 11250765
rmpe_cv<-mean(abs(automobileData2$price-crossVal$Predicted)/automobileData2$price)*100
100-rmpe_cv
## [1] 82.6
A 10-fold cross validation is giving us accuracy of 82.6
Regression tree
A regression problem can also be solved using a decision tree, We just have to tweak the parameters to make the tree work on regression problems.
We wil use rpart library to build our model on
library(rpart)
library(rattle)
## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(rpart.plot)
Tree1 <- rpart(price ~., method="anova", data=automobileData2)
printcp(Tree1)
##
## Regression tree:
## rpart(formula = price ~ ., data = automobileData2, method = "anova")
##
## Variables actually used in tree construction:
## [1] curb.weight engine.size width
##
## Root node error: 1e+10/205 = 6e+07
##
## n= 205
##
## CP nsplit rel error xerror xstd
## 1 0.63 0 1.0 1.0 0.16
## 2 0.19 1 0.4 0.4 0.05
## 3 0.03 2 0.2 0.2 0.04
## 4 0.01 3 0.2 0.2 0.04
## 5 0.01 4 0.1 0.2 0.04
summary(Tree1)
## Call:
## rpart(formula = price ~ ., data = automobileData2, method = "anova")
## n= 205
##
## CP nsplit rel error xerror xstd
## 1 0.6283 0 1.000 1.010 0.1614
## 2 0.1876 1 0.372 0.382 0.0471
## 3 0.0272 2 0.184 0.223 0.0420
## 4 0.0115 3 0.157 0.201 0.0415
## 5 0.0100 4 0.145 0.200 0.0413
##
## Variable importance
## engine.size curb.weight width city.mpg horsepower make
## 23 20 15 13 12 9
## length highway.mpg wheel.base
## 4 4 1
##
## Node number 1: 205 observations, complexity param=0.628
## mean=1.32e+04, MSE=6.16e+07
## left son=2 (187 obs) right son=3 (18 obs)
## Primary splits:
## engine.size < 182 to the left, improve=0.628, (0 missing)
## num.of.cylinders < 4.5 to the left, improve=0.521, (0 missing)
## curb.weight < 2700 to the left, improve=0.509, (0 missing)
## horsepower < 118 to the left, improve=0.507, (0 missing)
## highway.mpg < 28.5 to the right, improve=0.480, (0 missing)
## Surrogate splits:
## curb.weight < 3330 to the left, agree=0.971, adj=0.667, (0 split)
## horsepower < 176 to the left, agree=0.966, adj=0.611, (0 split)
## width < 69.2 to the left, agree=0.961, adj=0.556, (0 split)
## make splits as LLLLLRLLLLLLL, agree=0.951, adj=0.444, (0 split)
## city.mpg < 16.5 to the right, agree=0.951, adj=0.444, (0 split)
##
## Node number 2: 187 observations, complexity param=0.188
## mean=1.13e+04, MSE=2.06e+07
## left son=4 (126 obs) right son=5 (61 obs)
## Primary splits:
## curb.weight < 2660 to the left, improve=0.615, (0 missing)
## highway.mpg < 28.5 to the right, improve=0.581, (0 missing)
## horsepower < 94.5 to the left, improve=0.516, (0 missing)
## engine.size < 126 to the left, improve=0.503, (0 missing)
## city.mpg < 22 to the right, improve=0.493, (0 missing)
## Surrogate splits:
## engine.size < 126 to the left, agree=0.893, adj=0.672, (0 split)
## highway.mpg < 28.5 to the right, agree=0.893, adj=0.672, (0 split)
## length < 178 to the left, agree=0.877, adj=0.623, (0 split)
## city.mpg < 22 to the right, agree=0.866, adj=0.590, (0 split)
## width < 66 to the left, agree=0.861, adj=0.574, (0 split)
##
## Node number 3: 18 observations
## mean=3.33e+04, MSE=4.66e+07
##
## Node number 4: 126 observations, complexity param=0.0272
## mean=8.8e+03, MSE=6.37e+06
## left son=8 (73 obs) right son=9 (53 obs)
## Primary splits:
## curb.weight < 2290 to the left, improve=0.428, (0 missing)
## horsepower < 89 to the left, improve=0.360, (0 missing)
## highway.mpg < 29.5 to the right, improve=0.331, (0 missing)
## length < 168 to the left, improve=0.317, (0 missing)
## fuel.system splits as LLLR, improve=0.311, (0 missing)
## Surrogate splits:
## length < 169 to the left, agree=0.897, adj=0.755, (0 split)
## width < 64.5 to the left, agree=0.881, adj=0.717, (0 split)
## horsepower < 77 to the left, agree=0.849, adj=0.642, (0 split)
## wheel.base < 95.9 to the left, agree=0.841, adj=0.623, (0 split)
## city.mpg < 27.5 to the right, agree=0.841, adj=0.623, (0 split)
##
## Node number 5: 61 observations, complexity param=0.0115
## mean=1.64e+04, MSE=1.12e+07
## left son=10 (53 obs) right son=11 (8 obs)
## Primary splits:
## width < 68.6 to the left, improve=0.213, (0 missing)
## make splits as LRL-R-LLL-LLR, improve=0.195, (0 missing)
## fuel.system splits as LLRR, improve=0.175, (0 missing)
## wheel.base < 101 to the left, improve=0.154, (0 missing)
## length < 188 to the left, improve=0.117, (0 missing)
## Surrogate splits:
## wheel.base < 109 to the left, agree=0.885, adj=0.125, (0 split)
##
## Node number 8: 73 observations
## mean=7.39e+03, MSE=2.27e+06
##
## Node number 9: 53 observations
## mean=1.07e+04, MSE=5.54e+06
##
## Node number 10: 53 observations
## mean=1.58e+04, MSE=9.27e+06
##
## Node number 11: 8 observations
## mean=2.04e+04, MSE=5.8e+06
plotcp(Tree1)
fancyRpartPlot(Tree1)
We can improve the tree by adding the optimal complexity parameter from the CP functions and from the CP plot.
Validating our Tree Model
Let’s validate our tree model by predicting the values on training and testing sets by comparing the accuracy or erros.
First we will build a new tree model just on the training set and then find the accuracy on both training and testing sets.
Tree2 <- rpart(price ~., method="anova", data=train_data)
Trpred_tree<-predict(Tree2,train_data)
Trtaget_tree<-train_data$price
rmpeTr_tree<-mean(abs(Trtaget_tree-Trpred_tree)/Trtaget_tree)*100
100 - rmpeTr_tree
## [1] 84.3
Tspred_tree<-predict(Tree2,test_data)
Tstarget_tree<-test_data$price
rmpeTs_tree<-mean(abs(Tstarget_tree-Tspred_tree)/Tstarget_tree)*100
100 - rmpeTs_tree
## [1] 83.1
This new tree model Tree2 was trained on training set which uses all the variables. This model gives accuarcy of 84.3% on the training data predictions and 83.1% on the test set predictions.
Cross validation the tree Tree made with Tree package can be cross validated to find best tree size to prune the tree.
Conclusion
In this project, we have developed two models for estimating the price. One is Linear regression and the other one is Regression tree. With Linear regression, we are able to get an accuracy of 84.2% for training and 76.5% for testing. With cross validation accuracy of 82.3%. With Regression tree, we are able to get an accuracy of 84.3% for training and 83.1% for testing.


