Automobile Pricing Strategy

Problem Statement
Data Exploration
Data Cleaning
- Managing Missing Values
Model Building
- Linear Regression Model.
- Regression tree
Conclusion

Pre-requisites: R Programming, Machine learning

Problem Statement

The car company need to come up with the best pricing strategy for their new car such that, the price is neither too low nor too high. Build a model to predict the optimal price of a car using the car features.

Data Exploration

Loading the Data File

Since, our dataset is in .txt format, we should use the following code to read the file. In this file, the header information is not provided. Therefore, header=FALSE.

automobileData <- read.csv("D:\\Dv Analytics\\Datasets\\Automobile Data Set\\AutoDataset.csv")

If the data file is in web, we can directly read it from web using the following command

webData <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data"), header=FALSE)

dim() function is used to find out the dimension of the dataset.

dim(automobileData)

## [1] 205  26

So, we have 205 samples of data and 26 variables. Use names() function to find out the variable names.

names(automobileData)

##  [1] "symboling"         "normalized.losses" "make"             
##  [4] "fuel.type"         "aspiration"        "num.of.doors"     
##  [7] "body.style"        "drive.wheels"      "engine.location"  
## [10] "wheel.base"        "length"            "width"            
## [13] "height"            "curb.weight"       "engine.type"      
## [16] "num.of.cylinders"  "engine.size"       "fuel.system"      
## [19] "bore"              "stroke"            "compression.ratio"
## [22] "horsepower"        "peak.rpm"          "city.mpg"         
## [25] "highway.mpg"       "price"

To know the structure(class and levels) of each variable, str() function is used.

str(automobileData)

## 'data.frame':    205 obs. of  26 variables:
##  $ symboling        : int  3 3 1 2 2 2 1 1 1 0 ...
##  $ normalized.losses: int  NA NA NA 164 164 NA 158 NA 158 NA ...
##  $ make             : Factor w/ 22 levels "alfa-romero",..: 1 1 1 2 2 2 2 2 2 2 ...
##  $ fuel.type        : Factor w/ 2 levels "diesel","gas": 2 2 2 2 2 2 2 2 2 2 ...
##  $ aspiration       : Factor w/ 2 levels "std","turbo": 1 1 1 1 1 1 1 1 2 2 ...
##  $ num.of.doors     : Factor w/ 3 levels "","four","two": 3 3 3 2 2 3 2 2 2 3 ...
##  $ body.style       : Factor w/ 5 levels "convertible",..: 1 1 3 4 4 4 4 5 4 3 ...
##  $ drive.wheels     : Factor w/ 3 levels "4wd","fwd","rwd": 3 3 3 2 1 2 2 2 2 1 ...
##  $ engine.location  : Factor w/ 2 levels "front","rear": 1 1 1 1 1 1 1 1 1 1 ...
##  $ wheel.base       : num  88.6 88.6 94.5 99.8 99.4 ...
##  $ length           : num  169 169 171 177 177 ...
##  $ width            : num  64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ...
##  $ height           : num  48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 ...
##  $ curb.weight      : int  2548 2548 2823 2337 2824 2507 2844 2954 3086 3053 ...
##  $ engine.type      : Factor w/ 7 levels "dohc","dohcv",..: 1 1 6 4 4 4 4 4 4 4 ...
##  $ num.of.cylinders : Factor w/ 7 levels "eight","five",..: 3 3 4 3 2 2 2 2 2 2 ...
##  $ engine.size      : int  130 130 152 109 136 136 136 136 131 131 ...
##  $ fuel.system      : Factor w/ 8 levels "1bbl","2bbl",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ bore             : num  3.47 3.47 2.68 3.19 3.19 3.19 3.19 3.19 3.13 3.13 ...
##  $ stroke           : num  2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ...
##  $ compression.ratio: num  9 9 9 10 8 8.5 8.5 8.5 8.3 7 ...
##  $ horsepower       : int  111 111 154 102 115 110 110 110 140 160 ...
##  $ peak.rpm         : int  5000 5000 5000 5500 5500 5500 5500 5500 5500 5500 ...
##  $ city.mpg         : int  21 21 19 24 18 19 19 19 17 16 ...
##  $ highway.mpg      : int  27 27 26 30 22 25 25 25 20 22 ...
##  $ price            : num  13495 16500 16500 13950 17450 ...

To have a short look at the dataset, we can use head() or tail() functions to see a few data samples.

head(automobileData)

##   symboling normalized.losses        make fuel.type aspiration
## 1         3                NA alfa-romero       gas        std
## 2         3                NA alfa-romero       gas        std
## 3         1                NA alfa-romero       gas        std
## 4         2               164        audi       gas        std
## 5         2               164        audi       gas        std
## 6         2                NA        audi       gas        std
##   num.of.doors  body.style drive.wheels engine.location wheel.base length
## 1          two convertible          rwd           front       88.6  168.8
## 2          two convertible          rwd           front       88.6  168.8
## 3          two   hatchback          rwd           front       94.5  171.2
## 4         four       sedan          fwd           front       99.8  176.6
## 5         four       sedan          4wd           front       99.4  176.6
## 6          two       sedan          fwd           front       99.8  177.3
##   width height curb.weight engine.type num.of.cylinders engine.size
## 1  64.1   48.8        2548        dohc             four         130
## 2  64.1   48.8        2548        dohc             four         130
## 3  65.5   52.4        2823        ohcv              six         152
## 4  66.2   54.3        2337         ohc             four         109
## 5  66.4   54.3        2824         ohc             five         136
## 6  66.3   53.1        2507         ohc             five         136
##   fuel.system bore stroke compression.ratio horsepower peak.rpm city.mpg
## 1        mpfi 3.47   2.68               9.0        111     5000       21
## 2        mpfi 3.47   2.68               9.0        111     5000       21
## 3        mpfi 2.68   3.47               9.0        154     5000       19
## 4        mpfi 3.19   3.40              10.0        102     5500       24
## 5        mpfi 3.19   3.40               8.0        115     5500       18
## 6        mpfi 3.19   3.40               8.5        110     5500       19
##   highway.mpg price
## 1          27 13495
## 2          27 16500
## 3          26 16500
## 4          30 13950
## 5          22 17450
## 6          25 15250

To have an overiew of our dataset, we can use the function summary().

summary(automobileData)

##    symboling       normalized.losses         make      fuel.type  
##  Min.   :-2.0000   Min.   : 65       toyota    : 32   diesel: 20  
##  1st Qu.: 0.0000   1st Qu.: 94       nissan    : 18   gas   :185  
##  Median : 1.0000   Median :115       mazda     : 17               
##  Mean   : 0.8341   Mean   :122       honda     : 13               
##  3rd Qu.: 2.0000   3rd Qu.:150       mitsubishi: 13               
##  Max.   : 3.0000   Max.   :256       subaru    : 12               
##                    NA's   :41        (Other)   :100               
##  aspiration  num.of.doors       body.style drive.wheels engine.location
##  std  :168       :  2     convertible: 6   4wd:  9      front:202      
##  turbo: 37   four:114     hardtop    : 8   fwd:120      rear :  3      
##              two : 89     hatchback  :70   rwd: 76                     
##                           sedan      :96                               
##                           wagon      :25                               
##                                                                        
##                                                                        
##    wheel.base         length          width           height     
##  Min.   : 86.60   Min.   :141.1   Min.   :60.30   Min.   :47.80  
##  1st Qu.: 94.50   1st Qu.:166.3   1st Qu.:64.10   1st Qu.:52.00  
##  Median : 97.00   Median :173.2   Median :65.50   Median :54.10  
##  Mean   : 98.76   Mean   :174.0   Mean   :65.91   Mean   :53.72  
##  3rd Qu.:102.40   3rd Qu.:183.1   3rd Qu.:66.90   3rd Qu.:55.50  
##  Max.   :120.90   Max.   :208.1   Max.   :72.30   Max.   :59.80  
##                                                                  
##   curb.weight   engine.type num.of.cylinders  engine.size     fuel.system
##  Min.   :1488   dohc : 12   eight :  5       Min.   : 61.0   mpfi   :94  
##  1st Qu.:2145   dohcv:  1   five  : 11       1st Qu.: 97.0   2bbl   :66  
##  Median :2414   l    : 12   four  :159       Median :120.0   idi    :20  
##  Mean   :2556   ohc  :148   six   : 24       Mean   :126.9   1bbl   :11  
##  3rd Qu.:2935   ohcf : 15   three :  1       3rd Qu.:141.0   spdi   : 9  
##  Max.   :4066   ohcv : 13   twelve:  1       Max.   :326.0   4bbl   : 3  
##                 rotor:  4   two   :  4                       (Other): 2  
##       bore          stroke      compression.ratio   horsepower   
##  Min.   :2.54   Min.   :2.070   Min.   : 7.00     Min.   : 48.0  
##  1st Qu.:3.15   1st Qu.:3.110   1st Qu.: 8.60     1st Qu.: 70.0  
##  Median :3.31   Median :3.290   Median : 9.00     Median : 95.0  
##  Mean   :3.33   Mean   :3.255   Mean   :10.14     Mean   :104.3  
##  3rd Qu.:3.59   3rd Qu.:3.410   3rd Qu.: 9.40     3rd Qu.:116.0  
##  Max.   :3.94   Max.   :4.170   Max.   :23.00     Max.   :288.0  
##  NA's   :4      NA's   :4                         NA's   :2      
##     peak.rpm       city.mpg      highway.mpg        price      
##  Min.   :4150   Min.   :13.00   Min.   :16.00   Min.   : 5118  
##  1st Qu.:4800   1st Qu.:19.00   1st Qu.:25.00   1st Qu.: 7788  
##  Median :5200   Median :24.00   Median :30.00   Median :10595  
##  Mean   :5125   Mean   :25.22   Mean   :30.75   Mean   :13207  
##  3rd Qu.:5500   3rd Qu.:30.00   3rd Qu.:34.00   3rd Qu.:16500  
##  Max.   :6600   Max.   :49.00   Max.   :54.00   Max.   :45400  
##  NA's   :2

From the overview of our data, we can see how the missing values are named if there are any. Usually, datasets have some missing values. They can be either in the form of ?, NA etc. is.na() can be used to find out how many NA values are there in the dataset. It is recommended to view seperately for each variable. Similarly, dataset$variable==“?” can be used to find out whether there are any missing values.

is.na(automobileData$symboling)

automobileData$symboling=="?"

The above commands compares the values in the columns to NA’s and ‘?’ and returns boolian if the value is NA or ‘?’. We can use sum(is.na()) or sum(automobileData$symboling=="?") functions to count the number of missing values. If the output is zero, then there are no missing values.

sum(is.na(automobileData$symboling))

## [1] 0

sum(automobileData$symboling=="?")

## [1] 0

We can see there are no missing values in the column ‘Symboling’.

To find the correlation between two numeric variables, we can use the function cor().

cor(automobileData$wheel.base,automobileData$price)

To find the correlation between a numeric and discrete variable, we can use the function biserial.cor() if the discrete variable is dichotomous. If the catagorical variable is multi level use function polyserial(). Note: you need “ltm” library for executing this function.

library("ltm")

biserial.cor(automobileData$price, automobileData$engine.location)

To find the correlation between two categorical variables, we should first create a table of the two variables and then using assocstats() function, we can find the correlation between them. Note: we need “vcd” library for that.

library("vcd")
stat_table<-table(automobileData$aspiration, automobileData$price)
assocstats(stat_table)

We can draw a plot between two variables using qplot() function, part of ‘ggplot2’ package.

library("ggplot2")
qplot(automobileData$symboling, automobileData$price, colour=I("red"), size=I(4))

We can draw a 3D plot using plot3d function.

library("plot3D")
library("rgl")
plot3d(automobileData$symboling,automobileData$price,automobileData$length,col="blue")

Univariate Analysis:

Now that we have seen the summary of whole dataset, we can move into univariate analysis, where we take one variable at a time and try to find any missing values or underlying outliers.

Symboling:

This variable denotes the risk score of the vehicle. +3 indicates high rish vehicle and -3 indicates probably safe vehicle.

summary(automobileData$symboling)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.0000  0.0000  1.0000  0.8341  2.0000  3.0000

library("ggplot2")
ggplot(automobileData, aes(y=symboling, x = 1)) + geom_boxplot(outlier.color = "red")

summary() function shows no missing values and boxplot is clean, which indicates no outliers.

normallized.losses

It’s relative average loss payment per insured vehicle year for insurance company. This value is normalized for all autos within a particular size classification and represents the average loss per car per year. Also this variable has 41 missing values.

summary(automobileData$normalized.losses)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      65      94     115     122     150     256      41

Summary shows mean and medians are close enough, meaning a good distribution of data in this field.

ggplot(automobileData, aes(y=normalized.losses, x = 1)) + geom_boxplot(outlier.color = "red")

## Warning: Removed 41 rows containing non-finite values (stat_boxplot).

From plot we can see one value sticking out of the distribution.

Make:

This variable shows the name of the maker company of the car.

summary(automobileData$make)

##   alfa-romero          audi           bmw     chevrolet         dodge 
##             3             7             8             3             9 
##         honda         isuzu        jaguar         mazda mercedes-benz 
##            13             4             3            17             8 
##       mercury    mitsubishi        nissan        peugot      plymouth 
##             1            13            18            11             7 
##       porsche       renault          saab        subaru        toyota 
##             5             2             6            12            32 
##    volkswagen         volvo 
##            12            11

str(automobileData$make)

##  Factor w/ 22 levels "alfa-romero",..: 1 1 1 2 2 2 2 2 2 2 ...

Summary and structure shows there are 22 maker companies in the dataset.

qplot(data = automobileData, x = make)

Fuel Type:

summary(automobileData$fuel.type)

## diesel    gas 
##     20    185

dichotomous variable : diesel and gas, representing the fuel type of the vehicle.

qplot(data = automobileData, x = fuel.type)

Clearly 90% vehicles work on Gasoline. No missing values.

Aspiration:

Represents if the vehicle is turbocharged or standard(normally) aspireted.

summary(automobileData$aspiration)

##   std turbo 
##   168    37

qplot(data = automobileData, x = aspiration)

18% vehicles are turbo charged rest follows standard air intake.

Num-of-Doors:

Catagorical variable with 2 levels: four and two

summary(automobileData$num.of.doors)

##      four  two 
##    2  114   89

qplot(data = automobileData, x = num.of.doors)

This variable contains 2 missing values.

Body-Style:

summary(automobileData$body.style)

## convertible     hardtop   hatchback       sedan       wagon 
##           6           8          70          96          25

qplot(data = automobileData, x = body.style)

Catagorical variable showing the type of body: convertible, hardtop, hatchback, sedan, wagon. Clearly Sedan and hatchback body styles dominates others.

Drive-wheels:

3 catagories of the drive wheels type: 4wd, fwd, rwd

summary(automobileData$drive.wheels)

## 4wd fwd rwd 
##   9 120  76

qplot(data = automobileData, x = drive.wheels)

Engine-location:

If the engine is placed in front or the rear.

qplot(data = automobileData, x = engine.location)

Wheel-Base:

summary(automobileData$wheel.base)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   86.60   94.50   97.00   98.76  102.40  120.90

ggplot(automobileData, aes(y=wheel.base, x = 1)) + geom_boxplot(outlier.color = "red")

Length:

Continuous variable representing the length of the vehicle.

summary(automobileData$length)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   141.1   166.3   173.2   174.0   183.1   208.1

ggplot(automobileData, aes(y=length, x = 1)) + geom_boxplot(outlier.color = "red")

At the low length one variable seem to be out of distribution which can be avoided.

Width:

Continuous variable representing width of the vehicle.

summary(automobileData$width)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   60.30   64.10   65.50   65.91   66.90   72.30

Mean and medians are nearby, chances of outliers are low.

ggplot(automobileData, aes(y=width, x = 1)) + geom_boxplot(outlier.color = "red")

4 values seem to be out of the distribution.

height:

Continuous variable, representing height of the vehicle.

summary(automobileData$height)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   47.80   52.00   54.10   53.72   55.50   59.80

Mean and medians are close, seem to have continuous distribution.

ggplot(automobileData, aes(y=height, x = 1)) + geom_boxplot(outlier.color = "red")

No outlier in this variable.

Curb-weight:

continuous variable Representing the total weight of the vehicle without the occupents and luggage.

summary(automobileData$curb.weight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1488    2145    2414    2556    2935    4066

ggplot(automobileData, aes(y=curb.weight, x = 1)) + geom_boxplot(outlier.color = "red")

Both summary and boxplot proves no sign of outliers.

Engine-type:

Catagorical variable showing they engine type used. 7 catagories in the field.

summary(automobileData$engine.type)

##  dohc dohcv     l   ohc  ohcf  ohcv rotor 
##    12     1    12   148    15    13     4

qplot(data = automobileData, x = engine.type)

Looks like a large number of vehicles has OHC(Over Head Cam-Shaft) engines.

num-of-cylinders:

Again a catagorical variable stating the number of cylinder the car engine have.

summary(automobileData$num.of.cylinders)

##  eight   five   four    six  three twelve    two 
##      5     11    159     24      1      1      4

qplot(data =automobileData, x = num.of.cylinders)

77.5% vehicles have 4-cylinder engines in use.

Engine-size:

Continuous variable for the size of the engine.

summary(automobileData$engine.size)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    61.0    97.0   120.0   126.9   141.0   326.0

ggplot(automobileData, aes(y = engine.size, x = 1)) + geom_boxplot(outlier.color = "red")

Boxplot shows 6 values as outliers, which are not following the distribution of the data.

Fuel System:

Catagorical variable showing the fuel system used in the car with 8 levels.

summary(automobileData$fuel.system)

## 1bbl 2bbl 4bbl  idi  mfi mpfi spdi spfi 
##   11   66    3   20    1   94    9    1

qplot(data=automobileData, x = fuel.system)

plot shows MPFI and 2BBL fuel system are dominating.

Bore:

Continuous variable presenting the engine bore size:

summary(automobileData$bore)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2.54    3.15    3.31    3.33    3.59    3.94       4

mean and medians are nearby but there are 4 missing values.

ggplot(automobileData, aes(y = bore, x = 1)) + geom_boxplot(outlier.color = "red")

## Warning: Removed 4 rows containing non-finite values (stat_boxplot).

Boxplot seem clear, no outliers.

Stroke:

Continuous variable for the stroke length of the engine.

summary(automobileData$stroke)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   2.070   3.110   3.290   3.255   3.410   4.170       4

Summary shows mean and medians to be close, but this variable contains a 4 missing values.

ggplot(automobileData, aes(y=stroke, x=1)) + geom_boxplot(outlier.color = "red")

## Warning: Removed 4 rows containing non-finite values (stat_boxplot).

Boxplot shows few outliers but these can be avoided.

Compression-Ratio:

Continuous variable giving the compression ratio of each car-engine.

summary(automobileData$compression.ratio)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00    8.60    9.00   10.14    9.40   23.00

ggplot(automobileData, aes(y = compression.ratio, x = 1)) + geom_boxplot(outlier.color = "red")

boxplot shows outliers in the variable.

Horsepower:

Continuous variable ranging from 48 to 288.

summary(automobileData$horsepower)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    48.0    70.0    95.0   104.3   116.0   288.0       2

Summary shows 2 missing values.

ggplot(automobileData, aes(y=horsepower, x = 1)) + geom_boxplot(outlier.color = "red")

## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

Boxplot shows 2 values above 200 to be outliers.

Peak-rpm:

Continuous variable for the peak rpm value of the car engine.

summary(automobileData$peak.rpm)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    4150    4800    5200    5125    5500    6600       2

Ranges from 4150 to 6600, with 2 missing values.

ggplot(automobileData, aes(y=peak.rpm, x=1)) + geom_boxplot(outlier.color = "red")

## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

The max value 6600 seem to be fall out of the distribution.

City-mpg:

Continuous variable for mileage per gallon fuel in when driven in city.

summary(automobileData$city.mpg)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   19.00   24.00   25.22   30.00   49.00

ggplot(automobileData, aes(y=city.mpg, x=1)) + geom_boxplot(outlier.color="red")

Boxplot shows 2 values above 45 as outliers.

Highway-mpg:

Continuous variable for mileage per gallon fuel in when driven on highway.

summary(automobileData$highway.mpg)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16.00   25.00   30.00   30.75   34.00   54.00

ggplot(automobileData, aes(y = highway.mpg, x=1)) + geom_boxplot(outlier.color = "red")

Boxplot shows values above 47mpg to be outliers.

Price:

Price is our output/predictive variable. A continuous variable showing the price of the car/vehicle.

summary(automobileData$price)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5118    7788   10600   13210   16500   45400

price ranges form 5118 to 45400.

ggplot(automobileData, aes(y=price, x=1)) + geom_boxplot(outlier.color = "red")

According to boxplot, cars pricier than 30000 goes beyond the regular distribution of this variable.

Training Set

Now, I have removed all the variables which has missing values except the target variable in which we have replaced them with mean.

traindata<-VehicleData[,c(-2,-6,-19,-20,-22,-23)]

traindata<-automobileData[,c(-2,-6,-19,-20,-22,-23)]
st_table<-table(traindata$make)
st_table
barplot(st_table)

Data Cleaning

NOTE: The dataset has very few observation managing the outliers could make us lose some of the information in the data. There are very few outliers, we have left them untouched.

We will create a new dataframe keeping the original unharmed.

automobileData2 <- automobileData

Managing Missing Values

We have 7 variables with missing values:

Variable	Missing Values
normalized.losses	41 (20%)
num.of.doors	2 (0.9%)
bore	4 (1.9%)
stroke	4 (1.9%)
horsepower	2 (0.9%)
peak.rpm	2 (09.%)

normalized.losses

This variable has around 20% missing values. Textbook method to treat missing values is to create another indicator variable stating if the values are missing or non-missing and replace the original with mean or median.

First we create an indicator variable indicating missing and non-missing values.

automobileData2$normalized.losses_ind <- as.factor(ifelse(is.na(automobileData$normalized.losses),"missing", "non-missing"))

summary(automobileData2$normalized.losses_ind)

##     missing non-missing 
##          41         164

Next we can replace missing values with mean in the original column.

automobileData2$normalized.losses <- ifelse(is.na(automobileData$normalized.losses),122, automobileData$normalized.losses)

summary(automobileData2$normalized.losses)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      65     101     122     122     137     256

num.of.doors

This variable contains just two missing values, we can just replace the missing values with either of the dichtomouns value. As the missing value is denoted by Factor level : “”, we can replace “”, with “two” or “four” using levels function.

levels(automobileData2$num.of.doors)[levels(automobileData2$num.of.doors)==""] <- "two"
summary(automobileData2$num.of.doors)

##  two four 
##   91  114

bore

It’s a continuous variable we can replace the missing values with mean:3.33

automobileData2$bore <- ifelse(is.na(automobileData$bore),3.33, automobileData$bore)
summary(automobileData2$bore)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.54    3.15    3.31    3.33    3.58    3.94

Stroke

We can replace 4 missing values with the mean : 3.255

automobileData2$stroke <- ifelse(is.na(automobileData$stroke),3.225, automobileData$stroke)
summary(automobileData2$stroke)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.070   3.110   3.290   3.255   3.410   4.170

Horsepower

Two missing values can be replaced with mean: 104.3

automobileData2$horsepower <- ifelse(is.na(automobileData$horsepower),104.3, automobileData$horsepower)

summary(automobileData2$horsepower)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    48.0    70.0    95.0   104.3   116.0   288.0

Peak-rpm

2 missing values can be replaced with mean : 5125

automobileData2$peak.rpm <- ifelse(is.na(automobileData$peak.rpm),5125, automobileData$peak.rpm)

summary(automobileData2$peak.rpm)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4150    4800    5200    5125    5500    6600

Combining Levels

Some catagorical variables have high number of levels. While training a model each level will have a dummy variable again and we don’t have enough observations/row to fit a model with that many variables.

To overcome this problem, we will combine the levels which are less than 5 in number to a same level in all the variables which have high number of level.

Variables make, engine.type, num.of.cylinders and fuel.system has more than 5 levels.

Make

levels that represent low frequency can be merged into single value. We will take levels that represents last ~10 percentile of the distribution.

summary(automobileData2$make)

##   alfa-romero          audi           bmw     chevrolet         dodge 
##             3             7             8             3             9 
##         honda         isuzu        jaguar         mazda mercedes-benz 
##            13             4             3            17             8 
##       mercury    mitsubishi        nissan        peugot      plymouth 
##             1            13            18            11             7 
##       porsche       renault          saab        subaru        toyota 
##             5             2             6            12            32 
##    volkswagen         volvo 
##            12            11

sort(prop.table(table(automobileData$make))*100)

## 
##       mercury       renault   alfa-romero     chevrolet        jaguar 
##     0.4878049     0.9756098     1.4634146     1.4634146     1.4634146 
##         isuzu       porsche          saab          audi      plymouth 
##     1.9512195     2.4390244     2.9268293     3.4146341     3.4146341 
##           bmw mercedes-benz         dodge        peugot         volvo 
##     3.9024390     3.9024390     4.3902439     5.3658537     5.3658537 
##        subaru    volkswagen         honda    mitsubishi         mazda 
##     5.8536585     5.8536585     6.3414634     6.3414634     8.2926829 
##        nissan        toyota 
##     8.7804878    15.6097561

levels(automobileData2$make)[levels(automobileData$make)%in%c("mercury", "renault", "alfa-romero", "chevrolet", "jaguar", "isuzu", "porsche", "saab", "audi", "plymouth")] <- "other"

summary(automobileData2$make)

##         other           bmw         dodge         honda         mazda 
##            41             8             9            13            17 
## mercedes-benz    mitsubishi        nissan        peugot        subaru 
##             8            13            18            11            12 
##        toyota    volkswagen         volvo 
##            32            12            11

Engine-Type

dohcv, rotor, dohc and l represents around 12% of the values, we will merge these values into one.

summary(automobileData2$engine.type)

##  dohc dohcv     l   ohc  ohcf  ohcv rotor 
##    12     1    12   148    15    13     4

sort(prop.table(table(automobileData2$engine.type))*100)

## 
##      dohcv      rotor       dohc          l       ohcv       ohcf 
##  0.4878049  1.9512195  5.8536585  5.8536585  6.3414634  7.3170732 
##        ohc 
## 72.1951220

levels(automobileData2$engine.type)[levels(automobileData2$engine.type)%in%c("dohcv", "rotor", "dohc", "l")] <- "other"

summary(automobileData2$engine.type)

## other   ohc  ohcf  ohcv 
##    29   148    15    13

num-of-cylinder We can convert this variable into a numeric variable.

library(plyr)
automobileData2$num.of.cylinders <- as.character(automobileData2$num.of.cylinders)

automobileData2$num.of.cylinders <- mapvalues(automobileData2$num.of.cylinders, from=c("eight", "five", "four", "six", "three", "twelve", "two"), to=c(8,5,4,6,3,12,2))

automobileData2$num.of.cylinders <- as.numeric(automobileData2$num.of.cylinders)

Fuel-system out of 8 levels, 4 levels represents only 12% of the data, we can combine those 4 levels into one.

summary(automobileData2$fuel.system)

## 1bbl 2bbl 4bbl  idi  mfi mpfi spdi spfi 
##   11   66    3   20    1   94    9    1

levels(automobileData2$fuel.system)[levels(automobileData2$fuel.system)%in%c("1bbl","4bbl", "mfi", "spdi","spfi")] <- "other"

summary(automobileData2$fuel.system)

## other  2bbl   idi  mpfi 
##    25    66    20    94

Model Building

Now that we have curated the data, we can move to model building part. Our predictive variable is a continuous variable, we can approach this problem with regression algorithms.

Linear Regression Model.

We will train our data on simple linear regression algorithm to predict our price variable.

Vif gives the importance of a variable. In general, it should be less than 5.

summary(automobileData2)

##    symboling       normalized.losses         make     fuel.type  
##  Min.   :-2.0000   Min.   : 65       other     :41   diesel: 20  
##  1st Qu.: 0.0000   1st Qu.:101       toyota    :32   gas   :185  
##  Median : 1.0000   Median :122       nissan    :18               
##  Mean   : 0.8341   Mean   :122       mazda     :17               
##  3rd Qu.: 2.0000   3rd Qu.:137       honda     :13               
##  Max.   : 3.0000   Max.   :256       mitsubishi:13               
##                                      (Other)   :71               
##  aspiration  num.of.doors       body.style drive.wheels engine.location
##  std  :168   two : 91     convertible: 6   4wd:  9      front:202      
##  turbo: 37   four:114     hardtop    : 8   fwd:120      rear :  3      
##                           hatchback  :70   rwd: 76                     
##                           sedan      :96                               
##                           wagon      :25                               
##                                                                        
##                                                                        
##    wheel.base         length          width           height     
##  Min.   : 86.60   Min.   :141.1   Min.   :60.30   Min.   :47.80  
##  1st Qu.: 94.50   1st Qu.:166.3   1st Qu.:64.10   1st Qu.:52.00  
##  Median : 97.00   Median :173.2   Median :65.50   Median :54.10  
##  Mean   : 98.76   Mean   :174.0   Mean   :65.91   Mean   :53.72  
##  3rd Qu.:102.40   3rd Qu.:183.1   3rd Qu.:66.90   3rd Qu.:55.50  
##  Max.   :120.90   Max.   :208.1   Max.   :72.30   Max.   :59.80  
##                                                                  
##   curb.weight   engine.type num.of.cylinders  engine.size    fuel.system
##  Min.   :1488   other: 29   Min.   : 2.00    Min.   : 61.0   other:25   
##  1st Qu.:2145   ohc  :148   1st Qu.: 4.00    1st Qu.: 97.0   2bbl :66   
##  Median :2414   ohcf : 15   Median : 4.00    Median :120.0   idi  :20   
##  Mean   :2556   ohcv : 13   Mean   : 4.38    Mean   :126.9   mpfi :94   
##  3rd Qu.:2935               3rd Qu.: 4.00    3rd Qu.:141.0              
##  Max.   :4066               Max.   :12.00    Max.   :326.0              
##                                                                         
##       bore          stroke      compression.ratio   horsepower   
##  Min.   :2.54   Min.   :2.070   Min.   : 7.00     Min.   : 48.0  
##  1st Qu.:3.15   1st Qu.:3.110   1st Qu.: 8.60     1st Qu.: 70.0  
##  Median :3.31   Median :3.290   Median : 9.00     Median : 95.0  
##  Mean   :3.33   Mean   :3.255   Mean   :10.14     Mean   :104.3  
##  3rd Qu.:3.58   3rd Qu.:3.410   3rd Qu.: 9.40     3rd Qu.:116.0  
##  Max.   :3.94   Max.   :4.170   Max.   :23.00     Max.   :288.0  
##                                                                  
##     peak.rpm       city.mpg      highway.mpg        price      
##  Min.   :4150   Min.   :13.00   Min.   :16.00   Min.   : 5118  
##  1st Qu.:4800   1st Qu.:19.00   1st Qu.:25.00   1st Qu.: 7788  
##  Median :5200   Median :24.00   Median :30.00   Median :10595  
##  Mean   :5125   Mean   :25.22   Mean   :30.75   Mean   :13207  
##  3rd Qu.:5500   3rd Qu.:30.00   3rd Qu.:34.00   3rd Qu.:16500  
##  Max.   :6600   Max.   :49.00   Max.   :54.00   Max.   :45400  
##                                                                
##  normalized.losses_ind
##  missing    : 41      
##  non-missing:164      
##                       
##                       
##                       
##                       
##

str(automobileData2)

## 'data.frame':    205 obs. of  27 variables:
##  $ symboling            : int  3 3 1 2 2 2 1 1 1 0 ...
##  $ normalized.losses    : num  122 122 122 164 164 122 158 122 158 122 ...
##  $ make                 : Factor w/ 13 levels "other","bmw",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ fuel.type            : Factor w/ 2 levels "diesel","gas": 2 2 2 2 2 2 2 2 2 2 ...
##  $ aspiration           : Factor w/ 2 levels "std","turbo": 1 1 1 1 1 1 1 1 2 2 ...
##  $ num.of.doors         : Factor w/ 2 levels "two","four": 1 1 1 2 2 1 2 2 2 1 ...
##  $ body.style           : Factor w/ 5 levels "convertible",..: 1 1 3 4 4 4 4 5 4 3 ...
##  $ drive.wheels         : Factor w/ 3 levels "4wd","fwd","rwd": 3 3 3 2 1 2 2 2 2 1 ...
##  $ engine.location      : Factor w/ 2 levels "front","rear": 1 1 1 1 1 1 1 1 1 1 ...
##  $ wheel.base           : num  88.6 88.6 94.5 99.8 99.4 ...
##  $ length               : num  169 169 171 177 177 ...
##  $ width                : num  64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ...
##  $ height               : num  48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 ...
##  $ curb.weight          : int  2548 2548 2823 2337 2824 2507 2844 2954 3086 3053 ...
##  $ engine.type          : Factor w/ 4 levels "other","ohc",..: 1 1 4 2 2 2 2 2 2 2 ...
##  $ num.of.cylinders     : num  4 4 6 4 5 5 5 5 5 5 ...
##  $ engine.size          : int  130 130 152 109 136 136 136 136 131 131 ...
##  $ fuel.system          : Factor w/ 4 levels "other","2bbl",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ bore                 : num  3.47 3.47 2.68 3.19 3.19 3.19 3.19 3.19 3.13 3.13 ...
##  $ stroke               : num  2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ...
##  $ compression.ratio    : num  9 9 9 10 8 8.5 8.5 8.5 8.3 7 ...
##  $ horsepower           : num  111 111 154 102 115 110 110 110 140 160 ...
##  $ peak.rpm             : num  5000 5000 5000 5500 5500 5500 5500 5500 5500 5500 ...
##  $ city.mpg             : int  21 21 19 24 18 19 19 19 17 16 ...
##  $ highway.mpg          : int  27 27 26 30 22 25 25 25 20 22 ...
##  $ price                : num  13495 16500 16500 13950 17450 ...
##  $ normalized.losses_ind: Factor w/ 2 levels "missing","non-missing": 1 1 1 2 2 1 2 1 2 1 ...

traindata1<-automobileData2[,c(-2,-3,-13,-14,-16,-22)]
dim(automobileData2)

## [1] 205  27

LMmodel<-lm(price~.,data=automobileData2)

summary(LMmodel)

In the summary table we see 2 variables ‘engine.typeohcf’, ‘fuel.systemidi’ shows “NA” in all the columns. It’s because we ran into perfect collinearity between these variables and some other. In other words, these variables are fully correlated to (or say representing exact impact as) one or more predictor variables. Using alias() function will help us to find out the variables these tow are highly correlated to.

alias(LMmodel)

## Model :
## price ~ symboling + normalized.losses + make + fuel.type + aspiration + 
##     num.of.doors + body.style + drive.wheels + engine.location + 
##     wheel.base + length + width + height + curb.weight + engine.type + 
##     num.of.cylinders + engine.size + fuel.system + bore + stroke + 
##     compression.ratio + horsepower + peak.rpm + city.mpg + highway.mpg + 
##     normalized.losses_ind
## 
## Complete :
##                 (Intercept) symboling normalized.losses makebmw makedodge
## engine.typeohcf  0           0         0                 0       0       
## fuel.systemidi   1           0         0                 0       0       
##                 makehonda makemazda makemercedes-benz makemitsubishi
## engine.typeohcf  0         0         0                 0            
## fuel.systemidi   0         0         0                 0            
##                 makenissan makepeugot makesubaru maketoyota makevolkswagen
## engine.typeohcf  0          0          1          0          0            
## fuel.systemidi   0          0          0          0          0            
##                 makevolvo fuel.typegas aspirationturbo num.of.doorsfour
## engine.typeohcf  0         0            0               0              
## fuel.systemidi   0        -1            0               0              
##                 body.stylehardtop body.stylehatchback body.stylesedan
## engine.typeohcf  0                 0                   0             
## fuel.systemidi   0                 0                   0             
##                 body.stylewagon drive.wheelsfwd drive.wheelsrwd
## engine.typeohcf  0               0               0             
## fuel.systemidi   0               0               0             
##                 engine.locationrear wheel.base length width height
## engine.typeohcf  1                   0          0      0     0    
## fuel.systemidi   0                   0          0      0     0    
##                 curb.weight engine.typeohc engine.typeohcv
## engine.typeohcf  0           0              0             
## fuel.systemidi   0           0              0             
##                 num.of.cylinders engine.size fuel.system2bbl
## engine.typeohcf  0                0           0             
## fuel.systemidi   0                0           0             
##                 fuel.systemmpfi bore stroke compression.ratio horsepower
## engine.typeohcf  0               0    0      0                 0        
## fuel.systemidi   0               0    0      0                 0        
##                 peak.rpm city.mpg highway.mpg
## engine.typeohcf  0        0        0         
## fuel.systemidi   0        0        0         
##                 normalized.losses_indnon-missing
## engine.typeohcf  0                              
## fuel.systemidi   0

Improving model by chunking insignificant variables Variable which shows P value more than 5% will impact our Adj. $R^2$ value in negative way. We will chunk those variables and build a new model with rest of the variables.

LMmodelNew1<-lm(price~make+aspiration+engine.location+wheel.base +width+curb.weight +num.of.cylinders+engine.size+bore+stroke+peak.rpm, data=automobileData2)

summary(LMmodelNew1)

We see that there is a very small loss in Adjusted R-squared value after removing the insignificant variables based on their P-values. This insignificant loss in R-squared proves we eliminated variables correctly.

To check further collinearity we can use vif() function form package: car

library(car)
vif(LMmodelNew1)

##                       GVIF Df GVIF^(1/(2*Df))
## make             33.297023 12        1.157267
## aspiration        1.745709  1        1.321253
## engine.location   1.523864  1        1.234449
## wheel.base        5.820991  1        2.412673
## width             6.516676  1        2.552778
## curb.weight      11.488619  1        3.389486
## num.of.cylinders 10.541836  1        3.246819
## engine.size      25.004334  1        5.000433
## bore              5.441816  1        2.332770
## stroke            2.816223  1        1.678161
## peak.rpm          2.075670  1        1.440718

Linear Regression Model Validation

So far we built a model and tried improving We will use caTools to divides the data into two subsets. train_new is our train data. hold_out is our test data. rmpeTrain is the training error. rmpeTest is the testing error.

library(caTools)
set.seed(90) # to get the same split everytime
spl = sample.split(automobileData2$price, SplitRatio = 0.8)
train_data = subset(automobileData2, spl==TRUE)
test_data = subset(automobileData2, spl==FALSE)

LMmodel2<-lm(price ~ aspiration+engine.location+wheel.base +width +engine.size+bore+stroke+peak.rpm, data=train_data)
summary(LMmodel2)

## 
## Call:
## lm(formula = price ~ aspiration + engine.location + wheel.base + 
##     width + engine.size + bore + stroke + peak.rpm, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12773.3  -1502.0   -294.2   1254.6  15037.9 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -4.857e+04  1.226e+04  -3.960 0.000114 ***
## aspirationturbo      2.248e+03  6.941e+02   3.238 0.001471 ** 
## engine.locationrear  1.455e+04  2.519e+03   5.775 4.08e-08 ***
## wheel.base           2.917e+02  7.179e+01   4.063 7.68e-05 ***
## width                3.434e+02  2.315e+02   1.484 0.139904    
## engine.size          1.497e+02  1.129e+01  13.257  < 2e-16 ***
## bore                -2.782e+03  1.222e+03  -2.276 0.024225 *  
## stroke              -2.924e+03  8.794e+02  -3.325 0.001104 ** 
## peak.rpm             1.856e+00  5.816e-01   3.191 0.001716 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3071 on 155 degrees of freedom
## Multiple R-squared:  0.8538, Adjusted R-squared:  0.8463 
## F-statistic: 113.2 on 8 and 155 DF,  p-value: < 2.2e-16

With training data this new model is giving R-squared value of 0.86% We can add few variables back into the model and try to improve the R-squared value.

Let’s make predictions on this new model for both training and testing data sets.

Trprediction<-predict(LMmodel2,train_data)
Trtarget<-train_data$price
rmpeTrain<-mean(abs(Trtarget-Trprediction)/Trtarget)*100
print(100-rmpeTrain)

## [1] 84.17113

Tsprediction<-predict(LMmodel2,test_data)
Tstarget<-test_data$price
rmpeTest<-mean(abs(Tstarget-Tsprediction)/Tstarget)*100
print(100-rmpeTest)

## [1] 76.5475

From above we can see we got 84.2% accuracy on testing set, and 76.5% accuracy on the test data.

We can also perform cross validation with 10 folds to get more precise performance of our model.

library(DAAG)

## Loading required package: lattice

## 
## Attaching package: 'DAAG'

## The following object is masked from 'package:car':
## 
##     vif

## The following object is masked from 'package:plyr':
## 
##     ozone

crossVal<-cv.lm(automobileData2, LMmodel2, 10)

## Analysis of Variance Table
## 
## Response: price
##                  Df   Sum Sq  Mean Sq F value  Pr(>F)    
## aspiration        1 3.97e+08 3.97e+08   37.72 4.5e-09 ***
## engine.location   1 1.47e+09 1.47e+09  140.14 < 2e-16 ***
## wheel.base        1 4.98e+09 4.98e+09  472.96 < 2e-16 ***
## width             1 1.71e+09 1.71e+09  162.76 < 2e-16 ***
## engine.size       1 1.75e+09 1.75e+09  166.07 < 2e-16 ***
## bore              1 3.29e+07 3.29e+07    3.13 0.07854 .  
## stroke            1 1.01e+08 1.01e+08    9.59 0.00224 ** 
## peak.rpm          1 1.24e+08 1.24e+08   11.79 0.00073 ***
## Residuals       196 2.06e+09 1.05e+07                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## 
## fold 1 
## Observations in test set: 20 
##                 2   52    70    72   93   95   105   107   118   125   128
## Predicted   10468 6418 22989 33325 7083 7083 18735 20746 19248 15269 34528
## cvpred      10302 6360 22786 33189 7120 7120 18597 20684 19103 15229 34778
## price       16500 6095 28176 34184 6849 7299 17199 18399 18150 12764 34028
## CV residual  6198 -265  5390   995 -271  179 -1398 -2285  -953 -2465  -750
##               136   138   139   151  154  156   169  183   197
## Predicted   12526 14687  7384  6785 6785 6785 13018 7983 16334
## cvpred      12489 14481  7227  6736 6736 6736 13111 7964 16378
## price       15510 18620  5118  5348 6918 8778  9639 7775 15985
## CV residual  3021  4139 -2109 -1388  182 2042 -3472 -189  -393
## 
## Sum of squares = 1.32e+08    Mean square = 6621678    n = 20 
## 
## fold 2 
## Observations in test set: 21 
##                 1   28    29    41    42   44   53    61    63    64    74
## Predicted   10468 9288 11746  8985 10390 6850 6418 11279 11279 11001 42414
## cvpred      10968 8746 11071  8078 10424 5986 6493 11649 11649 11371 42494
## price       13495 8558  8921 10295 12945 6785 6795  8495 10245 10795 40960
## CV residual  2527 -188 -2150  2217  2521  799  302 -3154 -1404  -576 -1534
##                85    88  121  123    130   135   137   143  158  165
## Predicted   15269 11840 7098 8139  25936 17403 14687  9106 7655 7142
## cvpred      15292 11633 6957 7980  28338 17879 14778  9471 7638 7080
## price       14489  9279 6229 7609  13207 15040 18150  7775 7198 8238
## CV residual  -803 -2354 -728 -371 -15131 -2839  3372 -1696 -440 1158
## 
## Sum of squares = 2.97e+08    Mean square = 14157321    n = 21 
## 
## fold 3 
## Observations in test set: 21 
##                 8    11    14    40    50    51    60   99   103   119
## Predicted   19175 12286 16147 10390 42411  6418 11279 7235 20323  7098
## cvpred      19363 12250 16532 10510 44911  6498 11285 7229 21028  7207
## price       18920 16430 21105  8845 36000  5195  8845 8249 14399  5572
## CV residual  -443  4180  4573 -1665 -8911 -1303 -2440 1020 -6629 -1635
##               120   132   149  162   181   186   192   196   198   204
## Predicted    9288 11089  9773 7655 20235  9962 15459 16334 16334 20701
## cvpred       9278 11186  9774 7652 20711 10004 15760 16329 16329 20657
## price        7957  9895  8013 8358 15690  8195 13295 13415 16515 22470
## CV residual -1321 -1291 -1761  706 -5021 -1809 -2465 -2914   186  1813
##               205
## Predicted   20136
## cvpred      19841
## price       22625
## CV residual  2784
## 
## Sum of squares = 2.37e+08    Mean square = 11295531    n = 21 
## 
## fold 4 
## Observations in test set: 21 
##                 19    24   25   32   33    47    59    65    73    82
## Predicted     57.4  9288 7098 6214 6317 10185  7294 11279 27864 10357
## cvpred       119.7  9483 7039 5882 6201 10016  7042 11084 27179 10217
## price       5151.0  7957 6229 6855 5399 11048 15645 11245 35056  8499
## CV residual 5031.3 -1526 -810  973 -802  1032  8603   161  7877 -1718
##                83   97   102   106   117   166   173   175   176   200
## Predicted   15292 7083 20323 20431 17924 10243 13018 12153 10885 16413
## cvpred      15323 7040 20302 20040 18114 10235 12949 12318 10841 16586
## price       12629 7499 13499 19699 17950  9298 17669 10698  9988 18950
## CV residual -2694  459 -6803  -341  -164  -937  4720 -1620  -853  2364
##               201
## Predicted   18440
## cvpred      18405
## price       16845
## CV residual -1560
## 
## Sum of squares = 2.58e+08    Mean square = 12305150    n = 21 
## 
## fold 5 
## Observations in test set: 21 
##                 9    15   21    22   26    38    48    56    57    62
## Predicted   20360 17825 7177  7098 7098 10390 31575  5992  5992 11279
## cvpred      19615 17888 7278  7129 7129 10359 31939  5629  5629 11085
## price       23875 24565 6575  5572 6692  7895 32250 10945 11845 10595
## CV residual  4260  6677 -703 -1557 -437 -2464   311  5316  6216  -490
##                75   79    84   104   111  141  157     177   182   199
## Predicted   39784 7670 15269 20323 19528 7281 7655 10885.4 18933 16413
## cvpred      39719 7579 15339 20445 19879 7328 7653 10835.7 19149 16506
## price       45400 6669 14869 13499 13860 7603 6938 10898.0 15750 18420
## CV residual  5681 -910  -470 -6946 -6019  275 -715    62.3 -3399  1914
##               203
## Predicted   23971
## cvpred      23907
## price       21485
## CV residual -2422
## 
## Sum of squares = 2.79e+08    Mean square = 13268277    n = 21 
## 
## fold 6 
## Observations in test set: 21 
##                 3     7   31    34   35    39    49    58    67    87
## Predicted   15394 19175 3984  8073 8073 10390 31575  5992 12342 10357
## cvpred      15190 18870 3959  7972 7972 10227 31057  6012 12057 10286
## price       16500 17710 6479  6529 7129  9095 35550 13645 18344  8189
## CV residual  1310 -1160 2520 -1443 -843 -1132  4493  7633  6287 -2097
##               101   113   116  140   142   145   147   167   170   172
## Predicted   10615 17924 15041 7279  9849  9798  9798 10243 13018 13018
## cvpred      10527 17848 14884 7665 10165 10117 10117 10379 12979 12979
## price        9549 16900 16630 7053  7126  9233  7463  9538  9989 11549
## CV residual  -978  -948  1746 -612 -3039  -884 -2654  -841 -2990 -1430
##               187
## Predicted    9962
## cvpred       9852
## price        8495
## CV residual -1357
## 
## Sum of squares = 1.72e+08    Mean square = 8178094    n = 21 
## 
## fold 7 
## Observations in test set: 20 
##                 4    10    17    37    45   55    71    77   94   112  122
## Predicted   11427 16935 25109  8710  7177 6301 25983  7670 7083 17586 7098
## cvpred      11333 16739 24597  8517  7145 6396 25684  7652 7075 17561 7059
## price       13950 13207 41315  7295 13207 7395 31600  5389 7349 15580 6692
## CV residual  2617 -3532 16718 -1222  6062  999  5916 -2263  274 -1981 -367
##               129  164   168   171   174   178   189  191   193
## Predicted   34528 7142 13018 13018 10885 10885 10426 8933 10639
## cvpred      33278 7221 12917 12917 10933 10933 10343 8824 10709
## price       37028 8058  8449 11199  8948 11248  9995 9980 13845
## CV residual  3750  837 -4468 -1718 -1985   315  -348 1156  3136
## 
## Sum of squares = 4.35e+08    Mean square = 21766366    n = 20 
## 
## fold 8 
## Observations in test set: 20 
##                12    36   54    81    86    90   98   100   115   126
## Predicted   12286  8786 6418 11840 10357  7083 7083 10615 19528 15628
## cvpred      12097  9047 6545 12021 10378  7191 7191 10662 19921 14868
## price       16925  7295 6695  9959  6989  5499 7999  8949 17075 22018
## CV residual  4828 -1752  150 -2062 -3389 -1692  808 -1713 -2846  7150
##                 148  152  153     161   179   180   184  185    188   194
## Predicted   10541.9 6785 6785 7655.50 20452 20452  9962 7983 9121.3 11944
## cvpred      10140.3 7012 7012 7735.32 20660 20660 10065 8268 9550.6 12027
## price       10198.0 6338 6488 7738.00 16558 15998  7975 7995 9495.0 12290
## CV residual    57.7 -674 -524    2.68 -4102 -4662 -2090 -273  -55.6   263
## 
## Sum of squares = 1.52e+08    Mean square = 7580236    n = 20 
## 
## fold 9 
## Observations in test set: 20 
##                 5    16   20   27      43    66    69    76    78    80
## Predicted   14944 24589 7177 7098 10248.4 15062 23829 17241  7670  9110
## cvpred      14860 24158 7172 7073 10255.3 15089 23945 17435  7682  9030
## price       17450 30760 6295 7609 10345.0 18280 28248 16503  6189  7689
## CV residual  2590  6602 -877  536    89.7  3191  4303  -932 -1493 -1341
##               96   108   109   114   127   131   134   146  155   195
## Predicted   7083 15041 17924 19190 34528 11037 12526 11494 6785 16334
## cvpred      7059 15355 18123 19662 35528 10973 12649 11731 6850 16401
## price       7799 11900 13200 16695 32528  9295 12170 11259 7898 12940
## CV residual  740 -3455 -4923 -2967 -3000 -1678  -479  -472 1048 -3461
## 
## Sum of squares = 1.56e+08    Mean square = 7791166    n = 20 
## 
## fold 10 
## Observations in test set: 20 
##                 6    13    18   23    30    46    68    89   91   92   110
## Predicted   14994 16147 28325 7098 15144  7177 23829 10144 7035 7083 16645
## cvpred      14853 15894 28081 7013 15177  7103 23779 10091 6927 7042 16736
## price       15250 20970 36880 6377 12964 13207 25552  9279 7099 6649 12440
## CV residual   397  5076  8799 -636 -2213  6104  1773  -812  172 -393 -4296
##               124   133   144   150  159  160  163   190   202
## Predicted   11723 12526 10593 11469 7659 7659 7655  9037 19898
## cvpred      11738 12518 10599 11546 7606 7606 7585  8981 20056
## price        8921 11850  9960 11694 7898 7788 9258 11595 19045
## CV residual -2817  -668  -639   148  292  182 1673  2614 -1011
## 
## Sum of squares = 1.88e+08    Mean square = 9396303    n = 20 
## 
## Overall (Sum over all 20 folds) 
##       ms 
## 11250765

rmpe_cv<-mean(abs(automobileData2$price-crossVal$Predicted)/automobileData2$price)*100
100-rmpe_cv

## [1] 82.6

A 10-fold cross validation is giving us accuracy of 82.6

Regression tree

A regression problem can also be solved using a decision tree, We just have to tweak the parameters to make the tree work on regression problems.

We wil use rpart library to build our model on

library(rpart)
library(rattle)

## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

library(rpart.plot)
Tree1 <- rpart(price ~., method="anova", data=automobileData2)
printcp(Tree1)

## 
## Regression tree:
## rpart(formula = price ~ ., data = automobileData2, method = "anova")
## 
## Variables actually used in tree construction:
## [1] curb.weight engine.size width      
## 
## Root node error: 1e+10/205 = 6e+07
## 
## n= 205 
## 
##     CP nsplit rel error xerror xstd
## 1 0.63      0       1.0    1.0 0.16
## 2 0.19      1       0.4    0.4 0.05
## 3 0.03      2       0.2    0.2 0.04
## 4 0.01      3       0.2    0.2 0.04
## 5 0.01      4       0.1    0.2 0.04

summary(Tree1)

## Call:
## rpart(formula = price ~ ., data = automobileData2, method = "anova")
##   n= 205 
## 
##       CP nsplit rel error xerror   xstd
## 1 0.6283      0     1.000  1.010 0.1614
## 2 0.1876      1     0.372  0.382 0.0471
## 3 0.0272      2     0.184  0.223 0.0420
## 4 0.0115      3     0.157  0.201 0.0415
## 5 0.0100      4     0.145  0.200 0.0413
## 
## Variable importance
## engine.size curb.weight       width    city.mpg  horsepower        make 
##          23          20          15          13          12           9 
##      length highway.mpg  wheel.base 
##           4           4           1 
## 
## Node number 1: 205 observations,    complexity param=0.628
##   mean=1.32e+04, MSE=6.16e+07 
##   left son=2 (187 obs) right son=3 (18 obs)
##   Primary splits:
##       engine.size      < 182  to the left,  improve=0.628, (0 missing)
##       num.of.cylinders < 4.5  to the left,  improve=0.521, (0 missing)
##       curb.weight      < 2700 to the left,  improve=0.509, (0 missing)
##       horsepower       < 118  to the left,  improve=0.507, (0 missing)
##       highway.mpg      < 28.5 to the right, improve=0.480, (0 missing)
##   Surrogate splits:
##       curb.weight < 3330 to the left,  agree=0.971, adj=0.667, (0 split)
##       horsepower  < 176  to the left,  agree=0.966, adj=0.611, (0 split)
##       width       < 69.2 to the left,  agree=0.961, adj=0.556, (0 split)
##       make        splits as  LLLLLRLLLLLLL, agree=0.951, adj=0.444, (0 split)
##       city.mpg    < 16.5 to the right, agree=0.951, adj=0.444, (0 split)
## 
## Node number 2: 187 observations,    complexity param=0.188
##   mean=1.13e+04, MSE=2.06e+07 
##   left son=4 (126 obs) right son=5 (61 obs)
##   Primary splits:
##       curb.weight < 2660 to the left,  improve=0.615, (0 missing)
##       highway.mpg < 28.5 to the right, improve=0.581, (0 missing)
##       horsepower  < 94.5 to the left,  improve=0.516, (0 missing)
##       engine.size < 126  to the left,  improve=0.503, (0 missing)
##       city.mpg    < 22   to the right, improve=0.493, (0 missing)
##   Surrogate splits:
##       engine.size < 126  to the left,  agree=0.893, adj=0.672, (0 split)
##       highway.mpg < 28.5 to the right, agree=0.893, adj=0.672, (0 split)
##       length      < 178  to the left,  agree=0.877, adj=0.623, (0 split)
##       city.mpg    < 22   to the right, agree=0.866, adj=0.590, (0 split)
##       width       < 66   to the left,  agree=0.861, adj=0.574, (0 split)
## 
## Node number 3: 18 observations
##   mean=3.33e+04, MSE=4.66e+07 
## 
## Node number 4: 126 observations,    complexity param=0.0272
##   mean=8.8e+03, MSE=6.37e+06 
##   left son=8 (73 obs) right son=9 (53 obs)
##   Primary splits:
##       curb.weight < 2290 to the left,  improve=0.428, (0 missing)
##       horsepower  < 89   to the left,  improve=0.360, (0 missing)
##       highway.mpg < 29.5 to the right, improve=0.331, (0 missing)
##       length      < 168  to the left,  improve=0.317, (0 missing)
##       fuel.system splits as  LLLR,     improve=0.311, (0 missing)
##   Surrogate splits:
##       length     < 169  to the left,  agree=0.897, adj=0.755, (0 split)
##       width      < 64.5 to the left,  agree=0.881, adj=0.717, (0 split)
##       horsepower < 77   to the left,  agree=0.849, adj=0.642, (0 split)
##       wheel.base < 95.9 to the left,  agree=0.841, adj=0.623, (0 split)
##       city.mpg   < 27.5 to the right, agree=0.841, adj=0.623, (0 split)
## 
## Node number 5: 61 observations,    complexity param=0.0115
##   mean=1.64e+04, MSE=1.12e+07 
##   left son=10 (53 obs) right son=11 (8 obs)
##   Primary splits:
##       width       < 68.6 to the left,  improve=0.213, (0 missing)
##       make        splits as  LRL-R-LLL-LLR, improve=0.195, (0 missing)
##       fuel.system splits as  LLRR, improve=0.175, (0 missing)
##       wheel.base  < 101  to the left,  improve=0.154, (0 missing)
##       length      < 188  to the left,  improve=0.117, (0 missing)
##   Surrogate splits:
##       wheel.base < 109  to the left,  agree=0.885, adj=0.125, (0 split)
## 
## Node number 8: 73 observations
##   mean=7.39e+03, MSE=2.27e+06 
## 
## Node number 9: 53 observations
##   mean=1.07e+04, MSE=5.54e+06 
## 
## Node number 10: 53 observations
##   mean=1.58e+04, MSE=9.27e+06 
## 
## Node number 11: 8 observations
##   mean=2.04e+04, MSE=5.8e+06

plotcp(Tree1)

fancyRpartPlot(Tree1)

We can improve the tree by adding the optimal complexity parameter from the CP functions and from the CP plot.

Validating our Tree Model

Let’s validate our tree model by predicting the values on training and testing sets by comparing the accuracy or erros.

First we will build a new tree model just on the training set and then find the accuracy on both training and testing sets.

Tree2 <- rpart(price ~., method="anova", data=train_data)

Trpred_tree<-predict(Tree2,train_data)
Trtaget_tree<-train_data$price
rmpeTr_tree<-mean(abs(Trtaget_tree-Trpred_tree)/Trtaget_tree)*100
100 - rmpeTr_tree

## [1] 84.3

Tspred_tree<-predict(Tree2,test_data)
Tstarget_tree<-test_data$price
rmpeTs_tree<-mean(abs(Tstarget_tree-Tspred_tree)/Tstarget_tree)*100
100 - rmpeTs_tree

## [1] 83.1

This new tree model Tree2 was trained on training set which uses all the variables. This model gives accuarcy of 84.3% on the training data predictions and 83.1% on the test set predictions.

Cross validation the tree Tree made with Tree package can be cross validated to find best tree size to prune the tree.

Conclusion

In this project, we have developed two models for estimating the price. One is Linear regression and the other one is Regression tree. With Linear regression, we are able to get an accuracy of 84.2% for training and 76.5% for testing. With cross validation accuracy of 82.3%. With Regression tree, we are able to get an accuracy of 84.3% for training and 83.1% for testing.

You can download the datasets and R code file for this session here.