Handout – Neural Networks in R – DV Flexi Learning Program

You can download the datasets and R code file for this session here.

Neural network Intuition
Neural network and vocabulary
Neural network algorithm
Math behind neural network algorithm
Building the neural networks
Validating the neural network model
Neural network applications
Image recognition using neural networks

Recap of Logistic Regression

When there is a categorical output yes/no, 1/0 (binary output) and the predictor variable is continuous, then we can not fit a linear regression line.
If we try fitting several lines then the formed logistic regression is better compared to linear regression line as it suits best for the dataset.
Thus we need logistic regression line for this data.
Using the predictor variables to predict the categorical output.

Before moving to neural networks using logistic regression, let us do a quick recap of how do we built the logistic regression and how it can build a neural network based on combination of several logistic regressions.

LAB: Logistic Regression

Dataset: Emp_Productivity/Emp_Productivity.csv
Filter the data and take a subset from above dataset. Filter condition is Sample_Set<3.
Draw a scatter plot that shows Age on X-axis and Experience on Y-axis. Try to distinguish the two classes with colors or shapes (visualizing the classes).
Build a logistic regression model to predict Productivity using age and experience.
Finally draw the decision boundary for this logistic regression model.
Create the confusion matrix.
Calculate the accuracy and error rates.

Solution

Dataset: Emp_Productivity/Emp_Productivity.csv

Emp_Productivity_raw <- read.csv("R datasetEmp_ProductivityEmp_Productivity.csv")

Filter the data and take a subset from above dataset. Filter condition is Sample_Set<3.

Emp_Productivity1<-Emp_Productivity_raw[Emp_Productivity_raw$Sample_Set<3,]

dim(Emp_Productivity1)

## [1] 74  4

names(Emp_Productivity1)

## [1] "Age"          "Experience"   "Productivity" "Sample_Set"

head(Emp_Productivity1)

##    Age Experience Productivity Sample_Set
## 1 20.0        2.3            0          1
## 2 16.2        2.2            0          1
## 3 20.2        1.8            0          1
## 4 18.8        1.4            0          1
## 5 18.9        3.2            0          1
## 6 16.7        3.9            0          1

table(Emp_Productivity1$Productivity)

## 
##  0  1 
## 33 41

Draw a scatter plot that shows Age on X axis and Experience on Y-axis. Try to distinguish the two classes with colors or shapes (visualizing the classes).

library(ggplot2)
ggplot(Emp_Productivity1)+geom_point(aes(x=Age,y=Experience,color=factor(Productivity),shape=factor(Productivity)),size=5)

1 is in the form of triangle, 0 is in the form of a dot.
Age is on the X-axis and Experience is on the Y-axis.
Here, there are few 1’s and few 0’s which looks like they can be classified very easily.
We will try to fit a logistic regression line i.e., the classifier and see whether that can help us in classifying 0’s from 1’s putting a decision boundary between them.
We will be fitting a logistic regression on productivity using the variables called age and experience from emp_productivity, family=binomial.

Build a logistic regression model to predict Productivity using age and experience.

Emp_Productivity_logit<-glm(Productivity~Age+Experience,data=Emp_Productivity1, family=binomial())
Emp_Productivity_logit

## 
## Call:  glm(formula = Productivity ~ Age + Experience, family = binomial(), 
##     data = Emp_Productivity1)
## 
## Coefficients:
## (Intercept)          Age   Experience  
##     -8.9361       0.2763       0.5923  
## 
## Degrees of Freedom: 73 Total (i.e. Null);  71 Residual
## Null Deviance:       101.7 
## Residual Deviance: 46.77     AIC: 52.77

coef(Emp_Productivity_logit)

## (Intercept)         Age  Experience 
##  -8.9361114   0.2762749   0.5923444

slope1 <- coef(Emp_Productivity_logit)[2]/(-coef(Emp_Productivity_logit)[3])
intercept1 <- coef(Emp_Productivity_logit)[1]/(-coef(Emp_Productivity_logit)[3])

To create the decision boundary we have to find slope and intercept.

Finally draw the decision boundary for this logistic regression model.

library(ggplot2)
base<-ggplot(Emp_Productivity1)+geom_point(aes(x=Age,y=Experience,color=factor(Productivity),shape=factor(Productivity)),size=5)
base+geom_abline(intercept = intercept1 , slope = slope1, color = "red", size = 2) #Base is the scatter plot. Then we are adding the decision boundary

Create the confusion matrix.

predicted_values<-round(predict(Emp_Productivity_logit,type="response"),0)
conf_matrix<-table(predicted_values,Emp_Productivity_logit$y)
conf_matrix

##                 
## predicted_values  0  1
##                0 31  2
##                1  2 39

Calculate the accuracy and error rates.

accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy

## [1] 0.9459459

Decision Boundary

Decision Boundary – Logistic Regression

The line or margin that separates the classes is called Decision Boundary.
Classification algorithms are all about finding the decision boundaries.
It needs not to be straight line or linear always.
The final function of our decision boundary looks like
- Y=1 if $w^Tx+w_0>0$ ; else Y=0

In logistic regression, it can be derived from the logistic regression coefficients and the threshold.
- Imagine the logistic regression line $p(y)=e^{\frac{(b_0+b_1x_1+b_2x_2)}{1+exp^(b_0+b_1x_1+b_2x_2)}}$
- Suppose if (p(y)>0.5) then class-1 or else class-0
  - $\log(\frac{y}{1-y})=b_0+b_1x_1+b_2x_2$
  - $\log(\frac{0.5}{0.5})=b_0+b_1x_1+b_2x_2$
  - $0=b_0+b_1x_1+b_2x_2$
  - $b_0+b_1x_1+b_2x_2=0$ is the line.
- Rewriting it in $mx+c$ form $X_2=(\frac{-b_1}{b_2})X_1+(\frac{-b_0}{b_2})$
- Anything above this line is class-1, below this line is class-0
  - $X_2>(\frac{-b_1}{b_2})X_1+(\frac{-b_0}{b_2})$ is class-1
  - $X_2<(\frac{-b_1}{b_2})X_1+(\frac{-b_0}{b_2})$ is class-0
  - $X_2=(\frac{-b_1}{b_2})X_1+(\frac{-b_0}{b_2})$ tie probability of 0.5
- We can change the decision boundary by changing the threshold value(here 0.5)

LAB: Decision Boundary

Draw a scatter plot that shows Age on X-axis and Experience on Y-axis. Try to distinguish the two classes with colors or shapes (visualizing the classes).
Build a logistic regression model to predict Productivity using age and experience.
Finally draw the decision boundary for this logistic regression model.
Create the confusion matrix.
Calculate the accuracy and error rates.

Solution

Drawing the Decision boundary for the logistic regression model

library(ggplot2)
base<-ggplot(Emp_Productivity1)+geom_point(aes(x=Age,y=Experience,color=factor(Productivity),shape=factor(Productivity)),size=5)
base+geom_abline(intercept = intercept1 , slope = slope1, color = "red", size = 2)

#Base is the scatter plot. Then we are adding the decision boundary

This is the logistic regression line or the decision boundary where anything above the line is classifies as 1 and anything below the line is classifies as 0.
The top portion of the decision boundary classifies as 1 and everything below classifies as 1.
Looks like fairly accurate model except 1 or 2 points shown in the graph.

Create the confusion matrix.

predicted_values<-round(predict(Emp_Productivity_logit,type="response"),0)
conf_matrix<-table(predicted_values,Emp_Productivity_logit$y)
conf_matrix

##                 
## predicted_values  0  1
##                0 31  2
##                1  2 39

Calculate the accuracy and error rates.

accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy

We are getting very high accuracy, because in the confusion matrix most of the values should be around the diagonal, apart from two points of 0’s and two points of 1’s which are wrongly classified.
As accuracy is fairly high, thus it is a very good model with accuracy of 94%.

New Representation for Logistic Regression

We are trying to understand neural networks here through logistic regression.
Moving forward we will see a new representation for logistic regression line that we just built and see how it transforms to neural networks later.
Thus this is the logistic regression line that we have built: $y=\frac{e^{(b_0+b_1x_1+b_2x_2)}}{1+e^{(b_0+b_1x_1+b_2x_2)}}$
That can be rewritten as: $y=\frac{1}{1+e^{-(b_0+b_1x_1+b_2x_2)}}$
Thus this particular line that can be taken as one equation and one can write it as: $y=g(w_0+w_1x_1+w_2x_2)$ where $g(x)=\frac{1}{1+e^{-(x)}}$ and $y=g(\sum w_kx_k)$
This can be displayed in diagram as follows:
$x_1$ and $x_2$ whose weights are $w_1$ and $w_2$ .
$w_0$ is having no prior coefficients.
We can simply say $w_0+w_1x_1+w_2x_2$ is the line equation, that is going through this logistic equation $y=g(\sum w_kx_k)$ .
This is how we can represent the same logistic regression line that we just built.

Finding the weights in logistic regression

$out(x) = (y=g(\sum w_kx_k))$

The output is a non linear function of linear combination of inputs – A typical multiple logistic regression line.
In this particular line, we have to find out the coefficients of w.
We find $w$ to minimize $\sum_{i=1}^n [y_i - g(\sum w_kx_k)]^2$

LAB: Non-Linear Decision Boundaries

Dataset: “Emp_Productivity/ Emp_Productivity.csv”
Draw a scatter plot that shows Age on X axis and Experience on Y-axis. Try to distinguish the two classes with colors or shapes (visualizing the classes).
Build a logistic regression model to predict Productivity using age and experience.
Finally draw the decision boundary for this logistic regression model.
Create the confusion matrix.
Calculate the accuracy and error rates.

Note – Here we are considering the entire data not the subset.

####The clasification graph on overall data
library(ggplot2)
ggplot(Emp_Productivity_raw)+geom_point(aes(x=Age,y=Experience,color=factor(Productivity),shape=factor(Productivity)),size=5)

It looks like as age and experience increases, there are lot of 0’s in the beginning and then there are lot of 1’s after that in the graph.
Thus it looks like there can be a logistic regression line and the separation between 0’s and 1’s in this particular portion.
When we go beyond this point again there are lot of 0’s.
Thus it looks like fitting one logistic regression line or one linear decision boundary might not be sufficient.
We go ahead and force fit in one linear decision boundary.

###Logistic Regerssion model for overall data
Emp_Productivity_logit_overall<-glm(Productivity~Age+Experience,data=Emp_Productivity_raw, family=binomial())
Emp_Productivity_logit_overall

## 
## Call:  glm(formula = Productivity ~ Age + Experience, family = binomial(), 
##     data = Emp_Productivity_raw)
## 
## Coefficients:
## (Intercept)          Age   Experience  
##     0.44784     -0.01755     -0.06324  
## 
## Degrees of Freedom: 118 Total (i.e. Null);  116 Residual
## Null Deviance:       155.7 
## Residual Deviance: 150.5     AIC: 156.5

slope2 <- coef(Emp_Productivity_logit_overall)[2]/(-coef(Emp_Productivity_logit_overall)[3])
intercept2 <- coef(Emp_Productivity_logit_overall)[1]/(-coef(Emp_Productivity_logit_overall)[3])

####Drawing the Decision boundary

library(ggplot2)
base<-ggplot(Emp_Productivity_raw)+geom_point(aes(x=Age,y=Experience,color=factor(Productivity),shape=factor(Productivity)),size=5)
base+geom_abline(intercept = intercept2 , slope = slope2, colour = "blue", size = 2)

Thus, obviously the decision boundary is no way close to good decision boundary.
There are so many misclassifications, in fact it has classified every value of triangle has wrong value of 0.
Let us see the overall accuracy, the confusion matrix might not look as good as it was earlier.
There are so many incorrect non-diagonal values.
And the accuracy might fall drastically.

####Accuracy of the overall model
predicted_values<-round(predict(Emp_Productivity_logit_overall,type="response"),0)
conf_matrix<-table(predicted_values,Emp_Productivity_logit_overall$y)
conf_matrix

##                 
## predicted_values  0  1
##                0 69 43
##                1  7  0

accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy

## [1] 0.5798319

Accuracy is 0.57, which is not at all acceptable.
This is not a single decision boundary.
We have to build a non-linear decision boundary that might looks like shape of “U” in most of the cases.

Non-Linear Decision Boundaries-Issue

Basically, when there is no linear separation between two classes or when single straight line can not help us to divide the two classes then one might have to go for a non-linear decision boundary.
In non-linear decision boundary, we can not fit just one line and say right hand side of the line is 1’s and left hand side of the line is 0’s.
Thus one line might not be sufficient.
Therefore this seems to be issue and logical regression does not seems to be a good option when we have non-linear decision boundaries.

Non-Linear Decision Boundaries

Non-Linear Decision Boundaries-Solution

We need to find a solution for Non-Linear Decision Boundaries.
We have one idea say:
- By using multiple logistic regression line together, we construct a decision boundary by fitting 2 or 3 logistic regression lines and then use it as a final classifier.
But as of now, a single logistic regression line can not work for a scenario, where there is a non-linear separating boundary between 2 classes.
We are having an issue with Non-Linear Decision Boundaries.
Let us have a possible solution.
We have the classes that can not be separated by using one linear line or a classifier.
Now the question is that why don’t we fit two models?
Model-1 that separates first portion and model-2 takes care of another portion thus instead of finding the final output that will have one classifier which is non-linear and will tell us where are 1’s and 0’s or where are class-1 and class-2.
We get an intermediate output say $h_1$ which is coming out of model-1 and there is an intermediate output $h_2$ which is coming out of model-2.
Thus indeed we can use $h_1$ and $h_2$ to find out the final classifier.

Intermediate Output1	Intermediate Output2
$out(x)= g(\sum w_kx_k)$ say $h_1$	$out(x)= g(\sum w_kx_k)$ say $h_2$

The Intermediate output

Using the $x$ and directly predicting $y$ is challenging, thus we have the independent variables $x's$ i.e., $x_1$ and $x_2$ .
Using the independent variables of $x's$ , we can directly predict $y$ which is challenging because $y$ is non-linearly dependent on $x$ , a linear classifier, which is not working.
Thus the idea is that we will try to predict $h$ , and then intermediate output $h$ will indeed predict $y$ .
Instead of directly going from $x$ to $y$ , we will try to predict $h_1$ using $x_1$ and $x_2$ , then $h_2$ using again $x_1$ and $x_2$ and again using $h_1$ and $h_2$ will try to predict $y$ .

Finding the Weights for Intermediate Outputs

How do we find the weights of the intermediate output.
We will try to predict $h_1$ using $x_1$ and $x_2$ , $h_2$ using $x_1$ and $x_2$ , and then indeed $h_1$ and $h_2$ will be used for predicting $y$ .
$h$ is a non-linear function of linear combination of $x$ , and $h_2$ is a non-linear function (g)(linear combination of $x_1$ and $x_2$ ), whereas $y$ is a non-linear function (linear combination of $h_1$ and $h_2$ .
Thus, that is an intermediate output $h$ .

LAB: Intermediate output

Dataset: Emp_Productivity/ Emp_Productivity_All_Sites.csv
Filter the data and take first 74 observations from above dataset. Filter condition is Sample_Set<3.
Build a logistic regression model to predict Productivity using age and experience.
Calculate the prediction probabilities for all the inputs. Store the probabilities in inter1 variable.
Filter the data and take observations from row 34 onwards. Filter condition is Sample_Set<1.
Build a logistic regression model to predict Productivity using age and experience.
Calculate the prediction probabilities for all the inputs. Store the probabilities in inter2 variable.
Build a consolidated model to predict productivity using inter-1 and inter-2 variables.
Create the confusion matrix and find the accuracy and error rates for the consolidated model.

Our sampled data Emp_Productivity1 has first 74 observations. Lets build the model on this sample data (sample-1).

####The clasification graph Sample-1
library(ggplot2)
ggplot(Emp_Productivity1)+geom_point(aes(x=Age,y=Experience,color=factor(Productivity),shape=factor(Productivity)),size=5)

Therefore, the data is in particular manner, in class-1 there are some 0’s, then 1’s and then again 0’s.
Thus there will be class-1, then class-2 and then again class-1, so we can not have one linear decision boundary or a classifier that classifies these 2 classes.
We took sample-1, the first initial portion as 0’s and then as 1’s and then try to draw decision boundary.
The initial portion looks like this:

We have the original data like this:

Then for sample-1, we took this particular subset of data.
And then we fit a logistic regression line which fits very well.
Thus we found a perfect classifier.

###Logistic Regerssion model1
Emp_Productivity_logit<-glm(Productivity~Age+Experience,data=Emp_Productivity1, family=binomial())

coef(Emp_Productivity_logit)

## (Intercept)         Age  Experience 
##  -8.9361114   0.2762749   0.5923444

slope1 <- coef(Emp_Productivity_logit)[2]/(-coef(Emp_Productivity_logit)[3])
intercept1 <- coef(Emp_Productivity_logit)[1]/(-coef(Emp_Productivity_logit)[3])

####Decision boundary for model1 built on Sample-1
library(ggplot2)
base<-ggplot(Emp_Productivity1)+geom_point(aes(x=Age,y=Experience,color=factor(Productivity),shape=factor(Productivity)),size=5)
base+geom_abline(intercept = intercept1 , slope = slope1, color = "red", size = 2)

#Base is the scatter plot. Then we are adding the decision boundary

It has higher accuracy.
We can take sample-2 which is the other portion.
In classifier-1, we have considered the sample-2 and then fit an intermediate output.
This is sample-2 based on the condition sample_set>1.

#Filter the data and take observations from row 34 onwards. 
Emp_Productivity2<-Emp_Productivity_raw[Emp_Productivity_raw$Sample_Set>1,]
####The clasification graph
library(ggplot2)
ggplot(Emp_Productivity2)+geom_point(aes(x=Age,y=Experience,color=factor(Productivity),shape=factor(Productivity)),size=5)

There are lot of 1’s and lot of 0’s.
Then we fit a logistic regression line obviously for this data. Also we should find the good logistic regression line.

###Logistic Regerssion model2 built on Sample2
Emp_Productivity_logit2<-glm(Productivity~Age+Experience, data=Emp_Productivity2, family=binomial())
Emp_Productivity_logit2

## 
## Call:  glm(formula = Productivity ~ Age + Experience, family = binomial(), 
##     data = Emp_Productivity2)
## 
## Coefficients:
## (Intercept)          Age   Experience  
##     16.3184      -0.3994      -0.2440  
## 
## Degrees of Freedom: 85 Total (i.e. Null);  83 Residual
## Null Deviance:       119 
## Residual Deviance: 34.08     AIC: 40.08

We can find a decision boundary that classifies as 0’s and as 1’s.

coef(Emp_Productivity_logit2)

## (Intercept)         Age  Experience 
##  16.3183916  -0.3994172  -0.2439643

slope3 <- coef(Emp_Productivity_logit2)[2]/(-coef(Emp_Productivity_logit2)[3])
intercept3 <- coef(Emp_Productivity_logit2)[1]/(-coef(Emp_Productivity_logit2)[3]) 



####Drawing the Decison boundry
library(ggplot2)
base<-ggplot(Emp_Productivity2)+geom_point(aes(x=Age,y=Experience,color=factor(Productivity),shape=factor(Productivity)),size=5)
base+geom_abline(intercept = intercept3 , slope = slope3, color = "red", size = 2)

We can see on one side of the decision boundary, where 1’s are there and other side 0’s are there.

####Accuracy of the model2
predicted_values<-round(predict(Emp_Productivity_logit2,type="response"),0)
conf_matrix<-table(predicted_values,Emp_Productivity_logit2$y)
conf_matrix

##                 
## predicted_values  0  1
##                0 43  2
##                1  2 39

accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy

## [1] 0.9534884

Accuracy level should be fairly high for this particular model.
As found, the accuracy is 95%.
Emp_Productivity_logit is on the initial portion of the data and Emp_Productivity_logit2 is on the second portion of the data.
We will create two more columns in this particular datasets.
Thus we are creating two new variables that will be an intermediate output i.e., the output of this particular models that we have built.
We will use the logistic regression model-1 and then we will try to predict variable called inter1.
So we will create manually inter1, since we can not directly predict $y$ , we will first predict inter1 and then inter2.
inter1 and inter2 will act as two new variables $h_1$ and $h_2$ , thus they will indeed be using for predicting $y$ .
inter1 which is predicted by logistic regression-1.
inter2 which is predicted by logistic regression-2.

#Calculate the prediction probabilities for all the inputs. Store the probabilities in inter1 variable
Emp_Productivity_raw$inter1<-predict(Emp_Productivity_logit,type="response", newdata=Emp_Productivity_raw)

#Calculate the prediction probabilities for all the inputs. Store the probabilities in inter2 variable
Emp_Productivity_raw$inter2<-predict(Emp_Productivity_logit2,type="response", newdata=Emp_Productivity_raw)

head(Emp_Productivity_raw)

##    Age Experience Productivity Sample_Set     inter1    inter2
## 1 20.0        2.3            0          1 0.11423230 0.9995775
## 2 16.2        2.2            0          1 0.04080461 0.9999096
## 3 20.2        1.8            0          1 0.09202657 0.9995949
## 4 18.8        1.4            0          1 0.05152147 0.9997899
## 5 18.9        3.2            0          1 0.13955234 0.9996608
## 6 16.7        3.9            0          1 0.11793035 0.9998329

The idea is that both these variables that are created using these separate logistic regression have higher chance of predicting $y$ and it will be done quite easily.
We have two new variables called inter1 and inter2.
Now the graph is slightly different for predicting $y$ using inter1 and inter2.
Now we can see the classification graph, inter1 output on X-axis and inter2 output on Y-axis.

####Clasification graph with the two new coloumns
library(ggplot2)
ggplot(Emp_Productivity_raw)+geom_point(aes(x=inter1,y=inter2,color=factor(Productivity),shape=factor(Productivity)),size=5)

We can clearly see a good linear separating boundary which is class-1 and class-0.
We can find a linear line which goes diagonally as shown in the graph.
Let us go ahead and predict the probability of productivity using inter1 and inter2.
So instead of using $x_1$ and $x_2$ directly, we are using inter1 and inter2 which are derived from two logistic regression model.
Thus creating a new model called logistic logit combined.

###Logistic Regerssion model with Intermediate outputs as input
Emp_Productivity_logit_combined<-glm(Productivity~inter1+inter2,data=Emp_Productivity_raw, family=binomial())
Emp_Productivity_logit_combined

## 
## Call:  glm(formula = Productivity ~ inter1 + inter2, family = binomial(), 
##     data = Emp_Productivity_raw)
## 
## Coefficients:
## (Intercept)       inter1       inter2  
##     -12.213        8.019        8.598  
## 
## Degrees of Freedom: 118 Total (i.e. Null);  116 Residual
## Null Deviance:       155.7 
## Residual Deviance: 49.74     AIC: 55.74

Now we have new model and we will try to observe the decision boundary.

slope4 <- coef(Emp_Productivity_logit_combined)[2]/(-coef(Emp_Productivity_logit_combined)[3])
intercept4<- coef(Emp_Productivity_logit_combined)[1]/(-coef(Emp_Productivity_logit_combined)[3]) 

####Drawing the Decison boundry
library(ggplot2)
base<-ggplot(Emp_Productivity_raw)+geom_point(aes(x=inter1,y=inter2,color=factor(Productivity),shape=factor(Productivity)),size=5)
base+geom_abline(intercept = intercept4 , slope = slope4, colour = "red", size = 2)

Thus from the decision boundary, it is very clear that on one side of the decision boundary there are lot of values from class-1 and on the other side of the decision boundary there are lot of values from class-0.

####Accuracy of the combined
predicted_values<-round(predict(Emp_Productivity_logit_combined,type="response"),0)
conf_matrix<-table(predicted_values,Emp_Productivity_logit_combined$y)
conf_matrix

##                 
## predicted_values  0  1
##                0 74  4
##                1  2 39

accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy

## [1] 0.9495798

We should see higher accuracy levels that are almost like 94.95%.
We have built three models i.e., Emp_Productivity_logit and Emp_Productivity_logit2 on the subset of the data and then we derive two more intermediate output variables and from those intermediate outputs we build a third model Emp_Productivity_logit_combined.
Finally we have classify these two classes into 0’s and 1’s.
Now there is a linear decision boundary that we can see finally but not by using direct input variables but by transforming or by building the logistic regression line and then finding the intermediate outputs.
Then using those intermediate outputs finally to come up with the final decision boundary.

Neural Network Intuition

Final Output $y = out(h) = g(\sum W_j h_j)$ $h_j = out(x) = g(\sum w_{jk}x_k)$ $y = out(h) = g(\sum W_j g(\sum w_{jk} x_k))$

So (h) is a non linear function of linear combination of inputs – A multiple logistic regression line.
(y) is a non linear function of linear combination of outputs of logistic regressions.
(y) is a non linear function of linear combination of non linear functions of linear combination of inputs.
We find $W$ to minimize $\sum_{i=1}^n [y_i - g(\sum W_j h_j)]^2$ .
We find $W_j$ and $w_{jk}$ to minimize $\sum_{i=1}^n [y_i - g(\sum W_j g(\sum w_{jk} x_k))]^2$ .
Neural networks is all about finding the sets of weights $W_j$ and $w_{jk}$ using Gradient Descent Method.

The Neural Networks

The neural networks methodology is similar to the intermediate output method explained above.
But we will not manually subset the data to create the different models.
The neural network technique automatically takes care of all the intermediate outputs using hidden layers.
It works very well for the data with non-linear decision boundaries.
The intermediate output layer in the network is known as hidden layer.
In simple terms, neural networks are multilayer nonlinear regression model.
If we have sufficient number of hidden layers, then we can estimate any complex non-linear function.

Neural Network and Vocabulary

Here, the two hidden layers, $h_1$ and $h_2$ are derived from the two inputs, $x_1$ and $x_2$ .

Why are they called hidden layers?

A hidden layer “hides” the desired output.
Instead of predicting the actual output using a single model, build multiple models to predict intermediate output.
There is no standard way of deciding the number of hidden layers but with experience and looking at the complexity or looking at the final accuracy of the model, we can experiment with the number of hidden layers.
But it is like the more the merrier.
So we have to avoid the over fitting as well.
So this is the overall intuition of the neural network.

Algorithm for Finding Weights

Algorithm is all about finding the weights/coefficients.
We randomly initialize some weights.
Calculate the output by supplying training input.
With those values in $x$ , we calculate the value of $y$ or predicted values of $y$ .
Given the values of $x$ , we already know the actual value of $y$ .
Now we will try to predict the value of $y$ using these weights.
Whatever is the error between the predicted and actual, we try to adjust the weight to reduce that error.
And finally we will find those weights after adjustments that will give us minimum amount of error.
Let us see what are the steps involved in the neural network algorithm.

The Neural Network Algorithm

Step 1 : Initialization of weights: Randomly select some weights.
Step 2 : Training & Activation: Input the training values and perform the calculations forward.
- We have the dataset with us, so we put the values of $x$ then we find the values of $y$ with which we can calculate forward.
- Once we calculate the value of $y$ , then those are predicted values of $y$ .
- In training itself i.e., in dataset, we will have the actual values of $y$ .
Step 3 : Error Calculation: Calculate the error at the outputs. Use the output error to calculate error fractions at each hidden layer based on the final layer.
Step 4: Weight training: Update the weights to reduce the error, recalculate and repeat the process of training & updating the weights for all the examples.
Step 5: Stopping criteria: Stop the training and weights updating process when the minimum error criteria is met.
- So we start with some weights and then calculate the errors and then adjust the weights to reduce the error, once there is very minimum error we stop it at that point.

Randomly Initialize Weights

In step-1 initialization of some random weights.
So we take any random weights.
As these are not the final weights, we can adjust them later.

Training & Activation

Training and activation is the next step.
So we have input the values of $x$ to predict $h_1$ , because we already have weights.
So we can just substitute the values of $x$ and random weights in the given equation, then we can find $h_1$ and $h_2$ .
So by giving input the training value, we can perform the calculations forward.
Training input & calculations is called – Feed Forward.

Error Calculation at Output

In the previous step, we have the predicted values of $y$ , so this $g(x)$ function is the predicted values of $y$ .
Now we know the actual value of $y$ .
So now we calculate error at the final layer and we can also find the error fractions at each hidden layer based on the error.
With this formula we can find out that what fraction of error will be contributing the overall error.
At each layer we can find out the error which is called Back Propagation helps in calculating errors signals backwards.

Error Calculation at hidden layers

This is the overall error, i.e. Err.
Once we have the errors then we can calculate the weight corrections that will reduce that errors.

Calculate weight corrections

Here, the $w_{11}$ , $w_{12}$ , etc. weight correction will reduce the errors on $h_1$ and $h_2$ and indeed $W_0$ , $W_1$ and $W_2$ will be reducing the overall error.
These are the weight corrections.

Update Weights

Here in the Update Weights, the weights will be the summation of previous weights and the weight corrections given as follows, $w_{11} := w_{11} + \delta w_{11}$
Update the weights to reduce the error based on the weight corrections, recalculate and repeat the process.
Hence with this new weight, the error will be reduced.
Thus this is one iteration that we repeat it again and again.
So now these new weights again will find out the error.
Again will find out the error at each hidden layer.
And again we will do the weight correction and update the weights.
So that with new weights, error will be reduced slightly.
We repeat the process again and again.

Stopping Criteria

We will stop training the weights and updating the weights, when the error will be least.
When there is no more error reduction happening, then we can stop at that point or we can set up a minimum error criteria i.e., if the error is less than particular point then we have to stop training.
Final weights will be taken from the final iteration.
Thus once the minimum error criteria is met, we can come out of this algorithm.

Once Again ..Neural network Algorithm

Step 1 : Initialization of weights: Randomly select some weights.
Step 2 : Training & Activation: Input the training values and perform the calculations forward.
Step 3 : Error Calculation: Calculate the error at the outputs. Use the output error to calculate error fractions at each hidden layer.
Step 4 : Weight training: Update the weights to reduce the error, recalculate and repeat the process of training & updating the weights for all the examples.
Step 5 : Stopping criteria: Stop the training and weights updating process when the minimum error criteria is met.

Neural network Algorithm-Demo

This Neural Network Algorithm shows how exactly it works with the actual numbers.

Looks like a dataset that can not be separated by using single linear decision boundary/perceptron.
Let us consider a similar but simple classification example.
- XOR Gate Dataset
Observing the dataset, then there are two classes: 1 and 0.
Here, clearly we can see that they can not be one single decision boundary that can separate them into two classes.
So we need to build a non-linear decision boundary and we need to go with neural networks to classify this whole system into 1’s and 0’s.
For an example we will take XOR Gate Dataset.
So we have XOR Gate and we can see in the diagram of it which is similar to above example.

Here the XOR Gate will be:
- When $x_1$ takes 1, $x_2$ takes 1 output is 0.
- When $x_1$ takes 1, $x_2$ takes 0 output is 1.
We will try to use neural network algorithm that will build a non linear decision boundary and will help us to segregate between two classes i.e., 1’s and 0’s so it is obvious that it is one decision line.
One line might not be sufficient so we need to build a non-linear decision boundary.

Randomly Initialize Weights

Step 1 will be randomly initialize weights.
Let us suppose these might be randomly initialize weights say, they can be anything.
No need to worry about what we take exactly.
So randomly initialized the weights like this.

Activation

We need to take the input data, this is directly taken from the dataset that we have.
This is the XOR dataset:
Here the XOR Gate will be:
- When $x_1$ takes 1, $x_2$ takes 1 output is 0.
- When $x_1$ takes 1, $x_2$ takes 0 output is 1.
So we will take the first data point and then we pass on:

Thus this is called first epoch.
Now inputting the values of $x_1$ and $x_2$ , so the value of $x_1$ is 1 and the value of $x_2$ is 1, and then we can find out the value of $h_1$ and $h_2$ .
Value of $h_1$ can be calculating using 0.818.
So using these weights, if we substitute weights $h_1$ is 0.818 and $h_2$ is 0.731.
Using $h_1$ and $h_2$ and these weights, we can use this equation to calculate the final value of $y$ .
When $x_1$ is 1 and $x_2$ is 1, then output is zero but with these weights, the output comes as 0.71.
Obviously there is an error as instead of zero we got 0.731.
So error is nothing but (Error = Actual – Predicted).
Predicted value is 0.713 and the actual value is 0.
Thus the (Error = Actual – Predicted), so (Y Target – Y Observed) is our error.
The error rate will be the output which is 0.713.

Back-Propagate Errors

Similarly we can calculate the expected error or error fraction at $h_1$ and at $h_2$ .
And then based on the delta at $y$ , the error fraction we can calculate the error fraction at $h_1$ is 0.021 and at $h_2$ is -0.028.

Calculate Weight Corrections

Based on the error fraction that we have calculated earlier, from that the weight adjustment will be derived.
Thus the error fraction and actual value of $y$ help us to find out the weight adjustment.
We calculate the weight adjustment, based on the weight adjustment we will update the weights.

Updated Weights

Earlier the weight was 0.5 at $h_1$ and -1 at $h_2$ , so we calculate the weight adjustment its comes around 0.00217501 at $h_1$ and -0.02867 at $h_2$ .
We will update the weight 0.5 which becomes 0.502175 and -1 becomes -1.002867.
So we have adjusted the weights.
Now we have to do the same thing for all the weights.
These are the new weights and for these new weights again we have to calculate the value of predicted $y$ and the actual $y$ again and we will find the error.

Updated Weights contd…

Iterations and Stopping Criteria

This iteration is just for one training example (1,1,0). This is just the first epoch.
We repeat the same process of training and updating of weights for all the data points
We continue and update the weights until we see there is no significant change in the error or when the maximum permissible error criteria is met.
By updating the weights in this method, we reduce the error slightly. When the error reaches the minimum point the iterations will be stopped and the weights will be considered as optimum for this training set

XOR Gate final NN Model

Finally find the decision boundaries using neural networks.
This how it looks like for XOR Gate final neural network model.
This is the manual calculations of Neural Network.
We really do not need to do error calculation, back propagation, updating the weights, etc.
If we are having the tool, then it takes care of everything automatically.
All we need is to supply the dataset, independent variable, dependent variable, etc. so everything will be taken care of.
We did this exercise just to take an idea but in general, we do not need to manually calculate the errors.
Thus we can use any tool to build Neural Networks.

Building the Neural Network

We do not really need to calculate the weights manually like that.
If we take R or Python or any tool which is prewritten, we just need to input the right values of (x) and the dataset.
The gradient descent method is not very easy to understand for a non-mathematics students thus it is not easy to write the program from the scratch.
The neural network tools do not expect the user to write the code for the full length back propagation algorithm at least for us i.e., the starting and intermediate student.
We do not really need to know the overall coding of the algorithm for finding out weights of neural network.
Thus, we can use tool like R.
We will try to use R to build the neural network and we just need to be slightly careful while setting out the parameters in this neural network function where everything else will be taken care of.
We will try to build a neural network model for XOR data.
We will also do a neural network weights finding exercise on Emp_Productivity.csv data.
We will also use this neural network to predict the values.
We will also find out what the final model will look like.
Earlier we have built two logistic regressions, now we will directly try to build one neural network equation.
Now we will build the neural network, so for building this, the function is neural net.
We will fit an XOR neural network model.

The good news is…

We do not need to write the code for weights calculation and updating.
There readymade codes, libraries and packages available in R.
The gradient descent method is not very easy to understand for a non-mathematics students.
Neural network tools do not expect the user to write the code for the full length back propagation algorithm.

Building the Neural Network in R

We have a couple of packages available in R.
We need to mention the dataset, input, output & number of hidden layers as input.
Neural network calculations are very complex. The algorithm may take sometime to produce the results
One need to be careful while setting the parameters. The runtime changed based on the input parameter values

LAB: Building the neural network in R

Build a neural network for XOR data.
Dataset: Emp_Productivity/Emp_Productivity.csv
Draw a 2D graph between age, experience and productivity.
Build neural network algorithm to predict the productivity based on age and experience.
Plot the neural network with final weights.

#Build a neural network for XOR data
xor_data <- read.csv("R datasetGatesxor.csv")

library(neuralnet)

## Warning: package 'neuralnet' was built under R version 3.3.2

xor_nn_model<-neuralnet(output~input1+input2,data=xor_data,hidden=2, linear.output = FALSE, threshold = 0.0000001)
plot(xor_nn_model)

Sometimes it may take time for implementing this graph depending on the seed value that has been supply or the weights which have been chosen.
Error is zero and these are the final weights for XOR model.
We can see the overall model.

xor_nn_model

## $call
## neuralnet(formula = output ~ input1 + input2, data = xor_data, 
##     hidden = 2, threshold = 0.0000001, linear.output = FALSE)
## 
## $response
##   output
## 1      0
## 2      1
## 3      1
## 4      0
## 
## $covariate
##      [,1] [,2]
## [1,]    1    1
## [2,]    1    0
## [3,]    0    1
## [4,]    0    0
## 
## $model.list
## $model.list$response
## [1] "output"
## 
## $model.list$variables
## [1] "input1" "input2"
## 
## 
## $err.fct
## function (x, y) 
## {
##     1/2 * (y - x)^2
## }
## <environment: 0x000000001725bbf0>
## attr(,"type")
## [1] "sse"
## 
## $act.fct
## function (x) 
## {
##     1/(1 + exp(-x))
## }
## <environment: 0x000000001725bbf0>
## attr(,"type")
## [1] "logistic"
## 
## $linear.output
## [1] FALSE
## 
## $data
##   input1 input2 output
## 1      1      1      0
## 2      1      0      1
## 3      0      1      1
## 4      0      0      0
## 
## $net.result
## $net.result[[1]]
##              [,1]
## 1 0.0003253483014
## 2 0.9996353029148
## 3 0.9996313468905
## 4 0.0003253955548
## 
## 
## $weights
## $weights[[1]]
## $weights[[1]][[1]]
##              [,1]         [,2]
## [1,]  12.57613311  11.37500633
## [2,]  26.09709905 -23.34083293
## [3,] -25.27406355  24.32173217
## 
## $weights[[1]][[2]]
##              [,1]
## [1,]  23.85189617
## [2,] -15.93571609
## [3,] -15.94656173
## 
## 
## 
## $startweights
## $startweights[[1]]
## $startweights[[1]][[1]]
##                [,1]         [,2]
## [1,]  1.34136916425 1.2805342043
## [2,]  0.09709905088 0.2695670659
## [3,] -0.66886355387 0.2526550529
## 
## $startweights[[1]][[2]]
##               [,1]
## [1,]  1.8831711692
## [2,] -0.2234252985
## [3,] -0.8890847239
## 
## 
## 
## $generalized.weights
## $generalized.weights[[1]]
##               [,1]            [,2]
## 1  0.0009714231652 -0.001058638259
## 2  0.0023663851028 -0.002465832511
## 3 -0.0012715107572  0.001231410573
## 4  0.0028361883865 -0.003061029869
## 
## 
## $result.matrix
##                                           1
## error                   0.00000024032143171
## reached.threshold       0.00000007870849299
## steps                 261.00000000000000000
## Intercept.to.1layhid1  12.57613310993247069
## input1.to.1layhid1     26.09709905087747828
## input2.to.1layhid1    -25.27406355387023140
## Intercept.to.1layhid2  11.37500633477892364
## input1.to.1layhid2    -23.34083293408322390
## input2.to.1layhid2     24.32173217293616219
## Intercept.to.output    23.85189616804580126
## 1layhid.1.to.output   -15.93571609448825122
## 1layhid.2.to.output   -15.94656172786419646
## 
## attr(,"class")
## [1] "nn"

We can draw the decision boundaries as well for XOR model.
If you remember XOR model looks like 0’s and 1’s.

#Decision Boundaries
m1_slope <- xor_nn_model$weights[[1]][[1]][2]/(-xor_nn_model$weights[[1]][[1]][3])
m1_intercept <- xor_nn_model$weights[[1]][[1]][1]/(-xor_nn_model$weights[[1]][[1]][3])

m2_slope <- xor_nn_model$weights[[1]][[1]][5]/(-xor_nn_model$weights[[1]][[1]][6])
m2_intercept <- xor_nn_model$weights[[1]][[1]][4]/(-xor_nn_model$weights[[1]][[1]][6])

####Drawing the Decision boundary

library(ggplot2)
base<-ggplot(xor_data)+geom_point(aes(x=input1,y=input2,color=factor(output),shape=factor(output)),size=5)
base+geom_abline(intercept = m1_intercept , slope = m1_slope, colour = "blue", size = 2) +geom_abline(intercept = m2_intercept , slope = m2_slope, colour = "blue", size = 2)

These are the decision boundaries for XOR model, neural network has built a non-linear decision boundary or two decision boundaries that will help us to identify 1’s and 0’s obviously.
Thus anything between these lines is 0, while beyond these lines is 1.
Similarly we will try to build a neural network model on employee productivity data.

#Build neural network algorithm to predict the productivity based on age and experience
library(neuralnet)
Emp_Productivity_nn_model1<-neuralnet(Productivity~Age+Experience,data=Emp_Productivity_raw )
plot(Emp_Productivity_nn_model1)

If we do not include linear.output, then the model will go wrong.
Thus we have to include the option linear.output=false, then it is not a linear output and it will be a binary output 1 or 0.

#Including the option Linear.output
Emp_Productivity_nn_model1<-neuralnet(Productivity~Age+Experience,data=Emp_Productivity_raw, linear.output = FALSE)
plot(Emp_Productivity_nn_model1)

So this is the final model with an error 13 and there are 40867 steps that have been taken for the no of hidden layers.
We didn’t mention any hidden layers earlier, so let’s try to mention hidden layers over here.

#Including the option Hidden layers
Emp_Productivity_nn_model1<-neuralnet(Productivity~Age+Experience,data=Emp_Productivity_raw, hidden=2,linear.output = FALSE)
plot(Emp_Productivity_nn_model1)

We have mention two hidden layers , error is 13 and 40000 steps.
So we try to do some “In Time Validation” and find out the error.
We can do the plotting of actual verses predicted.

####Results and Intime validation
actual_values<-Emp_Productivity_raw$Productivity
Predicted<-Emp_Productivity_nn_model1$net.result[[1]]
head(Predicted)

##           [,1]
## 1 0.3611875923
## 2 0.3611875923
## 3 0.3611875923
## 4 0.3611875923
## 5 0.3611875923
## 6 0.3611875923

#The root mean square error
sqr_err<-(actual_values-Predicted)^2
sum(sqr_err)

## [1] 27.46218781

mean(sqr_err)

## [1] 0.2307746874

sqrt(mean(sqr_err))

## [1] 0.4803901409

#Plottig Actual and Predicted
plot(actual_values)
points(Predicted, col=2)

Looks like that is not a very good model.
Then we can build one more neural network model on the employee productivity data.

#Plottig Actual and Predicted using ggplot
library(ggplot2)
library(reshape2)
act_pred_df<-data.frame(actual_values,Predicted)
act_pred_df$id<-rownames(act_pred_df)
act_pred_df_melt = melt(act_pred_df, id.vars ="id")
ggplot(act_pred_df_melt,aes(id, value, colour = variable)) + geom_point()

##Plotting Actual and Predicted using ggplot on classification graph
Emp_Productivity_pred_act<-data.frame(Emp_Productivity_raw,Predicted=round(Predicted,0))
library(ggplot2)
#Graph without predictions
ggplot(Emp_Productivity_pred_act)+geom_point(aes(x=Age,y=Experience,color=factor(Productivity)),size=5)

#Graph with predictions
ggplot(Emp_Productivity_pred_act)+geom_point(aes(x=Age,y=Experience,color=factor(Productivity),shape=factor(Predicted)),size=5)

Emp_Productivity_nn_model1<-neuralnet(Productivity~Age+Experience,data=Emp_Productivity_raw, hidden=2,linear.output = FALSE)
plot(Emp_Productivity_nn_model1)

This time error is 13.
Similarly we can try out the how many no of times we want and find out the best model by doing validation.

R Code Options

The code has the input parameters that we need to take care of i.e., the number of hidden layers in the neural network. It is actually the number of nodes that we can input as a vector to add more hidden layers.
Make sure that we have sufficient no. of hidden layers to capture the overall variance in the objective space.
Stepmax: While executing algorithm, sometimes neural network might run in big loop where there are more iterations taking place and it is never converging. So we can say 100,000 steps for algorithms to converge. Sometimes we may get an error “Algorithm did not converge with the default Stepmax”.
- To increase the stepmax parameter value in such cases.
Threshold is connected to error function calculation. It can be used as a stopping criteria.
Stopping criteria is based on the minimum error that is calculated.
If the error is lesser than the minimum error threshold, then the algorithm will stop.
We can fix the threshold error depending on our requirements.
Generally it is taken as 0.001.
In neural network function of R, the output is expected to be linear by default.
Most of the time R output is not linear, we need to specifically mention linear.output = FALSE for classification problems.
So this is how we build the neural network within R.

Output: Building the neural network in R

Thus, depending on the error we make sure the error is very close to 0 but if the error is higher than 0 then we might have to re-run the algorithm again.
Then we can validate to a kind of in-time validation.
How many 0s are there? Which are actually classified as 0s?
How many 1s are there? Which are actually classified as 1s?

Code- Prediction using NN

new_data<-data.frame(Age=40, Experience=12)
compute(Emp_Productivity_nn_model1, new_data)

## $neurons
## $neurons[[1]]
##      1 Age Experience
## [1,] 1  40         12
## 
## $neurons[[2]]
##      [,1]         [,2]         [,3]
## [1,]    1 0.9999999917 0.9943805052
## 
## 
## $net.result
##               [,1]
## [1,] 0.03814890608

We can predict using the compute function.
We can use the neural network model and we can give the values of the $x$ and then find the value of $y$ , using the compute function.
In fact there can be many solutions for a given neural network, because the gradient decent algorithm is searching for the local minima not for the global minima.
There is whole lot of theory behind that.
At this point we just need to consider that the neural network output that we get is not unique, it might have multiple solutions.

There can be many solutions

This is one solution, if we see the set of 11,8,7,19.
There can be several combinations.
Here the error is 0 and there are 191 steps.

Again the error is 0, but the set of weights are different.
Since there are so many values that are getting into the overall network.
So even if we adjust some values, then the remaining values will be adjusted automatically, thus the overall error will still minimum but with different weights.

This is set-3 with slightly different weights.
Thus there can be many weights because all that we are trying do is find the best weights or the solution.
The optimal solution that will give us the least error and we might end up with different set of weights.

Local vs. Global Minimum

There is an issue with Local vs. Global Minimum.
We need to know the details of the neural network.
The question is what exactly this multiple solutions are and what Local vs. Global Minimum is.
Thus there can be multiple solutions for a given neural network because there are so many weights and many weight combination can lead into a smaller error.
Thus gradient decent method which we use in finding the weights in neural network is not finding the final global minima but it is finding the nearest local minima and most of the times local minima.
So what global minimum in this particular graph is.

Algorithms will try to find the local minima rather than global minima because you might see multiple solutions for a given neural network problem.
That is a kind of uncomfortable situation but we can perform some cross validation checks to find out the real final optimize solution.
So there can be multiple optimal solutions of neural network.

Hidden layers and their role

Now we will try to understand what are hidden layers and their role in this whole neural network.
Thus this is a Multi-Layer Neural Network.

Multi Layer Neural Network

As shown in the figure, Layer-1 has two nodes while Layer-2 has three nodes.
We can have multiple layers as well because if the complexity in the objective space is really high and it is like seriously non-linear then we can use the multi-layer neural network to capture the overall variation or capture the overall classification details in the objective space.
So here is the role of hidden layers.

The Role of Hidden Layers

First Hidden Layer –

The first layer clearly states that it is a decision boundaries.
It creates two decision boundaries.
So we can simply think of them as perceptron or simple logistic regression lines outputs.
We can see them as multiple lines on the decision space.

Second Hidden Layer –

The Second layer combines these lines and forms simple decision boundary shapes.

Third Hidden Layer –

The third hidden layer forms even complex shapes within the boundaries generated by second layer, we can see some more complex boundaries from the third layer.
And as we increases the hidden layers then more decision boundaries have been generated that will help us in classifying within that boundaries of one class and outside of is another class.
We can imagine all these layers together divide the whole objective space into multiple decision boundary shapes, the cases within the shape are class-1 outside the shape are class-2.

The Number of Hidden Layers

There is no concrete rule to choose the right number. We need to choose by trial and error validation
Too few hidden layers might result in imperfect models. The error rate will be high
High number of hidden layers might lead to over-fitting, but it can be identified by using some validation techniques
The final number is based on the number of predictor variables, training data size and the complexity in the target.
When we are in doubt, its better to go with many hidden nodes than few. It will ensure higher accuracy. The training process will be slower though
Cross validation and testing error can help us in determining the model with optimal hidden layers

LAB: Digit Recognizer

Take an image of a handwritten single digit, and determine what that digit is.
Normalized handwritten digits, automatically scanned from envelopes by the U.S. Postal Service. The original scanned digits are binary and of different sizes and orientations; the images here have been deslanted and size normalized, resulting in 16 x 16 grayscale images (Le Cun et al., 1990).
The MNIST (Modified National Institute of Standards and Technology) database of hand written digits.
The data are in two gzipped files, and each line consists of the digitid (0-9) followed by the 256 grayscale values.
It has a training set of 60000 examples and a test set of 10,000 examples.
Thus that next time if we just input the image automatically, the algorithm should identify what digit is written in that particular picture.
This is the basic version of image recognition.
And we will take this test and training dataset.
Use the test dataset to validate the true classification power of the model
What is the final accuracy of the model?

Code: Digit Recognizer

#Importing test and training data
digits_train <- read.table("R datasetDigit RecognizerUSPSzip.train.txt", quote=""", comment.char="")
digits_test <- read.table("R datasetDigit RecognizerUSPSzip.test.txt", quote=""", comment.char="")

dim(digits_train)

## [1] 7291  257

col_names <- names(digits_train[,-1])
label_levels<-names(table(digits_train$V1))

#Lets see some images. 
for(i in 1:10)
{
data_row<-digits_train[i,-1]
pixels = matrix(as.numeric(data_row),16,16,byrow=TRUE)
image(pixels, axes = FALSE)
title(main = paste("Label is" , digits_train[i,1]), font.main = 4)
}

#####Creating multiple columns for multiple outputs
#####We need these variables while building the model
digit_labels<-data.frame(label=digits_train[,1])
for (i in 1:10)
  {
    digit_labels<-cbind(digit_labels, digit_labels$label==i-1)
    names(digit_labels)[i+1]<-paste("l",i-1,sep="")
    }

label_names<-names(digit_labels[,-1])

#Update the training dataset
digits_train1<-cbind(digits_train,digit_labels)
names(digits_train1)

##   [1] "V1"    "V2"    "V3"    "V4"    "V5"    "V6"    "V7"    "V8"   
##   [9] "V9"    "V10"   "V11"   "V12"   "V13"   "V14"   "V15"   "V16"  
##  [17] "V17"   "V18"   "V19"   "V20"   "V21"   "V22"   "V23"   "V24"  
##  [25] "V25"   "V26"   "V27"   "V28"   "V29"   "V30"   "V31"   "V32"  
##  [33] "V33"   "V34"   "V35"   "V36"   "V37"   "V38"   "V39"   "V40"  
##  [41] "V41"   "V42"   "V43"   "V44"   "V45"   "V46"   "V47"   "V48"  
##  [49] "V49"   "V50"   "V51"   "V52"   "V53"   "V54"   "V55"   "V56"  
##  [57] "V57"   "V58"   "V59"   "V60"   "V61"   "V62"   "V63"   "V64"  
##  [65] "V65"   "V66"   "V67"   "V68"   "V69"   "V70"   "V71"   "V72"  
##  [73] "V73"   "V74"   "V75"   "V76"   "V77"   "V78"   "V79"   "V80"  
##  [81] "V81"   "V82"   "V83"   "V84"   "V85"   "V86"   "V87"   "V88"  
##  [89] "V89"   "V90"   "V91"   "V92"   "V93"   "V94"   "V95"   "V96"  
##  [97] "V97"   "V98"   "V99"   "V100"  "V101"  "V102"  "V103"  "V104" 
## [105] "V105"  "V106"  "V107"  "V108"  "V109"  "V110"  "V111"  "V112" 
## [113] "V113"  "V114"  "V115"  "V116"  "V117"  "V118"  "V119"  "V120" 
## [121] "V121"  "V122"  "V123"  "V124"  "V125"  "V126"  "V127"  "V128" 
## [129] "V129"  "V130"  "V131"  "V132"  "V133"  "V134"  "V135"  "V136" 
## [137] "V137"  "V138"  "V139"  "V140"  "V141"  "V142"  "V143"  "V144" 
## [145] "V145"  "V146"  "V147"  "V148"  "V149"  "V150"  "V151"  "V152" 
## [153] "V153"  "V154"  "V155"  "V156"  "V157"  "V158"  "V159"  "V160" 
## [161] "V161"  "V162"  "V163"  "V164"  "V165"  "V166"  "V167"  "V168" 
## [169] "V169"  "V170"  "V171"  "V172"  "V173"  "V174"  "V175"  "V176" 
## [177] "V177"  "V178"  "V179"  "V180"  "V181"  "V182"  "V183"  "V184" 
## [185] "V185"  "V186"  "V187"  "V188"  "V189"  "V190"  "V191"  "V192" 
## [193] "V193"  "V194"  "V195"  "V196"  "V197"  "V198"  "V199"  "V200" 
## [201] "V201"  "V202"  "V203"  "V204"  "V205"  "V206"  "V207"  "V208" 
## [209] "V209"  "V210"  "V211"  "V212"  "V213"  "V214"  "V215"  "V216" 
## [217] "V217"  "V218"  "V219"  "V220"  "V221"  "V222"  "V223"  "V224" 
## [225] "V225"  "V226"  "V227"  "V228"  "V229"  "V230"  "V231"  "V232" 
## [233] "V233"  "V234"  "V235"  "V236"  "V237"  "V238"  "V239"  "V240" 
## [241] "V241"  "V242"  "V243"  "V244"  "V245"  "V246"  "V247"  "V248" 
## [249] "V249"  "V250"  "V251"  "V252"  "V253"  "V254"  "V255"  "V256" 
## [257] "V257"  "label" "l0"    "l1"    "l2"    "l3"    "l4"    "l5"   
## [265] "l6"    "l7"    "l8"    "l9"

#formula y~. doesn't work in neuralnet function
model_form <- as.formula(paste(paste(label_names, collapse = " + "), "~", paste(col_names, collapse = " + ")))

#Lets keep an eye on runtime
pc <- proc.time()

library(neuralnet)
Digit_model<-neuralnet(model_form, data=digits_train1, hidden=15,linear.output=FALSE)
summary(Digit_model)

##                     Length  Class      Mode    
## call                      5 -none-     call    
## response              72910 -none-     logical 
## covariate           1866496 -none-     numeric 
## model.list                2 -none-     list    
## err.fct                   1 -none-     function
## act.fct                   1 -none-     function
## linear.output             1 -none-     logical 
## data                    268 data.frame list    
## net.result                1 -none-     list    
## weights                   1 -none-     list    
## startweights              1 -none-     list    
## generalized.weights       1 -none-     list    
## result.matrix          4018 -none-     numeric

proc.time() - pc

##    user  system elapsed 
##  170.72    1.64  172.65

#######Prediction  on holdout data
test_predicted<-data.frame(compute(Digit_model,digits_test[,-1])$net.result)

########Collating all labels into a single column
pred_label<-0
for(i in 1:nrow(test_predicted))
  {
    pred_label[i]<-which.max(apply(test_predicted[i,],MARGIN=2,min))-1
    }   
test_predicted$pred_label<-pred_label

###Confusion Matrix and Accuracy
library(caret)

## Loading required package: lattice

confuse<-confusionMatrix(test_predicted$pred_label,digits_test$V1)
confuse

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4   5   6   7   8   9
##          0 343   0   7   3   1   6   2   0   5   0
##          1   0 250   0   1   1   0   0   0   1   0
##          2   2   4 168  10   4   5   3   0   0   1
##          3   1   0   5 138   0  11   0   2   9   2
##          4   3   4   6   1 176   6   5   8   7   4
##          5   3   0   0   8   1 126   6   1   3   1
##          6   5   3   6   0   5   1 152   0   1   0
##          7   1   2   1   0   1   0   0 128   2   3
##          8   0   1   4   3   1   3   2   1 134   2
##          9   1   0   1   2  10   2   0   7   4 164
## 
## Overall Statistics
##                                                   
##                Accuracy : 0.8863976               
##                  95% CI : (0.8716976, 0.8999577)  
##     No Information Rate : 0.1788739               
##     P-Value [Acc > NIR] : < 0.00000000000000022204
##                                                   
##                   Kappa : 0.8724009               
##  Mcnemar's Test P-Value : NA                      
## 
## Statistics by Class:
## 
##                       Class: 0  Class: 1   Class: 2   Class: 3   Class: 4
## Sensitivity          0.9554318 0.9469697 0.84848485 0.83132530 0.88000000
## Specificity          0.9854369 0.9982788 0.98396904 0.98370451 0.97565025
## Pos Pred Value       0.9346049 0.9881423 0.85279188 0.82142857 0.80000000
## Neg Pred Value       0.9902439 0.9920182 0.98342541 0.98477433 0.98656967
## Prevalence           0.1788739 0.1315396 0.09865471 0.08271051 0.09965122
## Detection Rate       0.1709018 0.1245640 0.08370703 0.06875934 0.08769307
## Detection Prevalence 0.1828600 0.1260588 0.09815645 0.08370703 0.10961634
## Balanced Accuracy    0.9704343 0.9726243 0.91622695 0.90751490 0.92782512
##                        Class: 5   Class: 6   Class: 7   Class: 8
## Sensitivity          0.78750000 0.89411765 0.87074830 0.80722892
## Specificity          0.98754737 0.98856832 0.99462366 0.99076589
## Pos Pred Value       0.84563758 0.87861272 0.92753623 0.88741722
## Neg Pred Value       0.98170075 0.99018539 0.98983414 0.98275862
## Prevalence           0.07972098 0.08470354 0.07324365 0.08271051
## Detection Rate       0.06278027 0.07573493 0.06377678 0.06676632
## Detection Prevalence 0.07424016 0.08619831 0.06875934 0.07523667
## Balanced Accuracy    0.88752369 0.94134298 0.93268598 0.89899740
##                        Class: 9
## Sensitivity          0.92655367
## Specificity          0.98524590
## Pos Pred Value       0.85863874
## Neg Pred Value       0.99284141
## Prevalence           0.08819133
## Detection Rate       0.08171400
## Detection Prevalence 0.09516692
## Balanced Accuracy    0.95589979

confuse$overall

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##   0.8863976084   0.8724009056   0.8716976421   0.8999576767   0.1788739412 
## AccuracyPValue  McnemarPValue 
##   0.0000000000            NaN

Real World Applications

Self driving car by taking the video as input
Speech recognition
Face recognition
Cancer cell analysis
Heart attack predictions
Currency predictions and stock price predictions
Credit card default and loan predictions
Marketing and advertising by predicting the response probability
Weather forecasting and rainfall prediction

Some examples

Face recognition:
- https://www.youtube.com/watch?v=57VkfXqJ1LU
- https://www.youtube.com/watch?v=xVQLBbXdVUY
Autonomous car software:
- https://www.youtube.com/watch?v=gG72-SjwxAM

Drawbacks of Neural Networks

Neural network is a computation intensive algorithm.
No real theory that explains how to choose the number of hidden layers.
Takes lot of time when the input data is large, needs powerful computing machines.
Difficult to interpret the results. Very hard to interpret and measure the impact of individual predictors.
Its not easy to choose the right training sample size and learning rate.
The local minimum issue. The gradient descent algorithm produces the optimal weights for the local minimum, the global minimum of the error function is not guaranteed.

Why the name neural network?

The neural network algorithm for solving complex learning problems is inspired by human brain
Our brains are a huge network of processing elements. It contains a network of billions of neurons.
In our brain, a neuron receives input from other neurons. Inputs are combined and send to next neuron
The artificial neural network algorithm is built on the same logic.
So if we see a particular neuron then it sends out to dendrite and it processes then it sends to axon output.

Conclusion

Neural network is a vast subject. Many data scientists solely focus on only Neural network techniques
In this session we practiced the introductory concepts only. Neural Networks has much more advanced techniques. There are many algorithms other than back propagation.
Neural networks particularly work well on some particular class of problems like image recognition.
The neural networks algorithms are very calculation intensive. They require highly efficient computing machines. Large datasets take significant amount of runtime on R. We need to try different types of options and packages.
Currently there is a lot of exciting research going on, around neural networks.
After gaining sufficient knowledge in this basic session, you may want to explore reinforced learning, deep learning etc. but for all those concepts, the neural networks is mandatory which we have discussed in this topic.

Appendix

Math- How to update the weights?

We update the weights backwards by iteratively calculating the error
The formula for weights updating is done using gradient descent method or delta rule also known as Widrow-Hoff rule
First we calculate the weight corrections for the output layer then we take care of hidden layer

$W_{jk} = W_{jk} + \delta W_{jk}$ where $\delta W_{jk} = \eta . y_j \delta_k$

$\eta$ is the learning parameter
$\delta_k = y_k (1- y_k) * Err$ (for hidden layers $\delta_k = y_k (1- y_k) * w_j * Err$ )
$Err = Expected output-Actual output$
The weight corrections is calculated based on the error function
The new weights are chosen in such way that the final error in that network is minimized

Math-How does the delta rule work?

Lets consider a simple example to understand the weight updating using delta rule.

If we building a simple logistic regression line. We would like to find the weights using weight update rule
$Y= \frac{1}{(1+e^{(-wx)})}$ is the equation
We are searching for the optimal $w$ for our data

Let $w$ be 1.
$Y=\frac{1}{(1+e^{(-x)})}$ is the initial equation.
The error in our initial step is 3.59.
To reduce the error we will add a $\delta$ to $w$ and make it 1.5.

Now $w$ is 1.5 (blue line).
$Y=\frac{1}{(1+e^{(-1.5x)})}$ the updated equation.
With the updated weight, the error is 1.57.
We can further reduce the error by increasing $w$ by $\delta$ .

If we repeat the same process of adding delta and updating weights, we can finally end up with minimum error
The weight at that final step is the optimal weight
In this example the weight is 8, and the error is 0
$Y=\frac{1}{(1+e^{(-8x)})}$ is the final equation.
In this example, we manually changed the weights to reduce the error. This is just for intuition, manual updating is not feasible for complex optimization problems.
In gradient descent is a scientific optimization method. We update the weights by calculating gradient of the function.

How does gradient descent work?

Gradient descent is one of the famous ways to calculate the local minimum
By Changing the weights we are moving towards the minimum value of the error function. The weights are changed by taking steps in the negative direction of the function gradient(derivative).

Does this method really work?

We changed the weights did it reduce the overall error?
Lets calculate the error with new weights and see the change

Gradient Descent Method Validation

With our initial set of weights the overall error was 0.7137,Y Actual is 0, Y Predicted is 0.7137 error =0.7137
The new weights give us a predicted value of 0.70655
In one iteration, we reduced the error from 0.7137 to 0.70655
The error is reduced by 1%. Repeat the same process with multiple epochs and training examples, we can reduce the error further.

References & Image Sources

“ROC curve” by Masato8686819 – Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons – https://commons.wikimedia.org/wiki/File:ROC_curve.svg#/media/File:ROC_curve.svg
“Curvas”??????UPO649 1112 prodgom – ?????????????????????????????????????????? – https://commons.wikimedia.org/wiki/File:Curvas.png#/media/File:Curvas.png??????CC BY-SA 3.0??????
http://www.autonlab.org/tutorials/neural.html
“Gradient ascent (surface)”. Licensed under Public Domain via Commons – https://commons.wikimedia.org/wiki/File:Gradient_ascent_(surface).png#/media/File:Gradient_ascent_(surface).png
“Gradient descent method” by ?????????? ???????? – ???????????????????????? ????????????????????, ???????????????? ??????????????. Licensed under CC BY-SA 3.0 via Wikimedia Commons – https://commons.wikimedia.org/wiki/File:Gradient_descent_method.png#/media/File:Gradient_descent_method.png
Lecture 7 :Artificial neural networks: Supervised learning: Negnevitsky, Person Education 2005
Gradient descent can find the local minimum instead of the global minimum By I, KSmrq
“Neuron”. Licensed under CC BY-SA 3.0 via Wikimedia Commons – https://commons.wikimedia.org/wiki/File:Neuron.svg#/media/File:Neuron.svg
“Neural signaling-human brain” by 7mike5000 – Gif created from Inside the Brain: Unraveling the Mystery of Alzheimer’s Disease, an educational film by the National Institute on Aging.. Licensed under CC BY-SA 3.0 via Wikimedia Commons – https://commons.wikimedia.org/wiki/File:Neural_signaling-human_brain.gif#/media/File:Neural_signaling-human_brain.gif

Select Category

Handout – Neural Networks in R

You can download the datasets and R code file for this session here.

Neural Networks

Contents

Recap of Logistic Regression

LAB: Logistic Regression

Solution

Decision Boundary

Decision Boundary – Logistic Regression

LAB: Decision Boundary

Solution

New Representation for Logistic Regression

Finding the weights in logistic regression

LAB: Non-Linear Decision Boundaries

Non-Linear Decision Boundaries-Issue

Non-Linear Decision Boundaries-Solution

The Intermediate output

Finding the Weights for Intermediate Outputs

LAB: Intermediate output

Neural Network Intuition

The Neural Networks

Neural Network and Vocabulary

Algorithm for Finding Weights

The Neural Network Algorithm

Once Again ..Neural network Algorithm

Neural network Algorithm-Demo

Randomly Initialize Weights

Activation

Back-Propagate Errors

Calculate Weight Corrections

Updated Weights

Updated Weights contd…

Iterations and Stopping Criteria

Building the Neural Network

Building the Neural Network in R

LAB: Building the neural network in R

R Code Options

Output: Building the neural network in R

Code- Prediction using NN

There can be many solutions

Local vs. Global Minimum

Hidden layers and their role

Multi Layer Neural Network

The Role of Hidden Layers

The Number of Hidden Layers

LAB: Digit Recognizer

Code: Digit Recognizer

Real World Applications

Some examples

Drawbacks of Neural Networks

Why the name neural network?

Conclusion

Appendix

Math- How to update the weights?

Math-How does the delta rule work?

How does gradient descent work?

Does this method really work?

Gradient Descent Method Validation

References & Image Sources