You can download the datasets and R code file for this session here.
Neural Networks
Contents
- Neural network Intuition
- Neural network and vocabulary
- Neural network algorithm
- Math behind neural network algorithm
- Building the neural networks
- Validating the neural network model
- Neural network applications
- Image recognition using neural networks
Recap of Logistic Regression
- When there is a categorical output yes/no, 1/0 (binary output) and the predictor variable is continuous, then we can not fit a linear regression line.
- If we try fitting several lines then the formed logistic regression is better compared to linear regression line as it suits best for the dataset.
- Thus we need logistic regression line for this data.
- Using the predictor variables to predict the categorical output.
- Before moving to neural networks using logistic regression, let us do a quick recap of how do we built the logistic regression and how it can build a neural network based on combination of several logistic regressions.
LAB: Logistic Regression
- Dataset: Emp_Productivity/Emp_Productivity.csv
- Filter the data and take a subset from above dataset. Filter condition is Sample_Set<3.
- Draw a scatter plot that shows Age on X-axis and Experience on Y-axis. Try to distinguish the two classes with colors or shapes (visualizing the classes).
- Build a logistic regression model to predict Productivity using age and experience.
- Finally draw the decision boundary for this logistic regression model.
- Create the confusion matrix.
- Calculate the accuracy and error rates.
Solution
- Dataset: Emp_Productivity/Emp_Productivity.csv
Emp_Productivity_raw <- read.csv("R datasetEmp_ProductivityEmp_Productivity.csv")
- Filter the data and take a subset from above dataset. Filter condition is Sample_Set<3.
Emp_Productivity1<-Emp_Productivity_raw[Emp_Productivity_raw$Sample_Set<3,]
dim(Emp_Productivity1)
## [1] 74 4
names(Emp_Productivity1)
## [1] "Age" "Experience" "Productivity" "Sample_Set"
head(Emp_Productivity1)
## Age Experience Productivity Sample_Set
## 1 20.0 2.3 0 1
## 2 16.2 2.2 0 1
## 3 20.2 1.8 0 1
## 4 18.8 1.4 0 1
## 5 18.9 3.2 0 1
## 6 16.7 3.9 0 1
table(Emp_Productivity1$Productivity)
##
## 0 1
## 33 41
- Draw a scatter plot that shows Age on X axis and Experience on Y-axis. Try to distinguish the two classes with colors or shapes (visualizing the classes).
library(ggplot2)
ggplot(Emp_Productivity1)+geom_point(aes(x=Age,y=Experience,color=factor(Productivity),shape=factor(Productivity)),size=5)
- 1 is in the form of triangle, 0 is in the form of a dot.
- Age is on the X-axis and Experience is on the Y-axis.
- Here, there are few 1’s and few 0’s which looks like they can be classified very easily.
- We will try to fit a logistic regression line i.e., the classifier and see whether that can help us in classifying 0’s from 1’s putting a decision boundary between them.
- We will be fitting a logistic regression on productivity using the variables called age and experience from emp_productivity, family=binomial.
- Build a logistic regression model to predict Productivity using age and experience.
Emp_Productivity_logit<-glm(Productivity~Age+Experience,data=Emp_Productivity1, family=binomial())
Emp_Productivity_logit
##
## Call: glm(formula = Productivity ~ Age + Experience, family = binomial(),
## data = Emp_Productivity1)
##
## Coefficients:
## (Intercept) Age Experience
## -8.9361 0.2763 0.5923
##
## Degrees of Freedom: 73 Total (i.e. Null); 71 Residual
## Null Deviance: 101.7
## Residual Deviance: 46.77 AIC: 52.77
coef(Emp_Productivity_logit)
## (Intercept) Age Experience
## -8.9361114 0.2762749 0.5923444
slope1 <- coef(Emp_Productivity_logit)[2]/(-coef(Emp_Productivity_logit)[3])
intercept1 <- coef(Emp_Productivity_logit)[1]/(-coef(Emp_Productivity_logit)[3])
- To create the decision boundary we have to find slope and intercept.
- Finally draw the decision boundary for this logistic regression model.
library(ggplot2)
base<-ggplot(Emp_Productivity1)+geom_point(aes(x=Age,y=Experience,color=factor(Productivity),shape=factor(Productivity)),size=5)
base+geom_abline(intercept = intercept1 , slope = slope1, color = "red", size = 2) #Base is the scatter plot. Then we are adding the decision boundary
- Create the confusion matrix.
predicted_values<-round(predict(Emp_Productivity_logit,type="response"),0)
conf_matrix<-table(predicted_values,Emp_Productivity_logit$y)
conf_matrix
##
## predicted_values 0 1
## 0 31 2
## 1 2 39
- Calculate the accuracy and error rates.
accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy
## [1] 0.9459459
Decision Boundary
Decision Boundary – Logistic Regression
- The line or margin that separates the classes is called Decision Boundary.
- Classification algorithms are all about finding the decision boundaries.
- It needs not to be straight line or linear always.
- The final function of our decision boundary looks like
- Y=1 if
; else Y=0
- Y=1 if
- In logistic regression, it can be derived from the logistic regression coefficients and the threshold.
- Imagine the logistic regression line
- Suppose if (p(y)>0.5) then class-1 or else class-0
is the line.
- Rewriting it in
form
- Anything above this line is class-1, below this line is class-0
is class-1
is class-0
tie probability of 0.5
- We can change the decision boundary by changing the threshold value(here 0.5)
- Imagine the logistic regression line
LAB: Decision Boundary
- Draw a scatter plot that shows Age on X-axis and Experience on Y-axis. Try to distinguish the two classes with colors or shapes (visualizing the classes).
- Build a logistic regression model to predict Productivity using age and experience.
- Finally draw the decision boundary for this logistic regression model.
- Create the confusion matrix.
- Calculate the accuracy and error rates.
Solution
- Drawing the Decision boundary for the logistic regression model
library(ggplot2)
base<-ggplot(Emp_Productivity1)+geom_point(aes(x=Age,y=Experience,color=factor(Productivity),shape=factor(Productivity)),size=5)
base+geom_abline(intercept = intercept1 , slope = slope1, color = "red", size = 2)
#Base is the scatter plot. Then we are adding the decision boundary
- This is the logistic regression line or the decision boundary where anything above the line is classifies as 1 and anything below the line is classifies as 0.
- The top portion of the decision boundary classifies as 1 and everything below classifies as 1.
- Looks like fairly accurate model except 1 or 2 points shown in the graph.
- Create the confusion matrix.
predicted_values<-round(predict(Emp_Productivity_logit,type="response"),0)
conf_matrix<-table(predicted_values,Emp_Productivity_logit$y)
conf_matrix
##
## predicted_values 0 1
## 0 31 2
## 1 2 39
- Calculate the accuracy and error rates.
accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy
- We are getting very high accuracy, because in the confusion matrix most of the values should be around the diagonal, apart from two points of 0’s and two points of 1’s which are wrongly classified.
- As accuracy is fairly high, thus it is a very good model with accuracy of 94%.
New Representation for Logistic Regression
- We are trying to understand neural networks here through logistic regression.
- Moving forward we will see a new representation for logistic regression line that we just built and see how it transforms to neural networks later.
- Thus this is the logistic regression line that we have built:
- That can be rewritten as:
- Thus this particular line that can be taken as one equation and one can write it as:
where
and
- This can be displayed in diagram as follows:
and
whose weights are
and
.
is having no prior coefficients.
- We can simply say
is the line equation, that is going through this logistic equation
.
- This is how we can represent the same logistic regression line that we just built.
Finding the weights in logistic regression
- The output is a non linear function of linear combination of inputs – A typical multiple logistic regression line.
- In this particular line, we have to find out the coefficients of w.
- We find
to minimize
LAB: Non-Linear Decision Boundaries
- Dataset: “Emp_Productivity/ Emp_Productivity.csv”
- Draw a scatter plot that shows Age on X axis and Experience on Y-axis. Try to distinguish the two classes with colors or shapes (visualizing the classes).
- Build a logistic regression model to predict Productivity using age and experience.
- Finally draw the decision boundary for this logistic regression model.
- Create the confusion matrix.
- Calculate the accuracy and error rates.
Note – Here we are considering the entire data not the subset.
####The clasification graph on overall data
library(ggplot2)
ggplot(Emp_Productivity_raw)+geom_point(aes(x=Age,y=Experience,color=factor(Productivity),shape=factor(Productivity)),size=5)
- It looks like as age and experience increases, there are lot of 0’s in the beginning and then there are lot of 1’s after that in the graph.
- Thus it looks like there can be a logistic regression line and the separation between 0’s and 1’s in this particular portion.
- When we go beyond this point again there are lot of 0’s.
- Thus it looks like fitting one logistic regression line or one linear decision boundary might not be sufficient.
- We go ahead and force fit in one linear decision boundary.
###Logistic Regerssion model for overall data
Emp_Productivity_logit_overall<-glm(Productivity~Age+Experience,data=Emp_Productivity_raw, family=binomial())
Emp_Productivity_logit_overall
##
## Call: glm(formula = Productivity ~ Age + Experience, family = binomial(),
## data = Emp_Productivity_raw)
##
## Coefficients:
## (Intercept) Age Experience
## 0.44784 -0.01755 -0.06324
##
## Degrees of Freedom: 118 Total (i.e. Null); 116 Residual
## Null Deviance: 155.7
## Residual Deviance: 150.5 AIC: 156.5
slope2 <- coef(Emp_Productivity_logit_overall)[2]/(-coef(Emp_Productivity_logit_overall)[3])
intercept2 <- coef(Emp_Productivity_logit_overall)[1]/(-coef(Emp_Productivity_logit_overall)[3])
####Drawing the Decision boundary
library(ggplot2)
base<-ggplot(Emp_Productivity_raw)+geom_point(aes(x=Age,y=Experience,color=factor(Productivity),shape=factor(Productivity)),size=5)
base+geom_abline(intercept = intercept2 , slope = slope2, colour = "blue", size = 2)
- Thus, obviously the decision boundary is no way close to good decision boundary.
- There are so many misclassifications, in fact it has classified every value of triangle has wrong value of 0.
- Let us see the overall accuracy, the confusion matrix might not look as good as it was earlier.
- There are so many incorrect non-diagonal values.
- And the accuracy might fall drastically.
####Accuracy of the overall model
predicted_values<-round(predict(Emp_Productivity_logit_overall,type="response"),0)
conf_matrix<-table(predicted_values,Emp_Productivity_logit_overall$y)
conf_matrix
##
## predicted_values 0 1
## 0 69 43
## 1 7 0
accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy
## [1] 0.5798319
- Accuracy is 0.57, which is not at all acceptable.
- This is not a single decision boundary.
- We have to build a non-linear decision boundary that might looks like shape of “U” in most of the cases.
Non-Linear Decision Boundaries-Issue
- Basically, when there is no linear separation between two classes or when single straight line can not help us to divide the two classes then one might have to go for a non-linear decision boundary.
- In non-linear decision boundary, we can not fit just one line and say right hand side of the line is 1’s and left hand side of the line is 0’s.
- Thus one line might not be sufficient.
- Therefore this seems to be issue and logical regression does not seems to be a good option when we have non-linear decision boundaries.
Non-Linear Decision Boundaries
Non-Linear Decision Boundaries-Solution
- We need to find a solution for Non-Linear Decision Boundaries.
- We have one idea say:
- By using multiple logistic regression line together, we construct a decision boundary by fitting 2 or 3 logistic regression lines and then use it as a final classifier.
- But as of now, a single logistic regression line can not work for a scenario, where there is a non-linear separating boundary between 2 classes.
- We are having an issue with Non-Linear Decision Boundaries.
- Let us have a possible solution.
- We have the classes that can not be separated by using one linear line or a classifier.
- Now the question is that why don’t we fit two models?
- Model-1 that separates first portion and model-2 takes care of another portion thus instead of finding the final output that will have one classifier which is non-linear and will tell us where are 1’s and 0’s or where are class-1 and class-2.
- We get an intermediate output say
which is coming out of model-1 and there is an intermediate output
which is coming out of model-2.
- Thus indeed we can use
and
to find out the final classifier.
Intermediate Output1 | Intermediate Output2 |
---|---|
The Intermediate output
- Using the
and directly predicting
is challenging, thus we have the independent variables
i.e.,
and
.
- Using the independent variables of
, we can directly predict
which is challenging because
is non-linearly dependent on
, a linear classifier, which is not working.
- Thus the idea is that we will try to predict
, and then intermediate output
will indeed predict
.
- Instead of directly going from
to
, we will try to predict
using
and
, then
using again
and
and again using
and
will try to predict
.
Finding the Weights for Intermediate Outputs
- How do we find the weights of the intermediate output.
- We will try to predict
using
and
,
using
and
, and then indeed
and
will be used for predicting
.
is a non-linear function of linear combination of
, and
is a non-linear function (g)(linear combination of
and
), whereas
is a non-linear function (linear combination of
and
.
- Thus, that is an intermediate output
.
LAB: Intermediate output
- Dataset: Emp_Productivity/ Emp_Productivity_All_Sites.csv
- Filter the data and take first 74 observations from above dataset. Filter condition is Sample_Set<3.
- Build a logistic regression model to predict Productivity using age and experience.
- Calculate the prediction probabilities for all the inputs. Store the probabilities in inter1 variable.
- Filter the data and take observations from row 34 onwards. Filter condition is Sample_Set<1.
- Build a logistic regression model to predict Productivity using age and experience.
- Calculate the prediction probabilities for all the inputs. Store the probabilities in inter2 variable.
- Build a consolidated model to predict productivity using inter-1 and inter-2 variables.
- Create the confusion matrix and find the accuracy and error rates for the consolidated model.
Our sampled data Emp_Productivity1 has first 74 observations. Lets build the model on this sample data (sample-1).
####The clasification graph Sample-1
library(ggplot2)
ggplot(Emp_Productivity1)+geom_point(aes(x=Age,y=Experience,color=factor(Productivity),shape=factor(Productivity)),size=5)
- Therefore, the data is in particular manner, in class-1 there are some 0’s, then 1’s and then again 0’s.
- Thus there will be class-1, then class-2 and then again class-1, so we can not have one linear decision boundary or a classifier that classifies these 2 classes.
- We took sample-1, the first initial portion as 0’s and then as 1’s and then try to draw decision boundary.
- The initial portion looks like this:
- We have the original data like this:
- Then for sample-1, we took this particular subset of data.
- And then we fit a logistic regression line which fits very well.
- Thus we found a perfect classifier.
###Logistic Regerssion model1
Emp_Productivity_logit<-glm(Productivity~Age+Experience,data=Emp_Productivity1, family=binomial())
coef(Emp_Productivity_logit)
## (Intercept) Age Experience
## -8.9361114 0.2762749 0.5923444
slope1 <- coef(Emp_Productivity_logit)[2]/(-coef(Emp_Productivity_logit)[3])
intercept1 <- coef(Emp_Productivity_logit)[1]/(-coef(Emp_Productivity_logit)[3])
####Decision boundary for model1 built on Sample-1
library(ggplot2)
base<-ggplot(Emp_Productivity1)+geom_point(aes(x=Age,y=Experience,color=factor(Productivity),shape=factor(Productivity)),size=5)
base+geom_abline(intercept = intercept1 , slope = slope1, color = "red", size = 2)
#Base is the scatter plot. Then we are adding the decision boundary
- It has higher accuracy.
- We can take sample-2 which is the other portion.
- In classifier-1, we have considered the sample-2 and then fit an intermediate output.
- This is sample-2 based on the condition sample_set>1.
#Filter the data and take observations from row 34 onwards.
Emp_Productivity2<-Emp_Productivity_raw[Emp_Productivity_raw$Sample_Set>1,]
####The clasification graph
library(ggplot2)
ggplot(Emp_Productivity2)+geom_point(aes(x=Age,y=Experience,color=factor(Productivity),shape=factor(Productivity)),size=5)
- There are lot of 1’s and lot of 0’s.
- Then we fit a logistic regression line obviously for this data. Also we should find the good logistic regression line.
###Logistic Regerssion model2 built on Sample2
Emp_Productivity_logit2<-glm(Productivity~Age+Experience, data=Emp_Productivity2, family=binomial())
Emp_Productivity_logit2
##
## Call: glm(formula = Productivity ~ Age + Experience, family = binomial(),
## data = Emp_Productivity2)
##
## Coefficients:
## (Intercept) Age Experience
## 16.3184 -0.3994 -0.2440
##
## Degrees of Freedom: 85 Total (i.e. Null); 83 Residual
## Null Deviance: 119
## Residual Deviance: 34.08 AIC: 40.08
- We can find a decision boundary that classifies as 0’s and as 1’s.
coef(Emp_Productivity_logit2)
## (Intercept) Age Experience
## 16.3183916 -0.3994172 -0.2439643
slope3 <- coef(Emp_Productivity_logit2)[2]/(-coef(Emp_Productivity_logit2)[3])
intercept3 <- coef(Emp_Productivity_logit2)[1]/(-coef(Emp_Productivity_logit2)[3])
####Drawing the Decison boundry
library(ggplot2)
base<-ggplot(Emp_Productivity2)+geom_point(aes(x=Age,y=Experience,color=factor(Productivity),shape=factor(Productivity)),size=5)
base+geom_abline(intercept = intercept3 , slope = slope3, color = "red", size = 2)
- We can see on one side of the decision boundary, where 1’s are there and other side 0’s are there.
####Accuracy of the model2
predicted_values<-round(predict(Emp_Productivity_logit2,type="response"),0)
conf_matrix<-table(predicted_values,Emp_Productivity_logit2$y)
conf_matrix
##
## predicted_values 0 1
## 0 43 2
## 1 2 39
accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy
## [1] 0.9534884
- Accuracy level should be fairly high for this particular model.
- As found, the accuracy is 95%.
- Emp_Productivity_logit is on the initial portion of the data and Emp_Productivity_logit2 is on the second portion of the data.
- We will create two more columns in this particular datasets.
- Thus we are creating two new variables that will be an intermediate output i.e., the output of this particular models that we have built.
- We will use the logistic regression model-1 and then we will try to predict variable called inter1.
- So we will create manually inter1, since we can not directly predict
, we will first predict inter1 and then inter2.
- inter1 and inter2 will act as two new variables
and
, thus they will indeed be using for predicting
.
- inter1 which is predicted by logistic regression-1.
- inter2 which is predicted by logistic regression-2.
#Calculate the prediction probabilities for all the inputs. Store the probabilities in inter1 variable
Emp_Productivity_raw$inter1<-predict(Emp_Productivity_logit,type="response", newdata=Emp_Productivity_raw)
#Calculate the prediction probabilities for all the inputs. Store the probabilities in inter2 variable
Emp_Productivity_raw$inter2<-predict(Emp_Productivity_logit2,type="response", newdata=Emp_Productivity_raw)
head(Emp_Productivity_raw)
## Age Experience Productivity Sample_Set inter1 inter2
## 1 20.0 2.3 0 1 0.11423230 0.9995775
## 2 16.2 2.2 0 1 0.04080461 0.9999096
## 3 20.2 1.8 0 1 0.09202657 0.9995949
## 4 18.8 1.4 0 1 0.05152147 0.9997899
## 5 18.9 3.2 0 1 0.13955234 0.9996608
## 6 16.7 3.9 0 1 0.11793035 0.9998329
- The idea is that both these variables that are created using these separate logistic regression have higher chance of predicting
and it will be done quite easily.
- We have two new variables called inter1 and inter2.
- Now the graph is slightly different for predicting
using inter1 and inter2.
- Now we can see the classification graph, inter1 output on X-axis and inter2 output on Y-axis.
####Clasification graph with the two new coloumns
library(ggplot2)
ggplot(Emp_Productivity_raw)+geom_point(aes(x=inter1,y=inter2,color=factor(Productivity),shape=factor(Productivity)),size=5)
- We can clearly see a good linear separating boundary which is class-1 and class-0.
- We can find a linear line which goes diagonally as shown in the graph.
- Let us go ahead and predict the probability of productivity using inter1 and inter2.
- So instead of using
and
directly, we are using inter1 and inter2 which are derived from two logistic regression model.
- Thus creating a new model called logistic logit combined.
###Logistic Regerssion model with Intermediate outputs as input
Emp_Productivity_logit_combined<-glm(Productivity~inter1+inter2,data=Emp_Productivity_raw, family=binomial())
Emp_Productivity_logit_combined
##
## Call: glm(formula = Productivity ~ inter1 + inter2, family = binomial(),
## data = Emp_Productivity_raw)
##
## Coefficients:
## (Intercept) inter1 inter2
## -12.213 8.019 8.598
##
## Degrees of Freedom: 118 Total (i.e. Null); 116 Residual
## Null Deviance: 155.7
## Residual Deviance: 49.74 AIC: 55.74
- Now we have new model and we will try to observe the decision boundary.
slope4 <- coef(Emp_Productivity_logit_combined)[2]/(-coef(Emp_Productivity_logit_combined)[3])
intercept4<- coef(Emp_Productivity_logit_combined)[1]/(-coef(Emp_Productivity_logit_combined)[3])
####Drawing the Decison boundry
library(ggplot2)
base<-ggplot(Emp_Productivity_raw)+geom_point(aes(x=inter1,y=inter2,color=factor(Productivity),shape=factor(Productivity)),size=5)
base+geom_abline(intercept = intercept4 , slope = slope4, colour = "red", size = 2)
- Thus from the decision boundary, it is very clear that on one side of the decision boundary there are lot of values from class-1 and on the other side of the decision boundary there are lot of values from class-0.
####Accuracy of the combined
predicted_values<-round(predict(Emp_Productivity_logit_combined,type="response"),0)
conf_matrix<-table(predicted_values,Emp_Productivity_logit_combined$y)
conf_matrix
##
## predicted_values 0 1
## 0 74 4
## 1 2 39
accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy
## [1] 0.9495798
- We should see higher accuracy levels that are almost like 94.95%.
- We have built three models i.e., Emp_Productivity_logit and Emp_Productivity_logit2 on the subset of the data and then we derive two more intermediate output variables and from those intermediate outputs we build a third model Emp_Productivity_logit_combined.
- Finally we have classify these two classes into 0’s and 1’s.
- Now there is a linear decision boundary that we can see finally but not by using direct input variables but by transforming or by building the logistic regression line and then finding the intermediate outputs.
- Then using those intermediate outputs finally to come up with the final decision boundary.
Neural Network Intuition
Final Output
- So (h) is a non linear function of linear combination of inputs – A multiple logistic regression line.
- (y) is a non linear function of linear combination of outputs of logistic regressions.
- (y) is a non linear function of linear combination of non linear functions of linear combination of inputs.
- We find
to minimize
.
- We find
and
to minimize
.
- Neural networks is all about finding the sets of weights
and
using Gradient Descent Method.
The Neural Networks
- The neural networks methodology is similar to the intermediate output method explained above.
- But we will not manually subset the data to create the different models.
- The neural network technique automatically takes care of all the intermediate outputs using hidden layers.
- It works very well for the data with non-linear decision boundaries.
- The intermediate output layer in the network is known as hidden layer.
- In simple terms, neural networks are multilayer nonlinear regression model.
- If we have sufficient number of hidden layers, then we can estimate any complex non-linear function.
Neural Network and Vocabulary
- Here, the two hidden layers,
and
are derived from the two inputs,
and
.
Why are they called hidden layers?
- A hidden layer “hides” the desired output.
- Instead of predicting the actual output using a single model, build multiple models to predict intermediate output.
- There is no standard way of deciding the number of hidden layers but with experience and looking at the complexity or looking at the final accuracy of the model, we can experiment with the number of hidden layers.
- But it is like the more the merrier.
- So we have to avoid the over fitting as well.
- So this is the overall intuition of the neural network.
Algorithm for Finding Weights
- Algorithm is all about finding the weights/coefficients.
- We randomly initialize some weights.
- Calculate the output by supplying training input.
- With those values in
, we calculate the value of
or predicted values of
.
- Given the values of
, we already know the actual value of
.
- Now we will try to predict the value of
using these weights.
- Whatever is the error between the predicted and actual, we try to adjust the weight to reduce that error.
- And finally we will find those weights after adjustments that will give us minimum amount of error.
- Let us see what are the steps involved in the neural network algorithm.
The Neural Network Algorithm
- Step 1 : Initialization of weights: Randomly select some weights.
- Step 2 : Training & Activation: Input the training values and perform the calculations forward.
- We have the dataset with us, so we put the values of
then we find the values of
with which we can calculate forward.
- Once we calculate the value of
, then those are predicted values of
.
- In training itself i.e., in dataset, we will have the actual values of
.
- We have the dataset with us, so we put the values of
- Step 3 : Error Calculation: Calculate the error at the outputs. Use the output error to calculate error fractions at each hidden layer based on the final layer.
- Step 4: Weight training: Update the weights to reduce the error, recalculate and repeat the process of training & updating the weights for all the examples.
- Step 5: Stopping criteria: Stop the training and weights updating process when the minimum error criteria is met.
- So we start with some weights and then calculate the errors and then adjust the weights to reduce the error, once there is very minimum error we stop it at that point.
Randomly Initialize Weights
- In step-1 initialization of some random weights.
- So we take any random weights.
- As these are not the final weights, we can adjust them later.
Training & Activation
- Training and activation is the next step.
- So we have input the values of
to predict
, because we already have weights.
- So we can just substitute the values of
and random weights in the given equation, then we can find
and
.
- So by giving input the training value, we can perform the calculations forward.
- Training input & calculations is called – Feed Forward.
Error Calculation at Output
- In the previous step, we have the predicted values of
, so this
function is the predicted values of
.
- Now we know the actual value of
.
- So now we calculate error at the final layer and we can also find the error fractions at each hidden layer based on the error.
- With this formula we can find out that what fraction of error will be contributing the overall error.
- At each layer we can find out the error which is called Back Propagation helps in calculating errors signals backwards.
Error Calculation at hidden layers
- This is the overall error, i.e. Err.
- Once we have the errors then we can calculate the weight corrections that will reduce that errors.
Calculate weight corrections
- Here, the
,
, etc. weight correction will reduce the errors on
and
and indeed
,
and
will be reducing the overall error.
- These are the weight corrections.
Update Weights
- Here in the Update Weights, the weights will be the summation of previous weights and the weight corrections given as follows,
- Update the weights to reduce the error based on the weight corrections, recalculate and repeat the process.
- Hence with this new weight, the error will be reduced.
- Thus this is one iteration that we repeat it again and again.
- So now these new weights again will find out the error.
- Again will find out the error at each hidden layer.
- And again we will do the weight correction and update the weights.
- So that with new weights, error will be reduced slightly.
- We repeat the process again and again.
Stopping Criteria
- We will stop training the weights and updating the weights, when the error will be least.
- When there is no more error reduction happening, then we can stop at that point or we can set up a minimum error criteria i.e., if the error is less than particular point then we have to stop training.
- Final weights will be taken from the final iteration.
- Thus once the minimum error criteria is met, we can come out of this algorithm.
Once Again ..Neural network Algorithm
- Step 1 : Initialization of weights: Randomly select some weights.
- Step 2 : Training & Activation: Input the training values and perform the calculations forward.
- Step 3 : Error Calculation: Calculate the error at the outputs. Use the output error to calculate error fractions at each hidden layer.
- Step 4 : Weight training: Update the weights to reduce the error, recalculate and repeat the process of training & updating the weights for all the examples.
- Step 5 : Stopping criteria: Stop the training and weights updating process when the minimum error criteria is met.
Neural network Algorithm-Demo
- This Neural Network Algorithm shows how exactly it works with the actual numbers.
- Looks like a dataset that can not be separated by using single linear decision boundary/perceptron.
- Let us consider a similar but simple classification example.
- XOR Gate Dataset
- Observing the dataset, then there are two classes: 1 and 0.
- Here, clearly we can see that they can not be one single decision boundary that can separate them into two classes.
- So we need to build a non-linear decision boundary and we need to go with neural networks to classify this whole system into 1’s and 0’s.
- For an example we will take XOR Gate Dataset.
- So we have XOR Gate and we can see in the diagram of it which is similar to above example.
- Here the XOR Gate will be:
- When
takes 1,
takes 1 output is 0.
- When
takes 1,
takes 0 output is 1.
- When
- We will try to use neural network algorithm that will build a non linear decision boundary and will help us to segregate between two classes i.e., 1’s and 0’s so it is obvious that it is one decision line.
- One line might not be sufficient so we need to build a non-linear decision boundary.
Randomly Initialize Weights
- Step 1 will be randomly initialize weights.
- Let us suppose these might be randomly initialize weights say, they can be anything.
- No need to worry about what we take exactly.
- So randomly initialized the weights like this.
Activation
- We need to take the input data, this is directly taken from the dataset that we have.
- This is the XOR dataset:
- Here the XOR Gate will be:
- When
takes 1,
takes 1 output is 0.
- When
takes 1,
takes 0 output is 1.
- When
- So we will take the first data point and then we pass on:
- Thus this is called first epoch.
- Now inputting the values of
and
, so the value of
is 1 and the value of
is 1, and then we can find out the value of
and
.
- Value of
can be calculating using 0.818.
- So using these weights, if we substitute weights
is 0.818 and
is 0.731.
- Using
and
and these weights, we can use this equation to calculate the final value of
.
- When
is 1 and
is 1, then output is zero but with these weights, the output comes as 0.71.
- Obviously there is an error as instead of zero we got 0.731.
- So error is nothing but (Error = Actual – Predicted).
- Predicted value is 0.713 and the actual value is 0.
- Thus the (Error = Actual – Predicted), so (Y Target – Y Observed) is our error.
- The error rate will be the output which is 0.713.
Back-Propagate Errors
- Similarly we can calculate the expected error or error fraction at
and at
.
- And then based on the delta at
, the error fraction we can calculate the error fraction at
is 0.021 and at
is -0.028.
Calculate Weight Corrections
- Based on the error fraction that we have calculated earlier, from that the weight adjustment will be derived.
- Thus the error fraction and actual value of
help us to find out the weight adjustment.
- We calculate the weight adjustment, based on the weight adjustment we will update the weights.
Updated Weights
- Earlier the weight was 0.5 at
and -1 at
, so we calculate the weight adjustment its comes around 0.00217501 at
and -0.02867 at
.
- We will update the weight 0.5 which becomes 0.502175 and -1 becomes -1.002867.
- So we have adjusted the weights.
- Now we have to do the same thing for all the weights.
- These are the new weights and for these new weights again we have to calculate the value of predicted
and the actual
again and we will find the error.
Updated Weights contd…
Iterations and Stopping Criteria
- This iteration is just for one training example (1,1,0). This is just the first epoch.
- We repeat the same process of training and updating of weights for all the data points
- We continue and update the weights until we see there is no significant change in the error or when the maximum permissible error criteria is met.
- By updating the weights in this method, we reduce the error slightly. When the error reaches the minimum point the iterations will be stopped and the weights will be considered as optimum for this training set
XOR Gate final NN Model
- Finally find the decision boundaries using neural networks.
- This how it looks like for XOR Gate final neural network model.
- This is the manual calculations of Neural Network.
- We really do not need to do error calculation, back propagation, updating the weights, etc.
- If we are having the tool, then it takes care of everything automatically.
- All we need is to supply the dataset, independent variable, dependent variable, etc. so everything will be taken care of.
- We did this exercise just to take an idea but in general, we do not need to manually calculate the errors.
- Thus we can use any tool to build Neural Networks.
Building the Neural Network
- We do not really need to calculate the weights manually like that.
- If we take R or Python or any tool which is prewritten, we just need to input the right values of (x) and the dataset.
- The gradient descent method is not very easy to understand for a non-mathematics students thus it is not easy to write the program from the scratch.
- The neural network tools do not expect the user to write the code for the full length back propagation algorithm at least for us i.e., the starting and intermediate student.
- We do not really need to know the overall coding of the algorithm for finding out weights of neural network.
- Thus, we can use tool like R.
- We will try to use R to build the neural network and we just need to be slightly careful while setting out the parameters in this neural network function where everything else will be taken care of.
- We will try to build a neural network model for XOR data.
- We will also do a neural network weights finding exercise on Emp_Productivity.csv data.
- We will also use this neural network to predict the values.
- We will also find out what the final model will look like.
- Earlier we have built two logistic regressions, now we will directly try to build one neural network equation.
- Now we will build the neural network, so for building this, the function is neural net.
- We will fit an XOR neural network model.
The good news is…
- We do not need to write the code for weights calculation and updating.
- There readymade codes, libraries and packages available in R.
- The gradient descent method is not very easy to understand for a non-mathematics students.
- Neural network tools do not expect the user to write the code for the full length back propagation algorithm.
Building the Neural Network in R
- We have a couple of packages available in R.
- We need to mention the dataset, input, output & number of hidden layers as input.
- Neural network calculations are very complex. The algorithm may take sometime to produce the results
- One need to be careful while setting the parameters. The runtime changed based on the input parameter values
LAB: Building the neural network in R
- Build a neural network for XOR data.
- Dataset: Emp_Productivity/Emp_Productivity.csv
- Draw a 2D graph between age, experience and productivity.
- Build neural network algorithm to predict the productivity based on age and experience.
- Plot the neural network with final weights.
#Build a neural network for XOR data
xor_data <- read.csv("R datasetGatesxor.csv")
library(neuralnet)
## Warning: package 'neuralnet' was built under R version 3.3.2
xor_nn_model<-neuralnet(output~input1+input2,data=xor_data,hidden=2, linear.output = FALSE, threshold = 0.0000001)
plot(xor_nn_model)
- Sometimes it may take time for implementing this graph depending on the seed value that has been supply or the weights which have been chosen.
- Error is zero and these are the final weights for XOR model.
- We can see the overall model.
xor_nn_model
## $call
## neuralnet(formula = output ~ input1 + input2, data = xor_data,
## hidden = 2, threshold = 0.0000001, linear.output = FALSE)
##
## $response
## output
## 1 0
## 2 1
## 3 1
## 4 0
##
## $covariate
## [,1] [,2]
## [1,] 1 1
## [2,] 1 0
## [3,] 0 1
## [4,] 0 0
##
## $model.list
## $model.list$response
## [1] "output"
##
## $model.list$variables
## [1] "input1" "input2"
##
##
## $err.fct
## function (x, y)
## {
## 1/2 * (y - x)^2
## }
## <environment: 0x000000001725bbf0>
## attr(,"type")
## [1] "sse"
##
## $act.fct
## function (x)
## {
## 1/(1 + exp(-x))
## }
## <environment: 0x000000001725bbf0>
## attr(,"type")
## [1] "logistic"
##
## $linear.output
## [1] FALSE
##
## $data
## input1 input2 output
## 1 1 1 0
## 2 1 0 1
## 3 0 1 1
## 4 0 0 0
##
## $net.result
## $net.result[[1]]
## [,1]
## 1 0.0003253483014
## 2 0.9996353029148
## 3 0.9996313468905
## 4 0.0003253955548
##
##
## $weights
## $weights[[1]]
## $weights[[1]][[1]]
## [,1] [,2]
## [1,] 12.57613311 11.37500633
## [2,] 26.09709905 -23.34083293
## [3,] -25.27406355 24.32173217
##
## $weights[[1]][[2]]
## [,1]
## [1,] 23.85189617
## [2,] -15.93571609
## [3,] -15.94656173
##
##
##
## $startweights
## $startweights[[1]]
## $startweights[[1]][[1]]
## [,1] [,2]
## [1,] 1.34136916425 1.2805342043
## [2,] 0.09709905088 0.2695670659
## [3,] -0.66886355387 0.2526550529
##
## $startweights[[1]][[2]]
## [,1]
## [1,] 1.8831711692
## [2,] -0.2234252985
## [3,] -0.8890847239
##
##
##
## $generalized.weights
## $generalized.weights[[1]]
## [,1] [,2]
## 1 0.0009714231652 -0.001058638259
## 2 0.0023663851028 -0.002465832511
## 3 -0.0012715107572 0.001231410573
## 4 0.0028361883865 -0.003061029869
##
##
## $result.matrix
## 1
## error 0.00000024032143171
## reached.threshold 0.00000007870849299
## steps 261.00000000000000000
## Intercept.to.1layhid1 12.57613310993247069
## input1.to.1layhid1 26.09709905087747828
## input2.to.1layhid1 -25.27406355387023140
## Intercept.to.1layhid2 11.37500633477892364
## input1.to.1layhid2 -23.34083293408322390
## input2.to.1layhid2 24.32173217293616219
## Intercept.to.output 23.85189616804580126
## 1layhid.1.to.output -15.93571609448825122
## 1layhid.2.to.output -15.94656172786419646
##
## attr(,"class")
## [1] "nn"
- We can draw the decision boundaries as well for XOR model.
- If you remember XOR model looks like 0’s and 1’s.
#Decision Boundaries
m1_slope <- xor_nn_model$weights[[1]][[1]][2]/(-xor_nn_model$weights[[1]][[1]][3])
m1_intercept <- xor_nn_model$weights[[1]][[1]][1]/(-xor_nn_model$weights[[1]][[1]][3])
m2_slope <- xor_nn_model$weights[[1]][[1]][5]/(-xor_nn_model$weights[[1]][[1]][6])
m2_intercept <- xor_nn_model$weights[[1]][[1]][4]/(-xor_nn_model$weights[[1]][[1]][6])
####Drawing the Decision boundary
library(ggplot2)
base<-ggplot(xor_data)+geom_point(aes(x=input1,y=input2,color=factor(output),shape=factor(output)),size=5)
base+geom_abline(intercept = m1_intercept , slope = m1_slope, colour = "blue", size = 2) +geom_abline(intercept = m2_intercept , slope = m2_slope, colour = "blue", size = 2)
- These are the decision boundaries for XOR model, neural network has built a non-linear decision boundary or two decision boundaries that will help us to identify 1’s and 0’s obviously.
- Thus anything between these lines is 0, while beyond these lines is 1.
- Similarly we will try to build a neural network model on employee productivity data.
#Build neural network algorithm to predict the productivity based on age and experience
library(neuralnet)
Emp_Productivity_nn_model1<-neuralnet(Productivity~Age+Experience,data=Emp_Productivity_raw )
plot(Emp_Productivity_nn_model1)
- If we do not include linear.output, then the model will go wrong.
- Thus we have to include the option linear.output=false, then it is not a linear output and it will be a binary output 1 or 0.
#Including the option Linear.output
Emp_Productivity_nn_model1<-neuralnet(Productivity~Age+Experience,data=Emp_Productivity_raw, linear.output = FALSE)
plot(Emp_Productivity_nn_model1)
- So this is the final model with an error 13 and there are 40867 steps that have been taken for the no of hidden layers.
- We didn’t mention any hidden layers earlier, so let’s try to mention hidden layers over here.
#Including the option Hidden layers
Emp_Productivity_nn_model1<-neuralnet(Productivity~Age+Experience,data=Emp_Productivity_raw, hidden=2,linear.output = FALSE)
plot(Emp_Productivity_nn_model1)
- We have mention two hidden layers , error is 13 and 40000 steps.
- So we try to do some “In Time Validation” and find out the error.
- We can do the plotting of actual verses predicted.
####Results and Intime validation
actual_values<-Emp_Productivity_raw$Productivity
Predicted<-Emp_Productivity_nn_model1$net.result[[1]]
head(Predicted)
## [,1]
## 1 0.3611875923
## 2 0.3611875923
## 3 0.3611875923
## 4 0.3611875923
## 5 0.3611875923
## 6 0.3611875923
#The root mean square error
sqr_err<-(actual_values-Predicted)^2
sum(sqr_err)
## [1] 27.46218781
mean(sqr_err)
## [1] 0.2307746874
sqrt(mean(sqr_err))
## [1] 0.4803901409
#Plottig Actual and Predicted
plot(actual_values)
points(Predicted, col=2)
- Looks like that is not a very good model.
- Then we can build one more neural network model on the employee productivity data.
#Plottig Actual and Predicted using ggplot
library(ggplot2)
library(reshape2)
act_pred_df<-data.frame(actual_values,Predicted)
act_pred_df$id<-rownames(act_pred_df)
act_pred_df_melt = melt(act_pred_df, id.vars ="id")
ggplot(act_pred_df_melt,aes(id, value, colour = variable)) + geom_point()
##Plotting Actual and Predicted using ggplot on classification graph
Emp_Productivity_pred_act<-data.frame(Emp_Productivity_raw,Predicted=round(Predicted,0))
library(ggplot2)
#Graph without predictions
ggplot(Emp_Productivity_pred_act)+geom_point(aes(x=Age,y=Experience,color=factor(Productivity)),size=5)
#Graph with predictions
ggplot(Emp_Productivity_pred_act)+geom_point(aes(x=Age,y=Experience,color=factor(Productivity),shape=factor(Predicted)),size=5)
Emp_Productivity_nn_model1<-neuralnet(Productivity~Age+Experience,data=Emp_Productivity_raw, hidden=2,linear.output = FALSE)
plot(Emp_Productivity_nn_model1)
- This time error is 13.
- Similarly we can try out the how many no of times we want and find out the best model by doing validation.
R Code Options
- The code has the input parameters that we need to take care of i.e., the number of hidden layers in the neural network. It is actually the number of nodes that we can input as a vector to add more hidden layers.
- Make sure that we have sufficient no. of hidden layers to capture the overall variance in the objective space.
- Stepmax: While executing algorithm, sometimes neural network might run in big loop where there are more iterations taking place and it is never converging. So we can say 100,000 steps for algorithms to converge. Sometimes we may get an error “Algorithm did not converge with the default Stepmax”.
- To increase the stepmax parameter value in such cases.
- Threshold is connected to error function calculation. It can be used as a stopping criteria.
- Stopping criteria is based on the minimum error that is calculated.
- If the error is lesser than the minimum error threshold, then the algorithm will stop.
- We can fix the threshold error depending on our requirements.
- Generally it is taken as 0.001.
- In neural network function of R, the output is expected to be linear by default.
- Most of the time R output is not linear, we need to specifically mention linear.output = FALSE for classification problems.
- So this is how we build the neural network within R.
Output: Building the neural network in R
- Thus, depending on the error we make sure the error is very close to 0 but if the error is higher than 0 then we might have to re-run the algorithm again.
- Then we can validate to a kind of in-time validation.
- How many 0s are there? Which are actually classified as 0s?
- How many 1s are there? Which are actually classified as 1s?
Code- Prediction using NN
new_data<-data.frame(Age=40, Experience=12)
compute(Emp_Productivity_nn_model1, new_data)
## $neurons
## $neurons[[1]]
## 1 Age Experience
## [1,] 1 40 12
##
## $neurons[[2]]
## [,1] [,2] [,3]
## [1,] 1 0.9999999917 0.9943805052
##
##
## $net.result
## [,1]
## [1,] 0.03814890608
- We can predict using the compute function.
- We can use the neural network model and we can give the values of the
and then find the value of
, using the compute function.
- In fact there can be many solutions for a given neural network, because the gradient decent algorithm is searching for the local minima not for the global minima.
- There is whole lot of theory behind that.
- At this point we just need to consider that the neural network output that we get is not unique, it might have multiple solutions.
There can be many solutions
- This is one solution, if we see the set of 11,8,7,19.
- There can be several combinations.
- Here the error is 0 and there are 191 steps.
- Again the error is 0, but the set of weights are different.
- Since there are so many values that are getting into the overall network.
- So even if we adjust some values, then the remaining values will be adjusted automatically, thus the overall error will still minimum but with different weights.
- This is set-3 with slightly different weights.
- Thus there can be many weights because all that we are trying do is find the best weights or the solution.
- The optimal solution that will give us the least error and we might end up with different set of weights.
Local vs. Global Minimum
- There is an issue with Local vs. Global Minimum.
- We need to know the details of the neural network.
- The question is what exactly this multiple solutions are and what Local vs. Global Minimum is.
- Thus there can be multiple solutions for a given neural network because there are so many weights and many weight combination can lead into a smaller error.
- Thus gradient decent method which we use in finding the weights in neural network is not finding the final global minima but it is finding the nearest local minima and most of the times local minima.
- So what global minimum in this particular graph is.
- Algorithms will try to find the local minima rather than global minima because you might see multiple solutions for a given neural network problem.
- That is a kind of uncomfortable situation but we can perform some cross validation checks to find out the real final optimize solution.
- So there can be multiple optimal solutions of neural network.
Conclusion
- Neural network is a vast subject. Many data scientists solely focus on only Neural network techniques
- In this session we practiced the introductory concepts only. Neural Networks has much more advanced techniques. There are many algorithms other than back propagation.
- Neural networks particularly work well on some particular class of problems like image recognition.
- The neural networks algorithms are very calculation intensive. They require highly efficient computing machines. Large datasets take significant amount of runtime on R. We need to try different types of options and packages.
- Currently there is a lot of exciting research going on, around neural networks.
- After gaining sufficient knowledge in this basic session, you may want to explore reinforced learning, deep learning etc. but for all those concepts, the neural networks is mandatory which we have discussed in this topic.
Appendix
Math- How to update the weights?
- We update the weights backwards by iteratively calculating the error
- The formula for weights updating is done using gradient descent method or delta rule also known as Widrow-Hoff rule
- First we calculate the weight corrections for the output layer then we take care of hidden layer
where
is the learning parameter
(for hidden layers
)
- The weight corrections is calculated based on the error function
- The new weights are chosen in such way that the final error in that network is minimized
Math-How does the delta rule work?
- Lets consider a simple example to understand the weight updating using delta rule.
- If we building a simple logistic regression line. We would like to find the weights using weight update rule
is the equation
- We are searching for the optimal
for our data
- Let
be 1.
is the initial equation.
- The error in our initial step is 3.59.
- To reduce the error we will add a
to
and make it 1.5.
- Now
is 1.5 (blue line).
the updated equation.
- With the updated weight, the error is 1.57.
- We can further reduce the error by increasing
by
.
- If we repeat the same process of adding delta and updating weights, we can finally end up with minimum error
- The weight at that final step is the optimal weight
- In this example the weight is 8, and the error is 0
is the final equation.
- In this example, we manually changed the weights to reduce the error. This is just for intuition, manual updating is not feasible for complex optimization problems.
- In gradient descent is a scientific optimization method. We update the weights by calculating gradient of the function.
How does gradient descent work?
- Gradient descent is one of the famous ways to calculate the local minimum
- By Changing the weights we are moving towards the minimum value of the error function. The weights are changed by taking steps in the negative direction of the function gradient(derivative).
Does this method really work?
- We changed the weights did it reduce the overall error?
- Lets calculate the error with new weights and see the change
Gradient Descent Method Validation
- With our initial set of weights the overall error was 0.7137,Y Actual is 0, Y Predicted is 0.7137 error =0.7137
- The new weights give us a predicted value of 0.70655
- In one iteration, we reduced the error from 0.7137 to 0.70655
- The error is reduced by 1%. Repeat the same process with multiple epochs and training examples, we can reduce the error further.
References & Image Sources
- “ROC curve” by Masato8686819 – Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons – https://commons.wikimedia.org/wiki/File:ROC_curve.svg#/media/File:ROC_curve.svg
- “Curvas”??????UPO649 1112 prodgom – ?????????????????????????????????????????? – https://commons.wikimedia.org/wiki/File:Curvas.png#/media/File:Curvas.png??????CC BY-SA 3.0??????
- http://www.autonlab.org/tutorials/neural.html
- “Gradient ascent (surface)”. Licensed under Public Domain via Commons – https://commons.wikimedia.org/wiki/File:Gradient_ascent_(surface).png#/media/File:Gradient_ascent_(surface).png
- “Gradient descent method” by ?????????? ???????? – ???????????????????????? ????????????????????, ???????????????? ??????????????. Licensed under CC BY-SA 3.0 via Wikimedia Commons – https://commons.wikimedia.org/wiki/File:Gradient_descent_method.png#/media/File:Gradient_descent_method.png
- Lecture 7 :Artificial neural networks: Supervised learning: Negnevitsky, Person Education 2005
- Gradient descent can find the local minimum instead of the global minimum By I, KSmrq
- “Neuron”. Licensed under CC BY-SA 3.0 via Wikimedia Commons – https://commons.wikimedia.org/wiki/File:Neuron.svg#/media/File:Neuron.svg
- “Neural signaling-human brain” by 7mike5000 – Gif created from Inside the Brain: Unraveling the Mystery of Alzheimer’s Disease, an educational film by the National Institute on Aging.. Licensed under CC BY-SA 3.0 via Wikimedia Commons – https://commons.wikimedia.org/wiki/File:Neural_signaling-human_brain.gif#/media/File:Neural_signaling-human_brain.gif