Before start our lesson please download the datasets.

Neural network Intuition
Neural network and vocabulary
Neural network algorithm
Math behind neural network algorithm
Building the neural networks
Validating the neural network model
Neural network applications
Image recognition using neural networks

Recap of Logistic Regression

When there is a categorical output yes/no, 1/0 (binary output) and the predictor variable is continuous, then we can not fit a linear regression line.
If we try fitting several lines then the formed logistic regression is better compared to linear regression line as it suits best for the dataset.
Thus we need logistic regression line for this data.
Using the predictor variables to predict the categorical output:

Before moving to neural networks using logistic regression, let us do a quick recap of how do we built the logistic regression and how it can build a neural network based on combination of several logistic regressions.

LAB: Logistic Regression

Dataset: Emp_Productivity/Emp_Productivity.csv
Filter the data and take a subset from above dataset . Filter condition is Sample_Set<3
Draw a scatter plot that shows Age on X-axis and Experience on Y-axis. Try to distinguish the two classes with colors or shapes (visualizing the classes).
Build a logistic regression model to predict Productivity using age and experience.
Finally draw the decision boundary for this logistic regression model.
Create the confusion matrix.
Calculate the accuracy and error rates.

Solution

In [1]:

import pandas as pd
Emp_Productivity_raw = pd.read_csv("datasetsEmp_ProductivityEmp_Productivity.csv")
Emp_Productivity_raw.head(10)

Out[1]:

	Age	Experience	Sample_Set
0	20.0	2.3	1
1	16.2	2.2	1
2	20.2	1.8	1
3	18.8	1.4	1
4	18.9	3.2	1
5	16.7	3.9	1
6	16.3	1.4	1
7	20.0	1.4	1
8	18.0	3.6	1
9	21.2	4.3	1

In [2]:

#Filter the data and take a subset from above dataset . Filter condition is Sample_Set<3
Emp_Productivity1=Emp_Productivity_raw[Emp_Productivity_raw.Sample_Set<3]
Emp_Productivity1.shape

Out[2]:

(74, 4)

In [3]:

#frequency table of Productivity variable
Emp_Productivity1.Productivity.value_counts()

Out[3]:

1    41
0    33
Name: Productivity, dtype: int64

In [4]:

####The clasification graph
#Draw a scatter plot that shows Age on X axis and Experience on Y-axis. Try to distinguish the two classes with colors or shapes.
import matplotlib.pyplot as plt
%matplotlib inline

fig = plt.figure()
ax1 = fig.add_subplot(111)

ax1.scatter(Emp_Productivity1.Age[Emp_Productivity1.Productivity==0],Emp_Productivity1.Experience[Emp_Productivity1.Productivity==0], s=10, c='b', marker="o", label='Productivity 0')
ax1.scatter(Emp_Productivity1.Age[Emp_Productivity1.Productivity==1],Emp_Productivity1.Experience[Emp_Productivity1.Productivity==1], s=10, c='r', marker="+", label='Productivity 1')
plt.legend(loc='upper left');
plt.show()

In [5]:

#predict Productivity using age and experience
###Logistic Regerssion model1
import statsmodels.formula.api as sm
model1 = sm.logit(formula='Productivity ~ Age+Experience', data=Emp_Productivity1)
fitted1 = model1.fit()
fitted1.summary()

Optimization terminated successfully.
         Current function value: 0.315987
         Iterations 7

Out[5]:

Logit Regression Results
Dep. Variable:	Productivity	No. Observations:	74
Model:	Logit	Df Residuals:	71
Method:	MLE	Df Model:	2
Date:	Tue, 14 Feb 2017	Pseudo R-squ.:	0.5402
Time:	15:58:30	Log-Likelihood:	-23.383
converged:	True	LL-Null:	-50.860
		LLR p-value:	1.167e-12

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	-8.9361	2.061	-4.335	0.000	-12.976 -4.896
Age	0.2763	0.105	2.620	0.009	0.070 0.483
Experience	0.5923	0.298	1.988	0.047	0.008 1.176

In [6]:

#coefficients
coef=fitted1.normalized_cov_params
print(coef)

            Intercept       Age  Experience
Intercept    4.249138 -0.184321    0.030957
Age         -0.184321  0.011118   -0.017256
Experience   0.030957 -0.017256    0.088759

In [7]:

# getting slope and intercept of the line
slope1=coef.Intercept[1]/(-coef.Intercept[2])
intercept1=coef.Intercept[0]/(-coef.Intercept[2])
slope1
intercept1

Out[7]:

-137.26024805820899

In [8]:

#Finally draw the decision boundary for this logistic regression model
import matplotlib.pyplot as plt

fig = plt.figure()
ax1 = fig.add_subplot(111)

ax1.scatter(Emp_Productivity1.Age[Emp_Productivity1.Productivity==0],Emp_Productivity1.Experience[Emp_Productivity1.Productivity==0], s=10, c='b', marker="o", label='Productivity 0')
ax1.scatter(Emp_Productivity1.Age[Emp_Productivity1.Productivity==1],Emp_Productivity1.Experience[Emp_Productivity1.Productivity==1], s=10, c='r', marker="+", label='Productivity 1')
plt.legend(loc='upper left');

x_min, x_max = ax1.get_xlim()
ax1.plot([0, x_max], [intercept1, x_max*slope1+intercept1])
ax1.set_xlim([15,35])
ax1.set_ylim([0,10])
plt.show()

Accuracy of the model

In [9]:

#Predicting classes
predicted_values=fitted1.predict(Emp_Productivity1[["Age"]+["Experience"]])
predicted_values[1:10]

threshold=0.5
threshold

import numpy as np
predicted_class=np.zeros(predicted_values.shape)
predicted_class[predicted_values>threshold]=1

predicted_class

Out[9]:

array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])

In [10]:

#Confusion Matrix, Accuracy and Error
from sklearn.metrics import confusion_matrix as cm
ConfusionMatrix = cm(Emp_Productivity1[['Productivity']],predicted_class)
print('Confusion Matrix :', ConfusionMatrix)
accuracy=(ConfusionMatrix[0,0]+ConfusionMatrix[1,1])/sum(sum(ConfusionMatrix))
print('Accuracy : ',accuracy)
error=1-accuracy
print('Error: ',error)

Confusion Matrix : [[31  2]
 [ 2 39]]
Accuracy :  0.945945945946
Error:  0.0540540540541

Decision Boundary

Decision Boundary – Logistic Regression

The line or margin that separates the classes.
Classification algorithms are all about finding the decision boundaries.
It need not be straight line always.
The final function of our decision boundary looks like:
- Y=1 if $w^Tx+w_0>0$ ; else Y=0
In logistic regression, it can be derived from the logistic regression coefficients and the threshold.
- Imagine the logistic regression line $p(y)=frac{e^{(b_0+b_1x_1+b_2x_2)}}{1+e^{(b_0+b_1x_1+b_2x_2)}}$
- Suppose if p(y)>0.5 then class-1 or else class-0
  - $log(y/1-y)=b_0+b_1x_1+b_2x_2$
  - $log(0.5/0.5)=b_0+b_1x_1+b_2x_2$
  - $0=b_0+b_1x_1+b_2x_2$
  - $b_0+b_1x_1+b_2x_2=0$ is the line.
- Rewriting it in mx+c form:
  - $X_2=(-b_1/b_2)X_1+(-b_0/b_2)$
  - Anything above this line is class-1, below this line is class-0
  - $X_2>(-b_1/b_2)X_1+(-b_0/b_2)$ is class-1
  - $X_2<(-b_1/b_2)X_1+(-b_0/b_2)$ is class-0
  - $X_2=(-b_1/b_2)X_1+(-b_0/b_2)$ tie probability of 0.5
- We can change the decision boundary by changing the threshold value (here 0.5).

LAB: Decision Boundary

Draw a scatter plot that shows Age on X-axis and Experience on Y-axis. Try to distinguish the two classes with colors or shapes (visualizing the classes).
Build a logistic regression model to predict Productivity using age and experience.
Finally draw the decision boundary for this logistic regression model.
Create the confusion matrix.
Calculate the accuracy and error rates.

Solution : Drawing the decision boundary of the logistic regression model which was built in last lab.

In [11]:

import matplotlib.pyplot as plt

fig = plt.figure()
ax1 = fig.add_subplot(111)

ax1.scatter(Emp_Productivity1.Age[Emp_Productivity1.Productivity==0],Emp_Productivity1.Experience[Emp_Productivity1.Productivity==0], s=10, c='b', marker="o", label='Productivity 0')
ax1.scatter(Emp_Productivity1.Age[Emp_Productivity1.Productivity==1],Emp_Productivity1.Experience[Emp_Productivity1.Productivity==1], s=10, c='r', marker="+", label='Productivity 1')
plt.legend(loc='upper left');

x_min, x_max = ax1.get_xlim()
ax1.plot([0, x_max], [intercept1, x_max*slope1+intercept1])
ax1.set_xlim([15,35])
ax1.set_ylim([0,10])
plt.show()

New Representation for Logistic Regression

We are trying to understand neural networks through logistic regression.
Moving forward, we will see a new representation for logistic regression line that we just built and see how it transforms to neural networks later.
Thus, this is the logistic regression line that we have built:

$y=frac{e^{(b_0+b_1x_1+b_2x_2)}}{1+e^{(b_0+b_1x_1+b_2x_2)}}$

That can be rewritten as:

$y=frac{1}{1+e^{-(b_0+b_1x_1+b_2x_2)}}$

Thus, this particular line that can be taken as one equation and one can write it as:

$y=g(w_0+w_1x_1+w_2x_2); where g(x)=frac{1}{1+e^-(x)}$ $y=g(sum w_kx_k)$

This can be displayed in diagram as follows:
- x1 and x2 whose weights are w1 and w2.
- $w_0$ is having no prior coefficients.
- We can simply say $(w_0+w_1x_1+w_2x_2)(w_0+w_1x_1+w_2x_2)$ is the line equation, that is going through this logistic equation $y=g(sum{w_kx_k})$
- This is how we can represent the same logistic regression line that we just built.

Finding the weights in logistic regression

$out(x)=y=g(sum w_kx_k)$

The above output is a non linear function of linear combination of inputs – A typical multiple logistic regression line
In this particular line, we have to find out the coefficients of w.
We find w to minimize $sum_{i=1}^n [y_i - g(sum w_kx_k)]^2$

LAB: Non-Linear Decision Boundaries

Dataset: “Emp_Productivity/ Emp_Productivity.csv”
Draw a scatter plot that shows Age on X-axis and Experience on Y-axis. Try to distinguish the two classes with colors or shapes (visualizing the classes).
Build a logistic regression model to predict Productivity using age and experience.
Finally draw the decision boundary for this logistic regression model.
Create the confusion matrix.
Calculate the accuracy and error rates.

We are considering the entire data not just the subset.

In [12]:

Emp_Productivity_raw = pd.read_csv("datasetsEmp_ProductivityEmp_Productivity.csv")

In [13]:

#plotting the overall data
import matplotlib.pyplot as plt

fig = plt.figure()
ax = fig.add_subplot(111)

ax.scatter(Emp_Productivity_raw.Age[Emp_Productivity_raw.Productivity==0],Emp_Productivity_raw.Experience[Emp_Productivity_raw.Productivity==0], s=10, c='b', marker="o", label='Productivity 0')
ax.scatter(Emp_Productivity_raw.Age[Emp_Productivity_raw.Productivity==1],Emp_Productivity_raw.Experience[Emp_Productivity_raw.Productivity==1], s=10, c='r', marker="+", label='Productivity 1')
plt.legend(loc='upper left');
plt.show()

In [14]:

###Logistic Regerssion model1
import statsmodels.formula.api as sm
model = sm.logit(formula='Productivity ~ Age+Experience', data=Emp_Productivity_raw)
fitted = model.fit()
fitted.summary()

Optimization terminated successfully.
         Current function value: 0.632202
         Iterations 5

Out[14]:

Logit Regression Results
Dep. Variable:	Productivity	No. Observations:	119
Model:	Logit	Df Residuals:	116
Method:	MLE	Df Model:	2
Date:	Tue, 14 Feb 2017	Pseudo R-squ.:	0.03361
Time:	15:58:56	Log-Likelihood:	-75.232
converged:	True	LL-Null:	-77.848
		LLR p-value:	0.07307

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.4478	0.699	0.641	0.522	-0.921 1.817
Age	-0.0176	0.038	-0.459	0.646	-0.092 0.057
Experience	-0.0632	0.091	-0.698	0.485	-0.241 0.114

In [15]:

#coefficients
coef=fitted.normalized_cov_params
coef

Out[15]:

	Intercept	Age	Experience
Intercept	0.488120	-0.022329	0.030775
Age	-0.022329	0.001461	-0.002995
Experience	0.030775	-0.002995	0.008210

In [16]:

# getting slope and intercept of the line
slope=coef.Intercept[1]/(-coef.Intercept[2])
intercept=coef.Intercept[0]/(-coef.Intercept[2])
print('Slope :', slope)
print('Intercept :', intercept)

Slope : 0.725542552217
Intercept : -15.8607950797

In [17]:

#Finally draw the decision boundary for this logistic regression model
fig = plt.figure()
ax = fig.add_subplot(111)

ax.scatter(Emp_Productivity_raw.Age[Emp_Productivity_raw.Productivity==0],Emp_Productivity_raw.Experience[Emp_Productivity_raw.Productivity==0], s=10, c='b', marker="o", label='Productivity 0')
ax.scatter(Emp_Productivity_raw.Age[Emp_Productivity_raw.Productivity==1],Emp_Productivity_raw.Experience[Emp_Productivity_raw.Productivity==1], s=10, c='r', marker="+", label='Productivity 1')
plt.legend(loc='upper left');

x_min, x_max = ax.get_xlim()
ax.plot([0, x_max], [intercept, x_max*slope+intercept])
plt.show()

We can see above that the linear boundary layer is so bad in distinguising the classes.

Accuracy and Error

In [18]:

#Create the confusion matrix
#predicting values
predicted_values=fitted.predict(Emp_Productivity_raw[["Age"]+["Experience"]])
predicted_values[1:10]

#Lets convert them to classes using a threshold
threshold=0.5
threshold

import numpy as np
predicted_class=np.zeros(predicted_values.shape)
predicted_class[predicted_values>threshold]=1

#Predcited Classes
predicted_class[1:10]

from sklearn.metrics import confusion_matrix as cm
ConfusionMatrix = cm(Emp_Productivity_raw[['Productivity']],predicted_class)
ConfusionMatrix

Out[18]:

array([[69,  7],
       [43,  0]])

In [19]:

#Accuracy and Error
accuracy=(ConfusionMatrix[0,0]+ConfusionMatrix[1,1])/sum(sum(ConfusionMatrix))
print('Accuracy : ', accuracy)

error=1-accuracy
print('Error: ',error)

Accuracy :  0.579831932773
Error:  0.420168067227

Accuracy is 0.57, which is not at all acceptable.
This is not a single decision boundary.
We have to build a non-linear decision boundary that might looks like shape of “U” in most of the cases.

Non-Linear Decision Boundaries-Issue

Basically, when there is no linear separation between two classes or when single straight line can not help us to divide the two classes then one might have to go for a non-linear decision boundary.
In non-linear decision boundary, we can not fit just one line and say right hand side of the line is 1’s and left hand side of the line is 0’s.
Thus one line might not be sufficient.
Therefore, this seems to be an issue as logical regression does not seems to be a good option when we have non-linear decision boundaries. Non-Linear Decision Boundaries

Non-Linear Decision Boundaries-Solution

We need to find a solution for Non-Linear Decision Boundaries.
We have one idea say:
- By using multiple logistic regression line together, we construct a decision boundary by fitting 2 or 3 logistic regression lines and then use it as a final classifier.
But as of now, a single logistic regression line can not work for a scenario, where there is a non-linear separating boundary between 2 classes.
We are having an issue with Non-Linear Decision Boundaries.
Let us have a possible solution.
We have the classes that can not be separated by using one linear line or a classifier.
Now the question is that, why don’t we fit two models?
Model-1 that separates first portion and model-2 takes care of another portion thus, instead of finding the final output that will have one classifier, which is non-linear and will tell us where are 1’s and 0’s or where are class-1 and class-2.
We get an intermediate output say h1 which is coming out of model-1 and there is an intermediate output h2 which is coming out of model-2.
Thus, indeed we can use h1 and h2 to find out the final classifier.

Intermediate Output1	Intermediate Output2
$out(x)= g(sum{w_kx_k})$ say h1	$out(x)= g(sum{w_kx_k})$ say h2

The Intermediate output

Using the xx directly predicting yy is challenging, thus we have the independent variables x′s i.e., x1 and x2.
Using the independent variables of x′s, we can directly predict y which is challenging because y is non-linearly dependent on x, a linear classifier, which is not working.
Thus the idea is that, we will try to predict hh, and then intermediate output hh will indeed predict y.
Instead of directly going from x to y, we will try to predict h1 using x1 and x2, then h2 using again x1 and x2 and again using h1 and h2 will try to predict y.

Finding the Weights for Intermediate Outputs

How do we find the weights of the intermediate output.
We will try to predict h1 using x1 and x2, h2 using x1 and x2, and then indeed h1 and h2 will be used for predicting y.
h is a non-linear function of linear combination of x, and h2 is a non-linear function g (linear combination of x1 and x2), whereas y is a non-linear function (linear combination of h1 and h2).
Thus, that is an intermediate output h.

LAB: Intermediate output

Dataset: Emp_Productivity/ Emp_Productivity_All_Sites.csv
Filter the data and take first 74 observations from above dataset . Filter condition is Sample_Set<3
Build a logistic regression model to predict Productivity using age and experience.
Calculate the prediction probabilities for all the inputs. Store the probabilities in inter1 variable.
Filter the data and take observations from row 34 onwards. Filter condition is Sample_Set<1.
Build a logistic regression model to predict Productivity using age and experience.
Calculate the prediction probabilities for all the inputs. Store the probabilities in inter2 variable.
Build a consolidated model to predict productivity using inter-1 and inter-2 variables.
Create the confusion matrix and find the accuracy and error rates for the consolidated model.

Our sampled data Emp_Productivity1 has first 74 observations. Let’s build the model on this sample data (sample-1).

In [20]:

#Filter the data and take a subset from whole dataset . Filter condition is Sample_Set<3
Emp_Productivity1=Emp_Productivity_raw[Emp_Productivity_raw.Sample_Set<3]
Emp_Productivity1.shape

Out[20]:

(74, 4)

In [21]:

#Building a Logistic regression model1 to predict Productivity using age and experience
import statsmodels.formula.api as sm
model1 = sm.logit(formula='Productivity ~ Age+Experience', data=Emp_Productivity1)
fitted1 = model1.fit()
fitted1.summary()

Optimization terminated successfully.
         Current function value: 0.315987
         Iterations 7

Out[21]:

Logit Regression Results
Dep. Variable:	Productivity	No. Observations:	74
Model:	Logit	Df Residuals:	71
Method:	MLE	Df Model:	2
Date:	Tue, 14 Feb 2017	Pseudo R-squ.:	0.5402
Time:	15:59:08	Log-Likelihood:	-23.383
converged:	True	LL-Null:	-50.860
		LLR p-value:	1.167e-12

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	-8.9361	2.061	-4.335	0.000	-12.976 -4.896
Age	0.2763	0.105	2.620	0.009	0.070 0.483
Experience	0.5923	0.298	1.988	0.047	0.008 1.176

In [22]:

#Drawing the decision boundary for this logistic regression model
import matplotlib.pyplot as plt

fig = plt.figure()
ax1 = fig.add_subplot(111)

ax1.scatter(Emp_Productivity1.Age[Emp_Productivity1.Productivity==0],Emp_Productivity1.Experience[Emp_Productivity1.Productivity==0], s=10, c='b', marker="o", label='Productivity 0')
ax1.scatter(Emp_Productivity1.Age[Emp_Productivity1.Productivity==1],Emp_Productivity1.Experience[Emp_Productivity1.Productivity==1], s=10, c='r', marker="+", label='Productivity 1')

plt.xlim(min(Emp_Productivity1.Age), max(Emp_Productivity1.Age))
plt.ylim(min(Emp_Productivity1.Experience), max(Emp_Productivity1.Experience))
plt.legend(loc='upper left');

x_min, x_max = ax1.get_xlim()
ax1.plot([0, x_max], [intercept1, x_max*slope1+intercept1])
plt.show()

In [23]:

# Calculating and Storing prediction probabilities in inter1 variable for data Emp_Productivity1
Emp_Productivity_raw['inter1'] = fitted1.predict(Emp_Productivity_raw[["Age"]+["Experience"]])

For Sample_Set < 1 :

In [24]:

# Filter the data and take observations from row 34 onwards. Filter condition is Sample_Set<1
Emp_Productivity2=Emp_Productivity_raw[Emp_Productivity_raw.Sample_Set>1]
Emp_Productivity2.shape

Out[24]:

(86, 5)

In [25]:

Emp_Productivity2.head()

Out[25]:

	Age	Experience	Productivity	Sample_Set	inter1
33	33.9	6.2	1	2	0.983732
34	29.3	5.5	1	2	0.918087
35	27.8	3.4	1	2	0.680985
36	30.7	8.6	1	2	0.990432
37	28.4	8.2	1	2	0.977408

In [26]:

####The clasification graph
#Draw a scatter plot that shows Age on X axis and Experience on Y-axis. Try to distinguish the two classes with colors or shapes.
fig = plt.figure()
ax2 = fig.add_subplot(111)

ax2.scatter(Emp_Productivity2.Age[Emp_Productivity2.Productivity==0],Emp_Productivity2.Experience[Emp_Productivity2.Productivity==0], s=10, c='b', marker="o", label='Productivity 0')
ax2.scatter(Emp_Productivity2.Age[Emp_Productivity2.Productivity==1],Emp_Productivity2.Experience[Emp_Productivity2.Productivity==1], s=10, c='r', marker="+", label='Productivity 1')
plt.xlim(min(Emp_Productivity2.Age), max(Emp_Productivity2.Age))
plt.ylim(min(Emp_Productivity2.Experience), max(Emp_Productivity2.Experience))
plt.legend(loc='upper left');
plt.show()

In [27]:

#Build a logistic regression model to predict Productivity using age and experience of data Emp_Productivity2
###Logistic Regerssion model1
import statsmodels.formula.api as sm
model2 = sm.logit(formula='Productivity ~ Age+Experience', data=Emp_Productivity2)
fitted2 = model2.fit(method="bfgs")
fitted2.summary()

Optimization terminated successfully.
         Current function value: 0.198139
         Iterations: 24
         Function evaluations: 27
         Gradient evaluations: 27

Out[27]:

Logit Regression Results
Dep. Variable:	Productivity	No. Observations:	86
Model:	Logit	Df Residuals:	83
Method:	MLE	Df Model:	2
Date:	Tue, 14 Feb 2017	Pseudo R-squ.:	0.7137
Time:	15:59:18	Log-Likelihood:	-17.040
converged:	True	LL-Null:	-59.518
		LLR p-value:	3.566e-19

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	16.3184	3.966	4.114	0.000	8.545 24.092
Age	-0.3994	0.135	-2.949	0.003	-0.665 -0.134
Experience	-0.2440	0.189	-1.288	0.198	-0.615 0.127

In [28]:

#coefficients
coef=fitted2.normalized_cov_params
print(coef)
#getting slope and intercept of the line
slope2=fitted2.params[1]/(-fitted2.params[2])
intercept2=fitted2.params[0]/(-fitted2.params[2])

            Intercept       Age  Experience
Intercept   15.730470 -0.497397    0.183390
Age         -0.497397  0.018339   -0.014873
Experience   0.183390 -0.014873    0.035860

In [29]:

#Finally draw the decision boundary for this logistic regression model
import matplotlib.pyplot as plt

fig = plt.figure()
ax2 = fig.add_subplot(111)

ax2.scatter(Emp_Productivity2.Age[Emp_Productivity2.Productivity==0],Emp_Productivity2.Experience[Emp_Productivity2.Productivity==0], s=10, c='b', marker="o", label='Productivity 0')
ax2.scatter(Emp_Productivity2.Age[Emp_Productivity2.Productivity==1],Emp_Productivity2.Experience[Emp_Productivity2.Productivity==1], s=10, c='r', marker="+", label='Productivity 1')
plt.xlim(min(Emp_Productivity2.Age), max(Emp_Productivity2.Age))
plt.ylim(min(Emp_Productivity2.Experience), max(Emp_Productivity2.Experience))
plt.legend(loc='upper left');

x_min, x_max = ax2.get_xlim()
y_min,y_max=ax2.get_ylim()
ax2.plot([x_min, x_max], [x_min*slope2+intercept2, x_max*slope2+intercept2])
plt.show()

In [30]:

#Calculate the prediction probabilities for all the inputs. Store the probabilities in inter2 variable for data Emp_Productivity2
Emp_Productivity_raw['inter2']=fitted2.predict(Emp_Productivity_raw[["Age"]+["Experience"]])

In [31]:

###Confusion matrix, Accuracy and error of the model2

#Predciting Values
predicted_values=fitted2.predict(Emp_Productivity2[["Age"]+["Experience"]])
predicted_values[1:10]

#Lets convert them to classes using a threshold
threshold=0.5
threshold

import numpy as np
predicted_class=np.zeros(predicted_values.shape)
predicted_class[predicted_values>threshold]=1

#Predcited Classes
predicted_class[1:10]

from sklearn.metrics import confusion_matrix as cm
ConfusionMatrix = cm(Emp_Productivity2[['Productivity']],predicted_class)
print('Confusion Matrix : ',ConfusionMatrix)
accuracy=(ConfusionMatrix[0,0]+ConfusionMatrix[1,1])/sum(sum(ConfusionMatrix))
print('Accuracy : ', accuracy)

error=1-accuracy
print('Error : ', error)

Confusion Matrix :  [[43  2]
 [ 2 39]]
Accuracy :  0.953488372093
Error :  0.046511627907

Now that both models have been created lets try to combine them.

In [32]:

#plotting the new columns
import matplotlib.pyplot as plt

fig = plt.figure()
ax = fig.add_subplot(111)

ax.scatter(Emp_Productivity_raw.inter1[Emp_Productivity_raw.Productivity==0],Emp_Productivity_raw.inter2[Emp_Productivity_raw.Productivity==0], s=50, c='b', marker="o", label='Productivity 0')
ax.scatter(Emp_Productivity_raw.inter1[Emp_Productivity_raw.Productivity==1],Emp_Productivity_raw.inter2[Emp_Productivity_raw.Productivity==1], s=50, c='r', marker="+", label='Productivity 1')

plt.xlim(min(Emp_Productivity_raw.inter1), max(Emp_Productivity_raw.inter1)+0.2)
plt.ylim(min(Emp_Productivity_raw.inter2), max(Emp_Productivity_raw.inter2)+0.2)


plt.legend(loc='lower left');
plt.show()

In [33]:

###Logistic Regerssion model with Intermediate outputs as input
import statsmodels.formula.api as sm

model_combined = sm.logit(formula='Productivity ~ inter1+inter2', data=Emp_Productivity_raw)
fitted_combined = model_combined.fit(method="bfgs")
fitted_combined.summary()

Optimization terminated successfully.
         Current function value: 0.208985
         Iterations: 26
         Function evaluations: 27
         Gradient evaluations: 27

Out[33]:

Logit Regression Results
Dep. Variable:	Productivity	No. Observations:	119
Model:	Logit	Df Residuals:	116
Method:	MLE	Df Model:	2
Date:	Tue, 14 Feb 2017	Pseudo R-squ.:	0.6805
Time:	15:59:27	Log-Likelihood:	-24.869
converged:	True	LL-Null:	-77.848
		LLR p-value:	9.805e-24

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	-12.2134	1.907	-6.405	0.000	-15.951 -8.476
inter1	8.0193	1.409	5.693	0.000	5.258 10.780
inter2	8.5983	1.509	5.697	0.000	5.640 11.556

In [34]:

#coefficients
coef=fitted_combined.normalized_cov_params
print(coef)
# getting slope and intercept of the line
slope_combined=fitted_combined.params[1]/(-fitted_combined.params[2])
intercept_combined=fitted_combined.params[0]/(-fitted_combined.params[2])

           Intercept    inter1    inter2
Intercept   3.635572 -2.326775 -2.637054
inter1     -2.326775  1.984539  1.413297
inter2     -2.637054  1.413297  2.277541

In [35]:

#Finally draw the decision boundary for this logistic regression model
import matplotlib.pyplot as plt

fig = plt.figure()
ax2 = fig.add_subplot(111)

ax2.scatter(Emp_Productivity_raw.inter1[Emp_Productivity_raw.Productivity==0],Emp_Productivity_raw.inter2[Emp_Productivity_raw.Productivity==0], s=10, c='b', marker="o", label='Productivity 0')
ax2.scatter(Emp_Productivity_raw.inter1[Emp_Productivity_raw.Productivity==1],Emp_Productivity_raw.inter2[Emp_Productivity_raw.Productivity==1], s=10, c='r', marker="+", label='Productivity 1')

plt.xlim(min(Emp_Productivity_raw.inter1), max(Emp_Productivity_raw.inter1)+0.2)
plt.ylim(min(Emp_Productivity_raw.inter2), max(Emp_Productivity_raw.inter2)+0.2)

plt.legend(loc='lower left');

x_min, x_max = ax2.get_xlim()
y_min,y_max=ax2.get_ylim()
ax2.plot([x_min, x_max], [x_min*slope_combined+intercept_combined, x_max*slope_combined+intercept_combined])
plt.show()

In [36]:

#### Confusion Matrix, Accuracy and Error of the Intermediate
#Predciting Values
predicted_values=fitted_combined.predict(Emp_Productivity_raw[["inter1"]+["inter2"]])
predicted_values[1:10]

#Lets convert them to classes using a threshold
threshold=0.5
threshold

import numpy as np
predicted_class=np.zeros(predicted_values.shape)
predicted_class[predicted_values>threshold]=1

#Predcited Classes
predicted_class[1:10]

from sklearn.metrics import confusion_matrix as cm
ConfusionMatrix = cm(Emp_Productivity_raw[['Productivity']],predicted_class)
print('Confusion Matrix : ',ConfusionMatrix)

accuracy=(ConfusionMatrix[0,0]+ConfusionMatrix[1,1])/sum(sum(ConfusionMatrix))
print('Accuracy : ', accuracy)

error=1-accuracy
print('Error : ', error)

Confusion Matrix :  [[74  2]
 [ 4 39]]
Accuracy :  0.949579831933
Error :  0.0504201680672

We got an accuracy of 94.95% with an Intermediate model.

Neural Network Intuition

Final Output $y = out(h) = g(sum W_jh_j)$ $h_j = out(x) = g(sum w_{jk}x_k)$ $y = out(h) = g(sum W_j g(sum w_{jk}x_k))$

Here, h is a non linear function of linear combination of inputs – A multiple logistic regression line.
Y is a non linear function of linear combination of outputs of logistic regressions.
Y is a non linear function of linear combination of non linear functions of linear combination of inputs.
We find W to minimize $sum_{i=1}^n [y_i - g(sum W_jh_j)]^2$ .
We find Wj and $w_{jk}$ to minimize $sum_{i=1}^n [y_i - g(sum W_j g(sum w_{jk} x_k))]^2$ $sum_{i=1}^n [y_i - g(sum W_j g(sum w_{jk} x_k))]^2$ .
Neural networks is all about finding the sets of weights $w_j$ and $w_{jk}$ using Gradient Descent Method.

The Neural Networks

The neural networks methodology is similar to the intermediate output method explained above.
But we will not manually subset the data to crate the different models.
The neural network technique automatically takes care of all the intermediate outputs using hidden layers.
It works very well for the data with non-linear decision boundaries.
The intermediate output layer in the network is known as hidden layer.
In Simple terms, neural networks are multi layer nonlinear regression model.
If we have sufficient number of hidden layers, then we can estimate any complex non-linear function.

Neural Network and Vocabulary Why are they called hidden layers?

A hidden layer “hides” the desired output.
Instead of predicting the actual output using a single model, build multiple models to predict intermediate output.
There is no standard way of deciding the number of hidden layers but with experience and looking at the complexity or looking at the final accuracy of the model, we can experiment with the number of hidden layers.
But it is like the more the merrier.
Thus, we have to avoid the over fitting as well.
This is the overall intuition of the neural network.

Algorithm for Finding Weights

Algorithm is all about finding the weights/coefficients.
We randomly initialize some weights.
Calculate the output by supplying training input.
With those values in xx, we calculate the value of yy or predicted values of yy.
Given the values of xx, we already know the actual value of yy.
Now we will try to predict the value of yy using these weights.
Whatever is the error between the predicted and actual, we try to adjust the weight to reduce that error.
And finally, we will find those weights after adjustments that will give us minimum amount of error.
Let us see what are the steps involved in the neural network algorithm.

The Neural Network Algorithm

Step 1: Initialization of weights: Randomly select some weights.
Step 2: Training & Activation: Input the training values and perform the calculations forward.
- We have the dataset with us, so we put the values of x, then we find the values of y with which we can calculate forward.
- Once we calculate the value of y, then those are predicted values of y.
- In training itself i.e., in dataset, we will have the actual values of y.
Step 3: Error Calculation: Calculate the error at the outputs. Use the output error to calculate error fractions at each hidden layer.
Step 4: Weight training: Update the weights to reduce the error, recalculate and repeat the process of training & updating the weights for all the examples.
Step 5: Stopping criteria: Stop the training and weights updating process when the minimum error criteria is met.
- So we start with some weights and then calculate the errors and then adjust the weights to reduce the error, once there is very minimum error we stop it at that point.

Randomly Initialize Weights

In step-1 initialization of some random weights.
So we take any random weights.
As these are not the final weights, we can adjust them later.

Training & Activation

Training and activation is the next step.
Now, we have input the values of x to predict h1, because we already have weights.
We can just substitute the values of x and random weights in the given equation, then we can find h1 and h2.
Thus, by giving input the training value, we can perform the calculations forward.
Training input & calculations is called – Feed Forward.

Error Calculation at Output

In the previous step, we have the predicted values of y, so this $g(x)$ function is the predicted values of y.
Now we know the actual value of y.
Now we calculate error at the final layer and we can also find the error fractions at each hidden layer based on the error.
With this formula, we can find out that what fraction of error will be contributing the overall error.
At each layer, we can find out the error which is called Back Propagation which helps in calculating errors signals backwards.

Error Calculation at hidden layers

This is the overall error, i.e. Err.
Once we have the errors, then we can calculate the weight corrections that will reduce that errors.

Calculate weight corrections

Here, the $w_{12}$ , $w_{12}$ , etc. weight correction will reduce the errors on $h_1$ and $h_2$ , and indeed $w_0$ , $w_1 and w_2$ will be reducing the overall error.
These are the weight corrections.

Update Weights

Here in the Update Weights, the weights will be the summation of previous weights and the weight corrections given as follows, $w_{11}:=w_{11}+delta w_{11}$
Update the weights to reduce the error based on the weight corrections, recalculate and repeat the process.
Hence with this new weight, the error will be reduced.
Thus, this is one iteration that we repeat it again and again.
Now these new weights again will find out the error.
Again will find out the error at each hidden layer.
And again we will do the weight correction and update the weights.
So that with new weights, error will be reduced slightly.
We repeat the process again and again.

Stopping Criteria

We will stop training the weights and updating the weights, when the error will be least.
When there is no more error reduction happening, then we can stop at that point or we can set up a minimum error criteria i.e., if the error is less than particular point, then we have to stop training.
Final weights will be taken from the final iteration.
Thus, once the minimum error criteria is met, we can come out of this algorithm.

Once Again ..Neural network Algorithm

Step 1: Initialization of weights: Randomly select some weights.
Step 2: Training & Activation: Input the training values and perform the calculations forward.
Step 3: Error Calculation: Calculate the error at the outputs. Use the output error to calculate error fractions at each hidden layer.
Step 4: Weight training: Update the weights to reduce the error, recalculate and repeat the process of training & updating the weights for all the examples.
Step 5: Stopping criteria: Stop the training and weights updating process when the minimum error criteria is met.

Neural network Algorithm-Demo

This Neural Network Algorithm shows how exactly it works with the actual numbers.
Looks like a dataset that can not be separated by using single linear decision boundary/perceptron.
Let us consider a similar but simple classification example.
- XOR Gate Dataset
Observing the dataset, then there are two classes: 1 and 0.
Here, clearly we can see that they can not be one single decision boundary that can separate them into two classes.
Thus, we need to build a non-linear decision boundary and we need to go with neural networks to classify this whole system into 1’s and 0’s.
For an example, we will take XOR Gate Dataset.
So we have XOR Gate and we can see in the diagram of it which is similar to above example.
Here the XOR Gate will be:
- When x1 takes 1, x2 takes 1 and the output is 0.
- When x1 takes 1, x2 takes 0 and the output is 1.
We will try to use neural network algorithm, that will build a non linear decision boundary and will help us to segregate between two classes i.e., 1’s and 0’s thus, it is obvious that it is one decision line.
One line might not be sufficient so we need to build a non-linear decision boundary.

Randomly Initialize Weights

Step 1 will be randomly initialize weights.
Let us suppose these might be randomly initialize weights say, they can be anything.
No need to worry about what we take exactly.
So randomly initialized the weights like this.

Activation

We need to take the input data, this is directly taken from the dataset that we have.
This is the XOR dataset:
Here the XOR Gate will be:
- When x1 takes 1, x2 takes 1 and the output is 0.
- When x1 takes 1, x2 takes 0 and the output is 1.
We will take the first data point and then we pass on.
Thus this is called first epoch.
Now inputting the values of x1 and x2, so the value of x1 is 1 and the value of x2 is 1, and then we can find out the value of h1 and h2.
Value of h1 can be calculating using 0.818.
So using these weights, if we substitute weights h1 is 0.818 and h2 is 0.731.
Using h1 and h2 and these weights, we can use this equation to calculate the final value of y.
When x1 is 1 and x2 is 1, then output is zero but with these weights, the output comes as 0.71.
Obviously there is an error as instead of zero we got 0.731.
So error is nothing but (Error = Actual – Predicted).
Predicted value is 0.713 and the actual value is 0.
Thus the (Error = Actual – Predicted), so $(Y_{Target} - Y_{Observed})$ is our error.
The error rate will be the output which is 0.713.

Back-Propagate Errors

Similarly, we can calculate the expected error or error fraction at h1 and at h2.
And then based on the delta at y, the error fraction, we can calculate the error fraction at h1 is 0.021 and at h2 is -0.028.

Calculate Weight Corrections

Based on the error fraction that we have calculated earlier, from that the weight adjustment will be derived.
Thus, the error fraction and actual value of yy help us to find out the weight adjustment.
We calculate the weight adjustment, based on the weight adjustment we will update the weights.

Updated Weights

Earlier the weight was 0.5 at h1 and -1 at h2, so we calculate the weight adjustment its comes around 0.00217501 at h1 and -0.02867 at h2.
We will update the weight 0.5 which becomes 0.502175 and -1 becomes -1.002867.
So we have adjusted the weights.
Now we have to do the same thing for all the weights.
These are the new weights and for these new weights again we have to calculate the value of predicted y and the actual y again and we will find the error.

Updated Weights contd… Iterations and Stopping Criteria

This iteration is just for one training example (1,1,0). This is just the first epoch.
We repeat the same process of training and updating of weights for all the data points.
We continue and update the weights until we see there is no significant change in the error or when the maximum permissible error criteria is met.
By updating the weights in this method, we reduce the error slightly. When the error reaches the minimum point the iterations will be stopped and the weights will be considered as optimum for this training set.

XOR Gate final NN Model

Finally find the decision boundaries using neural networks.
This how it looks like for XOR Gate final neural network model.
This is the manual calculations of Neural Network.
We really do not need to do error calculation, back propagation, updating the weights, etc.
If we are having the tool, then it takes care of everything automatically.
All we need is to supply the dataset, independent variable, dependent variable, etc. so everything will be taken care of.
We did this exercise just to take an idea but in general, we do not need to manually calculate the errors.
Thus, we can use any tool to build Neural Networks.

Building the Neural Network

We do not really need to calculate the weights manually like that.
If we take R or Python or any tool which is prewritten, we just need to input the right values of xx and the dataset.
The gradient descent method is not very easy to understand for a non-mathematics students, thus it is not easy to write the program from the scratch.
The neural network tools do not expect the user to write the code for the full length back propagation algorithm at least for us i.e., the starting and intermediate student.
We do not really need to know the overall coding of the algorithm for finding out weights of neural network.
Thus, we can use tool like R.
We will try to use R to build the neural network and we just need to be slightly careful while setting out the parameters in this neural network function where everything else will be taken care of.
We will try to build a neural network model for XOR data.
We will also do a neural network weights finding exercise on Emp_Productivity.csv data.
We will also use this neural network to predict the values.
We will also find out what the final model will look like.
Earlier we have built two logistic regressions, now we will directly try to build one neural network equation.
Now we will build the neural network, so for building this, the function is neural net.
We will fit an XOR neural network model.

The good news is…

We don’t need to write the code for weights calculation and updating.
There are readymade codes, libraries and packages available in Python.
The gradient descent method is not very easy to understand for a non-mathematics students.
Neural network tools don’t expect the user to write the code for the full length back propagation algorithm.

Building the Neural Network in Python

We have a couple of packages available in Python.
We need to mention the dataset, input, output & number of hidden layers as input.
Neural network calculations are very complex. The algorithm may take sometime to produce the results.
One needs to be careful while setting the parameters. The runtime changed based on the input parameter values.

Python Code Options

Step 1: Import the Neurolab.
Step 2: We need a epouch(error pouch) error = [], In this we store error value of each iteration or epoach.
Step 3: Create a network:
- Here nl.net.newff([[0, 1],[0,1]],[4,1],transf=[nl.trans.LogSig()] * 2) is feed forward neural network.
- 1st argument is min max values of predictor variables; can be a list of list.
- 2nd argument is no. of nodes in each layer i.e., 4 in hidden and 1 in output layer.
- transf is transfer function applied in each layer
Step 4: Train Network:
- net.train() outputs error which is appended to error variable and has a few parameters.
Step 5: Simulate Network:
- net.sim(input) gives the output for the network.

LAB: Building the neural network in Python

Build a neural network for XOR data
Dataset: Emp_Productivity/Emp_Productivity.csv
Draw a 2D graph between age, experience and productivity.
Build neural network algorithm to predict the productivity based on age and experience.
Plot the neural network with final weights.

Manual Calculation of Weights XOR Dataset

In [37]:

## Importing the dataset
xor_data =pd.read_csv("datasetsGatesxor.csv")
xor_data.shape

Out[37]:

(4, 3)

In [38]:

xor_data.head()

Out[38]:

	input1	input2	output
0	1	1	0
1	1	0	1
2	0	1	1
3	0	0	0

In [39]:

#Neural network building
#We use 'neurolab' package for implementing neural nets in python
#We need to manually downlaod and install  neurolab
import neurolab as nl
import numpy as np
import pylab as pl

In [40]:

error = []        #In this we store error value of each iteration or epoach

# Create network with 1 hidden layer and random initialized
#nl.net.newff() is feed forward neural network
#1st argument is min max values of predictor variables
#2nd argument is no.of nodes in each layer i.e 4 in hidden 1 in o/p
#transf is transfer function applied in each layer
net = nl.net.newff([[0, 1],[0,1]],[4,1],transf=[nl.trans.LogSig()] * 2)
net.trainf = nl.train.train_rprop

In [41]:

# Training network
#net.train outputs error which is appended to error variable
error.append(net.train(xor_data[["input1"]+["input2"]], xor_data[["output"]], show=0, epochs = 100,goal=0.001))

In [42]:

#plotting epoches Vs error
#we can use this plot to specify the no.of epoaches in training to reduce time
pl.figure(1)
pl.plot(error[0])
pl.xlabel('Number of epochs')
pl.ylabel('Training error')
pl.grid()
pl.show()

In [43]:

# Simulate network(predicting)
predicted_values = net.sim(xor_data[["input1"]+["input2"]])

In [44]:

#converting predicted values into classes by using threshold
predicted_class=predicted_values
predicted_class[predicted_values>0.5]=1
predicted_class[predicted_values<=0.5]=0

In [45]:

#predicted classes
predicted_class

Out[45]:

array([[ 0.],
       [ 1.],
       [ 1.],
       [ 0.]])

In [46]:

#confusion matrix
from sklearn.metrics import confusion_matrix as cm
ConfusionMatrix = cm(xor_data[['output']],predicted_class)
print('ConfusionMatrix : ', ConfusionMatrix)

#accuracy
accuracy=(ConfusionMatrix[0,0]+ConfusionMatrix[1,1])/sum(sum(ConfusionMatrix))
print('Accuracy : ', accuracy)

#accuracy is 100% it classifying every training observation correctly
error=1-accuracy
print('Error :', error)

ConfusionMatrix :  [[2 0]
 [0 2]]
Accuracy :  1.0
Error : 0.0

Building the neural network in Python on EMP_Productivity datasets

In [47]:

Emp_Productivity_raw = pd.read_csv("datasetsEmp_ProductivityEmp_Productivity.csv")
Emp_Productivity_raw.head(10)

Out[47]:

	Age	Experience	Sample_Set
0	20.0	2.3	1
1	16.2	2.2	1
2	20.2	1.8	1
3	18.8	1.4	1
4	18.9	3.2	1
5	16.7	3.9	1
6	16.3	1.4	1
7	20.0	1.4	1
8	18.0	3.6	1
9	21.2	4.3	1

In [48]:

#Draw a scatter plot that shows Age on X axis and Experience on Y-axis. Try to distinguish the two classes with colors or shapes.
import matplotlib.pyplot as plt

fig = plt.figure()
ax1 = fig.add_subplot(111)

ax1.scatter(Emp_Productivity_raw.Age[Emp_Productivity_raw.Productivity==0],Emp_Productivity_raw.Experience[Emp_Productivity_raw.Productivity==0], s=10, c='b', marker="o", label='Productivity 0')
ax1.scatter(Emp_Productivity_raw.Age[Emp_Productivity_raw.Productivity==1],Emp_Productivity_raw.Experience[Emp_Productivity_raw.Productivity==1], s=10, c='r', marker="+", label='Productivity 1')
plt.legend(loc='upper left');
plt.show()

In [49]:

####Building neural net
import neurolab as nl
import numpy as np
import pylab as pl

error = []
# Create network with 1 layer and random initialized
net = nl.net.newff([[15, 60],[1,20]],[6,1],transf=[nl.trans.LogSig()] * 2)
net.trainf = nl.train.train_rprop

In [50]:

# Train network
error.append(net.train(Emp_Productivity_raw[["Age"]+["Experience"]], Emp_Productivity_raw[["Productivity"]], show=0, epochs = 500,goal=0.02))

# Simulate network
predicted_values = net.sim(Emp_Productivity_raw[["Age"]+["Experience"]])

In [51]:

#Converting Predictive values into Predected Classes
predicted_class=predicted_values
predicted_class[predicted_values>0.5]=1
predicted_class[predicted_values<=0.5]=0

#Predcited Classes
predicted_class[0:10]

Out[51]:

array([[ 0.],
       [ 0.],
       [ 0.],
       [ 0.],
       [ 0.],
       [ 0.],
       [ 0.],
       [ 0.],
       [ 0.],
       [ 0.]])

In [52]:

#confusion matrix
from sklearn.metrics import confusion_matrix as cm
ConfusionMatrix = cm(Emp_Productivity_raw[['Productivity']],predicted_class)
print('Confusion Matrix : ', ConfusionMatrix)

#accuracy
accuracy=(ConfusionMatrix[0,0]+ConfusionMatrix[1,1])/sum(sum(ConfusionMatrix))
print('Accuracy : ', accuracy)

error=1-accuracy
print('Error : ', error)

Confusion Matrix :  [[74  2]
 [ 4 39]]
Accuracy :  0.949579831933
Error :  0.0504201680672

In [53]:

#plotting actual and prected classes
Emp_Productivity_raw['predicted_class']=pd.DataFrame(predicted_class)

import matplotlib.pyplot as plt

fig = plt.figure()
ax1 = fig.add_subplot(111)

ax1.scatter(Emp_Productivity_raw.Age[Emp_Productivity_raw.Productivity==0],Emp_Productivity_raw.Experience[Emp_Productivity_raw.Productivity==0], s=40, c='g', marker="o", label='Productivity 0')
ax1.scatter(Emp_Productivity_raw.Age[Emp_Productivity_raw.Productivity==1],Emp_Productivity_raw.Experience[Emp_Productivity_raw.Productivity==1], s=40, c='g', marker="x", label='Productivity 1')
ax1.scatter(Emp_Productivity_raw.Age[Emp_Productivity_raw.predicted_class==0],Emp_Productivity_raw.Experience[Emp_Productivity_raw.predicted_class==0], s=30, c='b', marker="1", label='Predicted 0')
ax1.scatter(Emp_Productivity_raw.Age[Emp_Productivity_raw.predicted_class==1],Emp_Productivity_raw.Experience[Emp_Productivity_raw.predicted_class==1], s=30, c='r', marker="2", label='Predicted 1')

plt.legend(loc='upper left');
plt.show()

In [54]:

# Simulate network with 'Age': 40; 'experience':12
x = np.array([[40,12]])
net.sim(x)

Out[54]:

array([[ 0.04617027]])

With the threshold 0.5, the above output will be ‘0’.

Local vs. Global Minimum

There is an issue with Local vs. Global Minimum.
We need to know the details of the neural network.
The question is what exactly this multiple solutions are and what Local vs. Global Minimum is.
Thus, there can be multiple solutions for a given neural network because there are so many weights and many weight combination can lead into a smaller error.
Thus, gradient decent method is which we use in finding the weights in neural network is not finding the final global minima but it is finding the nearest local minima and most of the times local minima.
So what global minimum in this particular graph is.
Algorithms will try to find the local minima rather than global minima, because you might see multiple solutions for a given neural network problem.
That is a kind of uncomfortable situation, but we can perform some cross validation checks to find out the real final optimize solution.
So there can be multiple optimal solutions of neural network.

Hidden layers and their role

Now we will try to understand what are hidden layers and their role in this whole neural network.
Thus, this is a Multi-Layer Neural Network.

Multi Layer Neural Network

As shown in the figure, Layer-1 has two nodes while Layer-2 has three nodes.
We can have multiple layers as well, because if the complexity in the objective space is really high and it is like seriously non-linear, then we can use the multi-layer neural network to capture the overall variation or capture the overall classification details in the objective space.
So, here is the role of hidden layers.

The Role of Hidden Layers

The First hidden layer:
- The first layer is nothing but the liner decision boundaries.
- The simple logistic regression line outputs.
- We can see them as multiple lines on the decision space.
The Second hidden layer:
- The Second layer combines these lines and forms simple decision boundary shapes.
The Third hidden layer:
- The third hidden layer forms even complex shapes within the boundaries generated by second layer.
You can imagine, all these layers together divide the whole objective space into multiple decision boundary shapes, the cases within the shape are class-1 outside the shape are class-2.

The Number of Hidden Layers

There is no concrete rule to choose the right number. We need to choose by trial and error validation.
Too few hidden layers might result in imperfect models. The error rate will be high.
High number of hidden layers might lead to over-fitting, but it can be identified by using some validation techniques.
The final number is based on the number of predictor variables, training data size and the complexity in the target.
When we are in doubt, its better to go with many hidden nodes than few. It will ensure higher accuracy. The training process will be slower though.
Cross validation and testing error can help us in determining the model with optimal hidden layers.

LAB: Digit Recognizer

Take an image of a handwritten single digit, and determine what that digit is.
Normalized handwritten digits, automatically scanned from envelopes by the U.S. Postal Service. The original scanned digits are binary and of different sizes and orientations; the images here have been de-slanted and size normalized, resultingin 16×16 grayscale images (Le Cun et al., 1990).
The data are in two gzipped files, and each line consists of the digitid (0-9) followed by the 256 grayscale values.
Build a neural network model that can be used as the digit recognizer.
Use the test dataset to validate the true classification power of the model.
What is the final accuracy of the model?
We can see them as multiple lines on the decision space.

In [55]:

#Importing test and training data
import numpy as np
digits_train = np.loadtxt("datasetsDigit RecognizerUSPSzip.train.txt")

In [56]:

#digits_train is numpy array. we convert it into dataframe for better handling
train_data=pd.DataFrame(digits_train)
train_data.shape

Out[56]:

(7291, 257)

In [57]:

digits_test = np.loadtxt("datasetsDigit RecognizerUSPSzip.test.txt")
#digits_test is numpy array. we convert it into dataframe for better handling
test_data=pd.DataFrame(digits_test)
test_data.shape

Out[57]:

(2007, 257)

In [58]:

train_data[0].value_counts()     #To get labels of the images

Out[58]:

0.0    1194
1.0    1005
2.0     731
6.0     664
3.0     658
4.0     652
7.0     645
9.0     644
5.0     556
8.0     542
Name: 0, dtype: int64

In [59]:

import matplotlib.pyplot as plt

#Lets have a look at some images.

for i in range(0,5):
    data_row=digits_train[i][1:]
    #pixels = matrix(as.numeric(data_row),16,16,byrow=TRUE)
    pixels = np.matrix(data_row)
    pixels=pixels.reshape(16,16)
    plt.figure(figsize=(10,10))
    plt.subplot(3,3,i+1)
    plt.imshow(pixels)

In [60]:

#Creating multiple columns for multiple outputs
#####We need these variables while building the model
digit_labels=pd.DataFrame()
digit_labels['label']=train_data[0:][0]
label_names=['I0','I1','I2','I3','I4','I5','I6','I7','I8','I9']
for i in range(0,10):
    digit_labels[label_names[i]]=digit_labels.label==i
#see our newly created labels data
digit_labels.head(10)

Out[60]:

	label	I0	I1	I2	I3	I4	I5	I6	I7	I8	I9
0	6.0	False	False	False	False	False	False	True	False	False	False
1	5.0	False	False	False	False	False	True	False	False	False	False
2	4.0	False	False	False	False	True	False	False	False	False	False
3	7.0	False	False	False	False	False	False	False	True	False	False
4	3.0	False	False	False	True	False	False	False	False	False	False
5	6.0	False	False	False	False	False	False	True	False	False	False
6	3.0	False	False	False	True	False	False	False	False	False	False
7	1.0	False	True	False	False	False	False	False	False	False	False
8	0.0	True	False	False	False	False	False	False	False	False	False
9	1.0	False	True	False	False	False	False	False	False	False	False

In [61]:

#Update the training dataset
train_data1=pd.concat([train_data,digit_labels],axis=1)
print(train_data1.shape)
train_data1.head(5)

(7291, 268)

Out[61]:

	0	1	2	3	4	5	6	7	8	9	…	I0	I1	I2	I3	I4	I5	I6	I7	I8	I9
0	6.0	-1.0	-1.0	-1.0	-1.000	-1.000	-1.000	-1.000	-0.631	0.862	…	False	False	False	False	False	False	True	False	False	False
1	5.0	-1.0	-1.0	-1.0	-0.813	-0.671	-0.809	-0.887	-0.671	-0.853	…	False	False	False	False	False	True	False	False	False	False
2	4.0	-1.0	-1.0	-1.0	-1.000	-1.000	-1.000	-1.000	-1.000	-1.000	…	False	False	False	False	True	False	False	False	False	False
3	7.0	-1.0	-1.0	-1.0	-1.000	-1.000	-0.273	0.684	0.960	0.450	…	False	False	False	False	False	False	False	True	False	False
4	3.0	-1.0	-1.0	-1.0	-1.000	-1.000	-0.928	-0.204	0.751	0.466	…	False	False	False	True	False	False	False	False	False	False

5 rows × 268 columns

In [62]:

#########Neural network building
import neurolab as nl
import numpy as np
import pylab as pl

x_train=train_data.drop(train_data.columns[[0]], axis=1)
y_train=digit_labels.drop(digit_labels.columns[[0]], axis=1)

In [63]:

#getting minimum and maximum of each column of x_train into a list
def minMax(x):
    return pd.Series(index=['min','max'],data=[x.min(),x.max()])

In [64]:

listvalues = x_train.apply(minMax).T.values.tolist()

error = []

In [65]:

# Create network with 1 layer and random initialized
net = nl.net.newff(listvalues,[20,10],transf=[nl.trans.LogSig()] * 2)
net.trainf = nl.train.train_rprop

In [66]:

# Train network
import time
start_time = time.time()
error.append(net.train(x_train, y_train, show=0, epochs = 250,goal=0.02))
print("--- %s seconds ---" % (time.time() - start_time))

--- 284.80046010017395 seconds ---

In [67]:

# Prediction testing data
x_test=test_data.drop(test_data.columns[[0]], axis=1)
y_test=test_data[0:][0]

predicted_values = net.sim(x_test.as_matrix())
predict=pd.DataFrame(predicted_values)

index=predict.idxmax(axis=1)

In [68]:

#confusion matrix
from sklearn.metrics import confusion_matrix as cm
ConfusionMatrix = cm(y_test,index)
print('Confusion Matrix : ', ConfusionMatrix)

#accuracy
accuracy=np.trace(ConfusionMatrix)/sum(sum(ConfusionMatrix))
print('Accuracy : ', accuracy)

error=1-accuracy
print('Error : ', error)

Confusion Matrix :  [[344   0   1   2   2   3   5   0   2   0]
 [  0 249   2   1   3   0   4   4   1   0]
 [  2   1 175   7   4   1   1   2   4   1]
 [  2   0   7 137   1  15   0   0   3   1]
 [  2   1   3   0 178   1   2   3   3   7]
 [  2   0   0  10   3 138   0   0   4   3]
 [  1   0   3   0   3   3 158   0   2   0]
 [  0   0   1   2   7   0   0 132   1   4]
 [  3   0   5   6   4   7   2   1 133   5]
 [  0   1   3   0   4   1   0   4   1 163]]
Accuracy :  0.900348779273
Error :  0.0996512207275

Real World Applications

Self driving car by taking the video as input
Speech recognition
Face recognition
Cancer cell analysis
Heart attack predictions
Currency predictions and stock price predictions
Credit card default and loan predictions
Marketing and advertising by predicting the response probability
Weather forecasting and rainfall prediction

Some examples

Face recognition:
- https://www.youtube.com/watch?v=57VkfXqJ1LU
- https://www.youtube.com/watch?v=xVQLBbXdVUY
Autonomous car software:
- https://www.youtube.com/watch?v=gG72-SjwxAM

Drawbacks of Neural Networks

Neural network is a computation intensive algorithm.
No real theory that explains how to choose the number of hidden layers.
Takes lot of time when the input data is large, needs powerful computing machines.
Difficult to interpret the results. Very hard to interpret and measure the impact of individual predictors.
Its not easy to choose the right training sample size and learning rate.
The local minimum issue. The gradient descent algorithm produces the optimal weights for the local minimum, the global minimum of the error function is not guaranteed.

Why the name neural network?

The neural network algorithm for solving complex learning problems is inspired by human brain.
Our brains are a huge network of processing elements. It contains a network of billions of neurons.
In our brain, a neuron receives input from other neurons. Inputs are combined and send to next neuron.
The artificial neural network algorithm is built on the same logic.
So if we see a particular neuron then it sends out to dendrite and it processes then it sends to axon output.

Conclusion

Neural network is a vast subject. Many data scientists solely focus on only Neural network techniques
In this session, we practiced the introductory concepts only. Neural Networks has much more advanced techniques. There are many algorithms other than back propagation.
Neural networks particularly works well on some particular class of problems like image recognition.
The neural networks algorithms are very calculation intensive. They require highly efficient computing machines. Large datasets take significant amount of runtime on R. We need to try different types of options and packages.
Currently, there is a lot of exciting research going on, around neural networks.
After gaining sufficient knowledge in this basic session, you may want to explore reinforced learning, deep learning, etc.

Appendix

Math- How to update the weights?

We update the weights backwards by iteratively calculating the error.
The formula for weights updating is done using gradient descent method or delta rule also known as Widrow-Hoff rule.
First we calculate the weight corrections for the output layer, then we take care of hidden layers.
- Where $Delta W_{jk} = eta . y_j delta_k$ .
- $eta$ is the learning parameter.
- $delta_k = y_k (1- y_k) * Err$ (for hidden layers $Delta W_{jk} = eta . y_j delta_k$ )
- Err = Expected output-Actual output
The weight corrections is calculated based on the error function.
The new weights are chosen in such way that the final error in that network is minimized.

Math-How does the delta rule work?

Lets consider a simple example to understand the weight updating using delta rule.
If we building a simple logistic regression line. We would like to find the weights using weight update rule.
$Y= frac{1}{(1+e^{(-wx)})})$ is the equation.
We are searching for the optimal w for our data.
Let w be 1.
$Y=frac{1}{(1+e^{(-x)})}$ is the initial equation.
The error in our initial step is 3.59.
To reduce the error, we will add a delta to w and make it 1.5.
Now w is 1.5 (blue line).
$Y=frac{1}{(1+e^{(-1.5x)})}$ the updated equation.
With the updated weight, the error is 1.57.
We can further reduce the error by increasing w by delta.
If we repeat the same process of adding delta and updating weights, then we can finally end up with minimum error.
The weight at that final step is the optimal weight.
In this example, the weight is 8, and the error is 0.
$Y=frac{1}{(1+e^{(-8x)})}$ is the final equation.
In this example, we manually changed the weights to reduce the error. This is just for intuition, manual updating is not feasible for complex optimization problems.
The gradient descent is a scientific optimization method. We update the weights by calculating gradient of the function.

How does gradient descent work?

Gradient descent is one of the famous ways to calculate the local minimum.
By changing the weights, we are moving towards the minimum value of the error function. The weights are changed by taking steps in the negative direction of the function gradient (derivative). Does this method really work?
After changing the weights, did it reduce the overall error?
Let’s calculate the error with new weights and see the change.

Gradient Descent Method Validation

With our initial set of weights the overall error was 0.7137, Y Actual is 0, Y Predicted is 0.7137 and the error is 0.7137.
The new weights gives us a predicted value of 0.70655.
In one iteration, we reduced the error from 0.7137 to 0.70655.
The error is reduced by 1%. Repeat the same process with multiple epochs and training examples, we can reduce the error further.

References & Image Sources

“ROC curve” by Masato8686819 – Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons – https://commons.wikimedia.org/wiki/File:ROC_curve.svg#/media/File:ROC_curve.svg
“Curvas”??????UPO649 1112 prodgom – ?????????????????????????????????????????? – https://commons.wikimedia.org/wiki/File:Curvas.png#/media/File:Curvas.png??????CC BY-SA 3.0??????
http://www.autonlab.org/tutorials/neural.html
“Gradient ascent (surface)”. Licensed under Public Domain via Commons – https://commons.wikimedia.org/wiki/File:Gradient_ascent_(surface).png#/media/File:Gradient_ascent_(surface).png
“Gradient descent method” by ?????????? ???????? – ???????????????????????? ????????????????????, ???????????????? ??????????????. Licensed under CC BY-SA 3.0 via Wikimedia Commons – https://commons.wikimedia.org/wiki/File:Gradient_descent_method.png#/media/File:Gradient_descent_method.png
Lecture 7 :Artificial neural networks: Supervised learning: Negnevitsky, Person Education 2005
Gradient descent can find the local minimum instead of the global minimum By I, KSmrq
“Neuron”. Licensed under CC BY-SA 3.0 via Wikimedia Commons – https://commons.wikimedia.org/wiki/File:Neuron.svg#/media/File:Neuron.svg
“Neural signaling-human brain” by 7mike5000 – Gif created from Inside the Brain: Unraveling the Mystery of Alzheimer’s Disease, an educational film by the National Institute on Aging.. Licensed under CC BY-SA 3.0 via Wikimedia Commons – https://commons.wikimedia.org/wiki/File:Neural_signaling-human_brain.gif#/media/File:Neural_signaling-human_brain.gif

Select Category

Handout – Neural Networks in python