Transfer Function on Machine Learning Notebook
/tags/transfer-function/index.xml
Recent content in Transfer Function on Machine Learning NotebookHugo -- gohugo.ioen-usA Simple Neural Network - Simple Performance Improvements
/post/nn-python-tweaks/
Fri, 17 Mar 2017 08:53:55 +0000/post/nn-python-tweaks/<p>The 5th installment of our tutorial on implementing a neural network (NN) in Python. By the end of this tutorial, our NN should perform much more efficiently giving good results with fewer iterations. We will do this by implementing “momentum” into our network. We will also put in the other transfer functions for each layer.</p>
<p></p>
<div id="toctop"></div>
<ol>
<li><a href="#intro">Introduction</a></li>
<li><a href="#momentum">Momentum</a>
<ol>
<li><a href="#momentumbackground">Background</a></li>
<li><a href="#momentumpython">Momentum in Python</a></li>
<li><a href="#momentumtesting">Testing</a></li>
</ol></li>
<li><a href="#transferfunctions">Transfer Functions</a></li>
</ol>
<h2 id="intro"> Introduction </h2>
<p><a href="#toctop">To contents</a></p>
<p>We’ve come so far! The intial <a href="/post/neuralnetwork">maths</a> was a bit of a slog, as was the <a href="/post/nn-more-maths">vectorisation</a> of that maths, but it was important to be able to implement our NN in Python which we did in our <a href="/post/nn-in-python">previous post</a>. So what now? Well, you may have noticed when running the NN as it stands that it isn’t overly quick, depening on the randomly initialised weights, it may take the network the full number of <code>maxIterations</code> to converge, and then it may not converge at all! But there is something we can do about it. Let’s learn about, and implement, ‘momentum’.</p>
<h2 id="momentum"> Momentum </h2>
<h3 id="momentumbackground"> Background </h3>
<p><a href="#toctop">To contents</a></p>
<p>Let’s revisit our equation for error in the NN:</p>
<div id="eqerror">$$
\text{E} = \frac{1}{2} \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right)^{2}
$$</div>
<p>This isn’t the only error function that could be used. In fact, there’s a whole field of study in NN about the best error or ‘optimisation’ function that should be used. This one tries to look at the sum of the squared-residuals between the outputs and the expected values at the end of each forward pass (the so-called $l_{2}$-norm). Others e.g. $l_{1}$-norm, look at minimising the sum of the absolute differences between the values themselves. There are more complex error (a.k.a. optimisation or cost) functions, for example those that look at the cross-entropy in the data. There may well be a post in the future about different cost-functions, but for now we will still focus on the equation above.</p>
<p>Now this function is described as a ‘convex’ function. This is an important property if we are to make our NN converge to the correct answer. Take a look at the two functions below:</p>
<div id="fig1" class="figure_container">
<div class="figure_images">
<img title="convex" src="/img/simpleNN/convex.png" width="35%" hspace="10px"><img title="non-convex" src="/img/simpleNN/non-convex.png" width="35%" hspace="10px">
</div>
<div class="figure_caption">
<font color="blue">Figure 1</font>: A convex (left) and non-convex (right) cost function
</div>
</div>
<p>Let’s say that our current error was represented by the green ball. Our NN will calculate the gradient of its cost function at this point then look for the direction which is going to <em>minimise</em> the error i.e. go down a slope. The NN will feed the result into the back-propagation algorithm which will hopefully mean that on the next iteration, the error will have decreased. For a <em>convex</em> function, this is very straight forward, the NN just needs to keep going in the direction it found on the first run. But, look at the <em>non-convex</em> or <em>stochastic</em> function: our current error (green ball) sits at a point where either direction will take it to a lower error i.e. the gradient decreases on both sides. If the error goes to the left, it will hit <strong>one</strong> of the possible minima of the function, but this will be a higher minima (higher final error) than if the error had chosen the gradient to the right. Clearly the starting point for the error here has a big impact on the final result. Looking down at the 2D perspective (remembering that these are complex multi-dimensional functions), the non-convex case is clearly more ambiguous in terms of the location of the minimum and direction of descent. The convex function, however, nicely guides the error to the minimum with little care of the starting point.</p>
<div id="fig2" class="figure_container">
<div class="figure_images">
<img title="convexcontour" src="/img/simpleNN/convexcontourarrows.png" width="35%" hspace="10px"><img title="non-convexcontour" src="/img/simpleNN/nonconvexcontourarrows.png" width="35%" hspace="10px">
</div>
<div class="figure_caption">
<font color="blue">Figure 2</font>: Contours for a portion of the convex (left) and non-convex (right) cost function
</div>
</div>
<p>So let’s focus on the convex case and explain what <em>momentum</em> is and why it works. I don’t think you’ll ever see a back propagation algorithm without momentum implemented in some way. In its simplest form, it modifies the weight-update equation:</p>
<div>$$
\mathbf{ \Delta W_{JK} = -\eta \vec{\delta}_{K} \vec{ \mathcal{O}_{J}}}
$$</div>
<p>by adding an extra <em>momentum</em> term:</p>
<div>$$
\mathbf{ \Delta W_{JK}\left(t\right) = -\eta \vec{\delta}_{K} \vec{ \mathcal{O}_{J}}} + m \mathbf{\Delta W_{JK}\left(t-1\right)}
$$</div>
<p>The weight delta (the update amount to the weights after BP) now relies on its <em>previous</em> value i.e. the weight delta now at iteration $t$ requires the value of itself from $t-1$. The $m$ or momentum term, like the learning rate $\eta$ is just a small number between 0 and 1. What effect does this have?</p>
<p>Using prior information about the network is beneficial as it stops the network firing wildly into the unknown. If it can know the previous weights that have given the current error, it can keep the descent to the minimum roughly pointing in the same direction as it was before. The effect is that each iteration does not jump around so much as it would otherwise. In effect, the result is similar to that of the learning rate. We should be careful though, a large value for $m$ may cause the result to jump past the minimum and back again if combined with a large learning rate. We can think of momentum as changing the path taken to the optimum.</p>
<h3 id="momentumpython"> Momentum in Python </h3>
<p><a href="#toctop">To contents</a></p>
<p>So, implementing momentum into our NN should be pretty easy. We will need to provide a momentum term to the <code>backProp</code> method of the NN and also create a new matrix in which to store the weight deltas from the current epoch for use in the subsequent one.</p>
<p>In the <code>__init__</code> method of the NN, we need to initialise the previous weight matrix and then give them some values - they’ll start with zeros:</p>
<pre><code class="language-python">def __init__(self, numNodes):
"""Initialise the NN - setup the layers and initial weights"""
# Layer info
self.numLayers = len(numNodes) - 1
self.shape = numNodes
# Input/Output data from last run
self._layerInput = []
self._layerOutput = []
self._previousWeightDelta = []
# Create the weight arrays
for (l1,l2) in zip(numNodes[:-1],numNodes[1:]):
self.weights.append(np.random.normal(scale=0.1,size=(l2,l1+1)))
self._previousWeightDelta.append(np.zeros((l2,l1+1)))
</code></pre>
<p>The only other part of the NN that needs to change is the definition of <code>backProp</code> adding momentum to the inputs, and updating the weight equation. Finally, we make sure to save the current weights into the previous-weight matrix:</p>
<pre><code class="language-python">def backProp(self, input, target, trainingRate = 0.2, momentum=0.5):
"""Get the error, deltas and back propagate to update the weights"""
...
weightDelta = trainingRate * thisWeightDelta + momentum * self._previousWeightDelta[index]
self.weights[index] -= weightDelta
self._previousWeightDelta[index] = weightDelta
</code></pre>
<h3 id="momentumtesting"> Testing </h3>
<p><a href="#toctop">To contents</a></p>
<p>Our default values for learning rate and momentum are 0.2 and 0,5 respectively. We can change either of these by including them in the call to <code>backProp</code>. Thi is the only change to the iteration process:</p>
<pre><code class="language-python">for i in range(maxIterations + 1):
Error = NN.backProp(Input, Target, learningRate=0.2, momentum=0.5)
if i % 2500 == 0:
print("Iteration {0}\tError: {1:0.6f}".format(i,Error))
if Error <= minError:
print("Minimum error reached at iteration {0}".format(i))
break
Iteration 100000 Error: 0.000076
Input Output Target
[0 0] [ 0.00491572] [ 0.]
[1 1] [ 0.00421318] [ 0.]
[0 1] [ 0.99586268] [ 1.]
[1 0] [ 0.99586257] [ 1.]
</code></pre>
<p>Feel free to play around with these numbers, however, it would be unlikely that much would change right now. I say this beacuse there is only so good that we can get when using only the sigmoid function as our activation function. If you go back and read the post on <a href="/post/transfer-functions">transfer functions</a> you’ll see that it’s more common to use <em>linear</em> functions for the output layer. As it stands, the sigmoid function is unable to output a 1 or a 0 because it is asymptotic at these values. Therefore, no matter what learning rate or momentum we use, the network will never be able to get the best output.</p>
<p>This seems like a good time to implement the other transfer functions.</p>
<h3 id="transferfunctions"> Transfer Functions </h3>
<p><a href="#toctop">To contents</a></p>
<p>We’ve already gone through writing the transfer functions in Python in the <a href="/post/transfer-functions">transfer functions</a> post. We’ll just put these under the sigmoid function we defined earlier. I’m going to use <code>sigmoid</code>, <code>linear</code>, <code>gaussian</code> and <code>tanh</code> here.</p>
<p>To modify the network, we need to assign each layer its own activation function, so let’s put that in the ‘layer information’ part of the <code>__init__</code> method:</p>
<pre><code class="language-python">def __init__(self, layerSize, transferFunctions=None):
"""Initialise the Network"""
# Layer information
self.numLayers = len(numLayers) - 1
self.shape = numNodes
if transferFunctions is None:
layerTFs = []
for i in range(self.numLayers):
if i == self.numLayers - 1:
layerTFs.append(linear)
else:
layerTFs.append(sigmoid)
else:
if len(numNodes) != len(transferFunctions):
raise ValueError("Number of transfer functions must match the number of layers: minus input layer")
elif transferFunctions[0] is not None:
raise ValueError("The Input layer doesn't need a a transfer function: give it [None,...]")
else:
layerTFs = transferFunctions[1:]
self.tFunctions = layerTFs
</code></pre>
<p>Let’s go through this. We input into the initialisation a parameter called <code>transferFunctions</code> with a default value of <code>None</code>. If the default it taken, or if the parameter is ommitted, we set some defaults. for each layer, we use the <code>sigmoid</code> function, unless its the output layer where we will use the <code>linear</code> function. If a list of <code>transferFunctions</code> is given, first, check that it’s a ‘legal’ input. If the number of functions in the list is not the same as the number of layers (given by <code>numNodes</code>) then throw an error. Also, if the first function in the list is not <code>"None"</code> throw an error, because the first layer shouldn’t have an activation function (it is the input layer). If those two things are fine, go ahead and store the list of functions as <code>layerTFs</code> without the first (element 0) one.</p>
<p>We next need to replace all of our calls directly to <code>sigmoid</code> and its derivative. These should now refer to the list of functions via an <code>index</code> that depends on the number of the current layer. There are 3 instances of this in our NN: 1 in the forward pass where we call <code>sigmoid</code> directly, and 2 in the <code>backProp</code> method where we call the derivative at the output and hidden layers. so <code>sigmoid(layerInput)</code> for example should become:</p>
<pre><code class="language-python">self.tFunctions[index](layerInput)
</code></pre>
<p>Check the updated code <a href="/docs/simpleNN-improvements.py">here</a> if that’s confusing.</p>
<p>Let’s test this out! We’ll modify the call to initialising the NN by adding a list of functions like so:</p>
<pre><code class="language-python">Input = np.array([[0,0],[1,1],[0,1],[1,0]])
Target = np.array([[0.0],[0.0],[1.0],[1.0]])
transferFunctions = [None, sigmoid, linear]
NN = backPropNN((2,2,1), transferFunctions)
</code></pre>
<p>Running the NN like this with the default learning rate and momentum should provide you with an immediate performance boost simply becuase with the <code>linear</code> function we’re now able to get closer to the target values, reducing the error.</p>
<pre><code class="language-python">Iteration 0 Error: 1.550211
Iteration 2500 Error: 1.000000
Iteration 5000 Error: 0.999999
Iteration 7500 Error: 0.999999
Iteration 10000 Error: 0.999995
Iteration 12500 Error: 0.999969
Minimum error reached at iteration 14543
Input Output Target
[0 0] [ 0.0021009] [ 0.]
[1 1] [ 0.00081154] [ 0.]
[0 1] [ 0.9985881] [ 1.]
[1 0] [ 0.99877479] [ 1.]
</code></pre>
<p>Play around with the number of layers and different combinations of transfer functions as well as tweaking the learning rate and momentum. You’ll soon get a feel for how each changes the performance of the NN.</p>A Simple Neural Network - With Numpy in Python
/post/nn-in-python/
Wed, 15 Mar 2017 09:55:00 +0000/post/nn-in-python/<p>Part 4 of our tutorial series on Simple Neural Networks. We’re ready to write our Python script! Having gone through the maths, vectorisation and activation functions, we’re now ready to put it all together and write it up. By the end of this tutorial, you will have a working NN in Python, using only numpy, which can be used to learn the output of logic gates (e.g. XOR)
</p>
<div id="toctop"></div>
<ol>
<li><a href="#intro">Introduction</a></li>
<li><a href="#transferfunction">Transfer Function</a></li>
<li><a href="#backpropclass">Back Propagation Class</a>
<ol>
<li><a href="#initialisation">Initialisation</a></li>
<li><a href="#forwardpass">Forward Pass</a></li>
<li><a href="#backprop">Back Propagation</a></li>
</ol></li>
<li><a href="#testing">Testing</a></li>
<li><a href="#iterating">Iterating</a></li>
</ol>
<h3 id="intro"> Introduction </h3>
<p><a href="#toctop">To contents</a></p>
<p>We’ve <a href="/post/neuralnetwork">ploughed through the maths</a>, then <a href="/post/nn-more-maths">some more</a>, now we’re finally here! This tutorial will run through the coding up of a simple neural network (NN) in Python. We’re not going to use any fancy packages (though they obviously have their advantages in tools, speed, efficiency…) we’re only going to use numpy!</p>
<p>By the end of this tutorial, we will have built an algorithm which will create a neural network with as many layers (and nodes) as we want. It will be trained by taking in multiple training examples and running the back propagation algorithm many times.</p>
<p>Here are the things we’re going to need to code:</p>
<ul>
<li>The transfer functions</li>
<li>The forward pass</li>
<li>The back propagation algorithm</li>
<li>The update function</li>
</ul>
<p>To keep things nice and contained, the forward pass and back propagation algorithms should be coded into a class. We’re going to expect that we can build a NN by creating an instance of this class which has some internal functions (forward pass, delta calculation, back propagation, weight updates).</p>
<p>First things first… lets import numpy:</p>
<div class="highlight" style="background: #272822"><pre style="line-height: 125%"><span></span><span style="color: #f92672">import</span> <span style="color: #f8f8f2">numpy</span> <span style="color: #f92672">as</span> <span style="color: #f8f8f2">np</span>
</pre></div>
<p>Now let’s go ahead and get the first bit done:</p>
<h2 id="transferfunction"> Transfer Function </h2>
<p><a href="#toctop">To contents</a></p>
<p>To begin with, we’ll focus on getting the network working with just one transfer function: the sigmoid function. As we discussed in a <a href="/post/transfer-functions">previous post</a> this is very easy to code up because of its simple derivative:</p>
<div >$$
f\left(x_{i} \right) = \frac{1}{1 + e^{ - x_{i} }} \ \ \ \
f^{\prime}\left( x_{i} \right) = \sigma(x_{i}) \left( 1 - \sigma(x_{i}) \right)
$$</div>
<pre><code class="language-python">def sigmoid(x, Derivative=False):
if not Derivative:
return 1 / (1 + np.exp (-x))
else:
out = sigmoid(x)
return out * (1 - out)
</code></pre>
<p>This is a succinct expression which actually calls itself in order to get a value to use in its derivative. We’ve used numpy’s exponential function to create the sigmoid function and created an <code>out</code> variable to hold this in the derivative. Whenever we want to use this function, we can supply the parameter <code>True</code> to get the derivative, We can omit this, or enter <code>False</code> to just get the output of the sigmoid. This is the same function I used to get the graphs in the <a href="/post/transfer-functions">post on transfer functions</a>.</p>
<h2 id="backpropclass"> Back Propagation Class </h2>
<p><a href="#toctop">To contents</a></p>
<p>I’m fairly new to building my own classes in Python, but for this tutorial, I really relied on the videos of <a href="https://www.youtube.com/playlist?list=PLRyu4ecIE9tibdzuhJr94uQeKnOFkkbq6">Ryan on YouTube</a>. Some of his hacks were very useful so I’ve taken some of those on board, but i’ve made a lot of the variables more self-explanatory.</p>
<p>First we’re going to get the skeleton of the class setup. This means that whenever we create a new variable with the class of <code>backPropNN</code>, it will be able to access all of the functions and variables within itself.</p>
<p>It looks like this:</p>
<pre><code class="language-python">class backPropNN:
"""Class defining a NN using Back Propagation"""
# Class Members (internal variables that are accessed with backPropNN.member)
numLayers = 0
shape = None
weights = []
# Class Methods (internal functions that can be called)
def __init__(self):
"""Initialise the NN - setup the layers and initial weights"""
# Forward Pass method
def FP(self):
"""Get the input data and run it through the NN"""
# TrainEpoch method
def backProp(self):
"""Get the error, deltas and back propagate to update the weights"""
</code></pre>
<p>We’ve not added any detail to the functions (or methods) yet, but we know there needs to be an <code>__init__</code> method for any class, plus we’re going to want to be able to do a forward pass and then back propagate the error.</p>
<p>We’ve also added a few class members, variables which can be called from an instance of the <code>backPropNN</code> class. <code>numLayers</code> is just that, a count of the number of layers in the network, initialised to <code>0</code>. The <code>shape</code> of the network will return the size of each layer of the network in an array and the <code>weights</code> will return an array of the weights across the network.</p>
<h3 id="initialisation"> Initialisation </h3>
<p><a href="#toctop">To contents</a></p>
<p>We’re going to make the user supply an input variablewhich is the size of the layers in the network i.e. the number of nodes in each layer: <code>numNodes</code>. This will be an array which is the length of the number of layers (including the input and output layers) where each element is the number of nodes in that layer.</p>
<pre><code class="language-python">def __init__(self, numNodes):
"""Initialise the NN - setup the layers and initial weights"""
# Layer information
self.numLayers = len(numNodes) - 1
self.shape = numNodes
</code></pre>
<p>We’ve told our network to ignore the input layer when counting the number of layers (common practice) and that the shape of the network should be returned as the input array <code>numNodes</code>.</p>
<p>Lets also initialise the weights. We will take the approach of initialising all of the weights to small, random numbers. To keep the code succinct, we’ll use a neat function<code>zip</code>. <code>zip</code> is a function which takes two vectors and pairs up the elements in corresponding locations (like a zip). For example:</p>
<pre><code class="language-python">A = [1, 2, 3]
B = [4, 5, 6]
zip(A,B)
[(1,4), (2,5), (3,6)]
</code></pre>
<p>Why might this be useful? Well, when we talk about weights we’re talking about the connections between layers. Lets say we have <code>numNodes=(2, 2, 1)</code> i.e. a 2 layer network with 2 inputs, 1 output and 2 nodes in the hidden layer. Then we need to let the algorithm know that we expect two input nodes to send weights to 2 hidden nodes. Then 2 hidden nodes to send weights to 1 output node, or <code>[(2,2), (2,1)]</code>. Note that overall we will have 4 weights from the input to the hidden layer, and 2 weights from the hidden to the output layer.</p>
<p>What is our <code>A</code> and <code>B</code> in the code above that will give us <code>[(2,2), (2,1)]</code>? It’s this:</p>
<pre><code class="language-python">numNodes = (2,2,1)
A = numNodes[:-1]
B = numNodes[1:]
A
(2,2)
B
(2,1)
zip(A,B)
[(2,2), (2,1)]
</code></pre>
<p>Great! So each pair represents the nodes between which we need initialise some weights. In fact, the shape of each pair <code>(2,2)</code> is the clue to how many weights we are going to need between each layer e.g. between the input and hidden layers we are going to need <code>(2 x 2) =4</code> weights.</p>
<p>so <code>for</code> each pair <code>in zip(A,B)</code> (hint hint) we need to <code>append</code> some weights into that empty weight matrix we initialised earlier.</p>
<pre><code class="language-python"># Initialise the weight arrays
for (l1,l2) in zip(numNodes[:-1],numNodes[1:]):
self.weights.append(np.random.normal(scale=0.1,size=(l2,l1+1)))
</code></pre>
<p><code>self.weights</code> as we’re appending to the class member initialised earlier. We’re using the numpy random number generator from a <code>normal</code> distribution. The <code>scale</code> just tells numpy to choose numbers around the 0.1 kind of mark and that we want a matrix of results which is the size of the tuple <code>(l2,l1+1)</code>. Huh, <code>+1</code>? Don’t think we’re getting away without including the <em>bias</em> term! We want a random starting point even for the weight connecting the bias node (<code>=1</code>) to the next layer. Ok, but why this way and not <code>(l1+1,l2)</code>? Well, we’re looking for <code>l2</code> connections from each of the <code>l1+1</code> nodes in the previous layer - think of it as (number of observations x number of features). We’re creating a matrix of weights which goes across the nodes and down the weights from each node, or as we’ve seen in our maths tutorial:</p>
<div>$$
W_{ij} = \begin{pmatrix} w_{11} & w_{21} & w_{31} \\ w_{12} &w_{22} & w_{32} \end{pmatrix}, \ \ \ \
W_{jk} = \begin{pmatrix} w_{11} & w_{21} & w_{31} \end{pmatrix}
$$</div>
<p>Between the first two layers, and second 2 layers respectively with node 3 being the bias node.</p>
<p>Before we move on, lets also put in some placeholders in <code>__init__</code> for the input and output values to each layer:</p>
<pre><code class="language-python">self._layerInput = []
self._layerOutput = []
</code></pre>
<h3 id="forwardpass"> Forward Pass </h3>
<p><a href="#toctop">To contents</a></p>
<p>We’ve now initialised out network enough to be able to focus on the forward pass (FP).</p>
<p>Our <code>FP</code> function needs to have the input data. It needs to know how many training examples it’s going to have to go through, and it will need to reassign the inputs and outputs at each layer, so lets clean those at the beginning:</p>
<pre><code class="language-python">def FP(self,input):
numExamples = input.shape[0]
# Clean away the values from the previous layer
self._layerInput = []
self._layerOutput = []
</code></pre>
<p>So lets propagate. We already have a matrix of (randomly initialised) weights. We just need to know what the input is to each of the layers. We’ll separate this into the first hidden layer, and subsequent hidden layers.</p>
<p>For the first hidden layer we will write:</p>
<pre><code class="language-python">layerInput = self.weights[0].dot(np.vstack([input.T, np.ones([1, numExamples])]))
</code></pre>
<p>Let’s break this down:</p>
<p>Our training example inputs need to match the weights that we’ve already created. We expect that our examples will come in rows of an array with columns acting as features, something like <code>[(0,0), (0,1),(1,1),(1,0)]</code>. We can use numpy’s <code>vstack</code> to put each of these examples one on top of the other.</p>
<p>Each of the input examples is a matrix which will be multiplied by the weight matrix to get the input to the current layer:</p>
<div>$$
\mathbf{x_{J}} = \mathbf{W_{IJ} \vec{\mathcal{O}}_{I}}
$$</div>
<p>where $\mathbf{x_{J}}$ are the inputs to the layer $J$ and $\mathbf{\vec{\mathcal{O}}_{I}}$ is the output from the precious layer (the input examples in this case).</p>
<p>So given a set of $n$ input examples we <code>vstack</code> them so we just have <code>(n x numInputNodes)</code>. We want to transpose this, <code>(numInputNodes x n)</code> such that we can multiply by the weight matrix which is <code>(numOutputNodes x numInputNodes)</code>. This gives an input to the layer which is <code>(numOutputNodes x n)</code> as we expect.</p>
<p><strong>Note</strong> we’re actually going to do the transposition first before doing the <code>vstack</code> - this does exactly the same thing, but it also allows us to more easily add the bias nodes in to each input.</p>
<p>Bias! Lets not forget this: we add a bias node which always has the value <code>1</code> to each input (including the input layer). So our actual method is:</p>
<ol>
<li>Transpose the inputs <code>input.T</code></li>
<li>Add a row of ones to the bottom (one bias node for each input) <code>[input.T, np.ones([1,numExamples])]</code></li>
<li><code>vstack</code> this to compact the array <code>np.vstack(...)</code></li>
<li>Multipy with the weights connecting from the previous to the current layer <code>self.weights[0].dot(...)</code></li>
</ol>
<p>But what about the subsequent hidden layers? We’re not using the input examples in these layers, we are using the output from the previous layer <code>[self._layerOutput[-1]]</code> (multiplied by the weights).</p>
<pre><code class="language-python">for index in range(self.numLayers):
#Get input to the layer
if index ==0:
layerInput = self.weights[0].dot(np.vstack([input.T, np.ones([1, numExamples])]))
else:
layerInput = self.weights[index].dot(np.vstack([self._layerOutput[-1],np.ones([1,numExamples])]))
</code></pre>
<p>Make sure to save this output, but also to now calculate the output of the current layer i.e.:</p>
<div>$$
\mathbf{ \vec{ \mathcal{O}}_{J}} = \sigma(\mathbf{x_{J}})
$$</div>
<pre><code class="language-python">self._layerInput.append(layerInput)
self._layerOutput.append(sigmoid(layerInput))
</code></pre>
<p>Finally, make sure that we’re returning the data from our output layer the same way that we got it:</p>
<pre><code class="language-python">return self._layerOutput[-1].T
</code></pre>
<h3 id="backprop">Back Propagation</h3>
<p><a href="#toctop">To contents</a></p>
<p>We’ve successfully sent the data from the input layer to the output layer using some initially randomised weights <strong>and</strong> we’ve included the bias term (a kind of threshold on the activation functions). Our vectorised equations from the previous post will now come into play:</p>
<div>$$
\begin{align}
\mathbf{\vec{\delta}_{K}} &= \sigma^{\prime}\left( \mathbf{W_{JK}}\mathbf{\vec{\mathcal{O}}_{J}} \right) * \left( \mathbf{\vec{\mathcal{O}}_{K}} - \mathbf{T_{K}}\right) \\[0.5em]
\mathbf{ \vec{ \delta }_{J}} &= \sigma^{\prime} \left( \mathbf{ W_{IJ} \mathcal{O}_{I} } \right) * \mathbf{ W^{\intercal}_{JK}} \mathbf{ \vec{\delta}_{K}}
\end{align}
$$</div>
<div>$$
\begin{align}
\mathbf{W_{JK}} + \Delta \mathbf{W_{JK}} &\rightarrow \mathbf{W_{JK}}, \ \ \ \Delta \mathbf{W_{JK}} = -\eta \mathbf{ \vec{ \delta }_{K}} \mathbf{ \vec { \mathcal{O} }_{J}} \\[0.5em]
\vec{\theta} + \Delta \vec{\theta} &\rightarrow \vec{\theta}, \ \ \ \Delta \vec{\theta} = -\eta \mathbf{ \vec{ \delta }_{K}}
\end{align}
$$</div>
<p>With $*$ representing an elementwise multiplication between the matrices.</p>
<p>First, lets initialise some variables and get the error on the output of the output layer. We assume that the target values have been formatted in the same way as the input values i.e. they are a row-vector per input example. In our forward propagation method, the outputs are stored as column-vectors, thus the targets have to be transposed. We will need to supply the input data, the target data and $\eta$, the learning rate, which we will set at some small number for default. So we start back propagation by first initialising a placeholder for the deltas and getting the number of training examples before running them through the <code>FP</code> method:</p>
<pre><code class="language-python">def backProp(self, input, target, trainingRate = 0.2):
"""Get the error, deltas and back propagate to update the weights"""
delta = []
numExamples = input.shape[0]
# Do the forward pass
self.FP(input)
output_delta = self._layerOutput[index] - target.T
error = np.sum(output_delta**2)
</code></pre>
<p>We know from previous posts that the error is squared to get rid of the negatives. From this we compute the deltas for the output layer:</p>
<pre><code class="language-python">delta.append(output_delta * sigmoid(self._layerInput[index], True))
</code></pre>
<p>We now have the error but need to know what direction to alter the weights in, thus the gradient of the inputs to the layer need to be known. So, we get the gradient of the activation function at the input to the layer and get the product with the error. Notice we’ve supplied <code>True</code> to the sigmoid function to get its derivative.</p>
<p>This is the delta for the output layer. So this calculation is only done when we’re considering the index at the end of the network. We should be careful that when telling the algorithm that this is the “last layer” we take account of the zero-indexing in Python i.e. the last layer is <code>self.numLayers - 1</code> i.e. in a network with 2 layers, <code>layer[2]</code> does not exist.</p>
<p>We also need to get the deltas of the intermediate hidden layers. To do this, (according to our equations above) we have to ‘pull back’ the delta from the output layer first. More accurately, for any hidden layer, we pull back the delta from the <em>next</em> layer, which may well be another hidden layer. These deltas from the <em>next</em> layer are multiplied by the weights from the <em>next</em> layer <code>[index + 1]</code>, before getting the product with the sigmoid derivative evaluated at the <em>current</em> layer.</p>
<p><strong>Note</strong>: this is <em>back</em> propagation. We have to start at the end and work back to the beginning. We use the <code>reversed</code> keyword in our loop to ensure that the algorithm considers the layers in reverse order.</p>
<p>Combining this into one method:</p>
<pre><code class="language-python"># Calculate the deltas
for index in reversed(range(self.numLayers)):
if index == self.numLayers - 1:
# If the output layer, then compare to the target values
output_delta = self._layerOutput[index] - target.T
error = np.sum(output_delta**2)
delta.append(output_delta * sigmoid(self._layerInput[index], True))
else:
# If a hidden layer. compare to the following layer's delta
delta_pullback = self.weights[index + 1].T.dot(delta[-1])
delta.append(delta_pullback[:-1,:] * sigmoid(self._layerInput[index], True))
</code></pre>
<p>Pick this piece of code apart. This is an important snippet as it calculates all of the deltas for all of the nodes in the network. Be sure that we understand:</p>
<ol>
<li>This is a <code>reversed</code> loop because we want to deal with the last layer first</li>
<li>The delta of the output layer is the residual between the output and target multiplied with the gradient (derivative) of the activation function <em>at the current layer</em>.</li>
<li>The delta of a hidden layer first needs the product of the <em>subsequent</em> layer’s delta with the <em>subsequent</em> layer’s weights. This is then multiplied with the gradient of the activation function evaluated at the <em>current</em> layer.</li>
</ol>
<p>Double check that this matches up with the equations above too! We can double check the matrix multiplication. For the output layer:</p>
<p><code>output_delta</code> = (numOutputNodes x 1) - (1 x numOutputNodes).T = (numOutputNodes x 1)
<code>error</code> = (numOutputNodes x 1) **2 = (numOutputNodes x 1)
<code>delta</code> = (numOutputNodes x 1) * sigmoid( (numOutputNodes x 1) ) = (numOutputNodes x 1)</p>
<p>For the hidden layers (take the one previous to the output as example):</p>
<p><code>delta_pullback</code> = (numOutputNodes x numHiddenNodes).T.dot(numOutputNodes x 1) = (numHiddenNodes x 1)
<code>delta</code> = (numHiddenNodes x 1) * sigmoid ( (numHuddenNodes x 1) ) = (numHiddenNodes x 1)</p>
<p>Hurray! We have the delta at each node in our network. We can use them to update the weights for each layer in the network. Remember, to update the weights between layer $J$ and $K$ we need to use the output of layer $J$ and the deltas of layer $K$. This means we need to keep a track of the index of the layer we’re currently working on ($J$) and the index of the delta layer ($K$) - not forgetting about the zero-indexing in Python:</p>
<pre><code class="language-python">for index in range(self.numLayers):
delta_index = self.numLayers - 1 - index
</code></pre>
<p>Let’s first get the outputs from each layer:</p>
<pre><code class="language-python"> if index == 0:
layerOutput = np.vstack([input.T, np.ones([1, numExamples])])
else:
layerOutput = np.vstack([self._layerOutput[index - 1], np.ones([1,self._layerOutput[index -1].shape[1]])])
</code></pre>
<p>The output of the input layer is just the input examples (which we’ve <code>vstack</code>-ed again and the output from the other layers we take from calculation in the forward pass (making sure to add the bias term on the end).</p>
<p>For the current <code>index</code> (layer) lets use this <code>layerOutput</code> to get the change in weight. We will use a few neat tricks to make this succinct:</p>
<pre><code class="language-python"> thisWeightDelta = np.sum(\
layerOutput[None,:,:].transpose(2,0,1) * delta[delta_index][None,:,:].transpose(2,1,0) \
, axis = 0)
</code></pre>
<p>Break it down. We’re looking for $\mathbf{ \vec{ \delta }_{K}} \mathbf{ \vec { \mathcal{O} }_{J}} $ so it’s the delta at <code>delta_index</code>, the next layer along.</p>
<p>We want to be able to deal with all of the input training examples simultaneously. This requires a bit of fancy slicing and transposing of the matrices. Take a look: by calling <code>vstack</code> we made all of the input data and bias terms live in the same matrix of a numpy array. When we slice this arraywith the <code>[None,:,:]</code> argument, it tells Python to take all (<code>:</code>) the data in the rows and columns and shift it to the 1st and 2nd dimensions and leave the first dimension empty (<code>None</code>). We do this to create the three dimensions which we can now transpose into. Calling <code>transpose(2,0,1)</code> instructs Python to move around the dimensions of the data (e.g. its rows… or examples). This creates an array where each example now lives in its own plane. The same is done for the deltas of the subsequent layer, but being careful to transpost them in the opposite direction so that the matrix multiplication can occur. The <code>axis= 0</code> is supplied to make sure that the inputs are multiplied by the correct dimension of the delta matrix.</p>
<p>This looks incredibly complicated. It an be broken down into a for-loop over the input examples, but this reduces the efficiency of the network. Taking advantage of the numpy array like this keeps our calculations fast. In reality, if you’re struggling with this particular part, just copy and paste it, forget about it and be happy with yourself for understanding the maths behind back propagation, even if this random bit of Python is perplexing.</p>
<p>Anyway. Lets take this set of weight deltas and put back the $\eta$. We’ll call this the <code>learningRate</code>. It’s called a lot of things, but this seems to be the most common. We’ll update the weights by making sure to include the <code>-</code> from the $-\eta$.</p>
<pre><code class="language-python"> weightDelta = trainingRate * thisWeightDelta
self.weights[index] -= weightDelta
</code></pre>
<p>the <code>-=</code> is Python slang for: take the current value and subtract the value of <code>weightDelta</code>.</p>
<p>To finish up, we want our back propagation to return the current error in the network, so:</p>
<pre><code class="language-python">return error
</code></pre>
<h2 id="testing"> A Toy Example</h2>
<p><a href="#toctop">To contents</a></p>
<p>Believe it or not, that’s it! The fundamentals of forward and back propagation have now been implemented in Python. If you want to double check your code, have a look at my completed .py <a href="/docs/simpleNN.py">here</a></p>
<p>Let’s test it!</p>
<pre><code class="language-python">Input = np.array([[0,0],[1,1],[0,1],[1,0]])
Target = np.array([[0.0],[0.0],[1.0],[1.0]])
NN = backPropNN((2,2,1))
Error = NN.backProp(Input, Target)
Output = NN.FP(Input)
print 'Input \tOutput \t\tTarget'
for i in range(Input.shape[0]):
print '{0}\t {1} \t{2}'.format(Input[i], Output[i], Target[i])
</code></pre>
<p>This will provide 4 input examples and the expected targets. We create an instance of the network called <code>NN</code> with 2 layers (2 nodes in the hidden and 1 node in the output layer). We make <code>NN</code> do <code>backProp</code> with the input and target data and then get the output from the final layer by running out input through the network with a <code>FP</code>. The printout is self explantory. Give it a try!</p>
<pre><code>Input Output Target
[0 0] [ 0.51624448] [ 0.]
[1 1] [ 0.51688469] [ 0.]
[0 1] [ 0.51727559] [ 1.]
[1 0] [ 0.51585529] [ 1.]
</code></pre>
<p>We can see that the network has taken our inputs, and we have some outputs too. They’re not great, and all seem to live around the same value. This is because we initialised the weights across the network to a similarly small random value. We need to repeat the <code>FP</code> and <code>backProp</code> process many times in order to keep updating the weights.</p>
<h2 id="iterating"> Iterating </h2>
<p><a href="#toctop">To contents</a></p>
<p>Iteration is very straight forward. We just tell our algorithm to repeat a maximum of <code>maxIterations</code> times or until the <code>Error</code> is below <code>minError</code> (whichever comes first). As the weights are stored internally within <code>NN</code> every time we call the <code>backProp</code> method, it uses the latest, internally stored weights and doesn’t start again - the weights are only initialised once upon creation of <code>NN</code>.</p>
<pre><code class="language-python">maxIterations = 100000
minError = 1e-5
for i in range(maxIterations + 1):
Error = NN.backProp(Input, Target)
if i % 2500 == 0:
print("Iteration {0}\tError: {1:0.6f}".format(i,Error))
if Error <= minError:
print("Minimum error reached at iteration {0}".format(i))
break
</code></pre>
<p>Here’s the end of my output from the first run:</p>
<pre><code>Iteration 100000 Error: 0.000291
Input Output Target
[0 0] [ 0.00780385] [ 0.]
[1 1] [ 0.00992829] [ 0.]
[0 1] [ 0.99189799] [ 1.]
[1 0] [ 0.99189943] [ 1.]
</code></pre>
<p>Much better! The error is very small and the outputs are very close to the correct value. However, they’re note completely right. We can do better, by implementing different activation functions which we will do in the next tutorial.</p>
<p><strong>Please</strong> let me know if anything is unclear, or there are mistakes. Let me know how you get on!</p>