Neural Network on Machine Learning Notebook
/tags/neural-network/index.xml
Recent content in Neural Network on Machine Learning NotebookHugo -- gohugo.ioen-usData Augmentations for n-Dimensional Image Input to CNNs
/post/dataaug/
Thu, 04 Jan 2018 10:13:20 +0000/post/dataaug/<p>One of the greatest limiting factors for training effective deep learning frameworks is the availability, quality and organisation of the <em>training data</em>. To be good at classification tasks, we need to show our CNNs <em>etc.</em> as many examples as we possibly can. However, this is not always possible especially in situations where the training data is hard to collect e.g. medical image data. In this post, we will learn how to apply <em>data augmentation</em> strategies to n-Dimensional images get the most of our limited number of examples.</p>
<p></p>
<h2 id="intro"> Introduction </h2>
<p>If we take any image, like our little Android below, and we shift all of the data in the image to the right by a single pixel, you may struggle to see any difference visually. However, numerically, this may as well be a completely different image! Imagine taking a stack of 10 of these images, each shifted by a single pixel compared to the previous one. Now consider the pixels in the images at [20, 25] or some arbitrary location. Focusing on that point, each pixel has a different colour, different average surrounding intensity etc. A CNN take these values into account when performing convolutions and deciding upon weights. If we supplied this set of 10 images to a CNN, it would effectively be making it learn that it should be invariant to these kinds of translations.</p>
<div style="width:100%; text-align:center;">
<div style="text-align:center; display:inline-block; width:29%; margin:auto;min-width:325px;">
<img title="Natural Image RGB" style="border: 2px solid black;" height=300 src="/img/augmentation/android.jpg" ><br>
<b>Android</b>
</div>
<div style="text-align:center; min-width:325px;display:inline-block; width:29%;margin:auto;">
<img title="Natural Image Grayscale" style="border: 2px solid black;"" height=300 src="/img/augmentation/android1px.png"><br>
<b>Shifted 1 pixel right</b>
</div>
<div style="text-align:center; min-width:325px;display:inline-block; width:29%;margin:auto;">
<img title="Natural Image Grayscale" style="border: 2px solid black;"" height=300 src="/img/augmentation/android10px.png"><br>
<b>Shifted 10 pixels right</b>
</div>
</div>
<p>Of course, translations are not the only way in which an image can change, but still <em>visually</em> be the same image. Consider rotating the image by even a single degree, or 5 degrees. It’s still an Android. Traning a CNN without including translated and rotated versions of the image may cause the CNN to <strong>overfit</strong> and assume that all images of Androids have to be perfectly upright and centered.</p>
<p>Providing deep learning frameworks with images that are translated, rotated, scaling, intensified and flipped is what we mean when we talk about <em>data augmentation</em>.</p>
<p>In this post we’ll look at how to apply these transformations to an image, even in 3D and see how it affects the performance of a deep learning framework. We will use an image from <em>flickr</em> user <a href="https://www.flickr.com/photos/andy_emcee/6416366321" title="Cat and Dog Image">andy_emcee</a> as an example of a 2D nautral image. As this is an RGB (color) image it has shape [512, 640, 3], one layer for each colour channel. We could take one layer to make this grayscale and truly 2D, but most images we deal with will be color so let’s leave it. For 3D we will use a 3D MRI scan</p>
<div style="width:100%; text-align:center;">
<div style="text-align:center; display:inline-block; width:49%; margin:auto;min-width:350px;">
<img title="Natural Image RGB" height=300 src="/img/augmentation/naturalimg.jpg"><br>
<b>RGB Image shape=[512, 640, 3]</b>
</div>
</div>
<h2 id="augs"> Augmentations </h2>
<p>As usual, we are going to write our augmentation functions in python. We’ll just be using simple functions from <code>numpy</code> and <code>scipy</code>.</p>
<h3 id="translate"> Translation </h3>
<p>In our functions, <code>image</code> is a 2 or 3D array - if it’s a 3D array, we need to be careful about specifying our translation directions in the argument called <code>offset</code>. We don’t really want to move images in the <code>z</code> direction for a couple of reasons: firstly, if it’s a 2D image, the third dimension will be the colour channel, if we move the image through this dimension the image will either become all red, all blue or all black if we move it <code>-2</code>, <code>2</code> or greater than these respectively; second, in a full 3D image, the third dimension is often the smallest e.g. most medical scans. In our translation function below, the <code>offset</code> is given as a length 2 array defining the shift in the <code>y</code> and <code>x</code> directions respectively (dont forget index 0 is which horizontal row we’re at in python). We hard-code z-direction to <code>0</code> but you’re welcome to change this if your use-case demands it. To ensure we get integer-pixel shifts, we enforce type <code>int</code> too.</p>
<pre><code class="language-python">def translateit(image, offset, isseg=False):
order = 0 if isseg == True else 5
return scipy.ndimage.interpolation.shift(image, (int(offset[0]), int(offset[1]), 0), order=order, mode='nearest')
</code></pre>
<p>Here we have also provided the option for what kind of interpolation we want to perform: <code>order = 0</code> means to just use the nearest-neighbour pixel intensity and <code>order = 5</code> means to perform bspline interpolation with order 5 (taking into account many pixels around the target). This is triggered with a Boolean argument to the <code>scaleit</code> function called <code>isseg</code> so named because when dealing with image-segmentations, we want to keep their integer class numbers and not get a result which is a float with a value between two classes. This is not a problem with the actual image as we want to retain as much visual smoothness as possible (though there is an arugment that we’re introducing data which didn’t exist in the original image). Similarly, when we move our image, we will leave a gap around the edges from which it’s moved. We need a way to fill in this gap: by default <code>shift</code> will use a contant value set to <code>0</code>. This may not be helpful in some case, so it’s best to set the <code>mode</code> to <code>'nearest'</code> which takes the cloest pixel-value and replicates it. It’s barely noticable with small shifts but looks wrong at larger offsets. We need to be careful and only apply small translations to our data.</p>
<div style="width:100%; text-align:center;">
<div style="text-align:center; display:inline-block; width:29%; margin:auto;min-width:325px;">
<img title="Natural Image RGB" style="border: 2px solid black;" height=300 src="/img/augmentation/naturalimg.jpg" ><br>
<b>Original Image</b>
</div>
<div style="text-align:center; min-width:325px;display:inline-block; width:29%;margin:auto;">
<img title="Natural Image Grayscale" style="border: 2px solid black;" height=300 src="/img/augmentation/naturalimgtrans5px.png"><br>
<b>Shifted 5 pixels right</b>
</div>
<div style="text-align:center; min-width:325px;display:inline-block; width:29%;margin:auto;">
<img title="Natural Image Grayscale" style="border: 2px solid black;" height=300 src="/img/augmentation/naturalimgtrans25px.png"><br>
<b>Shifted 25 pixels right</b>
</div>
</div>
<div style="width:100%; text-align:center;">
<div style="text-align:center; display:inline-block; width:29%; margin:auto;min-width:325px;">
<img title="CMR Image" height=300 src="/img/augmentation/cmrimg.png" >
<img title="CMR Segmentation" height=300 src="/img/augmentation/cmrseg.png" ><br>
<b>Original Image and Segmentation</b>
</div>
<div style="text-align:center; min-width:325px;display:inline-block; width:29%;margin:auto;">
<img title="CMR Image" height=300 src="/img/augmentation/cmrimgtrans1.png">
<img title="CMR Segmentation" height=300 src="/img/augmentation/cmrsegtrans1.png"><br>
<b>Shifted [-3, 1] pixels</b>
</div>
<div style="text-align:center; min-width:325px;display:inline-block; width:29%;margin:auto;">
<img title="CMR Image" height=300 src="/img/augmentation/cmrimgtrans2.png">
<img title="CMR Segmentation" height=300 src="/img/augmentation/cmrsegtrans2.png"><br>
<b>Shifted [4, -5] pixels</b>
</div>
</div>
<h3 id="scale"> Scaling </h3>
<p>When scaling an image, i.e. zooming in and out, we want to increase or decrease the area our image takes up whilst keeping the image dimensions the same. We scale our image by a certain <code>factor</code>. A <code>factor > 1.0</code> means the image scales-up, and <code>factor < 1.0</code> scales the image down. Note that we should provide a factor for each dimension: if we want to keep the same number of layers or slices in our image, we should set last value to <code>1.0</code>. To determine the intensity of the resulting image at each pixel, we are taking the lattice (grid) on which each pixel sits and using this to perform <em>interpolation</em> of the surrounding pixel intensities. <code>scipy</code> provides a handy function for this called <code>zoom</code>:</p>
<p>The definition is probably more complex than one would think:</p>
<pre><code class="language-python">def scaleit(image, factor, isseg=False):
order = 0 if isseg == True else 3
height, width, depth= image.shape
zheight = int(np.round(factor * height))
zwidth = int(np.round(factor * width))
zdepth = depth
if factor < 1.0:
newimg = np.zeros_like(image)
row = (height - zheight) // 2
col = (width - zwidth) // 2
layer = (depth - zdepth) // 2
newimg[row:row+zheight, col:col+zwidth, layer:layer+zdepth] = interpolation.zoom(image, (float(factor), float(factor), 1.0), order=order, mode='nearest')[0:zheight, 0:zwidth, 0:zdepth]
return newimg
elif factor > 1.0:
row = (zheight - height) // 2
col = (zwidth - width) // 2
layer = (zdepth - depth) // 2
newimg = interpolation.zoom(image[row:row+zheight, col:col+zwidth, layer:layer+zdepth], (float(factor), float(factor), 1.0), order=order, mode='nearest')
extrah = (newimg.shape[0] - height) // 2
extraw = (newimg.shape[1] - width) // 2
extrad = (newimg.shape[2] - depth) // 2
newimg = newimg[extrah:extrah+height, extraw:extraw+width, extrad:extrad+depth]
return newimg
else:
return image
</code></pre>
<p>There are three possibilities that we need to consider - we are scaling up, down or no scaling. In each case, we want to return an array that is <em>equal in size</em> to the input <code>image</code>. For the scaling down case, this involves making a blank image the same shape as the input, and finding the corresponding box in the resulting scaled image. For scaling up, it’s unnecessary to perform the scaling on the whole image, just the portion that will be ‘zoomed’ - so we pass only part of the array to the <code>zoom</code> function. There may also be some error in the final shape due to rounding, so we do some trimming of the extra rows and colums before passing it back. When no scaling is done, we just return the original image.</p>
<div style="width:100%; text-align:center;">
<div style="text-align:center; display:inline-block; width:29%; margin:auto;min-width:325px;">
<img title="Natural Image RGB" style="border: 2px solid black;" height=300 src="/img/augmentation/naturalimg.jpg" ><br>
<b>Original Image</b>
</div>
<div style="text-align:center; min-width:325px;display:inline-block; width:29%;margin:auto;">
<img title="Natural Image Grayscale" style="border: 2px solid black;" height=300 src="/img/augmentation/naturalimgscale075.png"><br>
<b>Scale-factor 0.75</b>
</div>
<div style="text-align:center; min-width:325px;display:inline-block; width:29%;margin:auto;">
<img title="Natural Image Grayscale" style="border: 2px solid black;" height=300 src="/img/augmentation/naturalimgscale125.png"><br>
<b>Scale-factor 1.25</b>
</div>
</div>
<div style="width:100%; text-align:center;">
<div style="text-align:center; display:inline-block; width:29%; margin:auto;min-width:325px;">
<img title="CMR Image" height=300 src="/img/augmentation/cmrimg.png" >
<img title="CMR Segmentation" height=300 src="/img/augmentation/cmrseg.png" ><br>
<b>Original Image and Segmentation</b>
</div>
<div style="text-align:center; min-width:325px;display:inline-block; width:29%;margin:auto;">
<img title="CMR Image" height=300 src="/img/augmentation/cmrimgscale1.png">
<img title="CMR Segmentation" height=300 src="/img/augmentation/cmrsegscale1.png"><br>
<b>Scale-factor 1.07</b>
</div>
<div style="text-align:center; min-width:325px;display:inline-block; width:29%;margin:auto;">
<img title="CMR Image" height=300 src="/img/augmentation/cmrimgscale2.png">
<img title="CMR Segmentation" height=300 src="/img/augmentation/cmrsegscale2.png"><br>
<b>Scale-factor 0.95</b>
</div>
</div>
<h3 id='resample'> Resampling </h3>
<p>It may be the case that we want to change the dimensions of our image such that they fit nicely into the input of our CNN. For example, most images and photographs have one dimension larger than the other or may be of different resolutions. This may not be the case in our training set, but most CNNs prefer to have inputs that are square and of identical sizes. We can use the same <code>scipy</code> function <code>interpolation.zoom</code> to do this:</p>
<pre><code class="language-python">def resampleit(image, dims, isseg=False):
order = 0 if isseg == True else 5
image = interpolation.zoom(image, np.array(dims)/np.array(image.shape, dtype=np.float32), order=order, mode='nearest')
if image.shape[-1] == 3: #rgb image
return image
else:
return image if isseg else (image-image.min())/(image.max()-image.min())
</code></pre>
<p>The key part here is that we’ve replaced the <code>factor</code> argument with <code>dims</code> of type <code>list</code>. <code>dims</code> should have length equal to the number of dimensions of our image i.e. 2 or 3. We are calculating the factor that each dimension needs to change by in order to change the image to the target <code>dims</code>. We’ve forced the denominator of the scaling factor to be of type <code>float</code> so that the resulting factor is also <code>float</code>.</p>
<p>In this step, we are also changing the intensities of the image to use the full range from <code>0.0</code> to <code>1.0</code>. This ensures that all of our image intensities fall over the same range - one fewer thing for the network to be biased against. Again, note that we don’t want to do this for our segmentations as the pixel ‘intensities’ are actually labels. We could do this in a separate function, but I want this to happen to all of my images at this point. There’s no difference to the visual display of the images because they are automaticallys rescaled to use the full range of display colours.</p>
<h3 id="rotate"> Rotation </h3>
<p>This function utilises another <code>scipy</code> function called <code>rotate</code>. It takes a <code>float</code> for the <code>theta</code> argument which specifies the number of degrees of the roation (negative numbers rotate anti-clockwise). We want the returned image to be of the same shape as the input <code>image</code> so <code>reshape = False</code> is used. Again we need to specify the <code>order</code> of the interpolation on the new lattice. The rotate function handles 3D images by rotating each slice by the same <code>theta</code>.</p>
<pre><code class="language-python">def rotateit(image, theta, isseg=False):
order = 0 if isseg == True else 5
return rotate(image, float(theta), reshape=False, order=order, mode='nearest')
</code></pre>
<div style="width:100%; text-align:center;">
<div style="text-align:center; display:inline-block; width:29%; margin:auto;min-width:325px;">
<img title="Natural Image RGB" style="border: 2px solid black;" height=300 src="/img/augmentation/naturalimg.jpg" ><br>
<b>Original Image</b>
</div>
<div style="text-align:center; min-width:325px;display:inline-block; width:29%;margin:auto;">
<img title="Natural Image Grayscale" style="border: 2px solid black;" height=300 src="/img/augmentation/naturalimgrotate-10.png"><br>
<b>Theta = -10.0 </b>
</div>
<div style="text-align:center; min-width:325px;display:inline-block; width:29%;margin:auto;">
<img title="Natural Image Grayscale" style="border: 2px solid black;" height=300 src="/img/augmentation/naturalimgrotate10.png"><br>
<b>Theta = 10.0</b>
</div>
</div>
<div style="width:100%; text-align:center;">
<div style="text-align:center; display:inline-block; width:29%; margin:auto;min-width:325px;">
<img title="CMR Image" height=300 src="/img/augmentation/cmrimg.png" >
<img title="CMR Segmentation" height=300 src="/img/augmentation/cmrseg.png" ><br>
<b>Original Image and Segmentation</b>
</div>
<div style="text-align:center; min-width:325px;display:inline-block; width:29%;margin:auto;">
<img title="CMR Image" height=300 src="/img/augmentation/cmrimgrotate1.png">
<img title="CMR Segmentation" height=300 src="/img/augmentation/cmrsegrotate1.png"><br>
<b>Theta = 6.18</b>
</div>
<div style="text-align:center; min-width:325px;display:inline-block; width:29%;margin:auto;">
<img title="CMR Image" height=300 src="/img/augmentation/cmrimgrotate2.png">
<img title="CMR Segmentation" height=300 src="/img/augmentation/cmrsegrotate2.png"><br>
<b>Theta = -1.91</b>
</div>
</div>
<h3 id="intensify"> Intensity Changes </h3>
<p>The final augmentation we can perform is a scaling in the intensity of the pixels. This effectively brightens or dims the image by appling a blanket increase or decrease across all pixels. We specify the amount by a factor: <code>factor < 1.0</code> will dim the image, and <code>factor > 1.0</code> will brighten it. Note that we don’t want a <code>factor = 0.0</code> as this will blank the image.</p>
<pre><code class="language-python">def intensifyit(image, factor):
return image*float(factor)
</code></pre>
<h3 id="flip"> Flipping </h3>
<p>One of the most common image augmentation procedures for natural images (dogs, cats, landscapes etc.) is to do flipping. The premise being that a dog is a dog no matter which was it’s facing. Or it doesn’t matter if a tree is on the right or the left of an image, it’s still a tree.</p>
<p>We can do horizontal flipping, left-to-right or vertical flipping, up and down. It may make sense to do only one of these (if we know that dogs don’t walk on their heads for example). In this case, we can specify a <code>list</code> of 2 boolean values: if each is <code>1</code> then both flips are performed. We use the <code>numpy</code> functions <code>fliplr</code> and <code>flipup</code> for these.</p>
<p>As with resampling, the intensity changes are modified to take the range of the display so there wont be a noticable difference in the images. The maximum value for display is 255 so increasing this will just scale it back down.</p>
<pre><code class="language-python">def flipit(image, axes):
if axes[0]:
image = np.fliplr(image)
if axes[1]:
image = np.flipud(image)
return image
</code></pre>
<h3 id="cropping"> Cropping </h3>
<p>This may be a very niche function, but it’s important in my case. Often in natrual image processing, random crops are done on the image in order to give patches - these patches often contain most of the image data e.g. 224 x 224 patch rather than 299 x 299 image. This is just another way of showing the network a very similar but also entirely different image. Central crops are also done. What’s different in my case is that I always want my segmentation to be fully-visible in the image that I show to the network (I’m working with 3D cardiac MRI segmentations).</p>
<p>So this function looks at the segmentation and creates a bounding box using the outermost pixels. We’re producing ‘square’ crops with side-length equal to the width of the image (the shortest side not including the depth). In this case, the bounding box is created and, if necessary, the window is moved up and down the image to make sure the full segmentation is visible. It also makes sure that the output is always square in the case that the bounding box moves off the image array.</p>
<pre><code class="language-python">def cropit(image, seg=None, margin=5):
fixedaxes = np.argmin(image.shape[:2])
trimaxes = 0 if fixedaxes == 1 else 1
trim = image.shape[fixedaxes]
center = image.shape[trimaxes] // 2
print image.shape
print fixedaxes
print trimaxes
print trim
print center
if seg is not None:
hits = np.where(seg!=0)
mins = np.argmin(hits, axis=1)
maxs = np.argmax(hits, axis=1)
if center - (trim // 2) > mins[0]:
while center - (trim // 2) > mins[0]:
center = center - 1
center = center + margin
if center + (trim // 2) < maxs[0]:
while center + (trim // 2) < maxs[0]:
center = center + 1
center = center + margin
top = max(0, center - (trim //2))
bottom = trim if top == 0 else center + (trim//2)
if bottom > image.shape[trimaxes]:
bottom = image.shape[trimaxes]
top = image.shape[trimaxes] - trim
if trimaxes == 0:
image = image[top: bottom, :, :]
else:
image = image[:, top: bottom, :]
if seg is not None:
if trimaxes == 0:
seg = seg[top: bottom, :, :]
else:
seg = seg[:, top: bottom, :]
return image, seg
else:
return image
</code></pre>
<p>Note that this function will work to square an image even when there is no segmentation given. We also have to be careful about which axes we take as the ‘fixed’ length for the square and which one to trim.</p>
<div style="width:100%; text-align:center;">
<div style="text-align:center; display:inline-block; width:29%; margin:auto;min-width:325px;">
<img title="Natural Image RGB" style="border: 2px solid black;" height=300 src="/img/augmentation/naturalimg.jpg" ><br>
<b>Original Image</b>
</div>
<div style="text-align:center; min-width:325px;display:inline-block; width:29%;margin:auto;">
<img title="Natural Image Grayscale" style="border: 2px solid black;" height=300 src="/img/augmentation/naturalimgcrop.png"><br>
<b> Cropped </b>
</div>
</div>
<div style="width:100%; text-align:center;">
<div style="text-align:center; display:inline-block; width:29%; margin:auto;min-width:325px;">
<img title="CMR Image" height=300 src="/img/augmentation/cmrimg.png" >
<img title="CMR Segmentation" height=300 src="/img/augmentation/cmrseg.png" ><br>
<b>Original Image and Segmentation</b>
</div>
<div style="text-align:center; min-width:325px;display:inline-block; width:29%;margin:auto;">
<img title="CMR Image" height=300 src="/img/augmentation/cmrimgcrop.png">
<img title="CMR Segmentation" height=300 src="/img/augmentation/cmrsegcrop.png"><br>
<b>Cropped</b>
</div>
</div>
<h2 id="application"> Application </h2>
<p>We should be careful about how we apply our transformations. For example, if we apply multiple transformations to the same image we need to make sure that we don’t apply ‘resampling’ after ‘intensity changes’ because this will reset the range of the image, defeating the point of the intensification. However, as we will generally want our data to span the same range, wholesale intensity shifts are less often seen. We also want to make sure that we are not being over zealous with the augmentations either - we need to set limits for our factors and other arguments.</p>
<p>When I implement data augmentation, I put all of these transforms into one script which can be downloaded here: <a href="/docs/transforms.py" title="transforms.py"><code>transforms.py</code></a>. I then call the transforms that I want from another script.</p>
<p>We create a set of cases, one for each transformation, which draws random (but controlled) parameters for our augmentations, remember we don’t want anything too extreme. We don’t want to apply all of these transformations every time, so we also create an array of random length (number of transformations) and randomly assigned elements (the transformations to apply).</p>
<pre><code class="language-python">np.random.seed()
numTrans = np.random.randint(1, 6, size=1)
allowedTrans = [0, 1, 2, 3, 4]
whichTrans = np.random.choice(allowedTrans, numTrans, replace=False)
</code></pre>
<p>We assign a new <code>random.seed</code> every time to ensure that each pass is different to the last. There are 5 possible transformations so <code>numTrans</code> is a single random integer between 1 and 5. We then take a <code>random.choice</code> of the <code>allowedTrans</code> up to <code>numTrans</code>. We don’t want to apply the same transformation more than once, so <code>replace=False</code>.</p>
<p>After some trial and error, I’ve found that the following parameters are good:</p>
<ul>
<li>rotations - <code>theta</code> $ \in [-10.0, 10.0] $ degrees</li>
<li>scaling - <code>factor</code> $ \in [0.9, 1.1] $ i.e. 10% zoom-in or zoom-out</li>
<li>intensity - <code>factor</code> $ \in [0.8, 1.2] $ i.e. 20% increase or decrease</li>
<li>translation - <code>offset</code> $ \in [-5, 5] $ pixels</li>
<li>margin - I tend to set at either 5 or 10 pixels.</li>
</ul>
<p>For an image called <code>thisim</code> and segmentation called <code>thisseg</code>, the cases I use are:</p>
<pre><code class="language-python">if 0 in whichTrans:
theta = float(np.around(np.random.uniform(-10.0,10.0, size=1), 2))
thisim = rotateit(thisim, theta)
thisseg = rotateit(thisseg, theta, isseg=True) if withseg else np.zeros_like(thisim)
if 1 in whichTrans:
scalefactor = float(np.around(np.random.uniform(0.9, 1.1, size=1), 2))
thisim = scaleit(thisim, scalefactor)
thisseg = scaleit(thisseg, scalefactor, isseg=True) if withseg else np.zeros_like(thisim)
if 2 in whichTrans:
factor = float(np.around(np.random.uniform(0.8, 1.2, size=1), 2))
thisim = intensifyit(thisim, factor)
#no intensity change on segmentation
if 3 in whichTrans:
axes = list(np.random.choice(2, 1, replace=True))
thisim = flipit(thisim, axes+[0])
thisseg = flipit(thisseg, axes+[0]) if withseg else np.zeros_like(thisim)
if 4 in whichTrans:
offset = list(np.random.randint(-5,5, size=2))
currseg = thisseg
thisim = translateit(thisim, offset)
thisseg = translateit(thisseg, offset, isseg=True) if withseg else np.zeros_like(thisim)
</code></pre>
<p>In each case, a random set of parameters is found and passed to the transform functions. The image and segmentation are passed separately to each one. In my case, I only choose to flip horizontally by randomly choosing 0 or 1 and appending <code>[0]</code> such that the transform ignores the second axis. We’ve also added a boolean variable called <code>withseg</code>. When <code>True</code> the segmentation is augmented, otherwise a blank image is returned.</p>
<p>Finally, we crop the image to make it square before resampling it to the desired <code>dims</code>.</p>
<pre><code class="language-python">thisim, thisseg = cropit(thisim, thisseg)
thisim = resampleit(thisim, dims)
thisseg = resampleit(thisseg, dims, isseg=True) if withseg else np.zeros_like(thisim)
</code></pre>
<p>Putting this together in a script makes testing the augmenter easier: you can download the script <a href="/docs/augmenter.py" title="augmenter.py">here</a>. Some things in the code to note:</p>
<ul>
<li>The script takes one mandatory argument (image filename) and an optional segmentation filename</li>
<li>There’s a bit of error checking - are the files able to be loaded? Is it an rgb or full 3D image (3rd dimension greater than 3).</li>
<li>We specify the final image dimensions, [224, 224, 8] in this case</li>
<li>We also declare some default values for the parameters so that we can…</li>
<li>…print out the applied transformations and their parameters at the end</li>
<li>There’s a definition for a <code>plotit</code> function that just creates a 2 x 2 matrix where the top 2 images are the originals and the bottom two are the augmented images.</li>
<li>There’s a commented out part which is what I used to save the images created in this post</li>
</ul>
<p>In a live setting where we want to do data-augmentation on the fly, we would essentially call this script with the filenames or image arrays to augment and create as many augmentations of the images as we wish. We’ll take a look at this as an example in the next post.</p>
<p><strong>Edit: 15/05/2018</strong></p>
<ul>
<li>Added a <code>sliceshift</code> function to <code>transforms.py</code>. This takes in a 3D image and randomly shifts a <code>fraction</code> of the slices using our <code>translateit</code> function (which I’ve also updated slightly). This allows us to simulate motion in medical images.</li>
</ul>Convolutional Neural Networks - TensorFlow (Basics)
/post/tensorflow-basics/
Mon, 03 Jul 2017 09:44:24 +0100/post/tensorflow-basics/<p>We’ve looked at the principles behind how a CNN works, but how do we actually implement this in Python? This tutorial will look at the basic idea behind Google’s TensorFlow: an efficient way to build a CNN using purpose-build Python libraries.</p>
<p></p>
<div style="text-align:center;"><img width=30% title="TensorFlow" src="/img/CNN/TF_logo.png"></div>
<h2 id="intro"> Introduction </h2>
<p>Building a CNN from scratch in Python is perfectly possible, but very memory intensive. It can also lead to very long pieces of code. Several libraries have been developed by the community to solve this problem by wrapping the most common parts of CNNs into special methods called from their own libraries. Theano, Keras and PyTorch are notable libraries being used today that are all opensource. However, since TensorFlow was released and Google announced their machine-learning-specific hardware, the Tensor Processing Unit (TPU), TensorFlow has quickly become a much-used tool in the field. If any applications being built today are intended for use on mobile devices, TensorFlow is the way to go as the mobile TPU in the upcoming Google phones will be able to perform inference from machine learning models in the User’s hand. Of course, being a relative newcomer and updates still very much controlled by Google, TensorFlow may not have the huge body of support that has built up with Theano, say.</p>
<p>Nevertheless, TensorFlow is powerful and quick to setup so long as you know how: read on to find out. Much of this tutorial is based around the documentation provided by Google, but gives a lot more information that many be useful to less experienced users.</p>
<h2 id="install"> Installation </h2>
<p>TensorFlow is just another set of Python libraries distributed by Google via the website: <a href="https://www.tensorflow.org/install" title="TensorFlow Installation">https://www.tensorflow.org/install</a>. There’s the option to install the version for use on GPUs but that’s not necessary for this tutorial, we’ll be using the MNIST dataset which is not too memory instensive.</p>
<p>Go ahead and install the TensorFlow libraries. I would say that even though they suggest using TF in a virtual environment, we will be coding up our CNN in a Python script so don’t worry about that if you’re not comfortable with it.</p>
<p>One of the most frustrating things you will find with TF is that much of the documentation on various websites is already out-of-date. Some of the commands have been re-written or renamed since the support was put in place. Even some of Google’s own tutorials are now old and require tweaking. Currently, the code written here will work on all versions, but may throw some ‘depreication’ warnings.</p>
<h2 id="structure"> TensorFlow Structure </h2>
<p>The idea of ‘flow’ is central to TF’s organisation. The actual CNN is written as a ‘graph’. A graph is simply a list of the differnet layers in your network each with their own input and output. Whatever data we input at the top will ‘flow’ through the graph and output some values. The values we will also deal with using TensorFlow which will automatically take care of the updating of any internal weights via whatever optimization method and loss function we prefer.</p>
<p>The graph is called by some initial functions in the script that create the classifier, run the training and output whatever evlauation metrics we like.</p>
<p>Before writing any functions, lets import the necessary includes and tell TF to limit any program logging:</p>
<pre><code class="language-python">import numpy as np
import os
import tensorflow as tf
from tensorflow.contrib import learn
from tensorflow.contrib.learn.python.learn.estimators import model_fn as model_fn_lib
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
</code></pre>
<p>We’ve included multiple TF lines to save on the typing later.</p>
<h3 id="graph"> The Graph </h3>
<p>Let’s get straight to it and start to build our graph. We will keep it simple:</p>
<ul>
<li>2 convolutional layers learning 16 filters (or kernels) of [3 x 3]</li>
<li>2 max-pooling layers that half the size of the image using [2 x 2] kernel</li>
<li>A fully connected layer at the end.</li>
</ul>
<pre><code class="language-python">#Hyperparameters
numK = 16 #number of kernels in each conv layer
sizeConvK = 3 #size of the kernels in each conv layer [n x n]
sizePoolK = 2 #size of the kernels in each pool layer [m x m]
inputSize = 28 #size of the input image
numChannels = 1 #number of channels to the input image grayscale=1, RGB=3
def convNet(inputs, labels, mode):
#reshape the input from a vector to a 2D image
input_layer = tf.reshape(inputs, [-1, inputSize, inputSize, numChannels])
#perform convolution and pooling
conv1 = doConv(input_layer)
pool1 = doPool(conv1)
conv2 = doConv(pool1)
pool2 = doPool(conv2)
#flatted the result back to a vector for the FC layer
flatPool = tf.reshape(pool2, [-1, 7 * 7 * numK])
dense = tf.layers.dense(inputs=flatPool, units=1024, activation=tf.nn.relu)
</code></pre>
<p>So what’s going on here? First we’ve defined some parameters for the CNN such as kernel sizes, the height of the input image (assuming it’s square) and the number of channels for the image. The number of channels is <code>1</code> for both Black and White with intensity values of either 0 or 1, and grayscale images with intensities in the range [0 255]. Colour images have <code>3</code> channels, Red, Green and Blue.</p>
<p>You’ll notice that we’ve barely used TF so far: we use it to reshape the data. This is important, when we run our script, TF will take our raw data and turn it into its own data type i.e. a <code>tensor</code>. That means our normal <code>numpy</code> operations won’t work on them so we should use the in-built <code>tf.reshape</code> function which works in the same was as the one in numpy - it takes the input data and an output shape as arguments.</p>
<p>But why are we reshaping at all? Well, the data that is input into the network will be in the form of vectors. The image will have been saved along with lots of other images as single lines of a larger file. This is the case with the MNIST dataset and is common in machine learning. So we need to put it back into image-form so that we can perform convolutions.</p>
<p>“Where are those random 7s and the -1 from?”… good question. In this example, we are going to be using the MNIST dataset whose images are 28 x 28. If we put this through 2 pooling layers we will half (14 x 14) and half again (7 x 7) the width. Thus the layer needs to know what it is expecting the output to look like based upon the input which will be a 7 x 7 x <code>numK</code> tensor, one 7 x 7 for each kernel. Keep in mind that we will be running the network with more than one input image at a time, so in reality when we get to this stage, there will be <code>n</code> images here which all have 7 x 7 x <code>numK</code> values associated with them. The -1 simply tells TensorFlow to take <em>all</em> of these images and do the same to each. It’s short hand for “do this for the whole batch”.</p>
<p>There’s also a <code>tf.layers.dense</code> method at the end here. This is one of TF’s in-built layer types that is very handy. We just tell it what to take as input, how many units we want it to have and what non-linearity we would prefer at the end. Instead of typing this all separately, it’s combined into a single line. Neat!</p>
<p>But what about the <code>conv</code> and <code>pool</code> layers? Well, to keep the code nice and tidy, I like to write the convolution and pooling layers in separate functions. This means that if I want to add more <code>conv</code> or <code>pool</code> layers, I can just write them in underneath the current ones and the code will still look clean (not that the functions are very long). Here they are:</p>
<pre><code class="language-python">def doConv(inputs):
convOut = tf.layers.conv2d(inputs=inputs, filters=numK, kernel_size=[sizeConvK, sizeConvK], \
padding="SAME", activation=tf.nn.relu)
return convOut
def doPool(inputs):
poolOut = tf.layers.max_pooling2d(inputs=inputs, pool_size=[sizePoolK, sizePoolK], strides=2)
return poolOut
</code></pre>
<p>Again, both the <code>conv</code> and <code>pool</code> layers are simple one-liners. They both take in some input data and need to know the size of the kernel you want them to use (which we defined earlier on). The <code>conv</code> layer needs to know how many <code>filters</code> to learn too. Alongside this, we need to take care of any mis-match between the image size and the size of the kernels to ensure that we’re not changing the size of the image when we get the output. This is easily done in TF by setting the <code>padding</code> attribute to <code>"SAME"</code>. We’ve got our non-linearity at the end here too. We’ve hard-coded that the pooling layer will have <code>strides=2</code> and will therefore half in size at each pooling layer.</p>
<p>Now we have the main part of our network coded-up. But it wont do very much unless we ask TF to give us some outputs and compare them to some training data.</p>
<p>As the MNIST data is used for image-classification problems, we’ll be trying to get the network to output probabilities that the image it is given belongs to a specific class i.e. a number 0-9. The MNIST dataset provides the numbers 0-9 which, if we provided this to the network, would start to output guesses of decimal values 0.143, 4.765, 8.112 or whatever. We need to change this data so that each class can have its own specific box which the network can assign a probability. We use the idea of ‘one-hot’ labels for this. For example, class 3 becomes [0 0 0 1 0 0 0 0 0 0] and class 9 becomes [0 0 0 0 0 0 0 0 0 1]. This way we’re not asking the network to predict the number associated with each class but rather how likely is the test-image to be in this class.</p>
<p>TF has a very handy function for changing class labels into ‘one-hot’ labels. Let’s continue coding our graph in the <code>convNet</code> function.</p>
<pre><code class="language-python"> #Get the output in the form of one-hot labels with x units
logits = tf.layers.dense(inputs=dense, units=10)
loss = None
train_op = None
#At the end of the network, check how well we did
if mode != learn.ModeKeys.INFER:
#create one-hot tabels from the training-labels
onehot_labels = tf.one_hot(indices=tf.cast(labels, tf.int32), depth=10)
#check how close the output is to the training-labels
loss = tf.losses.softmax_cross_entropy(onehot_labels=onehot_labels, logits=logits)
#After checking the loss, use it to train the network weights
if mode == learn.ModeKeys.TRAIN:
train_op = tf.contrib.layers.optimize_loss(loss=loss, global_step=tf.contrib.framework.get_global_step(), \
learning_rate=learning_rate, optimizer="SGD")
</code></pre>
<p><code>logits</code> here is the output of the network which corresponds to the 10 classes of the training labels. The next two sections check whether we should be training the weights right now, or checking how well we’ve done. First we check our progress: we use <code>tf.one_hot</code> to create the one-hot labels from the numeric training labels given to the network in <code>labels</code>. We’ve performed a <code>tf.cast</code> operation to make sure that the data is of the correct type before doing the conversion.</p>
<p>Our loss-function is an important part of a CNN (or any machine learning algorithm). There are many different loss functions already built-in with TensorFlow from simple <code>absolute_difference</code> to more complex functions like our <code>softmax_cross_entropy</code>. We won’t delve into how this is calculated, just know that we can pick any loss function. More advanced users can write their own loss-functions. The loss function takes in the output of the network <code>logits</code> and compares it to our <code>onehot_labels</code>.</p>
<p>When this is done, we ask TF to perform some updating or ‘optimization’ of the network based on the loss that we just calculated. the <code>train_op</code> in TF is the name given in support documents to the function that performs any background changes to the fundamentals of the network or updates values. Our <code>train_op</code> here is a simple loss-optimiser that tries to find the minimum loss for our data. As with all machine learning algorithms, the parameters of this optimiser are subject to much research. Using a pre-built optimiser such as those included with TF will ensure that your network performs efficiently and trains as quickly as possible. The <code>learning_rate</code> can be set as a variable at the beginning of our script along with the other parameters. We tend to stick with <code>0.001</code> to begin with and move in orders of magnitude if we need to e.g. <code>0.01</code> or <code>0.0001</code>. Just like the loss functions, there are a number of optimisers to use, some will take longer than others if they are more complex. For our purposes on the MNIST dataset, simple stochastic gradient descent (<code>SGD</code>) will suffice.</p>
<p>Notice that we are just giving TF some instructions: take my network, calculate the loss and do some optimisation based on that loss.</p>
<p>We are going to want to show what the network has learned, so we output the current predictions by definiing a dictionary of data. The raw logits information and the associated probabilities (found by taking the softmax of the logits tensor).</p>
<pre><code>predictions ={"classes": tf.argmax(input=logits, axis=1), "probabilities": tf.nn.softmax(logits, name="softmax_tensor")}
</code></pre>
<p>We can finish off our graph by making sure it returns the data:</p>
<pre><code>return model_fn_lib.ModelFnOps(mode=mode, predictions=predictions, loss=loss, train_op=train_op)
</code></pre>
<p><code>ModelFnOps</code> class is returned that contains the current mode of the network (training or inference), the current predictions, loss and the <code>train_op</code> that we use to train the network.</p>
<h3 id="setup">Setting up the Script</h3>
<p>Now that the graph has been constructed, we need to call it and tell TF to do the training. First, lets take a moment to load the data the we will be using. The MNIST dataset has its own loading method within TF (handy!). Let’s define the main body of our script:</p>
<pre><code class="language-python">def main(unused_argv):
# Load training and eval data
mnist = learn.datasets.load_dataset("mnist")
train_data = mnist.train.images # Returns np.array
train_labels = np.asarray(mnist.train.labels, dtype=np.int32)
eval_data = mnist.test.images # Returns np.array
eval_labels = np.asarray(mnist.test.labels, dtype=np.int32)
</code></pre>
<p>Next, we create the classifier that will hold the network and all of its data. We have to tell it what our graph is called under <code>model_fn</code> and where we would like our output stored.</p>
<p><strong>Note:</strong> If you use the <code>/tmp</code> directory in Linux you will probably find that the model will no longer be there if you restart your computer. If you intend to reload and use your model later on, be sure to save it in a more conventient place.</p>
<pre><code class="language-python"> mnistClassifier = learn.Estimator(model_fn=convNet, model_dir="/tmp/mln_MNIST")
</code></pre>
<p>We will want to get some information out of our network that tells us about the training performance. For example, we can create a dictionary that will hold the probabilities from the key that we named ‘softmax_tensor’ in the graph. How often we save this information is controlled with the <code>every_n_iter</code> attricute. We add this to the <code>tf.train.LoggingTensorHook</code>.</p>
<pre><code class="language-python"> tensors2log = {"probabilities": "softmax_tensor"}
logging_hook = tf.train.LoggingTensorHook(tensors=tensors2log, every_n_iter=100)
</code></pre>
<p>Finally! Let’s get TF to actually train the network. We call the <code>.fit</code> method of the classifier that we created earlier. We pass it the training data and the labels along with the batch size (i.e. how much of the training data we want to use in each iteration). Bare in mind that even though the MNIST images are very small, there are 60,000 of them and this may not do well for your RAM. We also need to say what the maximum number of iterations we’d like TF to perform is and also add on that we want to <code>monitor</code> the training by outputting the data we’ve requested in <code>logging_hook</code>.</p>
<pre><code class="language-python"> mnistClassifier.fit(x=train_data, y=train_labels, batch_size=100, steps=1000, monitors=[logging_hook])
</code></pre>
<p>When the training is complete, we’d like TF to take some test-data and tell us how well the network performs. So we create a special metrics dictionary that TF will populate by calling the <code>.evaluate</code> method of the classifier.</p>
<pre><code class="language-python"> metrics = {"accuracy": learn.MetricSpec(metric_fn=tf.metrics.accuracy, prediction_key="classes")}
eval_results = mnistClassifier.evaluate(x=eval_data, y=eval_labels, metrics=metrics)
print(eval_results)
</code></pre>
<p>In this case, we’ve chosen to find the accuracy of the classifier by using the <code>tf.metrics.accuracy</code> value for the <code>metric_fn</code>. We also need to tell the evaluator that it’s the ‘classes’ key we’re looking at in the graph. This is then passed to the evaluator along with the test data.</p>
<h3 id="running">Running the Network</h3>
<p>Adding the final main function to the script and making sure we’ve done all the necessary includes, we can run the program. The full script can be found <a href="/docs/tfCNNMNIST.py" title="TFCNNMNIST.py">here</a>.</p>
<p>In the current configuration, running the network for 1000 epochs gave me an output of:</p>
<pre><code class="language-python">{'loss': 1.9025836, 'global_step': 1000, 'accuracy': 0.64929998}
</code></pre>
<p>Definitely not a great accuracy for the MNIST dataset! We could just run this for longer and would likely see an increase in accuracy, Instead, lets make some of the easy tweaks to our network that we’ve described before: dropout and batch normalisation.</p>
<p>In our graph, we want to add:</p>
<pre><code class="language-python"> dense = tf.contrib.layers.batch_norm(dense, decay=0.99, is_training= mode==learn.ModeKeys.TRAIN)
dense = tf.layers.dropout(inputs=dense, rate=keepProb, training = mode==learn.ModeKeys.TRAIN)
</code></pre>
<p>This layer <a href="https://www.tensorflow.org/api_docs/python/tf/contrib/layers/batch_norm" title="tf.contrib.layers.batch_norm">has many different attirbutes</a>. It’s functionality is taken from <a href="https://arxiv.org/abs/1502.03167" title="Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift">the paper by Loffe and Szegedy (2015)</a>.</p>
<p>Dropout layer’s <code>keepProb</code> is defined in the Hyperparameter pramble to the script. Another value that can be changed to improve the performance of the network. Both of these lines are in the final script <a href="/docs/tfCNNMNIST.py" title="tffCNNMNIST.py">available here</a>, just uncomment them.</p>
<p>If we re-run the script, it will automatically load the most recent state of the network (clever TensorFlow!) but… it will fail because the checkpoint does not include the two new layers in its graph. So we must either delete our <code>/tmp/mln_MNIST</code> folder, or give the classifier a new <code>model_dir</code>.</p>
<p>Doing this and rerunning for the same 1000 epochs, I get an instant 140% increase in accuracy:</p>
<pre><code class="language-python">{'loss': 0.29391664, 'global_step': 1000, 'accuracy': 0.91680002}
</code></pre>
<p>Simply changing the optimiser to use the “Adam” rather than “SGD” optimiser yields:</p>
<pre><code class="language-python">{'loss': 0.040745325, 'global_step': 1000, 'accuracy': 0.98500001}
</code></pre>
<p>And running for slightly longer (20,000 iterations);</p>
<pre><code class="language-python">{'loss': 0.046967514, 'global_step': 20000, 'accuracy': 0.99129999}
</code></pre>
<h2 id="conclusion"> Conclusion </h2>
<p>TensorFlow takes away the tedium of having to write out the full code for each individual layer and is able to perform optimisation and evaluation with minimal effort.</p>
<p>If you look around online, you will see many methods for using TF that will get you similar results. I actually prefer some methods that are a little more explicit. The tutorial on Google for example has some room to allow us to including more logging features.</p>
<p>In future posts, we will look more into logging and TensorBoard, but for now, happy coding!</p>Convolutional Neural Networks - Basics
/post/CNN1/
Fri, 07 Apr 2017 09:46:56 +0100/post/CNN1/<p>This series will give some background to CNNs, their architecture, coding and tuning. In particular, this tutorial covers some of the background to CNNs and Deep Learning. We won’t go over any coding in this session, but that will come in the next one. What’s the big deal about CNNs? What do they look like? Why do they work? Find out in this tutorial.</p>
<p></p>
<h2 id="intro"> Introduction </h2>
<p>A convolutional neural network (CNN) is very much related to the standard NN we’ve previously encountered. I found that when I searched for the link between the two, there seemed to be no natural progression from one to the other in terms of tutorials. It would seem that CNNs were developed in the late 1980s and then forgotten about due to the lack of processing power. In fact, it wasn’t until the advent of cheap, but powerful GPUs (graphics cards) that the research on CNNs and Deep Learning in general was given new life. Thus you’ll find an explosion of papers on CNNs in the last 3 or 4 years.</p>
<p>Nonetheless, the research that has been churned out is <em>powerful</em>. CNNs are used in so many applications now:</p>
<ul>
<li>Object recognition in images and videos (think image-search in Google, tagging friends faces in Facebook, adding filters in Snapchat and tracking movement in Kinect)</li>
<li>Natural language processing (speech recognition in Google Assistant or Amazon’s Alexa)</li>
<li>Playing games (the recent <a href="https://en.wikipedia.org/wiki/AlphaGo" title="AlphaGo on Wiki">defeat of the world ‘Go’ champion</a> by DeepMind at Google)</li>
<li>Medical innovation (from drug discovery to prediction of disease)</li>
</ul>
<p>Dispite the differences between these applications and the ever-increasing sophistication of CNNs, they all start out in the same way. Let’s take a look.</p>
<h2 id="deep"> CNN or Deep Learning? </h2>
<p>
The "deep" part of deep learning comes in a couple of places: the number of layers and the number of features. Firstly, as one may expect, there are usually more layers in a deep learning framework than in your average multi-layer perceptron or standard neural network. We have some architectures that are 150 layers deep. Secondly, each layer of a CNN will learn multiple 'features' (multiple sets of weights) that connect it to the previous layer; so in this sense it's much deeper than a normal neural net too. In fact, some powerful neural networks, even CNNs, only consist of a few layers. So the 'deep' in DL acknowledges that each layer of the network learns multiple features. More on this later.
</p><p>
Often you may see a conflation of CNNs with DL, but the concept of DL comes some time before CNNs were first introduced. Connecting multiple neural networks together, altering the directionality of their weights and stacking such machines all gave rise to the increasing power and popularity of DL.
</p><p>
We won't delve too deeply into history or mathematics in this tutorial, but if you want to know the timeline of DL in more detail, I'd suggest the paper "On the Origin of Deep Learning" (Wang and Raj 2016) available <a href="https://t.co/aAw4rEpZEt" title="On the Origin of Deep Learning">here</a>. It's a lengthy read - 72 pages including references - but shows the logic between progressive steps in DL.
</p><p>
As with the study of neural networks, the inspiration for CNNs came from nature: specifically, the visual cortex. It drew upon the idea that the neurons in the visual cortex focus upon different sized patches of an image getting different levels of information in different layers. If a computer could be programmed to work in this way, it may be able to mimic the image-recognition power of the brain. So how can this be done?
</p>
<p>A CNN takes as input an array, or image (2D or 3D, grayscale or colour) and tries to learn the relationship between this image and some target data e.g. a classification. By ‘learn’ we are still talking about weights just like in a regular neural network. The difference in CNNs is that these weights connect small subsections of the input to each of the different neurons in the first layer. Fundamentally, there are multiple neurons in a single layer that each have their own weights to the same subsection of the input. These different sets of weights are called ‘kernels’.</p>
<p>It’s important at this stage to make sure we understand this weight or kernel business, because it’s the whole point of the ‘convolution’ bit of the CNN.</p>
<h2 id="kernels"> Convolution and Kernels </h2>
<p>Convolution is something that should be taught in schools along with addition, and multiplication - it’s <a href="https://en.wikipedia.org/wiki/Convolution" title="Convolution on Wiki">just another mathematical operation</a>. Perhaps the reason it’s not, is because it’s a little more difficult to visualise.</p>
<p>Let’s say we have a pattern or a stamp that we want to repeat at regular intervals on a sheet of paper, a very convenient way to do this is to perform a convolution of the pattern with a regular grid on the paper. Think about hovering the stamp (or kernel) above the paper and moving it along a grid before pushing it into the page at each interval.</p>
<p>This idea of wanting to repeat a pattern (kernel) across some domain comes up a lot in the realm of signal processing and computer vision. In fact, if you’ve ever used a graphics package such as Photoshop, Inkscape or GIMP, you’ll have seen many kernels before. The list of ‘filters’ such as ‘blur’, ‘sharpen’ and ‘edge-detection’ are all done with a convolution of a kernel or filter with the image that you’re looking at.</p>
<p>For example, let’s find the outline (edges) of the image ‘A’.</p>
<div style="text-align:center; display:inline-block; width:100%; margin:auto;">
<img title="Android" src="/img/CNN/android.png"><br>
<b>A</b>
</div>
<p>We can use a kernel, or set of weights, like the ones below.</p>
<div style="width:100%; text-align:center;">
<div style="text-align:center; display:inline-block; width:49%; margin:auto;min-width:155px;">
<img title="Horizontal Filter" height=150 src="/img/CNN/horizFilter.png"><br>
<b>Finds horizontals</b>
</div>
<div style="text-align:center; min-width:150px;display:inline-block; width:49%;margin:auto;">
<img title="Vertical Filter" height=150 src="/img/CNN/vertFilter.png"><br>
<b>Finds verticals</b>
</div>
</div>
<p>A kernel is placed in the top-left corner of the image. The pixel values covered by the kernel are multiplied with the corresponing kernel values and the products are summated. The result is placed in the new image at the point corresponding to the centre of the kernel. An example for this first step is shown in the diagram below. This takes the vertical Sobel filter (used for edge-detection) and applies it to the pixels of the image.</p>
<div style="text-align:center; display:inline-block; width:100%;margin:auto;">
<img title="Conv Example" height="350" src="/img/CNN/convExample.png"><br>
<b>A step in the Convolution Process.</b>
</div>
<p>The kernel is moved over by one pixel and this process is repated until all of the possible locations in the image are filtered as below, this time for the horizontal Sobel filter. Notice that there is a border of empty values around the convolved image. This is because the result of convolution is placed at the centre of the kernel. To deal with this, a process called ‘padding’ or more commonly ‘zero-padding’ is used. This simply means that a border of zeros is placed around the original image to make it a pixel wider all around. The convolution is then done as normal, but the convolution result will now produce an image that is of equal size to the original.</p>
<div style="width:100%;margin:auto; text-align:center;">
<div style="text-align:center; display:inline-block; width:45%;min-width:455px;margin:auto;">
<img title="Sobel Conv Gif" height="450" src="/img/CNN/convSobel.gif"><br>
<b>The kernel is moved over the image performing the convolution as it goes.</b>
</div>
<div style="text-align:center; display:inline-block; width:45%;min-width:450px;margin:auto;">
<img title="Zero Padding Conv" height="450" src="/img/CNN/convZeros.png"><br>
<b>Zero-padding is used so that the resulting image doesn't shrink.</b>
</div>
</div>
<p>Now that we have our convolved image, we can use a colourmap to visualise the result. Here, I’ve just normalised the values between 0 and 255 so that I can apply a grayscale visualisation:</p>
<div style="text-align:center; display:inline-block; width:100%;margin:auto;">
<img title="Conv Result" height="150"src="/img/CNN/convResult.png"><br>
<b>Result of the convolution</b>
</div>
<p>This dummy example could represent the very bottom left edge of the Android’s head and doesn’t really look like it’s detected anything. To see the proper effect, we need to scale this up so that we’re not looking at individual pixels. Performing the horizontal and vertical sobel filtering on the full 264 x 264 image gives:</p>
<div style="width:100%;margin:auto; text-align:center;">
<div style="text-align:center; display:inline-block; min-width:100px;margin:auto;">
<img title="Horizontal Sobel" src="/img/CNN/horiz.png"><br>
<b>Horizontal Sobel</b>
</div>
<div style="text-align:center; display:inline-block; margin:auto;min-width:100px">
<img title="Vertical Sobel" src="/img/CNN/vert.png"><br>
<b>Vertical Sobel</b>
</div>
<div style="text-align:center; display:inline-block;margin:auto;min-width:100px">
<img title="Full Sobel" src="/img/CNN/both.png"><br>
<b>Combined Sobel</b>
</div>
</div>
<p>Where we’ve also added together the result from both filters to get both the horizontal and vertical ones.</p>
<h3 id="relationship"> How does this feed into CNNs? </h3>
<p>Clearly, convolution is powerful in finding the features of an image <strong>if</strong> we already know the right kernel to use. Kernel design is an artform and has been refined over the last few decades to do some pretty amazing things with images (just look at the huge list in your graphics software!). But the important question is, what if we don’t know the features we’re looking for? Or what if we <strong>do</strong> know, but we don’t know what the kernel should look like?</p>
<p>Well, first we should recognise that every pixel in an image is a <strong>feature</strong> and that means it represents an <strong>input node</strong>. The result from each convolution is placed into the next layer in a <strong>hidden node</strong>. Each feature or pixel of the convolved image is a node in the hidden layer.</p>
<p>We’ve already said that each of these numbers in the kernel is a weight, and that weight is the connection between the feature of the input image and the node of the hidden layer. The kernel is swept across the image and so there must be as many hidden nodes as there are input nodes (well actually slightly fewer as we should add zero-padding to the input image). This means that the hidden layer is also 2D like the input image. Sometimes, instead of moving the kernel over one pixel at a time, the <strong>stride</strong>, as it’s called, can be increased. This will result in fewer nodes or fewer pixels in the convolved image. Consider it like this:</p>
<div style="width:100%;margin:auto; text-align:center;">
<div style="text-align:center; display:inline-block;margin:auto;min-width:300px;">
<img title="Hidden Layer Nodes" height=300 src="/img/CNN/hiddenLayer.png"><br>
<b>Hidden Layer Nodes in a CNN</b>
</div>
<div style="text-align:center; display:inline-block;margin:auto;min-width:300px">
<img title="Hidden Layer after Increased Stride" height=225 src="/img/CNN/strideHidden.png"><br>
<b>Increased stride means fewer hidden-layer nodes</b>
</div>
</div>
<p>These weights that connect to the nodes need to be learned in exactly the same way as in a regular neural network. The image is passed through these nodes (by being convolved with the weights a.k.a the kernel) and the result is compared to some output (the error of which is then backpropagated and optimised).</p>
<p>In reality, it isn’t just the weights or the kernel for one 2D set of nodes that has to be learned, there is a whole array of nodes which all look at the same area of the image (sometimes, but possibly incorrectly, called the <strong>receptive field</strong>*). Each of the nodes in this row (or <strong>fibre</strong>) tries to learn different kernels (different weights) that will show up some different features of the image, like edges. So the hidden-layer may look something more like this:</p>
<p>* <em>Note: we’ll talk more about the receptive field after looking at the pooling layer below</em></p>
<div style="width:100%;margin:auto; text-align:center;">
<div style="text-align:center; display:inline-block;margin:auto;min-width:100px">
<img title="Multiple Kernel Hidden Layer" height=350 src="/img/CNN/deepConv.png"><br>
<b>For a 2D image learning a set of kernels.</b>
</div>
<div style="text-align:center; display:inline-block;margin:auto;min-width:100px">
<img title="3 Channel Image" height=350 src="/img/CNN/deepConv3.png"><br>
<b>For a 3 channel RGB image the kernel becomes 3D.</b>
</div>
</div>
<p>Now <strong>this</strong> is why deep learning is called <strong>deep</strong> learning. Each hidden layer of the convolutional neural network is capable of learning a large number of kernels. The output from this hidden-layer is passed to more layers which are able to learn their own kernels based on the <em>convolved</em> image output from this layer (after some pooling operation to reduce the size of the convolved output). This is what gives the CNN the ability to see the edges of an image and build them up into larger features.</p>
<h2 id="CNN Architecture"> CNN Archiecture </h2>
<p>It is the <em>architecture</em> of a CNN that gives it its power. A lot of papers that are puplished on CNNs tend to be about a new achitecture i.e. the number and ordering of different layers and how many kernels are learnt. Though often it’s the clever tricks applied to older architecures that really give the network power. Let’s take a look at the other layers in a CNN.</p>
<h2 id='layers'> Layers </h2>
<h3 id="input"> Input Layer </h3>
<p>The input image is placed into this layer. It can be a single-layer 2D image (grayscale), 2D 3-channel image (RGB colour) or 3D. The main difference between how the inputs are arranged comes in the formation of the expected kernel shapes. Kernels need to be learned that are the same depth as the input i.e. 5 x 5 x 3 for a 2D RGB image with dimensions of 5 x 5.</p>
<p>Inputs to a CNN seem to work best when they’re of certain dimensions. This is because of the behviour of the convolution. Depending on the <em>stride</em> of the kernel and the subsequent <em>pooling layers</em> the outputs may become an “illegal” size including half-pixels. We’ll look at this in the <em>pooling layer</em> section.</p>
<h3 id="convolution"> Convolutional Layer </h3>
<p>We’ve <a href="#kernels" title="Convolution and Kernels">already looked at what the conv layer does</a>. Just remember that it takes in an image e.g. [56 x 56 x 3] and assuming a stride of 1 and zero-padding, will produce an output of [56 x 56 x 32] if 32 kernels are being learnt. It’s important to note that the order of these dimensions can be important during the implementation of a CNN in Python. This is because there’s alot of matrix multiplication going on!</p>
<h3 id="nonlinear"> Non-linearity</h3>
<p>The ‘non-linearity’ here isn’t its own distinct layer of the CNN, but comes as part of the convolution layer as it is done on the output of the neurons (just like a normal NN). By this, we mean “don’t take the data forwards as it is (linearity) let’s do something to it (non-linearlity) that will help us later on”.</p>
<p>In our neural network tutorials we looked at different <a href="/post/transfer-functions" title="Transfer Functions">activation functions</a>. These each provide a different mapping of the input to an output, either to [-1 1], [0 1] or some other domain e.g the Rectified Linear Unit thresholds the data at 0: max(0,x). The <em>ReLU</em> is very popular as it doesn’t require any expensive computation and it’s been <a href="http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf" title="Krizhevsky et al 2012">shown to speed up the convergence of stochastic gradient descent algorithms</a>.</p>
<h3 id="pool"> Pooling Layer </h3>
<p>The pooling layer is key to making sure that the subsequent layers of the CNN are able to pick up larger-scale detail than just edges and curves. It does this by merging pixel regions in the convolved image together (shrinking the image) before attempting to learn kernels on it. Effectlively, this stage takes another kernel, say [2 x 2] and passes it over the entire image, just like in convolution. It is common to have the stride and kernel size equal i.e. a [2 x 2] kernel has a stride of 2. This example will <em>half</em> the size of the convolved image. The number of feature-maps produced by the learned kernels will remain the same as <strong>pooling</strong> is done on each one in turn. Thus the pooling layer returns an array with the same depth as the convolution layer. The figure below shows the principal.</p>
<div style="text-align:center; display:inline-block; width:100%;margin:auto;">
<img title="Pooling" height=350 src="/img/CNN/poolfig.gif"><br>
<b>Max-pooling: Pooling using a "max" filter with stride equal to the kernel size</b>
</div>
<h3 id="receptiveField"> A Note on the Receptive Field </h3>
<p>This is quite an important, but sometimes neglected, concept. We said that the receptive field of a single neuron can be taken to mean the area of the image which it can ‘see’. Each neuron therefore has a different receptive field. While this is true, the full impact of it can only be understood when we see what happens after pooling.</p>
<p>Let’s take an image of size [12 x 12] and a kernel size in the first conv layer of [3 x 3]. The output of the conv layer (assuming zero-padding and stride of 1) is going to be [12 x 12 x 10] if we’re learning 10 kernels. After pooling with a [3 x 3] kernel, we get an output of [4 x 4 x 10]. This gets fed into the next conv layer. Suppose the kernel in the second conv layer is [2 x 2], would we say that the receptive field here is also [2 x 2]? Well, some people do but, actually, no it’s not. In fact, a neuron in this layer is not just seeing the [2 x 2] area of the <em>convolved</em> image, it is actually seeing a [4 x 4] area of the <em>original</em> image too. That’s the [3 x 3] of the first layer for each of the pixels in the ‘receptive field’ of the second layer (remembering we had a stride of 1 in the first layer). Continuing this through the rest of the network, it is possible to end up with a final layer with a recpetive field equal to the size of the original image. Understanding this gives us the real insight to how the CNN works, building up the image as it goes.</p>
<h3 id="dense"> Fully-connected (Dense) Layer</h3>
<p>So this layer took me a while to figure out, despite its simplicity. If I take all of the say [3 x 3 x 64] featuremaps of my final pooling layer I have 3 x 3 x 64 = 576 different weights to consider and update. I need to make sure that my training labels match with the outputs from my output layer. We may only have 10 possibilities in our output layer (say the digits 0 - 9 in the classic MNIST number classification task). Thus we want the final numbers in our output layer to be [10,] and the layer before this to be [? x 10] where the ? represents the number of nodes in the layer before: the fully-connected (FC) layer. If there was only 1 node in this layer, it would have 576 weights attached to it - one for each of the weights coming from the previous pooling layer. This is not very useful as it won’t allow us to learn any combinations of these low-dimensional outputs. Increasing the number of neurons to say 1,000 will allow the FC layer to provide many different combinations of features and learn a more complex non-linear function that represents the feature space. The number of nodes in this layer can be whatever we want it to be and isn’t constrained by any previous dimensions - this is the thing that kept confusing me when I looked at other CNNs. Sometimes it’s also seen that there are two FC layers together, this just increases the possibility of learning a complex function. FC layers are 1D vectors.</p>
<p>However, FC layers act as ‘black boxes’ and are notoriously uninterpretable. They’re also prone to overfitting so <strong>dropout’</strong> is often performed (discussed below).</p>
<h4 id = "fcConv"> Fully-connected as a Convolutional Layer </h4>
<p>If the idea above doesn’t help you lets remove the FC layer and replace it with another convolutional layer. This is very simple - take the output from the pooling layer as before and apply a convolution to it with a kernel that is the same size as a featuremap in the pooling layer. For this to be of use, the input to the conv should be down to around [5 x 5] or [3 x 3] by making sure there have been enough pooling layers in the network. What does this achieve? By convolving a [3 x 3] image with a [3 x 3] kernel we get a 1 pixel output. There is no striding, just one convolution per featuremap. So our output from this layer will be a [1 x k] vector where <em>k</em> is the number of featuremaps. This is very similar to the FC layer, except that the output from the conv is only created from an individual featuremap rather than being connected to all of the featuremaps.</p>
<p>But, isn’t this more weights to learn? Yes, so it isn’t done. Instead, we perform either <em>global average pooling</em> or <em>global max pooling</em> where the <em>global</em> refers to a whole single feature map (not the whole set of feature maps). So we’re taking the average of all points in the feature and repeating this for each feature to get the [1 x k] vector as before. Note that the number of channels (kernels/features) in the last conv layer has to be equal to the number of outputs we want, or else we have to include an FC layer to change the [1 x k] vector to what we need.</p>
<p>This can be powerfull as we have represented a very large receptive field by a single pixel and also removed some spatial information that allows us to try and take into account translations of the input. We’re able to say, if the value of the output is high, that all of the featuremaps visible to this output have activated enough to represent a ‘cat’ or whatever it is we are training our network to learn.</p>
<h3 id="dropout"> Dropout Layer </h3>
<p>The previously mentioned fully-connected layer is connected to all weights in the previous layer - this can be a very large number. As such, an FC layer is prone to <em>overfitting</em> meaning that the network won’t generalise well to new data. There are a number of techniques that can be used to reduce overfitting though the most commonly seen in CNNs is the dropout layer, proposed by Hinton. As the name suggests, this causes the network to ‘drop’ some nodes on each iteration with a particular probability. The <em>keep probability</em> is between 0 and 1, most commonly around 0.2-0.5 it seems. This is the probability that a particular node is dropped during training. When back propagation occurs, the weights connected to these nodes are not updated. They are readded for the next iteration before another set is chosen for dropout.</p>
<h3 id="output"> Output Layer </h3>
<p>Of course depending on the purpose of your CNN, the output layer will be slightly different. In general, the output layer consists of a number of nodes which have a high value if they are ‘true’ or activated. Consider a classification problem where a CNN is given a set of images containing cats, dogs and elephants. If we’re asking the CNN to learn what a cat, dog and elephant looks like, output layer is going to be a set of three nodes, one for each ‘class’ or animal. We’d expect that when the CNN finds an image of a cat, the value at the node representing ‘cat’ is higher than the other two. This is the same idea as in a regular neural network. In fact, the FC layer and the output layer can be considered as a traditional NN where we also usually include a softmax activation function. Some output layers are probabilities and as such will sum to 1, whilst others will just achieve a value which could be a pixel intensity in the range 0-255. The output can also consist of a single node if we’re doing regression or deciding if an image belong to a specific class or not e.g. diseased or healthy. Commonly, however, even binary classificaion is proposed with 2 nodes in the output and trained with labels that are ‘one-hot’ encoded i.e. [1,0] for class 0 and [0,1] for class 1.</p>
<h3 id="backProp"> A Note on Back Propagation </h3>
<p>I’ve found it helpful to consider CNNs in reverse. It didn’t sit properly in my mind that the CNN first learns all different types of edges, curves etc. and then builds them up into large features e.g. a face. It came up in a discussion with a colleague that we could consider the CNN working in reverse, and in fact this is effectively what happens - back propagation updates the weights from the final layer <em>back</em> towards the first. In fact, the error (or loss) minimisation occurs firstly at the final layer and as such, this is where the network is ‘seeing’ the bigger picture. The gradient (updates to the weights) vanishes towards the input layer and is greatest at the output layer. We can effectively think that the CNN is learning “face - has eyes, nose mouth” at the output layer, then “I don’t know what a face is, but here are some eyes, noses, mouths” in the previous one, then “What are eyes? I’m only seeing circles, some white bits and a black hole” followed by “woohoo! round things!” and initially by “I think that’s what a line looks like”. Possibly we could think of the CNN as being less sure about itself at the first layers and being more advanced at the end.</p>
<p>CNNs can be used for segmentation, classification, regression and a whole manner of other processes. On the whole, they only differ by four things:</p>
<ul>
<li>architecture (number and order of conv, pool and fc layers plus the size and number of the kernels)</li>
<li>output (probabilitstic etc.)</li>
<li>training method (cost or loss function, regularisation and optimiser)</li>
<li>hyperparameters (learning rate, regularisation weights, batch size, iterations…)</li>
</ul>
<p>There may well be other posts which consider these kinds of things in more detail, but for now I hope you have some insight into how CNNs function. Now, lets code it up…</p>A Simple Neural Network - Simple Performance Improvements
/post/nn-python-tweaks/
Fri, 17 Mar 2017 08:53:55 +0000/post/nn-python-tweaks/<p>The 5th installment of our tutorial on implementing a neural network (NN) in Python. By the end of this tutorial, our NN should perform much more efficiently giving good results with fewer iterations. We will do this by implementing “momentum” into our network. We will also put in the other transfer functions for each layer.</p>
<p></p>
<div id="toctop"></div>
<ol>
<li><a href="#intro">Introduction</a></li>
<li><a href="#momentum">Momentum</a>
<ol>
<li><a href="#momentumbackground">Background</a></li>
<li><a href="#momentumpython">Momentum in Python</a></li>
<li><a href="#momentumtesting">Testing</a></li>
</ol></li>
<li><a href="#transferfunctions">Transfer Functions</a></li>
</ol>
<h2 id="intro"> Introduction </h2>
<p><a href="#toctop">To contents</a></p>
<p>We’ve come so far! The intial <a href="/post/neuralnetwork">maths</a> was a bit of a slog, as was the <a href="/post/nn-more-maths">vectorisation</a> of that maths, but it was important to be able to implement our NN in Python which we did in our <a href="/post/nn-in-python">previous post</a>. So what now? Well, you may have noticed when running the NN as it stands that it isn’t overly quick, depening on the randomly initialised weights, it may take the network the full number of <code>maxIterations</code> to converge, and then it may not converge at all! But there is something we can do about it. Let’s learn about, and implement, ‘momentum’.</p>
<h2 id="momentum"> Momentum </h2>
<h3 id="momentumbackground"> Background </h3>
<p><a href="#toctop">To contents</a></p>
<p>Let’s revisit our equation for error in the NN:</p>
<div id="eqerror">$$
\text{E} = \frac{1}{2} \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right)^{2}
$$</div>
<p>This isn’t the only error function that could be used. In fact, there’s a whole field of study in NN about the best error or ‘optimisation’ function that should be used. This one tries to look at the sum of the squared-residuals between the outputs and the expected values at the end of each forward pass (the so-called $l_{2}$-norm). Others e.g. $l_{1}$-norm, look at minimising the sum of the absolute differences between the values themselves. There are more complex error (a.k.a. optimisation or cost) functions, for example those that look at the cross-entropy in the data. There may well be a post in the future about different cost-functions, but for now we will still focus on the equation above.</p>
<p>Now this function is described as a ‘convex’ function. This is an important property if we are to make our NN converge to the correct answer. Take a look at the two functions below:</p>
<div id="fig1" class="figure_container">
<div class="figure_images">
<img title="convex" src="/img/simpleNN/convex.png" width="35%" hspace="10px"><img title="non-convex" src="/img/simpleNN/non-convex.png" width="35%" hspace="10px">
</div>
<div class="figure_caption">
<font color="blue">Figure 1</font>: A convex (left) and non-convex (right) cost function
</div>
</div>
<p>Let’s say that our current error was represented by the green ball. Our NN will calculate the gradient of its cost function at this point then look for the direction which is going to <em>minimise</em> the error i.e. go down a slope. The NN will feed the result into the back-propagation algorithm which will hopefully mean that on the next iteration, the error will have decreased. For a <em>convex</em> function, this is very straight forward, the NN just needs to keep going in the direction it found on the first run. But, look at the <em>non-convex</em> or <em>stochastic</em> function: our current error (green ball) sits at a point where either direction will take it to a lower error i.e. the gradient decreases on both sides. If the error goes to the left, it will hit <strong>one</strong> of the possible minima of the function, but this will be a higher minima (higher final error) than if the error had chosen the gradient to the right. Clearly the starting point for the error here has a big impact on the final result. Looking down at the 2D perspective (remembering that these are complex multi-dimensional functions), the non-convex case is clearly more ambiguous in terms of the location of the minimum and direction of descent. The convex function, however, nicely guides the error to the minimum with little care of the starting point.</p>
<div id="fig2" class="figure_container">
<div class="figure_images">
<img title="convexcontour" src="/img/simpleNN/convexcontourarrows.png" width="35%" hspace="10px"><img title="non-convexcontour" src="/img/simpleNN/nonconvexcontourarrows.png" width="35%" hspace="10px">
</div>
<div class="figure_caption">
<font color="blue">Figure 2</font>: Contours for a portion of the convex (left) and non-convex (right) cost function
</div>
</div>
<p>So let’s focus on the convex case and explain what <em>momentum</em> is and why it works. I don’t think you’ll ever see a back propagation algorithm without momentum implemented in some way. In its simplest form, it modifies the weight-update equation:</p>
<div>$$
\mathbf{ \Delta W_{JK} = -\eta \vec{\delta}_{K} \vec{ \mathcal{O}_{J}}}
$$</div>
<p>by adding an extra <em>momentum</em> term:</p>
<div>$$
\mathbf{ \Delta W_{JK}\left(t\right) = -\eta \vec{\delta}_{K} \vec{ \mathcal{O}_{J}}} + m \mathbf{\Delta W_{JK}\left(t-1\right)}
$$</div>
<p>The weight delta (the update amount to the weights after BP) now relies on its <em>previous</em> value i.e. the weight delta now at iteration $t$ requires the value of itself from $t-1$. The $m$ or momentum term, like the learning rate $\eta$ is just a small number between 0 and 1. What effect does this have?</p>
<p>Using prior information about the network is beneficial as it stops the network firing wildly into the unknown. If it can know the previous weights that have given the current error, it can keep the descent to the minimum roughly pointing in the same direction as it was before. The effect is that each iteration does not jump around so much as it would otherwise. In effect, the result is similar to that of the learning rate. We should be careful though, a large value for $m$ may cause the result to jump past the minimum and back again if combined with a large learning rate. We can think of momentum as changing the path taken to the optimum.</p>
<h3 id="momentumpython"> Momentum in Python </h3>
<p><a href="#toctop">To contents</a></p>
<p>So, implementing momentum into our NN should be pretty easy. We will need to provide a momentum term to the <code>backProp</code> method of the NN and also create a new matrix in which to store the weight deltas from the current epoch for use in the subsequent one.</p>
<p>In the <code>__init__</code> method of the NN, we need to initialise the previous weight matrix and then give them some values - they’ll start with zeros:</p>
<pre><code class="language-python">def __init__(self, numNodes):
"""Initialise the NN - setup the layers and initial weights"""
# Layer info
self.numLayers = len(numNodes) - 1
self.shape = numNodes
# Input/Output data from last run
self._layerInput = []
self._layerOutput = []
self._previousWeightDelta = []
# Create the weight arrays
for (l1,l2) in zip(numNodes[:-1],numNodes[1:]):
self.weights.append(np.random.normal(scale=0.1,size=(l2,l1+1)))
self._previousWeightDelta.append(np.zeros((l2,l1+1)))
</code></pre>
<p>The only other part of the NN that needs to change is the definition of <code>backProp</code> adding momentum to the inputs, and updating the weight equation. Finally, we make sure to save the current weights into the previous-weight matrix:</p>
<pre><code class="language-python">def backProp(self, input, target, trainingRate = 0.2, momentum=0.5):
"""Get the error, deltas and back propagate to update the weights"""
...
weightDelta = trainingRate * thisWeightDelta + momentum * self._previousWeightDelta[index]
self.weights[index] -= weightDelta
self._previousWeightDelta[index] = weightDelta
</code></pre>
<h3 id="momentumtesting"> Testing </h3>
<p><a href="#toctop">To contents</a></p>
<p>Our default values for learning rate and momentum are 0.2 and 0,5 respectively. We can change either of these by including them in the call to <code>backProp</code>. Thi is the only change to the iteration process:</p>
<pre><code class="language-python">for i in range(maxIterations + 1):
Error = NN.backProp(Input, Target, learningRate=0.2, momentum=0.5)
if i % 2500 == 0:
print("Iteration {0}\tError: {1:0.6f}".format(i,Error))
if Error <= minError:
print("Minimum error reached at iteration {0}".format(i))
break
Iteration 100000 Error: 0.000076
Input Output Target
[0 0] [ 0.00491572] [ 0.]
[1 1] [ 0.00421318] [ 0.]
[0 1] [ 0.99586268] [ 1.]
[1 0] [ 0.99586257] [ 1.]
</code></pre>
<p>Feel free to play around with these numbers, however, it would be unlikely that much would change right now. I say this beacuse there is only so good that we can get when using only the sigmoid function as our activation function. If you go back and read the post on <a href="/post/transfer-functions">transfer functions</a> you’ll see that it’s more common to use <em>linear</em> functions for the output layer. As it stands, the sigmoid function is unable to output a 1 or a 0 because it is asymptotic at these values. Therefore, no matter what learning rate or momentum we use, the network will never be able to get the best output.</p>
<p>This seems like a good time to implement the other transfer functions.</p>
<h3 id="transferfunctions"> Transfer Functions </h3>
<p><a href="#toctop">To contents</a></p>
<p>We’ve already gone through writing the transfer functions in Python in the <a href="/post/transfer-functions">transfer functions</a> post. We’ll just put these under the sigmoid function we defined earlier. I’m going to use <code>sigmoid</code>, <code>linear</code>, <code>gaussian</code> and <code>tanh</code> here.</p>
<p>To modify the network, we need to assign each layer its own activation function, so let’s put that in the ‘layer information’ part of the <code>__init__</code> method:</p>
<pre><code class="language-python">def __init__(self, layerSize, transferFunctions=None):
"""Initialise the Network"""
# Layer information
self.numLayers = len(numLayers) - 1
self.shape = numNodes
if transferFunctions is None:
layerTFs = []
for i in range(self.numLayers):
if i == self.numLayers - 1:
layerTFs.append(linear)
else:
layerTFs.append(sigmoid)
else:
if len(numNodes) != len(transferFunctions):
raise ValueError("Number of transfer functions must match the number of layers: minus input layer")
elif transferFunctions[0] is not None:
raise ValueError("The Input layer doesn't need a a transfer function: give it [None,...]")
else:
layerTFs = transferFunctions[1:]
self.tFunctions = layerTFs
</code></pre>
<p>Let’s go through this. We input into the initialisation a parameter called <code>transferFunctions</code> with a default value of <code>None</code>. If the default it taken, or if the parameter is ommitted, we set some defaults. for each layer, we use the <code>sigmoid</code> function, unless its the output layer where we will use the <code>linear</code> function. If a list of <code>transferFunctions</code> is given, first, check that it’s a ‘legal’ input. If the number of functions in the list is not the same as the number of layers (given by <code>numNodes</code>) then throw an error. Also, if the first function in the list is not <code>"None"</code> throw an error, because the first layer shouldn’t have an activation function (it is the input layer). If those two things are fine, go ahead and store the list of functions as <code>layerTFs</code> without the first (element 0) one.</p>
<p>We next need to replace all of our calls directly to <code>sigmoid</code> and its derivative. These should now refer to the list of functions via an <code>index</code> that depends on the number of the current layer. There are 3 instances of this in our NN: 1 in the forward pass where we call <code>sigmoid</code> directly, and 2 in the <code>backProp</code> method where we call the derivative at the output and hidden layers. so <code>sigmoid(layerInput)</code> for example should become:</p>
<pre><code class="language-python">self.tFunctions[index](layerInput)
</code></pre>
<p>Check the updated code <a href="/docs/simpleNN-improvements.py">here</a> if that’s confusing.</p>
<p>Let’s test this out! We’ll modify the call to initialising the NN by adding a list of functions like so:</p>
<pre><code class="language-python">Input = np.array([[0,0],[1,1],[0,1],[1,0]])
Target = np.array([[0.0],[0.0],[1.0],[1.0]])
transferFunctions = [None, sigmoid, linear]
NN = backPropNN((2,2,1), transferFunctions)
</code></pre>
<p>Running the NN like this with the default learning rate and momentum should provide you with an immediate performance boost simply becuase with the <code>linear</code> function we’re now able to get closer to the target values, reducing the error.</p>
<pre><code class="language-python">Iteration 0 Error: 1.550211
Iteration 2500 Error: 1.000000
Iteration 5000 Error: 0.999999
Iteration 7500 Error: 0.999999
Iteration 10000 Error: 0.999995
Iteration 12500 Error: 0.999969
Minimum error reached at iteration 14543
Input Output Target
[0 0] [ 0.0021009] [ 0.]
[1 1] [ 0.00081154] [ 0.]
[0 1] [ 0.9985881] [ 1.]
[1 0] [ 0.99877479] [ 1.]
</code></pre>
<p>Play around with the number of layers and different combinations of transfer functions as well as tweaking the learning rate and momentum. You’ll soon get a feel for how each changes the performance of the NN.</p>A Simple Neural Network - With Numpy in Python
/post/nn-in-python/
Wed, 15 Mar 2017 09:55:00 +0000/post/nn-in-python/<p>Part 4 of our tutorial series on Simple Neural Networks. We’re ready to write our Python script! Having gone through the maths, vectorisation and activation functions, we’re now ready to put it all together and write it up. By the end of this tutorial, you will have a working NN in Python, using only numpy, which can be used to learn the output of logic gates (e.g. XOR)
</p>
<div id="toctop"></div>
<ol>
<li><a href="#intro">Introduction</a></li>
<li><a href="#transferfunction">Transfer Function</a></li>
<li><a href="#backpropclass">Back Propagation Class</a>
<ol>
<li><a href="#initialisation">Initialisation</a></li>
<li><a href="#forwardpass">Forward Pass</a></li>
<li><a href="#backprop">Back Propagation</a></li>
</ol></li>
<li><a href="#testing">Testing</a></li>
<li><a href="#iterating">Iterating</a></li>
</ol>
<h3 id="intro"> Introduction </h3>
<p><a href="#toctop">To contents</a></p>
<p>We’ve <a href="/post/neuralnetwork">ploughed through the maths</a>, then <a href="/post/nn-more-maths">some more</a>, now we’re finally here! This tutorial will run through the coding up of a simple neural network (NN) in Python. We’re not going to use any fancy packages (though they obviously have their advantages in tools, speed, efficiency…) we’re only going to use numpy!</p>
<p>By the end of this tutorial, we will have built an algorithm which will create a neural network with as many layers (and nodes) as we want. It will be trained by taking in multiple training examples and running the back propagation algorithm many times.</p>
<p>Here are the things we’re going to need to code:</p>
<ul>
<li>The transfer functions</li>
<li>The forward pass</li>
<li>The back propagation algorithm</li>
<li>The update function</li>
</ul>
<p>To keep things nice and contained, the forward pass and back propagation algorithms should be coded into a class. We’re going to expect that we can build a NN by creating an instance of this class which has some internal functions (forward pass, delta calculation, back propagation, weight updates).</p>
<p>First things first… lets import numpy:</p>
<div class="highlight" style="background: #272822"><pre style="line-height: 125%"><span></span><span style="color: #f92672">import</span> <span style="color: #f8f8f2">numpy</span> <span style="color: #f92672">as</span> <span style="color: #f8f8f2">np</span>
</pre></div>
<p>Now let’s go ahead and get the first bit done:</p>
<h2 id="transferfunction"> Transfer Function </h2>
<p><a href="#toctop">To contents</a></p>
<p>To begin with, we’ll focus on getting the network working with just one transfer function: the sigmoid function. As we discussed in a <a href="/post/transfer-functions">previous post</a> this is very easy to code up because of its simple derivative:</p>
<div >$$
f\left(x_{i} \right) = \frac{1}{1 + e^{ - x_{i} }} \ \ \ \
f^{\prime}\left( x_{i} \right) = \sigma(x_{i}) \left( 1 - \sigma(x_{i}) \right)
$$</div>
<pre><code class="language-python">def sigmoid(x, Derivative=False):
if not Derivative:
return 1 / (1 + np.exp (-x))
else:
out = sigmoid(x)
return out * (1 - out)
</code></pre>
<p>This is a succinct expression which actually calls itself in order to get a value to use in its derivative. We’ve used numpy’s exponential function to create the sigmoid function and created an <code>out</code> variable to hold this in the derivative. Whenever we want to use this function, we can supply the parameter <code>True</code> to get the derivative, We can omit this, or enter <code>False</code> to just get the output of the sigmoid. This is the same function I used to get the graphs in the <a href="/post/transfer-functions">post on transfer functions</a>.</p>
<h2 id="backpropclass"> Back Propagation Class </h2>
<p><a href="#toctop">To contents</a></p>
<p>I’m fairly new to building my own classes in Python, but for this tutorial, I really relied on the videos of <a href="https://www.youtube.com/playlist?list=PLRyu4ecIE9tibdzuhJr94uQeKnOFkkbq6">Ryan on YouTube</a>. Some of his hacks were very useful so I’ve taken some of those on board, but i’ve made a lot of the variables more self-explanatory.</p>
<p>First we’re going to get the skeleton of the class setup. This means that whenever we create a new variable with the class of <code>backPropNN</code>, it will be able to access all of the functions and variables within itself.</p>
<p>It looks like this:</p>
<pre><code class="language-python">class backPropNN:
"""Class defining a NN using Back Propagation"""
# Class Members (internal variables that are accessed with backPropNN.member)
numLayers = 0
shape = None
weights = []
# Class Methods (internal functions that can be called)
def __init__(self):
"""Initialise the NN - setup the layers and initial weights"""
# Forward Pass method
def FP(self):
"""Get the input data and run it through the NN"""
# TrainEpoch method
def backProp(self):
"""Get the error, deltas and back propagate to update the weights"""
</code></pre>
<p>We’ve not added any detail to the functions (or methods) yet, but we know there needs to be an <code>__init__</code> method for any class, plus we’re going to want to be able to do a forward pass and then back propagate the error.</p>
<p>We’ve also added a few class members, variables which can be called from an instance of the <code>backPropNN</code> class. <code>numLayers</code> is just that, a count of the number of layers in the network, initialised to <code>0</code>. The <code>shape</code> of the network will return the size of each layer of the network in an array and the <code>weights</code> will return an array of the weights across the network.</p>
<h3 id="initialisation"> Initialisation </h3>
<p><a href="#toctop">To contents</a></p>
<p>We’re going to make the user supply an input variablewhich is the size of the layers in the network i.e. the number of nodes in each layer: <code>numNodes</code>. This will be an array which is the length of the number of layers (including the input and output layers) where each element is the number of nodes in that layer.</p>
<pre><code class="language-python">def __init__(self, numNodes):
"""Initialise the NN - setup the layers and initial weights"""
# Layer information
self.numLayers = len(numNodes) - 1
self.shape = numNodes
</code></pre>
<p>We’ve told our network to ignore the input layer when counting the number of layers (common practice) and that the shape of the network should be returned as the input array <code>numNodes</code>.</p>
<p>Lets also initialise the weights. We will take the approach of initialising all of the weights to small, random numbers. To keep the code succinct, we’ll use a neat function<code>zip</code>. <code>zip</code> is a function which takes two vectors and pairs up the elements in corresponding locations (like a zip). For example:</p>
<pre><code class="language-python">A = [1, 2, 3]
B = [4, 5, 6]
zip(A,B)
[(1,4), (2,5), (3,6)]
</code></pre>
<p>Why might this be useful? Well, when we talk about weights we’re talking about the connections between layers. Lets say we have <code>numNodes=(2, 2, 1)</code> i.e. a 2 layer network with 2 inputs, 1 output and 2 nodes in the hidden layer. Then we need to let the algorithm know that we expect two input nodes to send weights to 2 hidden nodes. Then 2 hidden nodes to send weights to 1 output node, or <code>[(2,2), (2,1)]</code>. Note that overall we will have 4 weights from the input to the hidden layer, and 2 weights from the hidden to the output layer.</p>
<p>What is our <code>A</code> and <code>B</code> in the code above that will give us <code>[(2,2), (2,1)]</code>? It’s this:</p>
<pre><code class="language-python">numNodes = (2,2,1)
A = numNodes[:-1]
B = numNodes[1:]
A
(2,2)
B
(2,1)
zip(A,B)
[(2,2), (2,1)]
</code></pre>
<p>Great! So each pair represents the nodes between which we need initialise some weights. In fact, the shape of each pair <code>(2,2)</code> is the clue to how many weights we are going to need between each layer e.g. between the input and hidden layers we are going to need <code>(2 x 2) =4</code> weights.</p>
<p>so <code>for</code> each pair <code>in zip(A,B)</code> (hint hint) we need to <code>append</code> some weights into that empty weight matrix we initialised earlier.</p>
<pre><code class="language-python"># Initialise the weight arrays
for (l1,l2) in zip(numNodes[:-1],numNodes[1:]):
self.weights.append(np.random.normal(scale=0.1,size=(l2,l1+1)))
</code></pre>
<p><code>self.weights</code> as we’re appending to the class member initialised earlier. We’re using the numpy random number generator from a <code>normal</code> distribution. The <code>scale</code> just tells numpy to choose numbers around the 0.1 kind of mark and that we want a matrix of results which is the size of the tuple <code>(l2,l1+1)</code>. Huh, <code>+1</code>? Don’t think we’re getting away without including the <em>bias</em> term! We want a random starting point even for the weight connecting the bias node (<code>=1</code>) to the next layer. Ok, but why this way and not <code>(l1+1,l2)</code>? Well, we’re looking for <code>l2</code> connections from each of the <code>l1+1</code> nodes in the previous layer - think of it as (number of observations x number of features). We’re creating a matrix of weights which goes across the nodes and down the weights from each node, or as we’ve seen in our maths tutorial:</p>
<div>$$
W_{ij} = \begin{pmatrix} w_{11} & w_{21} & w_{31} \\ w_{12} &w_{22} & w_{32} \end{pmatrix}, \ \ \ \
W_{jk} = \begin{pmatrix} w_{11} & w_{21} & w_{31} \end{pmatrix}
$$</div>
<p>Between the first two layers, and second 2 layers respectively with node 3 being the bias node.</p>
<p>Before we move on, lets also put in some placeholders in <code>__init__</code> for the input and output values to each layer:</p>
<pre><code class="language-python">self._layerInput = []
self._layerOutput = []
</code></pre>
<h3 id="forwardpass"> Forward Pass </h3>
<p><a href="#toctop">To contents</a></p>
<p>We’ve now initialised out network enough to be able to focus on the forward pass (FP).</p>
<p>Our <code>FP</code> function needs to have the input data. It needs to know how many training examples it’s going to have to go through, and it will need to reassign the inputs and outputs at each layer, so lets clean those at the beginning:</p>
<pre><code class="language-python">def FP(self,input):
numExamples = input.shape[0]
# Clean away the values from the previous layer
self._layerInput = []
self._layerOutput = []
</code></pre>
<p>So lets propagate. We already have a matrix of (randomly initialised) weights. We just need to know what the input is to each of the layers. We’ll separate this into the first hidden layer, and subsequent hidden layers.</p>
<p>For the first hidden layer we will write:</p>
<pre><code class="language-python">layerInput = self.weights[0].dot(np.vstack([input.T, np.ones([1, numExamples])]))
</code></pre>
<p>Let’s break this down:</p>
<p>Our training example inputs need to match the weights that we’ve already created. We expect that our examples will come in rows of an array with columns acting as features, something like <code>[(0,0), (0,1),(1,1),(1,0)]</code>. We can use numpy’s <code>vstack</code> to put each of these examples one on top of the other.</p>
<p>Each of the input examples is a matrix which will be multiplied by the weight matrix to get the input to the current layer:</p>
<div>$$
\mathbf{x_{J}} = \mathbf{W_{IJ} \vec{\mathcal{O}}_{I}}
$$</div>
<p>where $\mathbf{x_{J}}$ are the inputs to the layer $J$ and $\mathbf{\vec{\mathcal{O}}_{I}}$ is the output from the precious layer (the input examples in this case).</p>
<p>So given a set of $n$ input examples we <code>vstack</code> them so we just have <code>(n x numInputNodes)</code>. We want to transpose this, <code>(numInputNodes x n)</code> such that we can multiply by the weight matrix which is <code>(numOutputNodes x numInputNodes)</code>. This gives an input to the layer which is <code>(numOutputNodes x n)</code> as we expect.</p>
<p><strong>Note</strong> we’re actually going to do the transposition first before doing the <code>vstack</code> - this does exactly the same thing, but it also allows us to more easily add the bias nodes in to each input.</p>
<p>Bias! Lets not forget this: we add a bias node which always has the value <code>1</code> to each input (including the input layer). So our actual method is:</p>
<ol>
<li>Transpose the inputs <code>input.T</code></li>
<li>Add a row of ones to the bottom (one bias node for each input) <code>[input.T, np.ones([1,numExamples])]</code></li>
<li><code>vstack</code> this to compact the array <code>np.vstack(...)</code></li>
<li>Multipy with the weights connecting from the previous to the current layer <code>self.weights[0].dot(...)</code></li>
</ol>
<p>But what about the subsequent hidden layers? We’re not using the input examples in these layers, we are using the output from the previous layer <code>[self._layerOutput[-1]]</code> (multiplied by the weights).</p>
<pre><code class="language-python">for index in range(self.numLayers):
#Get input to the layer
if index ==0:
layerInput = self.weights[0].dot(np.vstack([input.T, np.ones([1, numExamples])]))
else:
layerInput = self.weights[index].dot(np.vstack([self._layerOutput[-1],np.ones([1,numExamples])]))
</code></pre>
<p>Make sure to save this output, but also to now calculate the output of the current layer i.e.:</p>
<div>$$
\mathbf{ \vec{ \mathcal{O}}_{J}} = \sigma(\mathbf{x_{J}})
$$</div>
<pre><code class="language-python">self._layerInput.append(layerInput)
self._layerOutput.append(sigmoid(layerInput))
</code></pre>
<p>Finally, make sure that we’re returning the data from our output layer the same way that we got it:</p>
<pre><code class="language-python">return self._layerOutput[-1].T
</code></pre>
<h3 id="backprop">Back Propagation</h3>
<p><a href="#toctop">To contents</a></p>
<p>We’ve successfully sent the data from the input layer to the output layer using some initially randomised weights <strong>and</strong> we’ve included the bias term (a kind of threshold on the activation functions). Our vectorised equations from the previous post will now come into play:</p>
<div>$$
\begin{align}
\mathbf{\vec{\delta}_{K}} &= \sigma^{\prime}\left( \mathbf{W_{JK}}\mathbf{\vec{\mathcal{O}}_{J}} \right) * \left( \mathbf{\vec{\mathcal{O}}_{K}} - \mathbf{T_{K}}\right) \\[0.5em]
\mathbf{ \vec{ \delta }_{J}} &= \sigma^{\prime} \left( \mathbf{ W_{IJ} \mathcal{O}_{I} } \right) * \mathbf{ W^{\intercal}_{JK}} \mathbf{ \vec{\delta}_{K}}
\end{align}
$$</div>
<div>$$
\begin{align}
\mathbf{W_{JK}} + \Delta \mathbf{W_{JK}} &\rightarrow \mathbf{W_{JK}}, \ \ \ \Delta \mathbf{W_{JK}} = -\eta \mathbf{ \vec{ \delta }_{K}} \mathbf{ \vec { \mathcal{O} }_{J}} \\[0.5em]
\vec{\theta} + \Delta \vec{\theta} &\rightarrow \vec{\theta}, \ \ \ \Delta \vec{\theta} = -\eta \mathbf{ \vec{ \delta }_{K}}
\end{align}
$$</div>
<p>With $*$ representing an elementwise multiplication between the matrices.</p>
<p>First, lets initialise some variables and get the error on the output of the output layer. We assume that the target values have been formatted in the same way as the input values i.e. they are a row-vector per input example. In our forward propagation method, the outputs are stored as column-vectors, thus the targets have to be transposed. We will need to supply the input data, the target data and $\eta$, the learning rate, which we will set at some small number for default. So we start back propagation by first initialising a placeholder for the deltas and getting the number of training examples before running them through the <code>FP</code> method:</p>
<pre><code class="language-python">def backProp(self, input, target, trainingRate = 0.2):
"""Get the error, deltas and back propagate to update the weights"""
delta = []
numExamples = input.shape[0]
# Do the forward pass
self.FP(input)
output_delta = self._layerOutput[index] - target.T
error = np.sum(output_delta**2)
</code></pre>
<p>We know from previous posts that the error is squared to get rid of the negatives. From this we compute the deltas for the output layer:</p>
<pre><code class="language-python">delta.append(output_delta * sigmoid(self._layerInput[index], True))
</code></pre>
<p>We now have the error but need to know what direction to alter the weights in, thus the gradient of the inputs to the layer need to be known. So, we get the gradient of the activation function at the input to the layer and get the product with the error. Notice we’ve supplied <code>True</code> to the sigmoid function to get its derivative.</p>
<p>This is the delta for the output layer. So this calculation is only done when we’re considering the index at the end of the network. We should be careful that when telling the algorithm that this is the “last layer” we take account of the zero-indexing in Python i.e. the last layer is <code>self.numLayers - 1</code> i.e. in a network with 2 layers, <code>layer[2]</code> does not exist.</p>
<p>We also need to get the deltas of the intermediate hidden layers. To do this, (according to our equations above) we have to ‘pull back’ the delta from the output layer first. More accurately, for any hidden layer, we pull back the delta from the <em>next</em> layer, which may well be another hidden layer. These deltas from the <em>next</em> layer are multiplied by the weights from the <em>next</em> layer <code>[index + 1]</code>, before getting the product with the sigmoid derivative evaluated at the <em>current</em> layer.</p>
<p><strong>Note</strong>: this is <em>back</em> propagation. We have to start at the end and work back to the beginning. We use the <code>reversed</code> keyword in our loop to ensure that the algorithm considers the layers in reverse order.</p>
<p>Combining this into one method:</p>
<pre><code class="language-python"># Calculate the deltas
for index in reversed(range(self.numLayers)):
if index == self.numLayers - 1:
# If the output layer, then compare to the target values
output_delta = self._layerOutput[index] - target.T
error = np.sum(output_delta**2)
delta.append(output_delta * sigmoid(self._layerInput[index], True))
else:
# If a hidden layer. compare to the following layer's delta
delta_pullback = self.weights[index + 1].T.dot(delta[-1])
delta.append(delta_pullback[:-1,:] * sigmoid(self._layerInput[index], True))
</code></pre>
<p>Pick this piece of code apart. This is an important snippet as it calculates all of the deltas for all of the nodes in the network. Be sure that we understand:</p>
<ol>
<li>This is a <code>reversed</code> loop because we want to deal with the last layer first</li>
<li>The delta of the output layer is the residual between the output and target multiplied with the gradient (derivative) of the activation function <em>at the current layer</em>.</li>
<li>The delta of a hidden layer first needs the product of the <em>subsequent</em> layer’s delta with the <em>subsequent</em> layer’s weights. This is then multiplied with the gradient of the activation function evaluated at the <em>current</em> layer.</li>
</ol>
<p>Double check that this matches up with the equations above too! We can double check the matrix multiplication. For the output layer:</p>
<p><code>output_delta</code> = (numOutputNodes x 1) - (1 x numOutputNodes).T = (numOutputNodes x 1)
<code>error</code> = (numOutputNodes x 1) **2 = (numOutputNodes x 1)
<code>delta</code> = (numOutputNodes x 1) * sigmoid( (numOutputNodes x 1) ) = (numOutputNodes x 1)</p>
<p>For the hidden layers (take the one previous to the output as example):</p>
<p><code>delta_pullback</code> = (numOutputNodes x numHiddenNodes).T.dot(numOutputNodes x 1) = (numHiddenNodes x 1)
<code>delta</code> = (numHiddenNodes x 1) * sigmoid ( (numHuddenNodes x 1) ) = (numHiddenNodes x 1)</p>
<p>Hurray! We have the delta at each node in our network. We can use them to update the weights for each layer in the network. Remember, to update the weights between layer $J$ and $K$ we need to use the output of layer $J$ and the deltas of layer $K$. This means we need to keep a track of the index of the layer we’re currently working on ($J$) and the index of the delta layer ($K$) - not forgetting about the zero-indexing in Python:</p>
<pre><code class="language-python">for index in range(self.numLayers):
delta_index = self.numLayers - 1 - index
</code></pre>
<p>Let’s first get the outputs from each layer:</p>
<pre><code class="language-python"> if index == 0:
layerOutput = np.vstack([input.T, np.ones([1, numExamples])])
else:
layerOutput = np.vstack([self._layerOutput[index - 1], np.ones([1,self._layerOutput[index -1].shape[1]])])
</code></pre>
<p>The output of the input layer is just the input examples (which we’ve <code>vstack</code>-ed again and the output from the other layers we take from calculation in the forward pass (making sure to add the bias term on the end).</p>
<p>For the current <code>index</code> (layer) lets use this <code>layerOutput</code> to get the change in weight. We will use a few neat tricks to make this succinct:</p>
<pre><code class="language-python"> thisWeightDelta = np.sum(\
layerOutput[None,:,:].transpose(2,0,1) * delta[delta_index][None,:,:].transpose(2,1,0) \
, axis = 0)
</code></pre>
<p>Break it down. We’re looking for $\mathbf{ \vec{ \delta }_{K}} \mathbf{ \vec { \mathcal{O} }_{J}} $ so it’s the delta at <code>delta_index</code>, the next layer along.</p>
<p>We want to be able to deal with all of the input training examples simultaneously. This requires a bit of fancy slicing and transposing of the matrices. Take a look: by calling <code>vstack</code> we made all of the input data and bias terms live in the same matrix of a numpy array. When we slice this arraywith the <code>[None,:,:]</code> argument, it tells Python to take all (<code>:</code>) the data in the rows and columns and shift it to the 1st and 2nd dimensions and leave the first dimension empty (<code>None</code>). We do this to create the three dimensions which we can now transpose into. Calling <code>transpose(2,0,1)</code> instructs Python to move around the dimensions of the data (e.g. its rows… or examples). This creates an array where each example now lives in its own plane. The same is done for the deltas of the subsequent layer, but being careful to transpost them in the opposite direction so that the matrix multiplication can occur. The <code>axis= 0</code> is supplied to make sure that the inputs are multiplied by the correct dimension of the delta matrix.</p>
<p>This looks incredibly complicated. It an be broken down into a for-loop over the input examples, but this reduces the efficiency of the network. Taking advantage of the numpy array like this keeps our calculations fast. In reality, if you’re struggling with this particular part, just copy and paste it, forget about it and be happy with yourself for understanding the maths behind back propagation, even if this random bit of Python is perplexing.</p>
<p>Anyway. Lets take this set of weight deltas and put back the $\eta$. We’ll call this the <code>learningRate</code>. It’s called a lot of things, but this seems to be the most common. We’ll update the weights by making sure to include the <code>-</code> from the $-\eta$.</p>
<pre><code class="language-python"> weightDelta = trainingRate * thisWeightDelta
self.weights[index] -= weightDelta
</code></pre>
<p>the <code>-=</code> is Python slang for: take the current value and subtract the value of <code>weightDelta</code>.</p>
<p>To finish up, we want our back propagation to return the current error in the network, so:</p>
<pre><code class="language-python">return error
</code></pre>
<h2 id="testing"> A Toy Example</h2>
<p><a href="#toctop">To contents</a></p>
<p>Believe it or not, that’s it! The fundamentals of forward and back propagation have now been implemented in Python. If you want to double check your code, have a look at my completed .py <a href="/docs/simpleNN.py">here</a></p>
<p>Let’s test it!</p>
<pre><code class="language-python">Input = np.array([[0,0],[1,1],[0,1],[1,0]])
Target = np.array([[0.0],[0.0],[1.0],[1.0]])
NN = backPropNN((2,2,1))
Error = NN.backProp(Input, Target)
Output = NN.FP(Input)
print 'Input \tOutput \t\tTarget'
for i in range(Input.shape[0]):
print '{0}\t {1} \t{2}'.format(Input[i], Output[i], Target[i])
</code></pre>
<p>This will provide 4 input examples and the expected targets. We create an instance of the network called <code>NN</code> with 2 layers (2 nodes in the hidden and 1 node in the output layer). We make <code>NN</code> do <code>backProp</code> with the input and target data and then get the output from the final layer by running out input through the network with a <code>FP</code>. The printout is self explantory. Give it a try!</p>
<pre><code>Input Output Target
[0 0] [ 0.51624448] [ 0.]
[1 1] [ 0.51688469] [ 0.]
[0 1] [ 0.51727559] [ 1.]
[1 0] [ 0.51585529] [ 1.]
</code></pre>
<p>We can see that the network has taken our inputs, and we have some outputs too. They’re not great, and all seem to live around the same value. This is because we initialised the weights across the network to a similarly small random value. We need to repeat the <code>FP</code> and <code>backProp</code> process many times in order to keep updating the weights.</p>
<h2 id="iterating"> Iterating </h2>
<p><a href="#toctop">To contents</a></p>
<p>Iteration is very straight forward. We just tell our algorithm to repeat a maximum of <code>maxIterations</code> times or until the <code>Error</code> is below <code>minError</code> (whichever comes first). As the weights are stored internally within <code>NN</code> every time we call the <code>backProp</code> method, it uses the latest, internally stored weights and doesn’t start again - the weights are only initialised once upon creation of <code>NN</code>.</p>
<pre><code class="language-python">maxIterations = 100000
minError = 1e-5
for i in range(maxIterations + 1):
Error = NN.backProp(Input, Target)
if i % 2500 == 0:
print("Iteration {0}\tError: {1:0.6f}".format(i,Error))
if Error <= minError:
print("Minimum error reached at iteration {0}".format(i))
break
</code></pre>
<p>Here’s the end of my output from the first run:</p>
<pre><code>Iteration 100000 Error: 0.000291
Input Output Target
[0 0] [ 0.00780385] [ 0.]
[1 1] [ 0.00992829] [ 0.]
[0 1] [ 0.99189799] [ 1.]
[1 0] [ 0.99189943] [ 1.]
</code></pre>
<p>Much better! The error is very small and the outputs are very close to the correct value. However, they’re note completely right. We can do better, by implementing different activation functions which we will do in the next tutorial.</p>
<p><strong>Please</strong> let me know if anything is unclear, or there are mistakes. Let me know how you get on!</p>A Simple Neural Network - Vectorisation
/post/nn-more-maths/
Mon, 13 Mar 2017 10:33:08 +0000/post/nn-more-maths/<p>The third in our series of tutorials on Simple Neural Networks. This time, we’re looking a bit deeper into the maths, specifically focusing on vectorisation. This is an important step before we can translate our maths in a functioning script in Python.</p>
<p></p>
<p>So we’ve <a href="/post/neuralnetwork">been through the maths</a> of a neural network (NN) using back propagation and taken a look at the <a href="/post/transfer-functions">different activation functions</a> that we could implement. This post will translate the mathematics into Python which we can piece together at the end into a functioning NN!</p>
<h2 id="forwardprop"> Forward Propagation </h2>
<p>Let’s remimnd ourselves of our notation from our 2 layer network in the <a href="/post/neuralnetwork">maths tutorial</a>:</p>
<ul>
<li>I is our input layer</li>
<li>J is our hidden layer</li>
<li>$w_{ij}$ is the weight connecting the $i^{\text{th}}$ node in in $I$ to the $j^{\text{th}}$ node in $J$</li>
<li>$x_{j}$ is the total input to the $j^{\text{th}}$ node in $J$</li>
</ul>
<p>So, assuming that we have three features (nodes) in the input layer, the input to the first node in the hidden layer is given by:</p>
<div>$$
x_{1} = \mathcal{O}_{1}^{I} w_{11} + \mathcal{O}_{2}^{I} w_{21} + \mathcal{O}_{3}^{I} w_{31}
$$</div>
<p>Lets generalise this for any connected nodes in any layer: the input to node $j$ in layer $l$ is:</p>
<div>$$
x_{j} = \mathcal{O}_{1}^{l-1} w_{1j} + \mathcal{O}_{2}^{l-1} w_{2j} + \mathcal{O}_{3}^{l-1} w_{3j}
$$</div>
<p>But we need to be careful and remember to put in our <em>bias</em> term $\theta$. In our maths tutorial, we said that the bias term was always equal to 1; now we can try to understand why.</p>
<p>We could just add the bias term onto the end of the previous equation to get:</p>
<div>$$
x_{j} = \mathcal{O}_{1}^{l-1} w_{1j} + \mathcal{O}_{2}^{l-1} w_{2j} + \mathcal{O}_{3}^{l-1} w_{3j} + \theta_{i}
$$</div>
<p>If we think more carefully about this, what we are really saying is that “an extra node in the previous layer, which always outputs the value 1, is connected to the node $j$ in the current layer by some weight $w_{4j}$“. i.e. $1 \cdot w_{4j}$:</p>
<div>$$
x_{j} = \mathcal{O}_{1}^{l-1} w_{1j} + \mathcal{O}_{2}^{l-1} w_{2j} + \mathcal{O}_{3}^{l-1} w_{3j} + 1 \cdot w_{4j}
$$</div>
<p>By the magic of matrix multiplication, we should be able to convince ourselves that:</p>
<div>$$
x_{j} = \begin{pmatrix} w_{1j} &w_{2j} &w_{3j} &w_{4j} \end{pmatrix}
\begin{pmatrix} \mathcal{O}_{1}^{l-1} \\
\mathcal{O}_{2}^{l-1} \\
\mathcal{O}_{3}^{l-1} \\
1
\end{pmatrix}
$$</div>
<p>Now, lets be a little more explicit, consider the input $x$ to the first two nodes of the layer $J$:</p>
<div>$$
\begin{align}
x_{1} &= \begin{pmatrix} w_{11} &w_{21} &w_{31} &w_{41} \end{pmatrix}
\begin{pmatrix} \mathcal{O}_{1}^{l-1} \\
\mathcal{O}_{2}^{l-1} \\
\mathcal{O}_{3}^{l-1} \\
1
\end{pmatrix}
\\[0.5em]
x_{2} &= \begin{pmatrix} w_{12} &w_{22} &w_{32} &w_{42} \end{pmatrix}
\begin{pmatrix} \mathcal{O}_{1}^{l-1} \\
\mathcal{O}_{2}^{l-1} \\
\mathcal{O}_{3}^{l-1} \\
1
\end{pmatrix}
\end{align}
$$</div>
<p>Note that the second matrix is constant between the input calculations as it is only the output values of the previous layer (including the bias term). This means (again by the magic of matrix multiplication) that we can construct a single vector containing the input values $x$ to the current layer:</p>
<div> $$
\begin{pmatrix} x_{1} \\ x_{2} \end{pmatrix}
= \begin{pmatrix} w_{11} & w_{21} & w_{31} & w_{41} \\
w_{12} & w_{22} & w_{32} & w_{42}
\end{pmatrix}
\begin{pmatrix} \mathcal{O}_{1}^{l-1} \\
\mathcal{O}_{2}^{l-1} \\
\mathcal{O}_{3}^{l-1} \\
1
\end{pmatrix}
$$</div>
<p>This is an $\left(n \times m+1 \right)$ matrix multiplied with an $\left(m +1 \times 1 \right)$ where:</p>
<ul>
<li>$n$ is the number of nodes in the current layer $l$</li>
<li>$m$ is the number of nodes in the previous layer $l-1$</li>
</ul>
<p>Lets generalise - the vector of inputs to the $n$ nodes in the current layer from the nodes $m$ in the previous layer is:</p>
<div> $$
\begin{pmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{n} \end{pmatrix}
= \begin{pmatrix} w_{11} & w_{21} & \cdots & w_{(m+1)1} \\
w_{12} & w_{22} & \cdots & w_{(m+1)2} \\
\vdots & \vdots & \ddots & \vdots \\
w_{1n} & w_{2n} & \cdots & w_{(m+1)n} \\
\end{pmatrix}
\begin{pmatrix} \mathcal{O}_{1}^{l-1} \\
\mathcal{O}_{2}^{l-1} \\
\mathcal{O}_{3}^{l-1} \\
1
\end{pmatrix}
$$</div>
<p>or:</p>
<div>$$
\mathbf{x_{J}} = \mathbf{W_{IJ}} \mathbf{\vec{\mathcal{O}}_{I}}
$$</div>
<p>In this notation, the output from the current layer $J$ is easily written as:</p>
<div>$$
\mathbf{\vec{\mathcal{O}}_{J}} = \sigma \left( \mathbf{W_{IJ}} \mathbf{\vec{\mathcal{O}}_{I}} \right)
$$</div>
<p>Where $\sigma$ is the activation or transfer function chosen for this layer which is applied elementwise to the product of the matrices.</p>
<p>This notation allows us to very efficiently calculate the output of a layer which reduces computation time. Additionally, we are now able to extend this efficiency by making out network consider <strong>all</strong> of our input examples at once.</p>
<p>Remember that our network requires training (many epochs of forward propagation followed by back propagation) and as such needs training data (preferably a lot of it!). Rather than consider each training example individually, we vectorise each example into a large matrix of inputs.</p>
<p>Our weights $\mathbf{W_{IJ}}$ connecting the layer $l$ to layer $J$ are the same no matter which input example we put into the network: this is fundamental as we expect that the network would act the same way for similar inputs i.e. we expect the same neurons (nodes) to fire based on the similar features in the input.</p>
<p>If 2 input examples gave the outputs $ \mathbf{\vec{\mathcal{O}}_{I_{1}}} $ and $ \mathbf{\vec{\mathcal{O}}_{I_{2}}} $ from the nodes in layer $I$ to a layer $J$ then the outputs from layer $J$ , $\mathbf{\vec{\mathcal{O}}_{J_{1}}}$ and $\mathbf{\vec{\mathcal{O}}_{J_{1}}}$ can be written:</p>
<div>$$
\begin{pmatrix}
\mathbf{\vec{\mathcal{O}}_{J_{1}}} \\
\mathbf{\vec{\mathcal{O}}_{J_{2}}}
\end{pmatrix}
=
\sigma \left(\mathbf{W_{IJ}}\begin{pmatrix}
\mathbf{\vec{\mathcal{O}}_{I_{1}}} &
\mathbf{\vec{\mathcal{O}}_{I_{2}}}
\end{pmatrix}
\right)
=
\sigma \left(\mathbf{W_{IJ}}\begin{pmatrix}
\begin{bmatrix}\mathcal{O}_{I_{1}}^{1} \\ \vdots \\ \mathcal{O}_{I_{1}}^{m}
\end{bmatrix}
\begin{bmatrix}\mathcal{O}_{I_{2}}^{1} \\ \vdots \\ \mathcal{O}_{I_{2}}^{m}
\end{bmatrix}
\end{pmatrix}
\right)
= \sigma \left(\begin{pmatrix} \mathbf{W_{IJ}}\begin{bmatrix}\mathcal{O}_{I_{1}}^{1} \\ \vdots \\ \mathcal{O}_{I_{1}}^{m}
\end{bmatrix} &
\mathbf{W_{IJ}} \begin{bmatrix}\mathcal{O}_{I_{2}}^{1} \\ \vdots \\ \mathcal{O}_{I_{2}}^{m}
\end{bmatrix}
\end{pmatrix}
\right)
$$</div>
<p>For the $m$ nodes in the input layer. Which may look hideous, but the point is that all of the training examples that are input to the network can be dealt with simultaneously because each example becomes another column in the input vector and a corresponding column in the output vector.</p>
<div class="highlight_section">
In summary, for forward propagation:
<uo>
<li> All $n$ training examples with $m$ features (input nodes) are put into column vectors to build the input matrix $I$, taking care to add the bias term to the end of each.</li>
<li> All weight vectors that connect $m +1$ nodes in the layer $I$ to the $n$ nodes in layer $J$ are put together in a weight-matrix</li>
<div>$$
\mathbf{I} = \left(
\begin{bmatrix}
\mathcal{O}_{I_{1}}^{1} \\ \vdots \\ \mathcal{O}_{I_{1}}^{m} \\ 1 \end{bmatrix}
\begin{bmatrix}
\mathcal{O}_{I_{2}}^{1} \\ \vdots \\ \mathcal{O}_{I_{2}}^{m} \\ 1
\end{bmatrix}
\begin{bmatrix}
\cdots \\ \cdots \\ \ddots \\ \cdots
\end{bmatrix}
\begin{bmatrix}
\mathcal{O}_{I_{n}}^{1} \\ \vdots \\ \mathcal{O}_{I_{n}}^{m} \\ 1
\end{bmatrix}
\right)
\ \ \ \
\mathbf{W_{IJ}} =
\begin{pmatrix} w_{11} & w_{21} & \cdots & w_{(m+1)1} \\
w_{12} & w_{22} & \cdots & w_{(m+1)2} \\
\vdots & \vdots & \ddots & \vdots \\
w_{1n} & w_{2n} & \cdots & w_{(m+1)n} \\
\end{pmatrix}
$$</div>
<p><li> We perform $ \mathbf{W_{IJ}} \mathbf{I}$ to get the vector $\mathbf{\vec{\mathcal{O}}_{J}}$ which is the output from each of the $m$ nodes in layer $J$ </li>
</ul>
</div></p>
<h2 id="backprop"> Back Propagation </h2>
<p>To perform back propagation there are a couple of things that we need to vectorise. The first is the error on the weights when we compare the output of the network $\mathbf{\vec{\mathcal{O}}_{K}}$ with the known target values:</p>
<div>$$
\mathbf{T_{K}} = \begin{bmatrix} t_{1} \\ \vdots \\ t_{k} \end{bmatrix}
$$</div>
<p>A reminder of the formulae:</p>
<div>$$
\delta_{k} = \mathcal{O}_{k} \left( 1 - \mathcal{O}_{k} \right) \left( \mathcal{O}_{k} - t_{k} \right),
\ \ \ \
\delta_{j} = \mathcal{O}_{i} \left( 1 - \mathcal{O}_{j} \right) \sum_{k \in K} \delta_{k} W_{jk}
$$</div>
<p>Where $\delta_{k}$ is the error on the weights to the output layer and $\delta_{j}$ is the error on the weights to the hidden layers. We also need to vectorise the update formulae for the weights and bias:</p>
<div>$$
W + \Delta W \rightarrow W, \ \ \ \
\theta + \Delta\theta \rightarrow \theta
$$</div>
<h3 id="outputdeltas"> Vectorising the Output Layer Deltas </h3>
<p>Lets look at the output layer delta: we need a subtraction between the outputs and the target which is multiplied by the derivative of the transfer function (sigmoid). Well, the subtraction between two matrices is straight forward:</p>
<div>$$
\mathbf{\vec{\mathcal{O}}_{K}} - \mathbf{T_{K}}
$$</div>
<p>but we need to consider the derivative. Remember that the output of the final layer is:</p>
<div>$$
\mathbf{\vec{\mathcal{O}}_{K}} = \sigma \left( \mathbf{W_{JK}}\mathbf{\vec{\mathcal{O}}_{J}} \right)
$$</div>
<p>and the derivative can be written:</p>
<div>$$
\sigma ^{\prime} \left( \mathbf{W_{JK}}\mathbf{\vec{\mathcal{O}}_{J}} \right) = \mathbf{\vec{\mathcal{O}}_{K}}\left( 1 - \mathbf{\vec{\mathcal{O}}_{K}} \right)
$$</div>
<p><strong>Note</strong>: This is the derivative of the sigmoid as evaluated at each of the nodes in the layer $K$. It is acting <em>elementwise</em> on the inputs to layer $K$. Thus it is a column vector with the same length as the number of nodes in layer $K$.</p>
<p>Put the derivative and subtraction terms together and we get:</p>
<div class="highlight_section">$$
\mathbf{\vec{\delta}_{K}} = \sigma^{\prime}\left( \mathbf{W_{JK}}\mathbf{\vec{\mathcal{O}}_{J}} \right) * \left( \mathbf{\vec{\mathcal{O}}_{K}} - \mathbf{T_{K}}\right)
$$</div>
<p>Again, the derivatives are being multiplied elementwise with the results of the subtration. Now we have a vector of deltas for the output layer $K$! Things aren’t so straight forward for the detlas in the hidden layers.</p>
<p>Lets visualise what we’ve seen:</p>
<div id="fig1" class="figure_container">
<div class="figure_images">
<img img title="NN Vectorisation" src="/img/simpleNN/nn_vectors1.png" width="30%">
</div>
<div class="figure_caption">
<font color="blue">Figure 1</font>: NN showing the weights and outputs in vector form along with the target values for layer $K$
</div>
</div>
<h3 id="hiddendeltas"> Vectorising the Hidden Layer Deltas </h3>
<p>We need to vectorise:</p>
<div>$$
\delta_{j} = \mathcal{O}_{i} \left( 1 - \mathcal{O}_{j} \right) \sum_{k \in K} \delta_{k} W_{jk}
$$</div>
<p>Let’s deal with the summation. We’re multipying each of the deltas $\delta_{k}$ in the output layer (or more generally, the subsequent layer could be another hidden layer) by the weight $w_{jk}$ that pulls them back to the node $j$ in the current layer before adding the results. For the first node in the hidden layer:</p>
<div>$$
\sum_{k \in K} \delta_{k} W_{jk} = \delta_{k}^{1}w_{11} + \delta_{k}^{2}w_{12} + \delta_{k}^{3}w_{13}
= \begin{pmatrix} w_{11} & w_{12} & w_{13} \end{pmatrix} \begin{pmatrix} \delta_{k}^{1} \\ \delta_{k}^{2} \\ \delta_{k}^{3}\end{pmatrix}
$$</div>
<p>Notice the weights? They pull the delta from each output layer node back to the first node of the hidden layer. In forward propagation, these we consider multiple nodes going out to a single node, rather than this way of receiving multiple nodes at a single node.</p>
<p>Combine this summation with the multiplication by the activation function derivative:</p>
<div>$$
\delta_{j}^{1} = \sigma^{\prime} \left( x_{j}^{1} \right)
\begin{pmatrix} w_{11} & w_{12} & w_{13} \end{pmatrix} \begin{pmatrix} \delta_{k}^{1} \\ \delta_{k}^{2} \\ \delta_{k}^{3} \end{pmatrix}
$$</div>
<p>remembering that the input to the $\text{1}^\text{st}$ node in the layer $J$</p>
<div>$$
x_{j}^{1} = \mathbf{W_{I1}}\mathbf{\vec{\mathcal{O}}_{I}}
$$</div>
<p>What about the $\text{2}^\text{nd}$ node in the hidden layer?</p>
<div>$$
\delta_{j}^{2} = \sigma^{\prime} \left( x_{j}^{2} \right)
\begin{pmatrix} w_{21} & w_{22} & w_{23} \end{pmatrix} \begin{pmatrix} \delta_{k}^{1} \\ \delta_{k}^{2} \\ \delta_{k}^{3} \end{pmatrix}
$$</div>
<p>This is looking familiar, hopefully we can be confident based upon what we’ve done before to say that:</p>
<div>$$
\begin{pmatrix}
\delta_{j}^{1} \\ \delta_{j}^{2}
\end{pmatrix}
=
\begin{pmatrix}
\sigma^{\prime} \left( x_{j}^{1} \right) \\ \sigma^{\prime} \left( x_{j}^{2} \right)
\end{pmatrix}
*
\begin{pmatrix}
w_{11} & w_{12} & w_{13} \\
w_{21} & w_{22} & w_{23}
\end{pmatrix}
\begin{pmatrix}\delta_{k}^{1} \\ \delta_{k}^{2} \\ \delta_{k}^{3} \end{pmatrix}
$$</div>
<p>We’ve seen a version of this weights matrix before when we did the forward propagation vectorisation. In this case though, look carefully - as we mentioned, the weights are not in the same places, in fact, the weight matrix has been <em>transposed</em> from the one we used in forward propagation. This makes sense because we’re going backwards through the network now! This is useful because it means there is very little extra calculation needed here - the matrix we need is already available from the forward pass, but just needs transposing. We can call the weights in back propagation here $ \mathbf{ W_{KJ}} $ as we’re pulling the deltas from $K$ to $J$.</p>
<div>$$
\begin{align}
\mathbf{W_{KJ}} &=
\begin{pmatrix}
w_{11} & w_{12} & \cdots & w_{1n} \\
w_{21} & w_{22} & \cdots & w_{23} \\
\vdots & \vdots & \ddots & \vdots \\
w_{(m+1)1} & w_{(m+1)2} & \cdots & w_{(m+1)n}
\end{pmatrix} , \ \ \
\mathbf{W_{JK}} =
\begin{pmatrix} w_{11} & w_{21} & \cdots & w_{(m+1)1} \\
w_{12} & w_{22} & \cdots & w_{(m+1)2} \\
\vdots & \vdots & \ddots & \vdots \\
w_{1n} & w_{2n} & \cdots & w_{(m+1)n} \\
\end{pmatrix} \\[0.5em]
\mathbf{W_{KJ}} &= \mathbf{W^{\intercal}_{JK}}
\end{align}
$$</div>
<div class="highlight_section">
And so, the vectorised equations for the output layer and hidden layer deltas are:
<div>$$
\begin{align}
\mathbf{\vec{\delta}_{K}} &= \sigma^{\prime}\left( \mathbf{W_{JK}}\mathbf{\vec{\mathcal{O}}_{J}} \right) * \left( \mathbf{\vec{\mathcal{O}}_{K}} - \mathbf{T_{K}}\right) \\[0.5em]
\mathbf{ \vec{ \delta }_{J}} &= \sigma^{\prime} \left( \mathbf{ W_{IJ} \mathcal{O}_{I} } \right) * \mathbf{ W^{\intercal}_{JK}} \mathbf{ \vec{\delta}_{K}}
\end{align}
$$</div>
<p></div></p>
<p>Lets visualise what we’ve seen:</p>
<div id="fig2" class="figure_container">
<div class="figure_images">
<img img title="NN Vectorisation 2" src="/img/simpleNN/nn_vectors2.png" width="20%">
</div>
<div class="figure_caption">
<font color="blue">Figure 2</font>: The NN showing the delta vectors
</div>
</div>
<h3 id="updates"> Vectorising the Update Equations </h3>
<p>Finally, now that we have the vectorised equations for the deltas (which required us to get the vectorised equations for the forward pass) we’re ready to get the update equations in vector form. Let’s recall the update equations</p>
<div>$$
\begin{align}
\Delta W &= -\eta \ \delta_{l} \ \mathcal{O}_{l-1} \\
\Delta\theta &= -\eta \ \delta_{l}
\end{align}
$$</div>
<p>Ignoring the $-\eta$ for now, we need to get a vector form for $\delta_{l} \ \mathcal{O}_{l-1}$ in order to get the update to the weights. We have the matrix of weights:</p>
<div>$$
\mathbf{W_{JK}} =
\begin{pmatrix} w_{11} & w_{21} & w_{31} \\
w_{12} & w_{22} & w_{32} \\
\end{pmatrix}
$$</div>
<p>Suppose we are updating the weight $w_{21}$ in the matrix. We’re looking to find the product of the output from the second node in $J$ with the delta from the first node in $K$.</p>
<div>$$
\Delta w_{21} = \delta_{K}^{1} \mathcal{O}_{J}^{2}
$$</div>
<p>Considering this example, we can write the matrix for the weight updates as:</p>
<div>$$
\Delta \mathbf{W_{JK}} =
\begin{pmatrix} \delta_{K}^{1} \mathcal{O}_{J}^{1} & \delta_{K}^{1} \mathcal{O}_{J}^{2} & \delta_{K}^{1} \mathcal{O}_{J}^{3} \\
\delta_{K}^{2} \mathcal{O}_{J}^{1} & \delta_{K}^{2} \mathcal{O}_{J}^{2} & \delta_{K}^{2} \mathcal{O}_{J}^{3}
\end{pmatrix}
=
\begin{pmatrix} \delta_{K}^{1} \\ \delta_{K}^{2}\end{pmatrix}
\begin{pmatrix} \mathcal{O}_{J}^{1} & \mathcal{O}_{J}^{2}& \mathcal{O}_{J}^{3}
\end{pmatrix}
$$</div>
<p>Generalising this into vector notation and including the <em>learning rate</em> $\eta$, the update for the weights in layer $J$ is:</p>
<div>$$
\Delta \mathbf{W_{JK}} = -\eta \mathbf{ \vec{ \delta }_{K}} \mathbf{ \vec { \mathcal{O} }_{J}}
$$</div>
<p>Similarly, we have the update to the bias term. If:</p>
<div>$$
\Delta \vec{\theta} = -\eta \mathbf{ \vec{ \delta }_{K}}
$$</div>
<p>So the bias term is updated just by taking the deltas straight from the nodes in the subsequent layer (with the negative factor of learning rate).</p>
<div class="highlight_section">
In summary, for back propagation, the equations we need in vector form are:
<div>$$
\begin{align}
\mathbf{\vec{\delta}_{K}} &= \sigma^{\prime}\left( \mathbf{W_{JK}}\mathbf{\vec{\mathcal{O}}_{J}} \right) * \left( \mathbf{\vec{\mathcal{O}}_{K}} - \mathbf{T_{K}}\right) \\[0.5em]
\mathbf{ \vec{ \delta }_{J}} &= \sigma^{\prime} \left( \mathbf{ W_{IJ} \mathcal{O}_{I} } \right) * \mathbf{ W^{\intercal}_{JK}} \mathbf{ \vec{\delta}_{K}}
\end{align}
$$</div>
<div>$$
\begin{align}
\mathbf{W_{JK}} + \Delta \mathbf{W_{JK}} &\rightarrow \mathbf{W_{JK}}, \ \ \ \Delta \mathbf{W_{JK}} = -\eta \mathbf{ \vec{ \delta }_{K}} \mathbf{ \vec { \mathcal{O} }_{J}} \\[0.5em]
\vec{\theta} + \Delta \vec{\theta} &\rightarrow \vec{\theta}, \ \ \ \Delta \vec{\theta} = -\eta \mathbf{ \vec{ \delta }_{K}}
\end{align}
$$</div>
<p>With $*$ representing an elementwise multiplication between the matrices.</p>
<p></div></p>
<h2 id="nextsteps"> What's next? </h2>
<p>Although this kinds of mathematics can be tedious and sometimes hard to follow (and probably with numerous notation mistakes… please let me know if you find them!), it is necessary in order to write a quick, efficient NN. Our next step is to implement this setup in Python.</p>A Simple Neural Network - Transfer Functions
/post/transfer-functions/
Wed, 08 Mar 2017 10:43:07 +0000/post/transfer-functions/<p>We’re going to write a little bit of Python in this tutorial on Simple Neural Networks (Part 2). It will focus on the different types of activation (or transfer) functions, their properties and how to write each of them (and their derivatives) in Python.</p>
<p></p>
<p>As promised in the previous post, we’ll take a look at some of the different activation functions that could be used in our nodes. Again <strong>please</strong> let me know if there’s anything I’ve gotten totally wrong - I’m very much learning too.</p>
<div id="toctop"></div>
<ol>
<li><a href="#linear">Linear Function</a></li>
<li><a href="#sigmoid">Sigmoid Function</a></li>
<li><a href="#tanh">Hyperbolic Tangent Function</a></li>
<li><a href="#gaussian">Gaussian Function</a></li>
<li><a href="#step">Heaviside (step) Function</a></li>
<li><a href="#ramp">Ramp Function</a>
<ol>
<li><a href="#relu">Rectified Linear Unit (ReLU)</a></li>
</ol></li>
</ol>
<h2 id="linear"> Linear (Identity) Function </h2>
<p><a href="#toctop">To contents</a></p>
<h3 id="what-does-it-look-like">What does it look like?</h3>
<div id="fig1" class="figure_container">
<div class="figure_images">
<img title="Simple NN" src="/img/transferFunctions/linear.png" width="40%"><img title="Simple NN" src="/img/transferFunctions/dlinear.png" width="40%">
</div>
<div class="figure_caption">
<font color="blue">Figure 1</font>: The linear function (left) and its derivative (right)
</div>
</div>
<h3 id="formulae">Formulae</h3>
<div>$$
f \left( x_{i} \right) = x_{i}
$$</div>
<h3 id="python-code">Python Code</h3>
<pre><code class="language-python">def linear(x, Derivative=False):
if not Derivative:
return x
else:
return 1.0
</code></pre>
<h3 id="why-is-it-used">Why is it used?</h3>
<p>If there’s a situation where we want a node to give its output without applying any thresholds, then the identity (or linear) function is the way to go.</p>
<p>Hopefully you can see why it is used in the final output layer nodes as we only want these nodes to do the $ \text{input} \times \text{weight}$ operations before giving us its answer without any further modifications.</p>
<p><font color="blue"></p>
<p><strong>Note:</strong> The linear function is not used in the hidden layers. We must use non-linear transfer functions in the hidden layer nodes or else the output will only ever end up being a linearly separable solution.</p>
<p></font></p>
<p><br></p>
<hr />
<h2 id="sigmoid"> The Sigmoid (or Fermi) Function </h2>
<p><a href="#toctop">To contents</a></p>
<h3 id="what-does-it-look-like-1">What does it look like?</h3>
<div id="fig2" class="figure_container">
<div class="figure_images">
<img title="Simple NN" src="/img/transferFunctions/sigmoid.png" width="40%"><img title="Simple NN" src="/img/transferFunctions/dsigmoid.png" width="40%">
</div>
<div class="figure_caption">
<font color="blue">Figure 2</font>: The sigmoid function (left) and its derivative (right)
</div>
</div>
<h3 id="formulae-1">Formulae</h3>
<div >$$
f\left(x_{i} \right) = \frac{1}{1 + e^{ - x_{i} }}, \ \
f^{\prime}\left( x_{i} \right) = \sigma(x_{i}) \left( 1 - \sigma(x_{i}) \right)
$$</div>
<h3 id="python-code-1">Python Code</h3>
<pre><code class="language-python">def sigmoid(x,Derivative=False):
if not Derivative:
return 1 / (1 + np.exp (-x))
else:
out = sigmoid(x)
return out * (1 - out)
</code></pre>
<h3 id="why-is-it-used-1">Why is it used?</h3>
<p>This function maps the input to a value between 0 and 1 (but not equal to 0 or 1). This means the output from the node will be a high signal (if the input is positive) or a low one (if the input is negative). This function is often chosen as it is one of the easiest to hard-code in terms of its derivative. The simplicity of its derivative allows us to efficiently perform back propagation without using any fancy packages or approximations. The fact that this function is smooth, continuous (differentiable), monotonic and bounded means that back propagation will work well.</p>
<p>The sigmoid’s natural threshold is 0.5, meaning that any input that maps to a value above 0.5 will be considered high (or 1) in binary terms.</p>
<p><br></p>
<hr />
<h2 id="tanh"> Hyperbolic Tangent Function ( $\tanh(x)$ ) </h2>
<p><a href="#toctop">To contents</a></p>
<h3 id="what-does-it-look-like-2">What does it look like?</h3>
<div id="fig3" class="figure_container">
<div class="figure_images">
<img title="Simple NN" src="/img/transferFunctions/tanh.png" width="40%"><img title="Simple NN" src="/img/transferFunctions/dtanh.png" width="40%">
</div>
<div class="figure_caption">
<font color="blue">Figure 3</font>: The hyperbolic tangent function (left) and its derivative (right)
</div>
</div>
<h3 id="formulae-2">Formulae</h3>
<div >$$
f\left(x_{i} \right) = \tanh\left(x_{i}\right),
f^{\prime}\left(x_{i} \right) = 1 - \tanh\left(x_{i}\right)^{2}
$$</div>
<h3 id="why-is-it-used-2">Why is it used?</h3>
<p>This is a very similar function to the previous sigmoid function and has much of the same properties: even its derivative is straight forward to compute. However, this function allows us to map the input to any value between -1 and 1 (but not inclusive of those). In effect, this allows us to apply a plenalty to the node (negative) rather than just have the node not fire at all. It also gives us a larger range of output to play with in the positive end of the scale meaning finer adjustments can be made.</p>
<p>This function has a natural threshold of 0, meaning that any input which maps to a value greater than 0 is considered high (or 1) in binary terms.</p>
<p>Again, the fact that this function is smooth, continuous (differentiable), monotonic and bounded means that back propagation will work well. The subsequent functions don’t all have these properties which makes them more difficult to use in back propagation (though it is done).
<br></p>
<hr />
<h2 id="what-s-the-difference-between-the-sigmoid-and-hyperbolic-tangent">What’s the difference between the sigmoid and hyperbolic tangent?</h2>
<p>They both achieve a similar mapping, are both continuous, smooth, monotonic and differentiable, but give out different values. For a sigmoid function, a large negative input generates an almost zero output. This lack of output will affect all subsequent weights in the network which may not be desirable - effectively stopping the next nodes from learning. In contrast, the $\tanh$ function supplies -1 for negative values, maintaining the output of the node and allowing subsequent nodes to learn from it.</p>
<hr />
<h2 id="gaussian"> Gaussian Function </h2>
<p><a href="#toctop">To contents</a></p>
<h3 id="what-does-it-look-like-3">What does it look like?</h3>
<div id="fig4" class="figure_container">
<div class="figure_images">
<img title="Simple NN" src="/img/transferFunctions/gaussian.png" width="40%"><img title="Simple NN" src="/img/transferFunctions/dgaussian.png" width="40%">
</div>
<div class="figure_caption">
<font color="blue">Figure 4</font>: The gaussian function (left) and its derivative (right)
</div>
</div>
<h3 id="formulae-3">Formulae</h3>
<div >$$
f\left( x_{i}\right ) = e^{ -x_{i}^{2}}, \ \
f^{\prime}\left( x_{i}\right ) = - 2x e^{ - x_{i}^{2}}
$$</div>
<h3 id="python-code-2">Python Code</h3>
<pre><code class="language-python">def gaussian(x, Derivative=False):
if not Derivative:
return np.exp(-x**2)
else:
return -2 * x * np.exp(-x**2)
</code></pre>
<h3 id="why-is-it-used-3">Why is it used?</h3>
<p>The gaussian function is an even function, thus is gives the same output for equally positive and negative values of input. It gives its maximal output when there is no input and has decreasing output with increasing distance from zero. We can perhaps imagine this function is used in a node where the input feature is less likely to contribute to the final result.</p>
<p><br></p>
<hr />
<h2 id="step"> Step (or Heaviside) Function </h2>
<p><a href="#toctop">To contents</a></p>
<h3 id="what-does-it-look-like-4">What does it look like?</h3>
<div id="fig5" class="figure_container">
<div class="figure_images">
<img title="Simple NN" src="/img/transferFunctions/step.png" width="40%">
</div>
<div class="figure_caption">
<font color="blue">Figure 5</font>: The Heaviside function (left) and its derivative (right)
</div>
</div>
<h3 id="formulae-4">Formulae</h3>
<div>$$
f(x)=
\begin{cases}
\begin{align}
0 \ &: \ x_{i} \leq T\\
1 \ &: \ x_{i} > T\\
\end{align}
\end{cases}
$$</div>
<h3 id="why-is-it-used-4">Why is it used?</h3>
<p>Some cases call for a function which applies a hard thresold: either the output is precisely a single value, or not. The other functions we’ve looked at have an intrinsic probablistic output to them i.e. a higher output in decimal format implying a greater probability of being 1 (or a high output). The step function does away with this opting for a definite high or low output depending on some threshold on the input $T$.</p>
<p>However, the step-function is discontinuous and therefore non-differentiable (its derivative is the Dirac-delta function). Therefore use of this function in practice is not done with back-propagation.</p>
<p><br></p>
<hr />
<h2 id="ramp"> Ramp Function </h2>
<p><a href="#toctop">To contents</a></p>
<h3 id="what-does-it-look-like-5">What does it look like?</h3>
<div id="fig6" class="figure_container">
<div class="figure_images">
<img title="Simple NN" src="/img/transferFunctions/ramp.png" width="40%"><img title="Simple NN" src="/img/transferFunctions/dramp.png" width="40%">
</div>
<div class="figure_caption">
<font color="blue">Figure 6</font>: The ramp function (left) and its derivative (right) with $T1=-2$ and $T2=3$.
</div>
</div>
<h3 id="formulae-5">Formulae</h3>
<div>$$
f(x)=
\begin{cases}
\begin{align}
0 \ &: \ x_{i} \leq T_{1}\\[0.5em]
\frac{\left( x_{i} - T_{1} \right)}{\left( T_{2} - T_{1} \right)} \ &: \ T_{1} \leq x_{i} \leq T_{2}\\[0.5em]
1 \ &: \ x_{i} > T_{2}\\
\end{align}
\end{cases}
$$</div>
<h3 id="python-code-3">Python Code</h3>
<pre><code class="language-python">def ramp(x, Derivative=False, T1=0, T2=np.max(x)):
out = np.ones(x.shape)
ids = ((x < T1) | (x > T2))
if not Derivative:
out = ((x - T1)/(T2-T1))
out[(x < T1)] = 0
out[(x > T2)] = 1
return out
else:
out[ids]=0
return out
</code></pre>
<h3 id="why-is-it-used-5">Why is it used?</h3>
<p>The ramp function is a truncated version of the linear function. From its shape, the ramp function looks like a more definitive version of the sigmoid function in that its maps a range of inputs to outputs over the range (0 1) but this time with definitive cut off points $T1$ and $T2$. This gives the function the ability to fire the node very definitively above a threshold, but still have some uncertainty in the lower regions. It may not be common to see $T1$ in the negative region unless the ramp is equally distributed about $0$.</p>
<h3 id="relu"> 6.1 Rectified Linear Unit (ReLU) </h3>
<p>There is a popular, special case of the ramp function in use in the powerful <em>convolutional neural network</em> (CNN) architecture called a <em><strong>Re</strong>ctifying <strong>L</strong>inear <strong>U</strong>nit</em> (ReLU). In a ReLU, $T1=0$ and $T2$ is the maximum of the input giving a linear function with no negative values as below:</p>
<div id="fig7" class="figure_container">
<div class="figure_images">
<img title="Simple NN" src="/img/transferFunctions/relu.png" width="40%"><img title="Simple NN" src="/img/transferFunctions/drelu.png" width="40%">
</div>
<div class="figure_caption">
<font color="blue">Figure 7</font>: The Rectified Linear Unit (ReLU) (left) with its derivative (right).
</div>
</div>
<p>and in Python:</p>
<pre><code class="language-python">def relu(x, Derivative=False):
if not Derivative:
return np.maximum(0,x)
else:
out = np.ones(x.shape)
out[(x < 0)]=0
return out
</code></pre>A Simple Neural Network - Mathematics
/post/neuralnetwork/
Mon, 06 Mar 2017 17:04:53 +0000/post/neuralnetwork/<p>This is the first part of a series of tutorials on Simple Neural Networks (NN). Tutorials on neural networks (NN) can be found all over the internet. Though many of them are the same, each is written (or recorded) slightly differently. This means that I always feel like I learn something new or get a better understanding of things with every tutorial I see. I’d like to make this tutorial as clear as I can, so sometimes the maths may be simplistic, but hopefully it’ll give you a good unserstanding of what’s going on. <strong>Please</strong> let me know if any of the notation is incorrect or there are any mistakes - either comment or use the contact page on the left.</p>
<div id="toctop"></div>
<ol>
<li><a href="#nnarchitecture">Neural Network Architecture</a></li>
<li><a href="#transferFunction">Transfer Function</a></li>
<li><a href="#feedforward">Feed-forward</a></li>
<li><a href="#error">Error</a></li>
<li><a href="#backPropagationGrads">Back Propagation - the Gradients</a></li>
<li><a href="#bias">Bias</a></li>
<li><a href="#backPropagationAlgorithm">Back Propagaton - the Algorithm</a></li>
</ol>
<h2 id="nnarchitecture">1. Neural Network Architecture </h2>
<p><a href="#toctop">To contents</a></p>
<p>By now, you may well have come across diagrams which look very similar to the one below. It shows some input node, connected to some output node via an intermediate node in what is called a ‘hidden layer’ - ‘hidden’ because in the use of NN only the input and output is of concern to the user, the ‘under-the-hood’ stuff may not be interesting to them. In real, high-performing NN there are usually more hidden layers.</p>
<div class="figure_container">
<div class="figure_images">
<img title="Simple NN" width=40% src="/img/simpleNN/simpleNN.png">
</div>
<div class="figure_caption">
<font color="blue">Figure 1</font>: A simple 2-layer NN with 2 features in the input layer, 3 nodes in the hidden layer and two nodes in the output layer.
</div>
</div>
<p>When we train our network, the nodes in the hidden layer each perform a calculation using the values from the input nodes. The output of this is passed on to the nodes of the next layer. When the output hits the final layer, the ‘output layer’, the results are compared to the real, known outputs and some tweaking of the network is done to make the output more similar to the real results. This is done with an algorithm called <em>back propagation</em>. Before we get there, lets take a closer look at these calculations being done by the nodes.</p>
<h2 id="transferFunction">2. Transfer Function </h2>
<p><a href="#toctop">To contents</a></p>
<p>At each node in the hidden and output layers of the NN, an <em>activation</em> or <em>transfer</em> function is executed. This function takes in the output of the previous node, and multiplies it by some <em>weight</em>. These weights are the lines which connect the nodes. The weights that come out of one node can all be different, that is they will <em>activate</em> different neurons. There can be many forms of the transfer function, we will first look at the <em>sigmoid</em> transfer function as it seems traditional.</p>
<div class="figure_container">
<div class="figure_images">
<img title="The sigmoid function" width=50% src="/img/simpleNN/sigmoid.png">
</div>
<div class="figure_caption">
<font color="blue">Figure 2</font>: The sigmoid function.
</div>
</div>
<p>As you can see from the figure, the sigmoid function takes any real-valued input and maps it to a real number in the range $(0 \ 1)$ - i.e. between, but not equal to, 0 and 1. We can think of this almost like saying ‘if the value we have maps to an output near 1, this node fires, if it maps to an output near 0, the node does not fire’. The equation for this sigmoid function is:</p>
<div id="eqsigmoidFunction">$$
\sigma ( x ) = \frac{1}{1 + e^{-x}}
$$</div>
<p>We need to have the derivative of this transfer function so that we can perform back propagation later on. This is the process where by the connections in the network are updated to tune the performance of the NN. We’ll talk about this in more detail later, but let’s find the derivative now.</p>
<div>
$$
\begin{align*}
\frac{d}{dx}\sigma ( x ) &= \frac{d}{dx} \left( 1 + e^{ -x }\right)^{-1}\\
&= -1 \times -e^{-x} \times \left(1 + e^{-x}\right)^{-2}= \frac{ e^{-x} }{ \left(1 + e^{-x}\right)^{2} } \\
&= \frac{\left(1 + e^{-x}\right) - 1}{\left(1 + e^{-x}\right)^{2}}
= \frac{\left(1 + e^{-x}\right) }{\left(1 + e^{-x}\right)^{2}} - \frac{1}{\left(1 + e^{-x}\right)^{2}}
= \frac{1}{\left(1 + e^{-x}\right)} - \left( \frac{1}{\left(1 + e^{-x}\right)} \right)^{2} \\[0.5em]
&= \sigma ( x ) - \sigma ( x ) ^ {2}
\end{align*}
$$</div>
<p>Therefore, we can write the derivative of the sigmoid function as:</p>
<div id="eqdsigmoid">$$
\sigma^{\prime}( x ) = \sigma (x ) \left( 1 - \sigma ( x ) \right)
$$</div>
<p>The sigmoid function has the nice property that its derivative is very simple: a bonus when we want to hard-code this into our NN later on. Now that we have our activation or transfer function selected, what do we do with it?</p>
<h2 id="feedforward">3. Feed-forward </h2>
<p><a href="#toctop">To contents</a></p>
<p>During a feed-forward pass, the network takes in the input values and gives us some output values. To see how this is done, let’s first consider a 2-layer neural network like the one in Figure 1. Here we are going to refer to:</p>
<ul>
<li>$i$ - the $i^{\text{th}}$ node of the input layer $I$</li>
<li>$j$ - the $j^{\text{th}}$ node of the hidden layer $J$</li>
<li>$k$ - the $k^{\text{th}}$ node of the input layer $K$</li>
</ul>
<p>The activation function at a node $j$ in the hidden layer takes the value:</p>
<div>$$
\begin{align}
x_{j} &= \xi_{1} w_{1j} + \xi_{2} w_{2j} \\[0.5em]
&= \sum_{i \in I} \xi_{i} w_{i j}
\end{align}
$$</div>
<p>where $\xi_{i}$ is the value of the $i^{\text{th}}$ input node and $w_{i j}$ is the weight of the connection between $i^{\text{th}}$ input node and the $j^{\text{th}}$ hidden node. <strong>In short:</strong> at each hidden layer node, multiply each input value by the connection received by that node and add them together.</p>
<p><strong>Note:</strong> the weights are initisliased when the network is setup. Sometimes they are all set to 1, or often they’re set to some small random value.</p>
<p>We apply the activation function on $x_{j}$ at the $j^{\text{th}}$ hidden node and get:</p>
<div>$$
\begin{align}
\mathcal{O}_{j} &= \sigma(x_{j}) \\
&= \sigma( \xi_{1} w_{1j} + \xi_{2} w_{2j})
\end{align}
$$</div>
<p>$\mathcal{O}_{j}$ is the output of the $j^{\text{th}}$ hidden node. This is calculated for each of the $j$ nodes in the hidden layer. The resulting outputs now become the input for the next layer in the network. In our case, this is the final output later. So for each of the $k$ nodes in $K$:</p>
<div>$$
\begin{align}
\mathcal{O}_{k} &= \sigma(x_{k}) \\
&= \sigma \left( \sum_{j \in J} \mathcal{O}_{j} w_{jk} \right)
\end{align}
$$</div>
<p>As we’ve reached the end of the network, this is also the end of the feed-foward pass. So how well did our network do at getting the correct result $\mathcal{O}_{k}$? As this is the training phase of our network, the true results will be known an we cal calculate the error.</p>
<h2 id="error">4. Error </h2>
<p><a href="#toctop">To contents</a></p>
<p>We measure error at the end of each foward pass. This allows us to quantify how well our network has performed in getting the correct output. Let’s define $t_{k}$ as the expected or <em>target</em> value of the $k^{\text{th}}$ node of the output layer $K$. Then the error $E$ on the entire output is:</p>
<div id="eqerror">$$
\text{E} = \frac{1}{2} \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right)^{2}
$$</div>
<p>Dont’ be put off by the random <sup>1</sup>⁄<sub>2</sub> in front there, it’s been manufactured that way to make the upcoming maths easier. The rest of this should be easy enough: get the residual (difference between the target and output values), square this to get rid of any negatives and sum this over all of the nodes in the output layer.</p>
<p>Good! Now how does this help us? Our aim here is to find a way to tune our network such that when we do a forward pass of the input data, the output is exactly what we know it should be. But we can’t change the input data, so there are only two other things we can change:</p>
<ol>
<li>the weights going into the activation function</li>
<li>the activation function itself</li>
</ol>
<p>We will indeed consider the second case in another post, but the magic of NN is all about the <em>weights</em>. Getting each weight i.e. each connection between nodes, to be just the perfect value, is what back propagation is all about. The back propagation algorithm we will look at in the next section, but lets go ahead and set it up by considering the following: how much of this error $E$ has come from each of the weights in the network?</p>
<p>We’re asking, what is the proportion of the error coming from each of the $W_{jk}$ connections between the nodes in layer $J$ and the output layer $K$. Or in mathematical terms:</p>
<div>$$
\frac{\partial{\text{E}}}{\partial{W_{jk}}} = \frac{\partial{}}{\partial{W_{jk}}} \frac{1}{2} \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right)^{2}
$$</div>
<p>If you’re not concerned with working out the derivative, skip this highlighted section.</p>
<div class="highlight_section">
To tackle this we can use the following bits of knowledge: the derivative of the sum is equal to the sum of the derivatives i.e. we can move the derivative term inside of the summation:
<div>$$ \frac{\partial{\text{E}}}{\partial{W_{jk}}} = \frac{1}{2} \sum_{k \in K} \frac{\partial{}}{\partial{W_{jk}}} \left( \mathcal{O}_{k} - t_{k} \right)^{2}$$</div>
<ul>
<li>the weight $w_{1k}$ does not affect connection $w_{2k}$ therefore the change in $W_{jk}$ with respect to any node other than the current $k$ is zero. Thus the summation goes away:</li>
</ul>
<div>$$ \frac{\partial{\text{E}}}{\partial{W_{jk}}} = \frac{1}{2} \frac{\partial{}}{\partial{W_{jk}}} \left( \mathcal{O}_{k} - t_{k} \right)^{2}$$</div>
<ul>
<li>apply the power rule knowing that $t_{k}$ is a constant:</li>
</ul>
<div>$$
\begin{align}
\frac{\partial{\text{E}}}{\partial{W_{jk}}} &= \frac{1}{2} \times 2 \times \left( \mathcal{O}_{k} - t_{k} \right) \frac{\partial{}}{\partial{W_{jk}}} \left( \mathcal{O}_{k}\right) \\
&= \left( \mathcal{O}_{k} - t_{k} \right) \frac{\partial{}}{\partial{W_{jk}}} \left( \mathcal{O}_{k}\right)
\end{align}
$$</div>
<ul>
<li>the leftover derivative is the chage in the output values with respect to the weights. Substituting $ \mathcal{O}_{k} = \sigma(x_{k}) $ and the sigmoid derivative $\sigma^{\prime}( x ) = \sigma (x ) \left( 1 - \sigma ( x ) \right)$:</li>
</ul>
<div>$$
\frac{\partial{\text{E}}}{\partial{W_{jk}}} = \left( \mathcal{O}_{k} - t_{k} \right) \sigma (x ) \left( 1 - \sigma ( x ) \right) \frac{\partial{}}{\partial{W_{jk}}} \left( x_{k}\right)
$$</div>
<ul>
<li>the final derivative, the input value $x_{k}$ is just $\mathcal{O}_{j} W_{jk}$ i.e. output of the previous layer times the weight to this layer. So the change in $\mathcal{O}_{j} w_{jk}$ with respect to $w_{jk}$ just gives us the output value of the previous layer $ \mathcal{O}_{j} $ and so the full derivative becomes:</li>
</ul>
<div>$$
\begin{align}
\frac{\partial{\text{E}}}{\partial{W_{jk}}} &= \left( \mathcal{O}_{k} - t_{k} \right) \sigma (x ) \left( 1 - \sigma ( x ) \right) \frac{\partial{}}{\partial{W_{jk}}} \left( \mathcal{O}_{j} W_{jk} \right) \\[0.5em]
&=\left( \mathcal{O}_{k} - t_{k} \right) \sigma (x ) \left( 1 - \mathcal{O}_{k} \right) \mathcal{O}_{j}
\end{align}
$$</div>
<p>We can replace the sigmoid function with the output of the layer
</div></p>
<p>The derivative of the error function with respect to the weights is then:</p>
<div id="derror">$$
\frac{\partial{\text{E}}}{\partial{W_{jk}}} =\left( \mathcal{O}_{k} - t_{k} \right) \mathcal{O}_{k} \left( 1 - \mathcal{O}_{k} \right) \mathcal{O}_{j}
$$</div>
<p>We group the terms involving $k$ and define:</p>
<div>$$
\delta_{k} = \mathcal{O}_{k} \left( 1 - \mathcal{O}_{k} \right) \left( \mathcal{O}_{k} - t_{k} \right)
$$</div>
<p>And therefore:</p>
<div id="derrorjk">$$
\frac{\partial{\text{E}}}{\partial{W_{jk}}} = \mathcal{O}_{j} \delta_{k}
$$</div>
<p>So we have an expression for the amount of error, called ‘deta’ ($\delta_{k}$), on the weights from the nodes in $J$ to each node $k$ in $K$. But how does this help us to improve out network? We need to back propagate the error.</p>
<h2 id="backPropagationGrads">5. Back Propagation - the gradients</h2>
<p><a href="#toctop">To contents</a></p>
<p>Back propagation takes the error function we found in the previous section, uses it to calculate the error on the current layer and updates the weights to that layer by some amount.</p>
<p>So far we’ve only looked at the error on the output layer, what about the hidden layer? This also has an error, but the error here depends on the output layer’s error too (because this is where the difference between the target $t_{k}$ and output $\mathcal{O}_{k}$ can be calculated). Lets have a look at the error on the weights of the hidden layer $W_{ij}$:</p>
<div>$$ \frac{\partial{\text{E}}}{\partial{W_{ij}}} = \frac{\partial{}}{\partial{W_{ij}}} \frac{1}{2} \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right)^{2}$$</div>
<p>Now, unlike before, we cannot just drop the summation as the derivative is not directly acting on a subscript $k$ in the summation. We should be careful to note that the output from every node in $J$ is actually connected to each of the nodes in $K$ so the summation should stay. But we can still use the same tricks as before: lets use the power rule again and move the derivative inside (because the summation is finite):</p>
<div>$$
\begin{align}
\frac{\partial{\text{E}}}{\partial{W_{ij}}} &= \frac{1}{2} \times 2 \times \frac{\partial{}}{\partial{W_{ij}}} \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right) \mathcal{O}_{k} \\
&= \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right) \frac{\partial{}}{\partial{W_{ij}}} \mathcal{O}_{k}
\end{align}
$$</div>
<p>Again, we substitute $\mathcal{O}_{k} = \sigma( x_{k})$ and its derivative and revert back to our output notation:</p>
<div>$$
\begin{align}
\frac{\partial{\text{E}}}{\partial{W_{ij}}} &= \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right) \frac{\partial{}}{\partial{W_{ij}}} (\sigma(x_{k}) )\\
&= \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right) \sigma(x_{k}) \left( 1 - \sigma(x_{k}) \right) \frac{\partial{}}{\partial{W_{ij}}} (x_{k}) \\
&= \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right) \mathcal{O}_{k} \left( 1 - \mathcal{O}_{k} \right) \frac{\partial{}}{\partial{W_{ij}}} (x_{k})
\end{align}
$$</div>
<p>This still looks familar from the output layer derivative, but now we’re struggling with the derivative of the input to $k$ i.e. $x_{k}$ with respect to the weights from $I$ to $J$. Let’s use the chain rule to break apart this derivative in terms of the output from $J$:</p>
<div> $$
\frac{\partial{ x_{k}}}{\partial{W_{ij}}} = \frac{\partial{ x_{k}}}{\partial{\mathcal{O}_{j}}}\frac{\partial{\mathcal{O}_{j}}}{\partial{W_{ij}}}
$$</div>
<p>The change of the input to the $k^{\text{th}}$ node with respect to the output from the $j^{\text{th}}$ node is down to a product with the weights, therefore this derivative just becomes the weights $W_{jk}$. The final derivative has nothing to do with the subscript $k$ anymore, so we’re free to move this around - lets put it at the beginning:</p>
<div>$$
\begin{align}
\frac{\partial{\text{E}}}{\partial{W_{ij}}} &= \frac{\partial{\mathcal{O}_{j}}}{\partial{W_{ij}}} \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right) \mathcal{O}_{k} \left( 1 - \mathcal{O}_{k} \right) W_{jk}
\end{align}
$$</div>
<p>Lets finish the derivatives, remembering that the output of the node $j$ is just $\mathcal{O}_{j} = \sigma(x_{j}) $ and we know the derivative of this function too:</p>
<div>$$
\begin{align}
\frac{\partial{\text{E}}}{\partial{W_{ij}}} &= \frac{\partial{}}{\partial{W_{ij}}}\sigma(x_{j}) \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right) \mathcal{O}_{k} \left( 1 - \mathcal{O}_{k} \right) W_{jk} \\
&= \sigma(x_{j}) \left( 1 - \sigma(x_{j}) \right) \frac{\partial{x_{j} }}{\partial{W_{ij}}} \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right) \mathcal{O}_{k} \left( 1 - \mathcal{O}_{k} \right) W_{jk} \\
&= \mathcal{O}_{j} \left( 1 - \mathcal{O}_{j} \right) \frac{\partial{x_{j} }}{\partial{W_{ij}}} \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right) \mathcal{O}_{k} \left( 1 - \mathcal{O}_{k} \right) W_{jk}
\end{align}
$$</div>
<p>The final derivative is straightforward too, the derivative of the input to $j$ with repect to the weights is just the previous input, which in our case is $\mathcal{O}_{i}$,</p>
<div>$$
\begin{align}
\frac{\partial{\text{E}}}{\partial{W_{ij}}} &= \mathcal{O}_{j} \left( 1 - \mathcal{O}_{j} \right) \mathcal{O}_{i} \sum_{k \in K} \left( \mathcal{O}_{k} - t_{k} \right) \mathcal{O}_{k} \left( 1 - \mathcal{O}_{k} \right) W_{jk}
\end{align}
$$</div>
<p>Almost there! Recall that we defined $\delta_{k}$ earlier, lets sub that in:</p>
<div>$$
\begin{align}
\frac{\partial{\text{E}}}{\partial{W_{ij}}} &= \mathcal{O}_{j} \left( 1 - \mathcal{O}_{j} \right) \mathcal{O}_{i} \sum_{k \in K} \delta_{k} W_{jk}
\end{align}
$$</div>
<p>To clean this up, we now define the ‘delta’ for our hidden layer:</p>
<div>$$
\delta_{j} = \mathcal{O}_{i} \left( 1 - \mathcal{O}_{j} \right) \sum_{k \in K} \delta_{k} W_{jk}
$$</div>
<p>Thus, the amount of error on each of the weights going into our hidden layer:</p>
<div id="derrorij">$$
\frac{\partial{\text{E}}}{\partial{W_{ij}}} = \mathcal{O}_{i} \delta_{j}
$$</div>
<p><strong>Note:</strong> the reason for the name <em>back</em> propagation is that we must calculate the errors at the far end of the network and work backwards to be able to calculate the weights at the front.</p>
<h2 id="bias">6. Bias </h2>
<p><a href="#toctop">To contents</a></p>
<p>Lets remind ourselves what happens inside our hidden layer nodes:</p>
<div class="figure_container">
<div class="figure_images">
<img title="Simple NN" width=50% src="/img/simpleNN/nodeInsideNoBias.png">
</div>
<div class="figure_caption">
<font color="blue">Figure 3</font>: The insides of a hidden layer node, $j$.
</div>
</div>
<ol>
<li>Each feature $\xi_{i}$ from the input layer $I$ is multiplied by some weight $w_{ij}$</li>
<li>These are added together to get $x_{i}$ the total, weighted input from the nodes in $I$</li>
<li>$x_{i}$ is passed through the activation, or transfer, function $\sigma(x_{i})$</li>
<li>This gives the output $\mathcal{O}_{j}$ for each of the $j$ nodes in hidden layer $J$</li>
<li>$\mathcal{O}_{j}$ from each of the $J$ nodes becomes $\xi_{j}$ for the next layer</li>
</ol>
<p>When we talk about the <em>bias</em> term in NN, we are talking about an additional parameter that is inluded in the summation of step 2 above. The bias term is usually denoted with the symbol $\theta$ (theta). It’s function is to act as a threshold for the activation (transfer) function. It is given the value of 1 and is not connected to anything else. As such, this means that any derivative of the node’s output with respect to the bias term would just give a constant, 1. This allows us to just think of the bias term as an output from the node with the value of 1. This will be updated later during backpropagation to change the threshold at which the node fires.</p>
<p>Lets update the equation for $x_{i}$:</p>
<div>$$
\begin{align}
x_{i} &= \xi_{1j} w_{1j} + \xi_{2j} w_{2j} + \theta_{j} \\[0.5em]
\sigma( x_{i} ) &= \sigma \left( \sum_{i \in I} \left( \xi_{ij} w_{ij} \right) + \theta_{j} \right)
\end{align}
$$</div>
<p>and put it on the diagram:</p>
<div class="figure_container">
<div class="figure_images">
<img title="Simple NN" width=50% src="/img/simpleNN/nodeInside.png">
</div>
<div class="figure_caption">
<font color="blue">Figure 3</font>: The insides of a hidden layer node, $j$.
</div>
</div>
<h2 id="backPropagationAlgorithm">7. Back Propagation - the algorithm</h2>
<p><a href="#toctop">To contents</a></p>
<p>Now we have all of the pieces! We’ve got the initial outputs after our feed-forward, we have the equations for the delta terms (the amount by which the error is based on the different weights) and we know we need to update our bias term too. So what does it look like:</p>
<ol>
<li>Input the data into the network and feed-forward</li>
<li><p>For each of the <em>output</em> nodes calculate:</p>
<div>$$
\delta_{k} = \mathcal{O}_{k} \left( 1 - \mathcal{O}_{k} \right) \left( \mathcal{O}_{k} - t_{k} \right)
$$</div></li>
<li><p>For each of the <em>hidden layer</em> nodes calculate:</p>
<div>$$
\delta_{j} = \mathcal{O}_{i} \left( 1 - \mathcal{O}_{j} \right) \sum_{k \in K} \delta_{k} W_{jk}
$$</div>
</li>
<li><p>Calculate the changes that need to be made to the weights and bias terms:</p>
<div>$$
\begin{align}
\Delta W &= -\eta \ \delta_{l} \ \mathcal{O}_{l-1} \\
\Delta\theta &= -\eta \ \delta_{l}
\end{align}
$$</div>
</li>
<li><p>Update the weights and biases across the network:</p>
<div>$$
\begin{align}
W + \Delta W &\rightarrow W \\
\theta + \Delta\theta &\rightarrow \theta
\end{align}
$$</div>
</li>
</ol>
<p>Here, $\eta$ is just a small number that limit the size of the deltas that we compute: we don’t want the network jumping around everywhere. The $l$ subscript denotes the deltas and output for that layer $l$. That is, we compute the delta for each of the nodes in a layer and vectorise them. Thus we can compute the element-wise product with the output values of the previous layer and get our update $\Delta W$ for the weights of the current later. Similarly with the bias term.</p>
<p>This algorithm is looped over and over until the error between the output and the target values is below some set threshold. Depending on the size of the network i.e. the number of layers and number of nodes per layer, it can take a long time to complete one ‘epoch’ or run through of this algorithm.</p>
<p><em>Some of the ideas and notation in this tutorial comes from the good videos by <a href="https://www.youtube.com/playlist?list=PL29C61214F2146796" title=" NN Videos">Ryan Harris</a></em></p>