A starter guide to neural networks: Part 1

@gbonaert Gregory Bonaert bonaert

In this series of posts, you’ll learn what neural networks are, why they work and how you can use them in your projects. We’ll use neural networks to simulate the XOR function, an absurdly over-engineered solution, to understand how they work. Thankfully, applying fancy tech to trivial problems is fun.

Then we’ll create an optical character recognition system (OCR) to get more in depth (this is already a lot more advanced, but it’s still far below what a 4 year old can do… that gushy thing in your skull is really powerful).

Why should I even care about neural networks?

First, you’ll look like a really smart person. Next time you’re with a tech-savvy friend, you can boast by mentioning offhandedly how simple they are, as if you created them in a sleep-deprived plane ride with a toddler kicking the seat behind you.

I'm better than you
At the end of this series,
you’ll have the right to wear this

The better reason, though, is that neural networks can simulate very complicated functions: the kind you can’t express through simple equations or rules. For example, it’s very difficult to create by hand a function that takes in a image, applies some human-made rules, and then outputs the number it represents.

You could try to do it by hand and waste 3 days with the only achievable result of wanting to shoot yourself in the head, or you could get a good database of images, feed them into the neural network and rejoice in your library-backed brilliance.

Neural nets are beautiful because they learn by themselves: you don’t have to teach them the rules. You feed them a lot of training examples, and the network adjusts itself to get more accurate results over time. Add a couple of validation checks for good measure, and you’re good to go.

Project 1: the XOR function

Let’s start our absurd project: teaching our neural network the XOR function. The XOR function is basically a binary “are they different” function: the output is 1 if the inputs are different, 0 if they’re equal.

The XOR function
The XOR function

Enough talk, what are neural networks?

Here’s the fundamental structure of a neural network:

A high-level view of a neural network
A structure of a neural network

There are 3 main parts:

1. The input layer, which receives the inputs of the function you’re trying to simulate (in an OCR system, the input would be the pixels of the picture; for XOR, it’s two binary values)
2. The hidden layers, where the magic happens (note: there may be more than 1 hidden layer)
3. The output layer, whose values are the prediction of the neural network

The simple rules of a neural network

A neural net is made of layers. We compute their value, layer by layer, until we reach the output layer. But how do we compute the value of a node N in a layer? We need to introduce a couple of concepts to understand it:

1. Weights

The value of a node depends on every node of the previous layer. However, some nodes in the previous layer might be more important than others. Some might have a positive effect, while other have a negative one. So how do we capture this idea in maths?

Neural net with some weights (taken from this amazing post by Steven Miller
Neural net with some weights (taken from this amazing post by Steven Miller

The idea is simple: we associate to each connection between node N and the nodes in the previous layer a weight. Let’s look at the first node of the hidden layer. We’ll call that node N, and the two input nodes A and B.

The weight between the nodes A and N is 0.8, and between nodes B and N is 0.2. To know the effect of A and B on the net input of A, we multiply its value by the weight/connection with node N.

  • Node A:  (value of A) x (weight between A and N) = 1 x 0.8 = 0.8
  • Node B:  (value of B) x (weight between B and N) = 1 x 0.2  = 0.2
  • Net input of N: 0.8 + 0.2 = 1.0
The net inputs of the hidden layer (taken again from this amazing post by Steven Miller)
The net inputs of the hidden layer (taken again from this amazing post by Steven Miller)

If that were all we did, this would be a absurdly complicated way to do linear combinations. Unfortunately, linear combinations aren’t enough to do OCR (or more generally, simulate any non-linear function). We need something more powerful.

2. Activation functions

To simulate more complicated functions, we need to add non-linearity. We have to introduce something new: activation functions.

A very common one is the sigmoid function.

The sigmoid function
The sigmoid function

It has some really nice properties:

  • The output is between 0 and 1 (so we don’t get absurdly high values)
  • Low values have images very close to 0, and high have values very close to 1 (it acts like a continuous binary switch)
  • It’s differentiable everywhere (you’ll see later why we need this)

After getting the net input of a node, we apply our activation function, getting the value of the node. Let’s apply this to our example.

The values of the hidden layer (credit: this amazing post by Steven Miller)
The values of the hidden layer (credit: this amazing post by Steven Miller)

Repeating this process for the last layer, we get the following result:

The prediction of our neural network (credit: this amazing post by Steven Miller)
The prediction of our neural network (credit: this amazing post by Steven Miller)

How do we improve the network?

To improve the network, we can either change the structure of the network or the weights.

The structure is usually chosen at the start by humans, after some divine inspiration (common) or after trying lots of possibilities randomly and seeing which one works best. Generally, once the structure is set, it doesn’t change anymore.

Instead, neural networks learn by changing their weights with every training example until the predictions become accurate. But how do we know how to change the weights?

Note: brace yourselves for some maths. It isn’t hard (it’s just the chain rule, basically), but it may take a bit to grok.

Gradient descent
The gradient allows us to find the minima

The key idea is that we can define a error function, which measures how good (or bad) our prediction is. Obviously, we’re trying to minimize this function, so we look at how changing a weight impacts that error function. Yes, that means good old calculus and derivatives/gradients.

If increasing a certain weight increases the error, then we do the opposite: we decrease the weight. It boils down to “let’s see how I should change this weight to reduce the error”. This is fundamental idea behind backpropagation, the most commonly used technique to improve neural networks.

First, let’s define our cost function:

$$ E = \frac{1}{2}(Actual – Predicted)^2 $$

We put the \(\frac{1}{2}\) because when we’ll derive it later on, it will cancel out the 2 from the exponent.

We need to see how changing weights impacts that function. The wonderful chain rule makes it surprisingly easy to figure out.

Let’s start with the weights of the last layer (currently \(0.3, 0.5\) and \(0.9\)), because it’s simpler than the hidden layers. Since I’m lazy, I’ll only do it for the top weight (we’ll call it \(w_1\)):

$$\frac{\partial E}{\partial w_1} = \frac{\partial E}{\partial output} * \frac{\partial output}{\partial input_{net}} * \frac{\partial input_{net}}{\partial w_1}$$

Here,

* \(input_{net}\): the net input of the node in the last layer
* \(output\): the value of the node (the result of applying the activation function on the net input).

Let’s then figure out the value of each of the three terms in our example.

Component 1

$$ \frac{\partial E}{\partial output}
= \frac{\partial (\frac{1}{2} * (Actual – Output)^2)}{\partial output}
=\, – (A\, -\, output) = -\, (0\, -\, 0.77) = 0.77$$

You can see that the \(\frac{1}{2}\) term cancelled the \(2\) from differentiation, making everything simpler

Component 2

Then, we have to compute $$ \frac{\partial output}{\partial input_{net}} $$ Since the \(output = a(input_{net})\) where \(a\) is our activation function, then \(\frac{\partial output}{\partial input_{net}}\) is actually really simple: it’s the derivative of the activation function evaluated at \(input_{net}\).

The cool thing about the sigmoid function is that its derivative is really simple:
$$ a'(x) = a(x) * (1 – a(x)) = output * (1 – output) $$

In our example,

$$ \frac{\partial output}{\partial input_{net}} = a'(input_{net}) = output * (1 – output) = 0.77 * (1 – 0.77) = 0.17 $$

Component 3

Last, we need to figure out the value of $$\frac{\partial input_{net}}{\partial w_1}$$Again, this is surprisingly easy, if we think about it for a minute.

What is \(input_{net}\)? It’s \( \sum \) weight \(*\) input, for every node in the previous layer. For the output node, then

$$ input_{net} = w_1 * input_1 + w_2 * input_2 + w_3 * input_3$$

Therefore, the derivative for a certain weight is simply the associated input. In other words,

$$ \frac{\partial input_{net}}{\partial w_1} = input_1 = 0.73 $$

 

Bringing it together…

we then have that

$$\frac{\partial E}{\partial w_1} = \frac{\partial E}{\partial output} * \frac{\partial output}{\partial input_{net}} * \frac{\partial input_{net}}{\partial w_1} = 0.77 * 0.17 * 0.73 = 0.095$$

We can then see that if we increase the weight, then the error rate will increase. Obviouly, we have to do the opposite: decrease the weight.

But by how much? In backpropagation, we multiply that derivative with a learning rate, and then subtract that from our weight (some algorithms think this is a stupid idea). In other words,

$$ weight_{new} = weight_{old}\, -\, \alpha * \frac{\partial E}{\partial w_1}$$

where \(\alpha\) is the learning rate. Let’s take \(\alpha = 0.2\) as our learning rate and apply it on the weight. Then,

$$w_{1, new} = w_{1, old}\, -\, \alpha * \frac{\partial E}{\partial w_1} = 0.3\, -\, 0.2 * 0.095 = 0.281$$

This might seem like a small change, but it actually isn’t, since neural networks need to train over thousand of examples to get good results. If each example made the weights bounce all over the place, then you’d have a hard time converging them to the optimum. Choosing a good learning rate (and a good structure for the network) is called hyperparameter tuning and is a magic art by itself.

One little problem…

We were able to figure out how to improve the weights of the last layer. What about the weights of previous layers? Let’s look at our fundamental equation again:

$$\frac{\partial E}{\partial w_{hidden}} = \frac{\partial E}{\partial output_{hidden}} * \frac{\partial output_{hidden}}{\partial input_{net}} * \frac{\partial input_{net}}{\partial w_{hidden}}$$

The 2nd and 3rd term stay the same and are still easy to compute. Unfortunately, the first term becomes difficult to compute! Imagine we’re trying to figure out

$$ \frac{\partial E}{\partial output_{hidden}} $$

for an output that’s 20 layers away of the output. How could we figure it out?
The answer to that question also reveals why the algorithm is called backpropagation… if you want to figure it out, check out the next post.

Leave a Reply