Machine Learning - Recurrent Neural Networks Maths

Published 7 December 2015 by Paul Tero of Existor Ltd

This article is a follow up to a previous article about feed forward neural networks. It looks at the logical and mathematical differences between recurrent neural networks and feed forward neural networks. The maths equations are matched to lines of code in a Python implementation of a minimal character level language model as referenced in this article about the "unreasonable effectiveness" of RNNs and a basic C version as well.

Extra hidden layer connections

The main difference between an NN and RNN is that in an RNN the hidden layer is additionally connected to itself. The graph below includes nine extra little black circles and arcs, which indicate the connections from each hidden node back to itself. Some of these connections are doubled up. This is because the connections have different weights in different directions. In other words, using the terminology of the previous article \(w^{hh}_{12} \ne w^{hh}_{21}\).

RNNs can learn sequences

The impact of those little extra lines is significant. In feed forward networks, the output depends only on the inputs and the weights. In RNNs, the output also depends on the previous inputs (via those extra connections). It means that RNNs can learn about sequences in data.

For example, the C code for previous article implemented a neural network which accepted the letter A, B or C as input. It learned to output the next letter in the alphabet: B, C or A respectively. The inputs and outputs were encoded in 1-of-k encoding (aka binary) as A=100, B=010, C=001. So it learned 100=>010, 010=>001, 001=>100.

How about if we wanted it to learn something about sequences instead? Instead of treating the output as letters, we could treat them as numbers. And we want the network to tell us how many times in a row it has seen the current letter. So if we send in the following inputs in sequence "ABCC", we should get the outputs "1112". Because it saw the A once, the B once, the C once, then the C again (twice).

Our feed forward network can never learn this, no matter how many neurons or hidden layers it has, and no matter how many training examples it sees. It can only learn a relationship between the current input and output. We need the extra connections to teach it about sequences in the data. That's what an RNN can do.

This is a simple example, but it can be easily extended once the code is in place. In particular, it can learn the sequences in letters and words. Instead of giving it "ABCC" and "1112", we could give it "particularl" and "articularly". The first time it runs, this network would try to learn that an input of the letter "p" (which could be represented in 1-of-k as 00000000000000010000000000) should give an output of "a". Next it will learn that "p" followed by "a" ought to produce an "r". And then "p", "a", "r" should output a "t". If we trained it on this single word thousands of times, it would learn the sequences in that word. Then we could just give it "p" and it would spell the rest of the word all by itself. Things would get particularly interesting with the "r"s and "l"s, because the first "r" is followed by a "t" but the second one by an "l". The RNN needs the extra connections to be able to differentiate between the two "r"s and learn the sequence properly.

Learning a single word is not particularly useful, but this could be extended to all words in a languages, such as this RNN which speaks in the style of Shakespeare.

Unrolling an RNN

The maths behind feed foward neural networks is already pretty intimidating. It seems like those circular connections will surely make it much much more complicated. But they don't really, because of a neat little trick.

If we only look at sequences of a fixed length, then we can unroll or unfold the RNN. For example, if we arbitrarily say that we are only interested in sequences up to 3 long, then we can unroll our RNN into a three feed forward networks. Each of the unrolled networks has the same architecture as the original RNN with an input, hidden and output layer. But there are now extra connections between the hidden layers of each unrolled network. For instance, when training with the input "ABC", the first unrolled network receives the A=>100, the second one the B=>010 and the third one the C=>001.

Feed forward stage

As with traditional neural networks, RNNs have a feed forward stage and a backwards propagation stage. With RNNs we do feed forward on each unrolled network separately. And we do all the feed forwards first, before thinking about back propagation.

So the first thing our RNN receives is the letter A=>100. It takes the inputs and computes the weighted sums. There is a weighted sum for each of the 3 hidden nodes (corresponding to j=1,2,3). Each weighted sum adds up values from the \(n_x=3\) input nodes plus the \(n_h=3\) hidden nodes. The weights between the input and hidden nodes are given by \(w^{xh}\) and the weights from hidden-to-hidden by \(w^{hh}\). At the beginning all those hidden nodes are set to zeros (so the second part of the equation below is acutally 0). The sum looks like this:

\(z_{hj} = \sum_{i=1}^{n_x} x_i w^{xh}_{ij} + \sum_{i=1}^{n_h} h_i w^{hh}_{ij}\)

From the weighted sums we compute the activation values as before:

\(h_j = \tanh\ (z_{hj}) = \tanh\ (\sum_{i=1}^{n_x} x_i w^{xh}_{ij} + \sum_{i=1}^{n_h} h_i w^{hh}_{ij}) \)

And then the weighted sums for the output layer:

\(z_{yj} = \sum_{i=1}^{n_h} h_i w^{hy}_{ij} \)

Then we compute the probable answer using softmax over all of the weighted output sums. This uses the softmax function shown here in all its gory mathematical detail:

\(p_j = \text{softmax}\ (z_{yj}) = \frac {e^{z_{yj}}} {\sum_{k=1}^{n_y} {e^{z_{yk}}}} \)

And finally we calculate the loss by adding up those probabilities according to the cross-entropy log loss formula for more than 2 outputs:

\(J = - \sum_{i=1}^{n_y} y_j \log p_j \)

Feed forward on the rest of the inputs

We have now done feed forward on the first unrolled network. In a traditional neural network, we would now continue to the back propagation stage. But not here. Instead we will go back and do it all again for the second unrolled network, using the second input B=>010. But now we must introduce another letter into the equations for the time step. It seems logical to use the letter t. We'll label everything in the first unrolled network with t=1 and so on. With labels the whole shebang now looks like this:

Mathematically our weighted sum in the second time step t=2 will refer to our second set of inputs (also t=2) but using the hidden node values from the old days when t=1:

\(z^{t=2}_{hj} = \sum_{i=1}^{n_x} x^{t=2}_i w^{xh}_{ij} + \sum_{i=1}^{n_h} h^{t=1}_i w^{hh}_{ij}\)

That t=1 on \(h^{t=1}\) is very important. That is the recurrent step that allows sequences to be remembered. And now we will compute the hidden activation values in this second time step, and only they will get the privilege of being t=2:

\(h^{t=2}_j = \tanh\ (z^{t=2}_{hj}) \)

The corresponding line of Python code for the previous two equations in a minimal RNN implementation is given below. np is the numpy object for performing matrix and math operations. hs is the state of the hidden nodes at different time steps, t for the current time step, t-1 for the previous one. xs[t] is the input at the current time step. Note the code also deals with the bias in variables bh and by but I am ignoring them here for clarity (and for some reason the Python code has no input bias bx):

hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]))  # hidden state

And here are the corresponding lines of C code from a C implemenation I have written. This code does include the bias. t is the time step and numinputswb is the number of inputs with the bias. The inputs are stored in one long 1-dimensional array, rather than a 2-dimensional array, because then the memory can be allocated in a single block. The 1-dimensional array is divided into chunks numinputswb long, and &inputs[t * numinputswb] is a pointer to the inputs for time step t. The function MatrixMultiply is included in the code for multiplying two matrices. It does the same thing as Python's np.dot. It uses the fast CBLAS linear algebra library where available. The numhidden results are stored in the in &hidden[t * numhiddenwb + 1] which is a pointer to the hidden sums for time step t, with the +1 to avoid overwriting the bias. The second line adds in the contribution from the previous time step, and the third line does the tanh turning each sum (except the 0th bias node) into an activation value:

MatrixMultiply (&inputs[t * numinputswb], Wxh, 1, numinputswb, numhidden, 0, &hidden[t * numhiddenwb + 1]);
if (t>0) MatrixMultiply (&hidden[(t-1) * numhiddenwb], Whh, 1, numhiddenwb, numhidden, 1, &hidden[t * numhiddenwb + 1]);
for (h=1; h<numhiddenwb; h++) hidden[t * numhiddenwb + h] = tanh (hidden[t * numhiddenwb + h]);

The rest of the network follows as before, but with the t=2. In the Python code the ys variable stores the weighted sums. In C the probabilities are stored in the probs array:

\(z^{t=2}_{yj} = \sum_{i=1}^{n_h} h^{t=2}_i w^{hy}_{ij} \)

ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars

MatrixMultiply (&hidden[t * numhiddenwb], Why, 1, numhiddenwb, numoutputs, 0, &probs[t * numoutputs]);

The probabilties are computed using softmax. These are stored in the ps variable in the Python code, and kept in the same probs array in C:

\(p^{t=2}_j = \text{softmax}\ (z^{t=2}_{yj}) \)

ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars

for (sum=0, y=0; y<numoutputs; y++) {probs[t * numoutputs + y] = exp (probs[t * numoutputs + y]); sum += probs[t * numoutputs + y];}
if (sum*gt;0) for (y=0; y<numoutputs; y++) probs[t * numoutputs + y] /= sum;

Then we compute the loss for this time step. It is the same as the cross-entropy loss function described for the feed-forward network for more than two outputs. Since only one of the y values is 1 and the rest are 0, the Python only adds in the single y=1 value (for targets[t]). Likewise in the C the for loop isn't really necessary, but I want to make the C code as close to the maths as possible.

\(J^{t=2} = - \sum_{i=1}^{n_y} y^{t=2}_j \log p^{t=2}_j \)

loss += -np.log(ps[t][targets[t],0])

for (y=0; y<numoutputs; y++) loss -= outputs[t * numoutputs + y] * log (probs[t * numoutputs + y]);

We continue like this for each input. In our case, we do it three times for t=1,2,3. At each time step we get an output p which we compare to the expected output y to see how well we are doing. We add up the total loss from each step and we remember the values of the hidden nodes, output nodes and probabilities. Because we will use them next in back propagation. This is implemented by a loop around the above steps:

for t in xrange(len(inputs)):

for (t=0; t<numtimesteps; t++)

Backwards propagation

So let's review what we have so far. It's like we have trained 3 separate neural networks, one for each time step. Now we need to go back and figure out how to change all the weights so it can do better next time.

To do backprop we need to figure out how the loss J depends on the various weights. This involves computing all the partial derivatives. The derivatives for the last unrolled network t=3 are pretty much the same as for the feed forward network. This equation describes how the loss J changes in relation to each of the hidden-to-output weights in the last unrolled network:

\(\frac {\partial J^{t=3}} {\partial w^{hy}_{ij}} = (p^{t=3}_j - y^{t=3}_j) h^{t=3}_i = h^{t=3}_i e^{t=3}_{yj} \)

In the Python code, the result of this equation is stored in the variable dWhy. The first and second line implement dy = p - y because the single y value which equals 1 is subtracted from p. So dy actually represents e in the sums above. This is multiplied by the hidden activation values hs:

dy = np.copy(ps[t])
dy[targets[t]] -= 1 # backprop into y
dWhy += np.dot(dy, hs[t].T)

In the C code, p - y is stored in the variable outputerror and the addition to dWhy is calculated by the MatrixMultiply function:

for (y=0; y<numoutputs; y++) outputerror[t * numoutputs + y] = probs[t * numoutputs + y] - outputs[t * numoutputs + y];
MatrixMultiply (&hidden[t * numhiddenwb], &outputerror[t * numoutputs], numhiddenwb, 1, numoutputs, 1, dWhy);

Now we need to figure out how J depends on the input-to-hidden weights. Let's take just one of those input-to-hidden weights to begin with \(w^{xh}_{11}\), shown in purple in the picture below. That weight influences the loss J in three different ways, through the three different output nodes. I've added J to the picture as it is were another output layer:

Using the chain rule for calculating partial derivatives, the influence of that first hidden weight is a sum of three different paths:

\(\frac {\partial J} {\partial w^{xh}_{11}} = \frac {\partial J} {\partial p_1} \frac {\partial p_1} {\partial z_{y1}} \frac {\partial z_{y1}} {\partial h_1} \frac {\partial h_1} {\partial z_{h1}} \frac {\partial z_{hi}} {\partial w^{xh}_{11}} + \frac {\partial J} {\partial p_2} \frac {\partial p_2} {\partial z_{y2}} \frac {\partial z_{y2}} {\partial h_1} \frac {\partial h_1} {\partial z_{h1}} \frac {\partial z_{hi}} {\partial w^{xh}_{11}} + \frac {\partial J} {\partial p_3} \frac {\partial p_3} {\partial z_{y3}} \frac {\partial z_{y3}} {\partial h_1} \frac {\partial h_1} {\partial z_{h1}} \frac {\partial z_{hi}} {\partial w^{xh}_{11}} \)

This can all be wrapped up in a sum and applied to any of the input-to-hidden weights with:

\(\frac {\partial J} {\partial w^{xh}_{mi}} = \sum_{j=1}^{n_y} (\frac {\partial J} {\partial p_j} \frac {\partial p_j} {\partial z_{yj}} \frac {\partial z_{yj}} {\partial h_i}) \frac {\partial h_i} {\partial z_{hi}} \frac {\partial z_{hi}} {\partial w^{xh}_{mi}} \)

The first two terms in this sum are just \(e_{yj}\) as computed for the hidden-to-output weights computed above. The third term is how the weighed sum \(z_{yj}\) depends on the hidden node \(h_i\) it is attached to. This is just the weight which connects them:

\(\frac {\partial J} {\partial w^{xh}_{mi}} = \sum_{j=1}^{n_y} (e_{yj} w^{hy}_{ij} ) \frac {\partial h_i} {\partial z_{hi}} \frac {\partial z_{hi}} {\partial w^{xh}_{mi}} \)

That whole sum at the beginning can be neatly summed up as the backward propagated hidden node error:

\(e_{hi} = \sum_{j=1}^{n_y} (e_{yj} w^{hy}_{ij} ) \)

In the Python code, this sum is stored in the variable dh. The code also includes the variable dhnext which initially equals zero and will be explained later. In C, hiddenerror takes the place of both dh and dhnext:

dh = np.dot(Why.T, dy) + dhnext # backprop into h

MatrixMultiply (Why, &outputerror[t * numoutputs], numhiddenwb, numoutputs, 1, 1, hiddenerror);

The fourth term in the sum above is the partial derivative of the tanh operation, which is \(1 - \tanh^2 z\). Since \(h = \tanh z\) this can be made more succint:

\( \frac {\partial h_i} {\partial z_{hi}} = 1 - \tanh^2 (z_{hi}) = 1 - h^2_i \)

The final term is just the weight that connects the input to the hidden node in question, so (with the time t=3) the whole sum boils down to:

\(\frac {\partial J^{t=3}} {\partial w^{xh}_{mi}} = e^{t=3}_{hi} (1 - (h^{t=3}_i)^2) x^{t=3}_m \)

The Python code stores the back propagated error \(e * (1 - h^2) \) in the variable dhraw. This is then multiplied by xs to give the partial derivatives dWxh. In C hiddenerror is multiplied by \(e * (1 - h^2) \) and stored back into hiddenerror. This is then multiplied by the inputs inputs (starting at index 1 in hiddenerror[1] to skip the bias node) to compute dWxh:

dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity
dWxh += np.dot(dhraw, xs[t].T)

for (h=1; h<numhiddenwb; h++) hiddenerror[h] *= 1 - hidden[t * numhiddenwb + h] * hidden[t * numhiddenwb + h];
MatrixMultiply (&inputs[t * numinputswb], &hiddenerror[1], numinputswb, 1, numhidden, 1, dWxh);

That is still exactly the same as going further back in feed forward network. But now there is a new bit, which is how J changes with relation to the hidden-to-hidden weights. This is worked out in exactly the same way by starting with a long sum:

\(\frac {\partial J^{t=3}} {\partial w^{hh}_{mi}} = \sum_{j=1}^{n_y} (\frac {\partial J^{t=3}} {\partial p^{t=3}_j} \frac {\partial p^{t=3}_j} {\partial z^{t=3}_{yj}} \frac {\partial z^{t=3}_{yj}} {\partial h^{t=3}_i}) \frac {\partial h^{t=3}_i} {\partial z^{t=3}_{hi}} \frac {\partial z^{t=3}_{hi}} {\partial w^{hh}_{mi}} \)

Using exactly the same process as above, this reduces to a dependence on the hidden node from the previous time step t=2:

\(\frac {\partial J^{t=3}} {\partial w^{hh}_{mi}} = e^{t=3}_{hi} (1 - (h^{t=3}_i)^2) h^{t=2}_m \)

The Python code has already stored \(e * (1 - h^2) \) in the variable dhraw. So this is multiplied by the hidden activation values from the previous time step t-1 to give dWhh:

dWhh += np.dot(dhraw, hs[t-1].T)

MatrixMultiply (&hidden[(t-1) * numhiddenwb], &hiddenerror[1], numhiddenwb, 1, numhidden, 1, dWhh);

Backwards propagation through time

That's not all though, because that only caters for the input-to-hidden and hidden-to-hidden weights which connect directly to the third unrolled network. What about the input-to-hidden and hidden-to-hidden weights connected to the second network? In other words, we have only dealt with the \(\partial w^{xh}_{mi}\) connected to \(h^{t=3}_i\). We also need to deal with \(\partial w^{xh}_{mi}\) connected to \(h^{t=2}_i\). That is complicated because the input and hidden nodes in the second network are connected to all the hidden nodes in the third network. And they are all connected to all the output nodes. So in fact there are 3x3=9 distinct paths back to that weight.

Let's try to compute how J changes with respect to the very first input weight \(w^{xh}_{11}\) in the second network. This is highlighted in purple in the image below. And let's follow the first of nine paths to the end of the network, from \(h^{t=2}_1\) to \(h^{t=3}_1\) and then \(y^{t=3}_1\). The path is highlighted in solid blue. Also included is the loss J as an imaginary node on the right:

So note in the equation below that we're only considering \(w^{xh}_{11}\) and only via one of the 9 possible paths so there are no sums (we'll add them back in later). This equation has 7 terms as opposed to 5 for the input-to-hidden weights. In fact, for every step back we add two more terms (one for how z changes with respect to h from the previous time step and the other for how h changes with respect to its weighted sum z):

\(\frac {\partial J^{t=3}} {\partial w^{xh}_{11}} = \frac {\partial J^{t=3}} {\partial p^{t=3}_1} \frac {\partial p^{t=3}_1} {\partial z^{t=3}_{y1}} \frac {\partial z^{t=3}_{y1}} {\partial h^{t=3}_1} \frac {\partial h^{t=3}_1} {\partial z^{t=3}_{h1}} \frac {\partial z^{t=3}_{h1}} {\partial h^{t=2}_1} \frac {\partial h^{t=2}_1} {\partial z^{t=2}_{h1}} \frac {\partial z^{t=2}_{h1}} {\partial w^{xh}_{11}} \)

Now we consider the second of nine paths which starts and ends in the same place but goes through the second hidden node in the third unrolled network, from \(h^{t=2}_1\) to \(h^{t=3}_2\) and finishing again at \(y^{t=3}_1\). This path is indicated with dashed blue lines:

\(\frac {\partial J^{t=3}} {\partial w^{xh}_{11}} += \frac {\partial J^{t=3}} {\partial p^{t=3}_1} \frac {\partial p^{t=3}_1} {\partial z^{t=3}_{y1}} \frac {\partial z^{t=3}_{y1}} {\partial h^{t=3}_2} \frac {\partial h^{t=3}_2} {\partial z^{t=3}_{h2}} \frac {\partial z^{t=3}_{h2}} {\partial h^{t=2}_1} \frac {\partial h^{t=2}_1} {\partial z^{t=2}_{h1}} \frac {\partial z^{t=2}_{h1}} {\partial w^{xh}_{11}} \)

We can do this for each of the 9 paths. We'll end up with a sum (this is all still just for the first weight):

\(\frac {\partial J^{t=3}} {\partial w^{xh}_{11}} = \sum_{j=1}^{n_y} \sum_{k=1}^{n_h} (\frac {\partial J^{t=3}} {\partial p^{t=3}_j} \frac {\partial p^{t=3}_j} {\partial z^{t=3}_{yj}} \frac {\partial z^{t=3}_{yj}} {\partial h^{t=3}_k} \frac {\partial h^{t=3}_k} {\partial z^{t=3}_{hk}} \frac {\partial z^{t=3}_{hk}} {\partial h^{t=2}_1} \frac {\partial h^{t=2}_1} {\partial z^{t=2}_{h1}} \frac {\partial z^{t=2}_{h1}} {\partial w^{xh}_{11}}) \)

And we need to do this for all weights so we replace the 11 with the indices mi as above:

\(\frac {\partial J^{t=3}} {\partial w^{xh}_{mi}} = \sum_{j=1}^{n_y} \sum_{k=1}^{n_h} (\frac {\partial J^{t=3}} {\partial p^{t=3}_j} \frac {\partial p^{t=3}_j} {\partial z^{t=3}_{yj}} \frac {\partial z^{t=3}_{yj}} {\partial h^{t=3}_k} \frac {\partial h^{t=3}_k} {\partial z^{t=3}_{hk}} \frac {\partial z^{t=3}_{hk}} {\partial h^{t=2}_i} \frac {\partial h^{t=2}_i} {\partial z^{t=2}_{hi}} \frac {\partial z^{t=2}_{hi}} {\partial w^{xh}_{mi}}) \)

We can reorganise this sum:

\(\frac {\partial J^{t=3}} {\partial w^{xh}_{mi}} = \sum_{k=1}^{n_h} ( \sum_{j=1}^{n_y} (\frac {\partial J^{t=3}} {\partial p^{t=3}_j} \frac {\partial p^{t=3}_j} {\partial z^{t=3}_{yj}} \frac {\partial z^{t=3}_{yj}} {\partial h^{t=3}_k}) \frac {\partial h^{t=3}_k} {\partial z^{t=3}_{hk}} \frac {\partial z^{t=3}_{hk}} {\partial h^{t=2}_i} ) \frac {\partial h^{t=2}_i} {\partial z^{t=2}_{hi}} \frac {\partial z^{t=2}_{hi}} {\partial w^{xh}_{mi}} \)

The first thing to observe is that we've already dealt with the first three terms. They are the error back propagated to the hidden nodes:

\(\frac {\partial J^{t=3}} {\partial w^{xh}_{mi}} = \sum_{k=1}^{n_h} (e^{t=3}_{hk} \frac {\partial h^{t=3}_k} {\partial z^{t=3}_{hk}} \frac {\partial z^{t=3}_{hk}} {\partial h^{t=2}_i} ) \frac {\partial h^{t=2}_i} {\partial z^{t=2}_{hi}} \frac {\partial z^{t=2}_{hi}} {\partial w^{xh}_{mi}} \)

The next two terms are calculated as:

\(\frac {\partial J^{t=3}} {\partial w^{xh}_{mi}} = \sum_{k=1}^{n_h} (e^{t=3}_{hk} (1 - (h^{t=3}_k)^2)) w^{hh}_{ik}) \frac {\partial h^{t=2}_i} {\partial z^{t=2}_{hi}} \frac {\partial z^{t=2}_{hi}} {\partial w^{xh}_{mi}} \)

We can consider that sum as the error back propagated to the second network's hidden nodes:

\(e^{t=2}_{hi} = \sum_{k=1}^{n_h} (e^{t=3}_{hk} (1 - (h^{t=3}_k)^2)) w^{hh}_{ik}) \)

Before we address the last two terms, let's look at what we have so far. The loss J has three separate influences from the input-to-hidden weights, from each of the unrolled networks above. So far we've only considered the first two of those. Combining all three, it looks like this (where all the e terms are actually exactly the same just with different summation indices on them):

\(\frac {\partial J^{t=3}} {\partial w^{xh}_{mi}} = e^{t=3}_{hi} \frac {\partial h^{t=3}_i} {\partial z^{t=3}_{hi}} \frac {\partial z^{t=3}_{hi}} {\partial w^{xh}_{mi}} + \sum_{k=1}^{n_h} ( e^{t=3}_{hk} \frac {\partial h^{t=3}_k} {\partial z^{t=3}_{hk}} \frac {\partial z^{t=3}_{hk}} {\partial h^{t=2}_i} ) \frac {\partial h^{t=2}_i} {\partial z^{t=2}_{hi}} \frac {\partial z^{t=2}_{hi}} {\partial w^{xh}_{mi}} + \sum_{l=1}^{n_h} (\sum_{k=1}^{n_h} ( e^{t=3}_{hk} \frac {\partial h^{t=3}_k} {\partial z^{t=3}_{hk}} \frac {\partial z^{t=3}_{hk}} {\partial h^{t=2}_l} ) \frac {\partial h^{t=2}_l} {\partial z^{t=2}_{hl}} \frac {\partial z^{t=2}_{hl}} {\partial h^{t=1}_i} ) \frac {\partial h^{t=1}_i} {\partial z^{t=1}_{hi}} \frac {\partial z^{t=1}_{hi}} {\partial w^{xh}_{mi}} \)

And we can simplify that to this. I've put in an extra marker such as \(t=2f3\) to indicate that these are errors computed for the inputs and outputs of the third unrolled network. You'll see why in the next step:

\(\frac {\partial J^{t=3}} {\partial w^{xh}_{mi}} = e^{t=3f3}_{hi} \frac {\partial h^{t=3}_i} {\partial z^{t=3}_{hi}} \frac {\partial z^{t=3}_{hi}} {\partial w^{xh}_{mi}} + e^{t=2f3}_{hi} \frac {\partial h^{t=2}_i} {\partial z^{t=2}_{hi}} \frac {\partial z^{t=2}_{hi}} {\partial w^{xh}_{mi}} + e^{t=1f3}_{hi} \frac {\partial h^{t=1}_i} {\partial z^{t=1}_{hi}} \frac {\partial z^{t=1}_{hi}} {\partial w^{xh}_{mi}} \)

All the e terms in that sum are computed based on the inputs and outputs to the third unrolled network. But backward propagation also goes in a loop, and we'll eventually compute the back propagated errors and changes to weights based on the second sets of inputs and outputs:

\(\frac {\partial J^{t=2}} {\partial w^{xh}_{mi}} = e^{t=2f2}_{hi} \frac {\partial h^{t=2}_i} {\partial z^{t=2}_{hi}} \frac {\partial z^{t=2}_{hi}} {\partial w^{xh}_{mi}} + e^{t=1f2}_{hi} \frac {\partial h^{t=1}_i} {\partial z^{t=1}_{hi}} \frac {\partial z^{t=1}_{hi}} {\partial w^{xh}_{mi}} \)

So the thing to note is that we can delay adding in the backward propagated errors until we get further into the loop. In other words, we can initially compute the derivatives of J with respect to the third unrolled network with only the first term:

\(\frac {\partial J^{t=3}} {\partial w^{xh}_{mi}} = e^{t=3f3}_{hi} \frac {\partial h^{t=3}_i} {\partial z^{t=3}_{hi}} \frac {\partial z^{t=3}_{hi}} {\partial w^{xh}_{mi}}\)

And then add in the other term only when we get to the second unrolled network:

\(\frac {\partial J^{t=2}} {\partial w^{xh}_{mi}} = (e^{t=2f3}_{hi} + e^{t=2f2}_{hi}) \frac {\partial h^{t=2}_i} {\partial z^{t=2}_{hi}} \frac {\partial z^{t=2}_{hi}} {\partial w^{xh}_{mi}} \)

And when we get to the first unrolled network we'll have:

\(\frac {\partial J^{t=1}} {\partial w^{xh}_{mi}} = (e^{t=1f3}_{hi} + e^{t=1f2}_{hi} + e^{t=1f1}_{hi}) \frac {\partial h^{t=1}_i} {\partial z^{t=1}_{hi}} \frac {\partial z^{t=1}_{hi}} {\partial w^{xh}_{mi}} \)

In the Python code this is what dhnext does. It stores the back-propagated-through-time error. In C we store this matrix multiplication in hiddentemp and then copy back to hiddenerror to prepare for next time:

dhnext = np.dot(Whh.T, dhraw)

MatrixMultiply (Whh, &hiddenerror[1], numhiddenwb, numhidden, 1, 0, hiddentemp);
memcpy (hiddenerror, hiddentemp, numhiddenwb);

And it is added to the e at each time step with:

dh = np.dot(Why.T, dy) + dhnext # backprop into h

MatrixMultiply (Why, &outputerror[t * numoutputs], numhiddenwb, numoutputs, 1, 1, hiddenerror);

The hidden-to-hidden weights are computed in exactly the same way using the same dhnext variable:

\(\frac {\partial J^{t=1}} {\partial w^{hh}_{mi}} = (e^{t=1f3}_{hi} + e^{t=1f2}_{hi} + e^{t=1f1}_{hi}) \frac {\partial h^{t=1}_i} {\partial z^{t=1}_{hi}} \frac {\partial z^{t=1}_{hi}} {\partial w^{hh}_{mi}} \)

Finally in Python in C, all of the above is wrapped in a back propagation loop which takes t from 3 down to 1 with:

for t in reversed(xrange(len(inputs))):

for (t=numtimesteps-1; t>=0; t--)

RNN C code

That's it for the maths of an RNN. To understand this better, I programmed a basic RNN in C, based on the equations above, using a feed forward neural network C code starting point and guided by the Python code. As with the Python code, it creates a basic character based language model.

The C program takes an input text file and trains an RNN as described above. It treats each individual line in the file as a complete sequence, taking each letter as an input, and the next letter as its output (eg "p" is the first input and "a" is the output for the word "particularly"). Even with a basic RNN like this there are lots of adjustable parameters. The program takes seven of them:

an input file
stop character: the ASCII value of the character used to end sequences, defaults to 10 for a newline character, so each line is a separate sequence
max sequence: the maximum length of a sequence, used for memory allocations
epochs: number of times it processes the whole input file
number of hidden nodes
learning rate: how fast the network learns. This is multiplied by the partial derivatives liek dWxh before they are subtracted from the weights Wxh. There are many techniques for adjusting the learning rate as learning progesses, but this code leaves it constant.
num generate: how many characters to generate after training, pass in a negative number to generate after each iteration so you can see the progression
attributes: 0=feed forward network which disables the recurrent hidden-to-hidden weights turning the RNN into a standard feed forward neural network. 1=RNN. 2=RNN with clipping on the partial derivatives, which prevents them getting too big (you may see this if you put in too many hidden nodes, the loss becomes nan for "not a nubmer"), 3=Adagrad learning, a relatively new technique which adjusts the learning rate for each weight separately and is used by the Python code.
debug: a debug level

As for the code itself, a few choice lines were included above, but the full code involves much more. There are several functions at the top of the file: GetTimeNow and GetTime are for timing on Windows and Linux/Mac platforms, PrintMatrix outputs a matrix in Matlab/Octave format for debugging purposes, MultiplyMatrix multiplies two matrices together (using the fast linear algebra library CBLAS if available) and GetRandom and GetRandomNormal get random numbers. RunRNN runs the RNN to generate text and TrainRNN trains it. Finally main reads command line parameters and sets it all going.

To compile the code you'll need a C compiler. If you want to use the CBLAS library, you'll have to install it first and then uncomment the following line at the top of code:

#include "cblas.h" //include CBLAS library

Run this on the command line to compile with the CBLAS library. Leave off the -lcblas if you are not using it:

gcc rnn.c -o rnn.out -lm -lcblas

That will give you the exectuable rnn.out. The next section runs some tests. Note that this code is not efficient. A real RNN has lots of shortcuts and optimisations. This code is to demonstrate as simply as possible how it works. It is also not particularly well written. I've avoided using too many functions, objects and options to try to keep it simple.

Testing it

To try the example mentioned above, first save the word "particularly" into a text file called "testinput.txt". Then run it with:

./rnn.out --input testinput.txt --attributes 0 --hidden 10 --learningrate 0.1

This sets up a feed-forward network (because of attributes=0) with 10 hidden nodes, reads the text file 10 times with a learning rate of 0.1, and then tests. Testing involves running the network using the first input character "p" as a seed. Following the seed, characters are printed probabilistically. So if there is a 60% chance for "a" to follow "p", then there will be a 60% chance of it outputting an "a":

After 10 epochs in 1ms, loss was 2.30246, generated text:
plllp...

The default setting is 10 epochs (times through the file), which isn't much at all. You can try --epochs 10000 instead, which produces something like:

After 10000 epochs in 288ms, loss was 0.23128, generated text:
parly
arlarticuly
ticularlarly...

This has only learned connections between single letters. So "r" is always the most probable letter after "a". But our training word "particularly" has two "r"s, one followed by a "t" and the other by an "l". This network considers "l" as more probable and so choose it more often. You can add --debug 2 to see it generate text after every epoch. You may notice that it gets to "parly" after about 100 epochs but can't get much further.

Now change the attributes to 1 to turn on the recurrent hidden-to-hidden connections to make this into an RNN:

./rnn.out --input testinput.txt --attributes 1 --hidden 10 --learningrate 0.1 --epochs 150 --debug 2

It will now learn the correct full sequence with "a" following the first "l" and "y" following the second one:

After 150 epochs in 7ms, loss was 2.08143, generated text:
particularly...

Now you can experiment with all the different parameters to see how it effects the network. Try varying the hidden nodes, epochs and learning rate. One striking change is to use the Adagrad learning by setting --attributes 3. Then the network can learn "particularly" within about 40 epochs.

More data

You can also use this C code to compare to the output from this great RNN post. This C is an implementation of the Python min RNN using 100,000 characters of Shakespeare as input. Get this text with:

wget http://cs.stanford.edu/people/karpathy/char-rnn/shakespear.txt

To match the Python code as closely as possible, run the C code with:

./rnn.out --input shakespear.txt --stop 0 --maxsequence 25 --hidden 100 --learningrate 0.1 --attributes 3 --epochs 50 --generate 200 --debug 3

This tells it to not treat newlines as end-of-sequence markers. Instead it feeds the network sequences of 25 characters at a time and carries over the hidden state between sequences. As with the Python, the C code generates some text every 100 iterations (eg every 100th 25 character sequence it sees, so every 2500 characters). An epoch (one pass through the Shakespeare file) therefore encompasses 4000 iterations.

On my computer, Python took about 15 seconds to do one epoch. The C code without CBLAS takes almost double 30 seconds. With CBLAS it's much faster, about 5.5 seconds per epoch. Here are the results after 50 epochs or 200,000 iterations. The Python code output this after about 10 minutes:

Sise and shoul dearffted;

BANTIAGH:
Sell fandenigh? shal net rethonk,
As husself make.

LATVOLIO:
Now these you'd the pors,
Aldiel frove in foud sootee.

Prot? Preak not geake o, the ravemarth, Loea

And C produced this after 5 minutes:

TLLOPrS:
My deepret wordst:
He.

ONCELIO:

QUEEN MACESTERTH:
Your bears.
Op, you thker from p meEveS by all not lerd he dony, and ang
Wast, sele in Sofef fear gifue.

Secont MaS of for our fare pripiin

It looks Shakespeare-ish, but doesn't really make sense, and unfortunately it doesn't get much better. The output from the RNN effectiveness post used more data, more time, more nodes and more layers. Still, quite impressive for a few minutes and a small amount of code (even if it did take ages to get it working fully).

For comparison, without the recurrent hidden-to-hidden nodes, the network took 3 minutes and came up with:

TUKIGSY: far spul ps.;; betavtod

LUSI crelounoat blyct ts!

The pobourak st I,

BRYo tall go; tat

Efficiencies

This C RNN could be made more efficient. The first step of the RNN is to multiply the sparse (almost all 0s) input vector times the input-to-hidden weight matrix. This matrix multiplication isn't necessary for sparse vectors - we could just copy the relevant column from the weights. However, I've left the matrix multiplication as it corresponds more to the maths above. The same applies to the outputs.

Both the Python and C code create a vocabularly from the input file. For example the Shakespeare text above has only 62 unique characters in it, so the network has 62 inputs (where eg 0=newline, 1=space, 2=!, etc). This is a small number, but if this network used words instead of letter, the vocababulary (eg a=0, aardvark=1, able=2, etc) can easily get up to 50,000 or 100,000. When these are turned into inputs, they become vectors with 49,999 zeros and a single 1. Algorithms like word2vec mitigate this by compressing the 50,000 long sparse vectors into 1,000 long dense vectors (with numbers from -1 to 1 instead of all 0s and a single 1).

Every step in the feed forward loop calculates the probability of every letter in the vocabulary. Techniques like heirarchical softmax mean computing these probabilities less often in a mathematically reliable way (which I don't yet fully understand). In practice, neural networks don't just run for the number of iterations you request. They usually stop early if specified conditions are met - for example if the loss stabilises and is no longer decreasing.

There are many other optimisation techniques, such as limiting how far back the propagation goes, training several inputs at once, running on a GPU and more.

Conclusion

I hope this has given a mathematical and intuitive feel for the differences between feed-forward and recurrent neural networks. One of the main problems with the basic RNN presented here is vanishing gradients. In practice, RNNs are not very good at learning relationships and sequences more than a few time steps back, because the error propagated back in time keeps getting multiplied by very small numbers and so vanishes to nothing. The next step would be to extend the RNN to use GRUs or LSTMs for the activation function, methods to avoid the disappearing gradient which are now widely used.