# Recurrent Neural Network tutorial for Beginners

Recurrent neural network is a sequence model, used mainly for Natural Language Processing tasks. In overall deep learning CNN (convolutional neural network) is mainly for images and RNN (Recurrent Neural Network) is mainly for NLP. In this Recurrent neural network tutorial we will understand how RNN works with real numbers in excel and we will see some applications of recurrent neural network.

## Recurrent Neural Network Applications

RNNs have shown great success in various NLP tasks. I am mentioning some of most common applications where RNN is used.

• Machine Translation: A great example of translator is Google Translate. This application can be done by “many to many” type of RNN
• Speech Recognition: This application can be done by “many to one” type of recurrent neural network
• Generating Image Descriptions: With help of convolutional Neural Networks, RNNs can generate descriptions for unlabelled images
• Sentiment Analysis: Predict a sentence is positive or negative. This application can be done by “many to one” type of RNN
• Text Auto completion: Great example is Gmail.
• Named Entity Recognition (NER): Extract important information from text

## Advantage of Recurrent Neural Network (RNN)

Now one question you may have in your mind that why can’t we solve above application using simple neural network (shallow neural network)? Why should we used Recurrent neural network instead of shallow neural network.

All above application is sequence modeling application. When it comes to human language, sequence is very important. For example “I like NLP” is a understandable sentence but “like NLP I” is not understandable. So the sequence is important when writing or speaking a text. Now let’s see why simple neural network fails in sequence modeling:

Let’s say we want to build language translator where input is English text and output is Bengali statement:

Once we built this neural network, and if our sentence size changes, as you know in simple Neural Network Architecture you have fixed input and fixed number of output neurons.

To solve this you can create huge number of vocabulary for all your text document and make word vector for each word. Let’s say you have word vector of length 10000 words, then the number of neurons of input layer of your simple neural network will be 10000 and since output is different language, number of output neurons can be different let’s say 15000. To handle all, it will be a huge computation task.

Another issue with simple vanilla neural network is that: while translating language sometimes for two different English statement you might have single Bengali statement. For example:

“I will go to Kolkata on Monday”
“On Monday I will go to Kolkata”

Above two sentences have same meaning and translation in Bengali language.

Now let’s train a simple artificial neural network model for first sentence “I will go to Kolkata on Monday”

Let’s say it adjusted weights for all edges which I have highlighted in red color.

Now if we try to train same model for sentence “On Monday I will go to Kolkata”. In this case meaning of “on Monday” is same but neural network has to learn different set of edges (in red color).

In both cases parameter are not shared for same type of words in different position in sentence.

Just to summarize, there mainly three major problem with simple artificial neural network in natural language processing (sequence problems):

• Variable size of input or output neurons (number of neurons)
• Too much computation
• No parameter sharing

## Types of Recurrent Neural Network

There are four types or modes of Recurrent Neural Network which are:

• One to one RNN: Where number of input is one and number of output is one
• One to many RNN: Number of input = 1 ; number of output > 1
• Many to one RNN: Number of input neuron > 1 ; number of output neuron = 1
• Many to many RNN: Number of input neuron > 1 ; number of output neuron > 1

Let’s see how those modes of RNN looks like:

## Recurrent Neural Network Architecture

Most generic architecture of RNN looks like below:

By above image of RNN you can understand that there is only one hidden layer. We are just passing different input to the same hidden layer in different time interval.

Note: Total time interval for a sentence is the number of word of that sentence.

Let me break this time travel with an example.

Let’s understand architecture of recurrent neural network with Named Entity Recognition (NER) problem. It is a many to many Recurrent Neural Network example. Let’s say we have a text: Mark is son of Edward. In this statement word Mark and Edward are person name.

The whole object of this Name Entity Recognition is to find out or predict names (Mark/ Edward) for a given text. To make training data for RNN model you can mark a word as 1 if it is a person name else 0. Like above.

Now let’s see how Recurrent Neural network (RNN) works for NER.

First you have to convert your input text into vector. There are many ways to convert word into vector such as word embeddings (word2vec, skip-gram, CBOW), bag of words, TF-IDF, One hot encoding of word vocabulary etc. In this example I am using one hot encoding for vectorizing any word.

So for our text: Mark is son of Edward

Total vocabulary = Number of unique words in all text = 5 (no repetition of word in our example text)

Note: Sentence can have any number of words.

So vector form of word:

• Mark = [1,0,0,0,0]
• is = [0,1,0,0,0]
• son = [0,0,1,0,0]
• of = [0,0,0,1,0]
• Edward = [0,0,0,0,1]

Those word vectors are input of RNN and you need to pass each word vector word by word sequentially. Let me break all sequential steps to make you clear of each steps.

#### Step1: time = 1

At very first time you need to pass first word of our example text(“Mark“) and produce output or predict 1/0 (person or not):

There are three layers:

1. Input layer: provide word vector of word “Mark
2. Hidden Layer: For this RNN tutorial have taken four neurons, it can be any number
3. Output Layer: Producing output (YMark) as 1 or 0 (person or not)

#### Step2: time = 2

Now I have to supply second word “is” along with previous output (YMark) as an input to the network.

There are same three layers:

1. Input layer: provide word vector of word “is” and output of previous step (YMark)
2. Hidden Layer: It is the same hidden layer used in previous step
3. Output Layer: Producing output as 1 or 0 (person or not)

This is how RNN carries the context of any sentence. To understand context let’s say we have two sentences:

• I love to eat Apple
• I love to join Apple

Here in first sentence Apple is a fruit and in second sentence Apple is an organization. So if we don’t know previous word (eat, join) we will miss the context. Recurrent neural network provide you memory to carry the context of any text.

#### Step3: time = 3

In this step we need to pass third word of our sentence (“son“) along with previous step output (Yis) as an input to the network.

There are same three layers:

• Input layer: provide word vector of word “son” and output of previous step (Yis)
• Hidden Layer: It is the same hidden layer used in previous step
• Output Layer: Predicting output as 1 or 0 (person or not)

Benefit of this kind of set up in RNN is when you are processing word “son“, we need to also provide Yis. Now this Yis carries previous state or previous memory (Mark is), some important statement. If we goes on, Yson will carry memory of “Mark is son“, some more important statement. This is how it will keep on storing important context of whole statement (sentence).

#### Step4: at time = 4

In this step we need to pass fourth word of our sentence (“of“) along with previous output (Yson) as an input to the network.

There are same three layers:

• Input layer: provide word vector of word “of” and output of previous step (Yson)
• Hidden Layer: It is the same hidden layer used in previous step
• Output Layer: Producing output as 1 or 0 (person or not)

#### Step5: at time = 5

In this step we need to pass fourth word of our sentence (“Edward“) along with previous output (Yof) as an input to the network.

There are same three layers:

1. Input layer: provide word vector of word “Edward” and output of previous step (Yof)
2. Hidden Layer: It is the same hidden layer used in previous step
3. Output Layer: Producing output as 1 or 0 (person or not)

This is how we need to process words one by one sequentially. Now let me show you animated version of Recurrent Neural Network architecture to understand all above steps clearly and how actually all steps is happening inside RNN architecture. Specially to make you clear that our Recurrent Neural Network Architecture is same, we are just passing different input sequentially and predicting output.

Now let me expand above animated version of RNN. So that I can show RNN architecture in a single image.

Now let me show RNN architecture in a single image.

This is how a Recurrent Neural Network handle sequence data. Now if number of words of any sentence increases, number of time will be increase so number of steps of RNN will be increased.

## Recurrent Neural Network Forward Propagation With Time in Excel

Hope at this point you are clear about Recurrent Neural Network Architecture and how RNN works. Now let’s see forward Propagation of RNN over time with mathematics.

Now before starting let me tell you one thing that each steps of recurrent neural network is nothing but a basic artificial neural network. So it is having

• Input Layer: Vector form of each word
• Hidden Layer: Same hidden layer shares each step. As you know hidden neurons have activation function (ReLu, tanh, sigmooid etc.)
• Output Layer: Produce output by applying Softmax function or operation to obtain a vector normalized probabilities of the output
• Weights: Two types of weights are there.
• W: input to hidden layer
• W1: hidden layer to output

Note: For first step you need to provide initial activation (a0). It can be array (of vocabulary size) of any random number or zero. From step2 this activation will be your previous step output.

So equation for output of all steps:

• at time = 1: YMark = Softmax(tanh(XMark X W + a0 X W1 + b))
• time = 2: Yis = Softmax(tanh(Xis X W + YMark X W1 + b))
• time = 3: Yson = Softmax(tanh(Xson X W + Yis X W1 + b))
• time = 4: Yof = Softmax(tanh(Xof X W + Yson X W1 + b))
• time = 5: YEdward = Softmax(tanh(XEdward X W + Yof X W1 + b))

Here b is Bias.

Now let’s calculate forward propagation with real numbers:

Let’s take a look at the inputs first:

The inputs are one hot encoded as discussed earlier. Our entire vocabulary is {Mark, is, son, of, Edward} and hence we can easily one hot encode the inputs.

We have randomly initialized the input to hidden layer weights (W) as a 3*5 matrix

#### Step1 calculation: time = 1:

XMark = [1,0,0,0,0]

Now moving to the Recurrent Neuron (hidden layer to output layer), we are keeping W1 as the weight which is a 1*1 matrix as 0.334043 and the bias which is also a 1*1 matrix as 0.019882.

Since there is no letter prior to the word “Mark” (time = 1), taking previous state (a0) as [0,0]

So now a0 X W1 + b

Now let’s calculate final equation:

tanh(XMark X W + a0 X W1 + b)

Now the probability for a word being person or not can be calculated by applying the softmax function. so we will have Softmax(YMark)

YMark = Sigmoid{tanh(XMark X W + a0 X W1 + b)} YMark = Sigmoid{tanh(XMark X W + a0 X W1 + b)}

Now you can say first row of Sigmoid output is probability for 1 (person) and second row of Sigmoid output is probability for 0 (not person). In our Sigmoid output first row value is 0.62, that means our model is predicting wordMarkas a person.

Now let’s see Step 2 calculation

#### Step2 calculation: time = 2:

Yis = Softmax(tanh(Xis X W + YMark X W1 + b))

Xis = [0,1,0,0,0]

As in Recurrent Neural Network is a one layer model so weights and bias are shared to each steps.

Now here is the difference between first step with any other steps. In first step since there is no letter prior to the word “Mark” (time = 1), so we had taken previous state (a0) as [0,0]. But from second step we have previous word. So instead of a0 we will use previous step output.

So now a0 X W1 + b will become YMark X W1 + b

So now let’s calculate: YMark X W1 + b

Now let’s calculate final equation: tanh(Xis X W + YMark X W1 + b)

Now the probability score with Sigmoid function:

Yis = Softmax(tanh(Xis X W + YMark X W1 + b)) Yis = Softmax(tanh(Xis X W + YMark X W1 + b))

This is how (like step 2) you can calculate other steps (times). Only in first step you have to use initial random value since there is no prior word or output.

## Error Calculation of RNNs

As you know after forward propagation next step of model building is to calculate error. This error calculation is required to make more accurate model by minimizing error by time using back propagation.

As you know to train any algorithm you must have x (statement) and y (prediction) variable. For our example sentence: (Since it is a training data for NER value of a word will be 1, if the word is a person name else 0)

Now to calculate error or Recurrent Neural Network you need subtract actual value with predicted value . Then to calculate total loss you just need to sum each time loss.

For example Error for time =1:

As actual output for word “Mark” is 1 (as word “Mark” is a person name)

loss_time1 = (Actual output – Predicted output) = (1 – YMark)

So, Total loss = loss_time1 + loss_time2 + loss_time3 + loss_time4 + loss_time5

## Conclusion

In this Recurrent Neural Network tutorial I have explained below points:

• Different types of recurrent neural network applications
• Advantage of Recurrent Neural Network (RNN)
• Types of Recurrent Neural Network
• Recurrent neural network architecture
• Recurrent Neural Network Forward Propagation With Time in Excel with Real Numbers
• Error Calculation of RNNs

If you have any question or suggestion regarding this post please let me know in the comment section below.

Automated page speed optimizations for fast site performance