Have you ever seen the movies Memento or Ghajni? Our basic RNNs are like the heroes in those movies who suffer from short-term memory problems. LSTM is a special version of RNN (Recurrent Neural Network) model that solves this short-term memory issue. In this tutorial, we will understand Long Short Term Memory (LSTM) from scratch and calculate with real numbers.

Before we start learning about LSTM neural network model, I will recommend you to read this tutorial: **Recurrent Neural Network tutorial for Beginners**. You should first understand RNN architecture before learning LSTM because LSTM is one of the member of RNN family.

## Application of LSTM models

Nowadays LSTM is solving various NLP tasks like language translation, sentiment analysis, text summarization, Time Series forecasting, etc.

For example, Google Translate uses LSTM-based neural machine translation architecture to translate text from one language to another.

LSTM models are used to improve the accuracy of speech recognition systems. For example, Apple’s Siri and Google Assistant use LSTM-based models to understand and respond to user’s voice commands.

LSTM model became popular in time series forecasting tasks also. You can use LSTM to predict stock prices, weather forecasting, energy demand, and other time-series data. For example, the Australian Energy Market Operator (AEMO) uses LSTM-based models to forecast electricity demand.

## Why is LSTM better than RNN?

RNNs are interesting because they can use past knowledge to help with current tasks. For example, if you want to predict the next word in a sentence, RNNs can use the previous words to do it better. But RNN has some limitations, let’s explore those.

In theory, RNNs (Recurrent Neural Networks) can remember things from a long time ago. However, when it comes to practical usage, they fail to do this very well.

RNNs (Recurrent Neural Networks) perform well for small sentences. But for long sentences, RNN fails to perform because of the vanishing gradient problem. RNN can not remember things for a long time.

For example, if we want to predict the final word in the sentence “*the stars are in the ____*“. In this case, the gap between context word (“stars“) and where we need it (“sky“) is small.

In this kind of case, RNN will work properly.

Now let’s take another example. Now if our goal is to predict final word in this sentence: “*Rome is a modern and cosmopolitan city that attracts millions of visitor. It is known for its delicious thin-crust _____*“

In this case, the gap between context word (“stars“) and where we need it (“pizza“) is too long.

As the gap increases, RNNs lose their ability to connect the dots and learn from it. This is because of the Vanishing Gradient problem of RNNs. Though in theory, RNN can handle such long dependency but in practice RNN struggles to learn and perform effectively in this situation because of the Vanishing Gradient problem.

LSTMs are designed specifically to **tackle this long-term dependency problem**. Unlike other models, they can remember information for extended periods of time without any issues.

## Types of LSTM model

There are several types of LSTM models that differ in terms of their architecture and the way they handle input and output.

**Vanilla LSTM**

This is the most basic type of LSTM model that includes a memory cell, input gate, forget gate, and output gate. The memory cell stores information about the sequence and the gates control the flow of information in and out of the cell.

In this tutorial, I am going to explain this type of LSTM neural network model.

**Bidirectional LSTM**

In this type of LSTM model, the input sequence is processed in both forward and backward directions using two separate LSTM layers. This allows the model to learn from both past and future contexts and is useful for tasks such as speech recognition and language translation.

**Stacked LSTM**

This type of LSTM model includes multiple LSTM layers, where the output of one layer serves as the input for the next layer. Stacked LSTMs are useful for learning hierarchical representations of sequential data.

**Convolutional LSTM**

In this type of LSTM model, the input sequence is processed using convolutional layers before being fed into the LSTM layers. This is useful for tasks such as video analysis and motion prediction.

**Attention LSTM**

This LSTM model has something special called attention mechanism. It helps the model to concentrate on important parts of the input sequence based on what is needed for the task. This feature is very helpful for tasks like understanding people’s feelings (sentiment analysis) and describing pictures in words (image captioning).

## LSTM Network Architecture

LSTMs or Long Short Term Memory networks are a type of Recurrent Neural Network or RNN that can learn and remember long-term dependencies. Hochreiter & Schmidhuber (1997) first introduced this deep learning algorithm, which gained popularity over time.

LSTMs are created to tackle the long-term dependency problem. Unlike RNNs, they can easily remember information for long periods of time

In Recurrent Neural Networks, there is a repeating module of a neural network. But this module is quite simple, like a single **tanh layer**.

**Here H is the hidden layer**

LSTMs share the same chain-like structure as other Recurrent Neural Networks, but their repeating module has a unique structure. Rather than having just one neural network layer, LSTMs have four, and they interact in a highly specialized way.

These four layers are: forget gate, input gate, cell gate, and output gate. These gates are memory units of the LSTM model.

**Gates of LSTM**

Now let me explain you all different gates or layers of LSTM neural network model with their uses.

**1. Forget Gate**

The forget gate is a neural network layer that decides what information to keep or discard from the cell state. It takes the previous hidden state (**Y _{t-1}**) and the current input (

**X**) as input. Then

_{t}**Sigmoid function**generates output value between 0 and 1. A value of 0 means to forget the previous state, and a value of 1 means to keep it.

**Forget Gate Equation**

`f`

_{t} = sigmoid(W_{f}[x_{t}, Y_{t-1}] + b_{f})

If we use different sizes of weights, then this equation will look like below:

**f _{t} = sigmoid(W_{f} * x_{t} + U_{f} * Y_{t-1} + b_{f})**

In this equation,

is a weight matrix that maps the input sequence**W**_{f}to the forget gate values`x`

_{t}

is a weight matrix that maps the previous hidden state**U**_{f}**Y**to the forget gate values._{t-1}

is a bias vector that is added to the weighted sum of**b**_{f}

and**W**_{f}* x**U**_{f}* Y_{t-1}

Let’s take an example of a conversation between two friends, Ria and Priya.

*Ria said, “I am going to the mall to buy some clothes.” Priya replied, “That’s great, I am working on my project at ____”*

In this conversation, the forget gate will help us forget the previous context of Ria’s sentence and focus only on Priya’s response. The value of ‘**f _{t}**‘ can be between 0 and 1, depending on the amount of information we want to remember. If we need to forget, the value will be 0, and if complete information is needed, the value will be 1.

**2. Input Gate**

The input gate decides what new information we can add to the current cell state. First, it takes the current state **X _{t}** and previously hidden state

**Y**and puts them into the

_{t-1}**second Sigmoid function**. Sigmoid function will convert the values between 0 and 1 and produce

**i**.

_{t}After that, the same hidden state and current state info go through the **tanh function**. This helps regulate the network and creates a candidate cell vector (**g _{t}**) with all possible values between -1 and 1.

Finally, point-wise multiplication happens between **i _{t}** and

**g**. This output is storing the information about the importance of a value (whether a new information should be stored or not).

_{t}Here **sigmoid function decides** which values we need to update and **tanh function helps** to determine how important the value is.

**Input Gate Equation**

**Input Gate Equation**

**i _{t} = sigmoid(W_{i}[x_{t}, Y_{t-1}] + b_{i})**

**g _{t} = tanh(W_{g}[x_{t}, Y_{t-1}] + b_{g}**

Similarly, for different sizes of weight matrices, we can write,

**i _{t} = sigmoid(W_{i}*x_{t} + U_{i}*Y_{t-1} + b_{i})**

**g _{t} = tanh(W_{g}*x_{t}, U_{g}*Y_{t-1}] + b_{g}**

Here,

is the weight matrix that maps the input sequence**W**_{i}

to the input gate values**x**_{t}

is weight matrix that maps the previous hidden state**U**_{i}

to the input gate values**Y**_{t-1}

is bias vector that is added to the weighted sum of**b**_{i}

and**W**_{i}* x_{t}**U**_{i}* Y_{t-1}

is a weight matrix that maps the input sequence**W**_{g}`x`

to the candidate values

is a weight matrix that maps the previous hidden state**U**_{g}`h_prev`

to the candidate values

is a bias vector that is added to the weighted sum of**b**_{g}

and**W**_{g}* x**U**_{g}* Y_{t-1}

During training, the values of

, **W _{i}**

**U**_{i}

, **b**_{i}

, **W**_{g}

, **U**_{g}

, and **b**_{g}

are learned through backpropagation and gradient descent. So that the model can learn to selectively add new information to the cell state based on the input sequence and the previous hidden state.For example consider the sentence: “*I love to eat pizza, but I am allergic to ____*“

To predict the next word, an LSTM needs to selectively forget and remember information from the previous words. The input gate is responsible for deciding what new information to let into the cell state.

In this example, the input gate would determine how much weight to give to the word “**pizza**” and how much weight to give to the word that follows “**allergic to**“.

If the next word is “**tomatoes**” for instance, the input gate would need to remember information about “**tomatoes**” without overwriting the information about “**pizza**” that was previously stored in the cell state.

**3. Cell State**

After calculating forget gate and input gate, we need to update the cell state of LSTM neural network model. The cell state acts like a conveyor belt running throughout the network. The cell state is crucial because it allows information to flow along the network without any changes.

Cell State is also known as the memory of LSTM.

In this cell state, all the decisions are made actually. First, the forget gate takes the previous cell state **C _{t-1}** and multiplies it with the forget vector

**f**. If the resulting value is close to 0, it means that the corresponding information is less important and should be dropped from the cell state.

_{t}After that, it takes output of the input gate and perform point-wise addition which updates the cell state and create a new cell state (**C _{t}**).

**Cell Gate Equation**

**Cell Gate Equation****C _{t} = f_{t} * C_{t-1} + i_{t} * g_{t}**

LSTMs have three gates, which are used to control and protect the cell state. The last one is output gate.

**4. Output Gate**

The output gate is the third and final gate in the LSTM architecture. It generates the output (**Y _{t}**) of the hidden layer.

First, we pass current state **X _{t}** and previous hidden state

**Y**into the third sigmoid function and produce O

_{t-1}_{t}. Then we are applying

**tanh function**to the new cell state C

_{t}. Finally apply point-wise multiplication to those output values and produce final output Y

_{t}

**Output Gate Equation**

**Output Gate Equation****o _{t} = sigmoid(W_{o} * x_{t} + U_{o} * Y_{t-1} + b_{o})**

**Y _{t} = o_{t} * tanh(c_{t})**

## Forward Propagation of LSTM from Scratch

Now let’s calculate forward propagation of LSTM to clear up our understanding. Let’s walk through a single iteration of an LSTM for next word prediction using the example sentence “*The cat sat on the ____*“.

**Input Encoding**

Like other NLP models, we first encode the input sequence “The cat sat on the” as a sequence of word vectors. You can use any technique such as **one-hot encoding** or **word embeddings**.

For simplicity, let’s assume we are using **one-hot encoding**, and we have a vocabulary size of 5 words “the”, “cat”, “sat”, “on”, and “dog”. The one-hot encoded input sequence would look like below:

[1 0 0 0 0] # “the”

[0 1 0 0 0] # “cat”

[0 0 1 0 0] # “sat”

[0 0 0 1 0] # “on”

In any kind of embedding technique, we should keep only unique words. This is the reason we used embedding vector of word **“the”** one time.

We’ll assume that our LSTM has a hidden state size of 3, so our weight matrices and biases have the following shapes:

- W_f, U_f, b_f: (3, 5), (3, 3), (3,)
- W_i, U_i, b_i: (3, 5), (3, 3), (3,)
- W_o, U_o, b_o: (3, 5), (3, 3), (3,)
- W_g, U_g, b_g: (3, 5), (3, 3), (3,)
- W_p, b_p: (5, 3), (5,)

We’ll also assume that our initial hidden state **Y _{0}** and cell state C

_{0}are both

**zero vectors of size 3**.

**Time step = 1**

At the first time step t=1, we feed the input word vector of **“the” [1 0 0 0 0]** into the LSTM, along with the previous hidden state **Y _{0}** and cell state

**C**:

_{0}**Forget Gate Calculation**

We have randomly initialized the **weight** (**W _{f}**) as a 3×5 matrix and

**bias b**as 3×1 matrix and

_{f}**U**as 3×3 matrics. Initializing the first hidden state

_{f}**Y**as

_{0}**zero vectors**of 1X3 matrix

So now we know the equation of forget get,

**f _{t} = sigmoid(W_{f} * x_{t} + U_{f} * Y_{t-1} + b_{f})**

for time step =1 the equation would be:

**f _{1} = sigmoid(W_{f} * x_{the} + U_{f} * Y_{0} + b_{f})**

After putting all values:

`f_1 = sigmoid([[-0.1, 0.2, -0.3, 0.1, -0.2], [-0.3, 0.1, 0.2, 0.1, 0.3], [0.2, 0.1, -0.2, -0.3, -0.1]] @ [1 0 0 0 0] + [[0.2, -0.1, 0.1], [-0.2, -0.1, -0.2], [0.1, -0.3, 0.2]] @ [0 0 0] + [0.1, -0.2, 0.3]) `

`= sigmoid([-0.1, 0.3, -0.1]) `

`= [0.475, 0.574, 0.475]`

**Input Gate Calculation**

**Input Gate Calculation**

Equation of input gate:

**i _{t} = sigmoid(W_{i}*x_{t} + U_{i}*Y_{t-1} + b_{i})**

**g _{t} = tanh(W_{g}*x_{t} + U_{g}*Y_{t-1}] + b_{g}**)

Similarly, let’s define our weight matrices.

Let’s first calculate i_{t}

**i _{t} = sigmoid(W_{i}*x_{t} + U_{i}*Y_{t-1} + b_{i})**

for time step =1 we can write the equation like below:

**i _{1} = sigmoid(W_{i}*x_{the} + U_{i}*Y_{0} + b_{i})**

After putting all values:

`i_1 = sigmoid([[-0.4, 0.2, 0.1, -0.2, 0.3], [-0.3, 0.2, 0.1, -0.3, 0.2], [0.1, -0.3, -0.1, 0.3, 0.2]] @ [1 0 0 0 0] + [[0.2, -0.1, 0.1], [-0.2, -0.1, -0.2], [0.1, -0.3, 0.2]] @ [0 0 0] + [-0.1, -0.2, 0.3])`

`= sigmoid([-0.1, 0.3, 0.3])`

`= [0.475, 0.574, 0.475]`

Now let’s calculate the value of g_{t}:

**g _{t} = tanh(W_{g}*x_{t} + U_{g}*Y_{t-1} + b_{g}**)

for time step =1 the equation will look like below:

**g _{1} = tanh(W_{g}*x_{the} + U_{g}*Y_{0} + b_{g}**)

After putting all values:

`g_1 = tanh([[0.1, 0.3, 0.2, 0.4, -0.2], [0.2, 0.4, -0.1, -0.2, 0.1], [-0.2, 0.1, -0.4, -0.1, 0.3]] @ [1 0 0 0 0] + [[-0.3, 0.2, -0.1], [0.2, -0.2, 0.1], [-0.1, -0.1, -0.3]] @ [0 0 0] + [0.1, -0.1, 0.2])`

`= tanh([0.11, 0.26, -0.01]) `

`= [0.110, 0.254, -0.010]`

**Cell Gate Calculation**

**Cell Gate Calculation**

As we know formula for cell gate is:

**C _{t} = f_{t} * C_{t-1} + i_{t} * g_{t}**

for time step =1 we can write like below:

**C _{1} = f_{1} * C_{0} + i_{1} * g_{1}**

In the above previous steps we have already calculated values for ** f_{1}, C_{0} , i_{1} _{,} **and

**. And we are keeping initial hidden state Y**

`g`_{1}

_{0}and cell state C

_{0}values both zero vectors of size 3

So putting values in the equation will look like below:

`c_1 = f_1 * c_0 + i_1 * g_1`

`= [0.475, 0.574, 0.475] * [0, 0, 0] + [0.475, 0.574, 0.475] * [0.110, 0.254, -0.010] `

`= [0.052, 0.146, -0.005]`

**Output Gate Calculation**

**Output Gate Calculation**

Finally, we can calculate values of output gate. In this gate we need to calculate two values:

**o _{t} = sigmoid(W_{o} * x_{t} + U_{o} * Y_{t-1} + b_{o})**

**Y _{t} = o_{t} * tanh(c_{t})**

Before jump into calculation let’s define our weight matrices in the similar manner like other gates.

Let’s first calculate o_{t}. For time step =1 we can write equation of o_{t} like below:

**o _{1} = sigmoid(W_{o} * x_{the} + U_{o} * Y_{0} + b_{o})**

Now let’s put all the values to the equation:

`o_1 = sigmoid([[0.3, 0.1, -0.2, 0.1, -0.4], [0.2, 0.1, -0.3, 0.2, -0.1], [0.1, -0.1, 0.2, 0.1, 0.3]] @ [1 0 0 0 0] + [[0.1, 0.1, -0.1], [0.2, -0.3, -0.2], [0.1, -0.1, 0.2]] @ [0 0 0] + [0.2, 0.3, 0.1])`

`= sigmoid([0.1, 0.5, 0.3]) `

`= [0.524, 0.622, 0.574]`

Now finally we can calculate hidden layer output (for time = 1) of LSTM neural network model. The equation for this output is:

**Y _{t} = o_{t} * tanh(c_{t})**

For time =1 we can write like this:

**Y _{1} = o_{1} * tanh(c_{1})**

Now we already know value for

and **o _{1}**

**c**_{1}

. Let’s put those values in this equation and produce the output.`Y`

_{1} = o_{1} * tanh(c_{1})

`= [0.524, 0.622, 0.574] * [0.110, 0.254, -0.005]`

`= [0.058, 0.157, -0.003]`

**Time step = 2**

At the second time step t=2, we feed the input vector for the word “cat” [0 1 0 0 0] into the LSTM neural network model, along with the previous hidden state h_{1} and cell state c_{1}.

**Note:** Weight and bias matrices will be same for each time step for one iteration.

`f_2 = sigmoid([[-0.1, 0.2, -0.3, 0.1, -0.2], [-0.3, 0.1, 0.2, 0.1, 0.3], [0.2, 0.1, -0.2, -0.3, -0.1]] @ [0 1 0 0 0] + [[0.2, -0.1, 0.1], [-0.2, -0.1, -0.2], [0.1, -0.3, 0.2]] @ [0.058, 0.157, -0.003] + [0.1, -0.2, 0.3]`

`= sigmoid([-0.165, -0.147, 0.137])`

`= [0.459, 0.463, 0.534]`

`i_2 = sigmoid([[-0.2, 0.3, -0.1, -0.3, 0.2], [-0.1, -0.2, -0.2, 0.2, 0.1], [0.1, 0.3, -0.2, -0.1, -0.1]] @ [0 1 0 0 0] + [[-0.2, 0.1, 0.1], [0.1, -0.2, -0.2], [-0.3, -0.1, 0.1]] @ [0.058, 0.157, -0.003] + [-0.2, 0.1, 0.1]) `

`= sigmoid([-0.368, 0.088, 0.079])`

`= [0.409, 0.522, 0.519]`

`g_2 = tanh([[0.1, 0.3, 0.2, 0.4, -0.2], [0.2, 0.4, -0.1, -0.2, 0.1], [-0.2, 0.1, -0.4, -0.1, 0.3]] @ [0 1 0 0 0] + [[-0.3, 0.2, -0.1], [0.2, -0.2, 0.1], [-0.1, -0.1, -0.3]] @ [0.058, 0.157, -0.003] + [0.1, -0.1, 0.2]) `

`= tanh([0.253, 0.288, -0.016])`

`= [0.246, 0.272, -0.016]`

`c_2 = f_2 * c_1 + i_2 * g_2`

`= [0.459, 0.463, 0.534] * [0.052, 0.146, -0.005] + [0.409, 0.522, 0.519] * [0.246, 0.272, -0.016] `

`= [0.222, 0.284, -0.009]`

`o_2 = sigmoid([[0.3, 0.1, -0.2, 0.1, -0.4], [0.2, 0.1, -0.3, 0.2, -0.1], [0.1, -0.1, 0.2, 0.1, 0.3]] @ [0 1 0 0 0] + [[0.1, 0.1, -0.1], [0.2, -0.3, -0.2], [0.1, -0.1, 0.2]] @ [0.058, 0.157, -0.003] + [0.2, 0.3, 0.1]) `

`= sigmoid([0.224, 0.287, 0.206]) `

`= [0.556, 0.570, 0.551]`

`Y_2 = o_2 * tanh(c_2) `

`= [0.556, 0.570, 0.551] * tanh([0.222, 0.284, -0.009]) `

`= [0.159, 0.163, -0.144]`

**Till time step = 5**

This will go on till last input word vector for word “the”. In our example, input words are “the”, “cat”, “sat”, “on”, and “the”.

So total time step it will take = 5

- “the” => time step = 1 input: x
_{the}, y_{0}, c_{0}, (calculated values: f_{1}, i_{1}, g_{1}, c_{1}, o_{1}, and y_{1}) - “cat” => time step = 2 input: x
_{cat}, y_{1}, c_{1}, (calculated values: f_{2}, i_{2}, g_{2}, c_{2}, o_{2}, and y_{2}) - “sat” => time step = 3 input: x
_{sat}, y_{2}, c_{2}, (calculated values: f_{3}, i_{3}, g_{3}, c_{3}, o_{3}, and y_{3}) - “on” => time step = 4 input: x
_{on}, y_{3}, c_{3}, (calculated values: f_{4}, i_{4}, g_{4}, c_{4}, o_{4}, and y_{4}) - “the” => time step = 5 input: x
_{the}, y_{4}, c_{4}, (calculated values: f_{5}, i_{5}, g_{5}, c_{5}, o_{5}, and y_{5})

**Next word Prediction**

We can now use the output hidden state

(final output)to predict the next word in the sequence. This is typically done by applying a softmax activation function to the output of a linear layer that takes **Y _{5}**

**Y**_{5}

as input.**p_5 = softmax(W _{p} * Y_{5} + b_{p})**

Now we need to use

weight matrix with random values of shape of (**W _{p}**

**).**

`output_size X vocabulary_size`

For example, we want to predict next word after Tim step = 2.

For this, we have

. Now length of the **Y _{2} = [0.159, 0.163, -0.144]**

**Y**vector = 3 and our

_{2}**vocabulary size**= 5.

So the shape of

will be 3X5. And the shape of **W _{p}**

**b**_{p}

will be 5X1So now we have:

`Y_2 = [0.159, 0.163, -0.144]`

```
W_p = [[-0.007, 0.366, -0.155, -0.223, -0.110],
[ 0.126, -0.009, -0.130, 0.246, 0.155],
[-0.057, 0.170, -0.070, -0.212, -0.091]]
```

`b_p = [0.083, -0.139, 0.017, -0.022, 0.061]`

**Note:** Weight and bias matrix values are random values.

So now, let’s calculate the next word prediction probability after time step = 2.

**p_2 = softmax(W _{p} * Y_{2} + b_{p})**

`= softmax([0.159, 0.163, -0.144] @ [[-0.007, 0.366, -0.155, -0.223, -0.110], [ 0.126, -0.009, -0.130, 0.246, 0.155], [-0.057, 0.170, -0.070, -0.212, -0.091]] + [0.083, -0.139, 0.017, -0.022, 0.061]`

`= softmax([-0.008, 0.173, -0.167, -0.290, 0.037]))`

`= [0.238, 0.305, 0.163, 0.116, 0.177]`

We can see the **second element (0.305)** of our softmax output vector is with higher probability. So predicted next word will be the second word in our vocabulary (vocabulary[2]).

Now our vocabulary is: `["the", "cat", "sat", "on", and "dog"]`

. **So the next predicted word will be** **vocabulary[2] = “cat”**.

At this point with only one iteration, you can not expect the proper output. I just tried to explain you the entire path of a long short term memory model for only one forward propagation iteration.

## Limitations of LSTM model

While LSTMs are a powerful tool for modeling sequential data, they do have some limitations. Here are some of the limitations of LSTMs:

**Computationally expensive**

For LSTMs to work well, they require a lot of data and computer resources, especially for large-scale applications. They are also computationally expensive to train.

**Limited interpretability**

Because LSTMs are black boxes, it is challenging to understand how and why particular decisions are made by the model.

**Overfitting**

when using smaller datasets, LSTMs might be vulnerable to overfitting. It means that the model may perform well on the training data but not for unseen data.

**Training time**

Due to their complex architecture, LSTMs may require more time to train than other models. This is because it needs to calculate all four gates (or states) for each input sequence.

## End Note

So in summary RNN is good for small sentences. But for long sentences where the dependency gap is big, RNN fails to perform because of vanishing gradient problem.

LSTM (long short-term memory) is designed to tackle this long-dependency issue. It uses several gates as it memory to remember important things for a long period of time.

In this tutorial, I explainedLSTM neural network model from scratch by calculating one iteration of LSTM forward propagation to predict next word for a given sentence.

This is it for this tutorial. If you have any questions or suggestions regarding this post, please let me know in the comment section below.

Hi there, I’m Anindya Naskar, Data Science Engineer. I created this website to show you what I believe is the best possible way to get your start in the field of Data Science.