Understand LSTM Neural Network Model from Scratch

Have you ever seen the movies Memento or Ghajni? Our basic RNNs are like the heroes in those movies who suffer from short-term memory problems. LSTM is a special version of RNN (Recurrent Neural Network) model that solves this short-term memory issue. In this tutorial, we will understand Long Short Term Memory (LSTM) from scratch and calculate with real numbers.

Before we start learning about LSTM neural network model, I will recommend you to read this tutorial: Recurrent Neural Network tutorial for Beginners. You should first understand RNN architecture before learning LSTM because LSTM is one of the member of RNN family.

Application of LSTM models

Nowadays LSTM is solving various NLP tasks like language translation, sentiment analysis, text summarization, Time Series forecasting, etc.

For example, Google Translate uses LSTM-based neural machine translation architecture to translate text from one language to another.

LSTM models are used to improve the accuracy of speech recognition systems. For example, Apple’s Siri and Google Assistant use LSTM-based models to understand and respond to user’s voice commands.

LSTM model became popular in time series forecasting tasks also. You can use LSTM to predict stock prices, weather forecasting, energy demand, and other time-series data. For example, the Australian Energy Market Operator (AEMO) uses LSTM-based models to forecast electricity demand.

Why is LSTM better than RNN?

RNNs are interesting because they can use past knowledge to help with current tasks. For example, if you want to predict the next word in a sentence, RNNs can use the previous words to do it better. But RNN has some limitations, let’s explore those.

In theory, RNNs (Recurrent Neural Networks) can remember things from a long time ago. However, when it comes to practical usage, they fail to do this very well.

RNNs (Recurrent Neural Networks) perform well for small sentences. But for long sentences, RNN fails to perform because of the vanishing gradient problem. RNN can not remember things for a long time.

For example, if we want to predict the final word in the sentence “the stars are in the ____“. In this case, the gap between context word (“stars“) and where we need it (“sky“) is small.

In this kind of case, RNN will work properly.

Now let’s take another example. Now if our goal is to predict final word in this sentence: “Rome is a modern and cosmopolitan city that attracts millions of visitor. It is known for its delicious thin-crust _____“

In this case, the gap between context word (“stars“) and where we need it (“pizza“) is too long.

advantage-of-long-short-term-memory-over-recurrent-neural-network-to-solve-gradient-discent-issue

As the gap increases, RNNs lose their ability to connect the dots and learn from it. This is because of the Vanishing Gradient problem of RNNs. Though in theory, RNN can handle such long dependency but in practice RNN struggles to learn and perform effectively in this situation because of the Vanishing Gradient problem.

LSTMs are designed specifically to tackle this long-term dependency problem. Unlike other models, they can remember information for extended periods of time without any issues.

Types of LSTM model

There are several types of LSTM models that differ in terms of their architecture and the way they handle input and output.

Vanilla LSTM

This is the most basic type of LSTM model that includes a memory cell, input gate, forget gate, and output gate. The memory cell stores information about the sequence and the gates control the flow of information in and out of the cell.

In this tutorial, I am going to explain this type of LSTM neural network model.

Bidirectional LSTM

In this type of LSTM model, the input sequence is processed in both forward and backward directions using two separate LSTM layers. This allows the model to learn from both past and future contexts and is useful for tasks such as speech recognition and language translation.

Stacked LSTM

This type of LSTM model includes multiple LSTM layers, where the output of one layer serves as the input for the next layer. Stacked LSTMs are useful for learning hierarchical representations of sequential data.

Convolutional LSTM

In this type of LSTM model, the input sequence is processed using convolutional layers before being fed into the LSTM layers. This is useful for tasks such as video analysis and motion prediction.

Attention LSTM

This LSTM model has something special called attention mechanism. It helps the model to concentrate on important parts of the input sequence based on what is needed for the task. This feature is very helpful for tasks like understanding people’s feelings (sentiment analysis) and describing pictures in words (image captioning).

LSTM Network Architecture

LSTMs or Long Short Term Memory networks are a type of Recurrent Neural Network or RNN that can learn and remember long-term dependencies. Hochreiter & Schmidhuber (1997) first introduced this deep learning algorithm, which gained popularity over time.

LSTMs are created to tackle the long-term dependency problem. Unlike RNNs, they can easily remember information for long periods of time

In Recurrent Neural Networks, there is a repeating module of a neural network. But this module is quite simple, like a single tanh layer.

simple-recurrent-neural-network-architecture-diagram — **Recurrent Neural Network**

Here H is the hidden layer

LSTMs share the same chain-like structure as other Recurrent Neural Networks, but their repeating module has a unique structure. Rather than having just one neural network layer, LSTMs have four, and they interact in a highly specialized way.

lstm-neural-network-model-architecture-explained — **Long Short Term Memory (LSTM)**

These four layers are: forget gate, input gate, cell gate, and output gate. These gates are memory units of the LSTM model.

Gates of LSTM

Now let me explain you all different gates or layers of LSTM neural network model with their uses.

Also Read: Automatic Keyword extraction using Python TextRank

1. Forget Gate

The forget gate is a neural network layer that decides what information to keep or discard from the cell state. It takes the previous hidden state (Y_t-1) and the current input (X_t) as input. Then Sigmoid function generates output value between 0 and 1. A value of 0 means to forget the previous state, and a value of 1 means to keep it.

Forget Gate Equation

f_t = sigmoid(W_f[x_t, Y_t-1] + b_f)

If we use different sizes of weights, then this equation will look like below:

f_t = sigmoid(W_f * x_t + U_f * Y_t-1 + b_f)

In this equation,

W_f is a weight matrix that maps the input sequence x_t to the forget gate values
U_f is a weight matrix that maps the previous hidden state Y_t-1 to the forget gate values.
b_f is a bias vector that is added to the weighted sum of W_f * x and U_f * Y_t-1

Let’s take an example of a conversation between two friends, Ria and Priya.

Ria said, “I am going to the mall to buy some clothes.” Priya replied, “That’s great, I am working on my project at ____”

In this conversation, the forget gate will help us forget the previous context of Ria’s sentence and focus only on Priya’s response. The value of ‘f_t‘ can be between 0 and 1, depending on the amount of information we want to remember. If we need to forget, the value will be 0, and if complete information is needed, the value will be 1.

2. Input Gate

input-gate-of-long-short-term-memory-model — Input Gate

The input gate decides what new information we can add to the current cell state. First, it takes the current state X_t and previously hidden state Y_t-1 and puts them into the second Sigmoid function. Sigmoid function will convert the values between 0 and 1 and produce i_t.

After that, the same hidden state and current state info go through the tanh function. This helps regulate the network and creates a candidate cell vector (g_t) with all possible values between -1 and 1.

Finally, point-wise multiplication happens between i_t and g_t. This output is storing the information about the importance of a value (whether a new information should be stored or not).

Here sigmoid function decides which values we need to update and tanh function helps to determine how important the value is.

Input Gate Equation

i_t = sigmoid(W_i[x_t, Y_t-1] + b_i)

g_t = tanh(W_g[x_t, Y_t-1] + b_g

Similarly, for different sizes of weight matrices, we can write,

i_t = sigmoid(W_i*x_t + U_i*Y_t-1 + b_i)

g_t = tanh(W_g*x_t, U_g*Y_t-1] + b_g

Here,

W_i is the weight matrix that maps the input sequence x_t to the input gate values
U_i is weight matrix that maps the previous hidden state Y_t-1 to the input gate values
b_i is bias vector that is added to the weighted sum of W_i * x_t and U_i * Y_t-1
W_g is a weight matrix that maps the input sequence x to the candidate values
U_g is a weight matrix that maps the previous hidden state h_prev to the candidate values
b_g is a bias vector that is added to the weighted sum of W_g * x and U_g * Y_t-1

During training, the values of W_i, U_i, b_i, W_g, U_g, and b_g are learned through backpropagation and gradient descent. So that the model can learn to selectively add new information to the cell state based on the input sequence and the previous hidden state.

For example consider the sentence: “I love to eat pizza, but I am allergic to ____“

To predict the next word, an LSTM needs to selectively forget and remember information from the previous words. The input gate is responsible for deciding what new information to let into the cell state.

In this example, the input gate would determine how much weight to give to the word “pizza” and how much weight to give to the word that follows “allergic to“.

If the next word is “tomatoes” for instance, the input gate would need to remember information about “tomatoes” without overwriting the information about “pizza” that was previously stored in the cell state.

3. Cell State

cell-gate-layer-of-lstm-neural-network-model — Cell Gate

After calculating forget gate and input gate, we need to update the cell state of LSTM neural network model. The cell state acts like a conveyor belt running throughout the network. The cell state is crucial because it allows information to flow along the network without any changes.

Cell State is also known as the memory of LSTM.

In this cell state, all the decisions are made actually. First, the forget gate takes the previous cell state C_t-1 and multiplies it with the forget vector f_t. If the resulting value is close to 0, it means that the corresponding information is less important and should be dropped from the cell state.

After that, it takes output of the input gate and perform point-wise addition which updates the cell state and create a new cell state (C_t).

Cell Gate Equation

C_t = f_t * C_t-1 + i_t * g_t

LSTMs have three gates, which are used to control and protect the cell state. The last one is output gate.

4. Output Gate

output-gate-or-layer-of-long-short-term-memory-model-of-nlp — Output Gate

The output gate is the third and final gate in the LSTM architecture. It generates the output (Y_t) of the hidden layer.

Also Read: Accurate Language Detection Using FastText & Python

First, we pass current state X_t and previous hidden state Y_t-1 into the third sigmoid function and produce O_t. Then we are applying tanh function to the new cell state C_t. Finally apply point-wise multiplication to those output values and produce final output Y_t

Output Gate Equation

o_t = sigmoid(W_o * x_t + U_o * Y_t-1 + b_o)

Y_t = o_t * tanh(c_t)

Forward Propagation of LSTM from Scratch

Now let’s calculate forward propagation of LSTM to clear up our understanding. Let’s walk through a single iteration of an LSTM for next word prediction using the example sentence “The cat sat on the ____“.

next-word-prediction-architecture-of-long-short-term-memory-model-lstm-neural-network-in-short

Input Encoding

Like other NLP models, we first encode the input sequence “The cat sat on the” as a sequence of word vectors. You can use any technique such as one-hot encoding or word embeddings.

For simplicity, let’s assume we are using one-hot encoding, and we have a vocabulary size of 5 words “the”, “cat”, “sat”, “on”, and “dog”. The one-hot encoded input sequence would look like below:

[1 0 0 0 0] # “the”

[0 1 0 0 0] # “cat”

[0 0 1 0 0] # “sat”

[0 0 0 1 0] # “on”

In any kind of embedding technique, we should keep only unique words. This is the reason we used embedding vector of word “the” one time.

We’ll assume that our LSTM has a hidden state size of 3, so our weight matrices and biases have the following shapes:

W_f, U_f, b_f: (3, 5), (3, 3), (3,)
W_i, U_i, b_i: (3, 5), (3, 3), (3,)
W_o, U_o, b_o: (3, 5), (3, 3), (3,)
W_g, U_g, b_g: (3, 5), (3, 3), (3,)
W_p, b_p: (5, 3), (5,)

We’ll also assume that our initial hidden state Y₀ and cell state C₀ are both zero vectors of size 3.

Time step = 1

At the first time step t=1, we feed the input word vector of “the” [1 0 0 0 0] into the LSTM, along with the previous hidden state Y₀ and cell state C₀:

Forget Gate Calculation

We have randomly initialized the weight (W_f) as a 3×5 matrix and bias b_f as 3×1 matrix and U_f as 3×3 matrics. Initializing the first hidden state Y₀ as zero vectors of 1X3 matrix

initialize-weight-matrices-for-forget-gate-of-lstm-model

So now we know the equation of forget get,

f_t = sigmoid(W_f * x_t + U_f * Y_t-1 + b_f)

for time step =1 the equation would be:

f₁ = sigmoid(W_f * x_the + U_f * Y₀ + b_f)

After putting all values:

f_1 = sigmoid([[-0.1, 0.2, -0.3, 0.1, -0.2], [-0.3, 0.1, 0.2, 0.1, 0.3], [0.2, 0.1, -0.2, -0.3, -0.1]] @ [1 0 0 0 0] + [[0.2, -0.1, 0.1], [-0.2, -0.1, -0.2], [0.1, -0.3, 0.2]] @ [0 0 0] + [0.1, -0.2, 0.3])

= sigmoid([-0.1, 0.3, -0.1])

= [0.475, 0.574, 0.475]

Input Gate Calculation

Equation of input gate:

i_t = sigmoid(W_i*x_t + U_i*Y_t-1 + b_i)

g_t = tanh(W_g*x_t + U_g*Y_t-1] + b_g)

Similarly, let’s define our weight matrices.

first-weight-matrices-for-input-gate-of-lstm-model — **Weight matrices for i_t**

second-weight-matrices-for-input-gate-of-long-sort-term-memory-model — **Weight matrices for g_t**

Let’s first calculate i_t

i_t = sigmoid(W_i*x_t + U_i*Y_t-1 + b_i)

for time step =1 we can write the equation like below:

i₁ = sigmoid(W_i*x_the + U_i*Y₀ + b_i)

After putting all values:

i_1 = sigmoid([[-0.4, 0.2, 0.1, -0.2, 0.3], [-0.3, 0.2, 0.1, -0.3, 0.2], [0.1, -0.3, -0.1, 0.3, 0.2]] @ [1 0 0 0 0] + [[0.2, -0.1, 0.1], [-0.2, -0.1, -0.2], [0.1, -0.3, 0.2]] @ [0 0 0] + [-0.1, -0.2, 0.3])

= sigmoid([-0.1, 0.3, 0.3])

= [0.475, 0.574, 0.475]

Now let’s calculate the value of g_t:

g_t = tanh(W_g*x_t + U_g*Y_t-1 + b_g)

for time step =1 the equation will look like below:

g₁ = tanh(W_g*x_the + U_g*Y₀ + b_g)

After putting all values:

g_1 = tanh([[0.1, 0.3, 0.2, 0.4, -0.2], [0.2, 0.4, -0.1, -0.2, 0.1], [-0.2, 0.1, -0.4, -0.1, 0.3]] @ [1 0 0 0 0] + [[-0.3, 0.2, -0.1], [0.2, -0.2, 0.1], [-0.1, -0.1, -0.3]] @ [0 0 0] + [0.1, -0.1, 0.2])

= tanh([0.11, 0.26, -0.01])

= [0.110, 0.254, -0.010]

Cell Gate Calculation

As we know formula for cell gate is:

C_t = f_t * C_t-1 + i_t * g_t

for time step =1 we can write like below:

C₁ = f₁ * C₀ + i₁ * g₁

In the above previous steps we have already calculated values for f₁, C₀ , i₁_, and g₁. And we are keeping initial hidden state Y₀ and cell state C₀ values both zero vectors of size 3

So putting values in the equation will look like below:

c_1 = f_1 * c_0 + i_1 * g_1

= [0.475, 0.574, 0.475] * [0, 0, 0] + [0.475, 0.574, 0.475] * [0.110, 0.254, -0.010]

= [0.052, 0.146, -0.005]

Output Gate Calculation

Finally, we can calculate values of output gate. In this gate we need to calculate two values:

o_t = sigmoid(W_o * x_t + U_o * Y_t-1 + b_o)

Y_t = o_t * tanh(c_t)

Before jump into calculation let’s define our weight matrices in the similar manner like other gates.

output-gate-weight-matrices-for-long-short-term-memory-model — **Output gate weight matrices**

Let’s first calculate o_t. For time step =1 we can write equation of o_t like below:

o₁ = sigmoid(W_o * x_the + U_o * Y₀ + b_o)

Now let’s put all the values to the equation:

o_1 = sigmoid([[0.3, 0.1, -0.2, 0.1, -0.4], [0.2, 0.1, -0.3, 0.2, -0.1], [0.1, -0.1, 0.2, 0.1, 0.3]] @ [1 0 0 0 0] + [[0.1, 0.1, -0.1], [0.2, -0.3, -0.2], [0.1, -0.1, 0.2]] @ [0 0 0] + [0.2, 0.3, 0.1])

= sigmoid([0.1, 0.5, 0.3])

= [0.524, 0.622, 0.574]

Now finally we can calculate hidden layer output (for time = 1) of LSTM neural network model. The equation for this output is:

Y_t = o_t * tanh(c_t)

For time =1 we can write like this:

Y₁ = o₁ * tanh(c₁)

Now we already know value for o₁ and c₁. Let’s put those values in this equation and produce the output.

Y₁ = o₁ * tanh(c₁)

= [0.524, 0.622, 0.574] * [0.110, 0.254, -0.005]

= [0.058, 0.157, -0.003]

Time step = 2

At the second time step t=2, we feed the input vector for the word “cat” [0 1 0 0 0] into the LSTM neural network model, along with the previous hidden state h₁ and cell state c₁.

Note: Weight and bias matrices will be same for each time step for one iteration.

f_2 = sigmoid([[-0.1, 0.2, -0.3, 0.1, -0.2], [-0.3, 0.1, 0.2, 0.1, 0.3], [0.2, 0.1, -0.2, -0.3, -0.1]] @ [0 1 0 0 0] + [[0.2, -0.1, 0.1], [-0.2, -0.1, -0.2], [0.1, -0.3, 0.2]] @ [0.058, 0.157, -0.003] + [0.1, -0.2, 0.3]

= sigmoid([-0.165, -0.147, 0.137])

= [0.459, 0.463, 0.534]

i_2 = sigmoid([[-0.2, 0.3, -0.1, -0.3, 0.2], [-0.1, -0.2, -0.2, 0.2, 0.1], [0.1, 0.3, -0.2, -0.1, -0.1]] @ [0 1 0 0 0] + [[-0.2, 0.1, 0.1], [0.1, -0.2, -0.2], [-0.3, -0.1, 0.1]] @ [0.058, 0.157, -0.003] + [-0.2, 0.1, 0.1])

= sigmoid([-0.368, 0.088, 0.079])

= [0.409, 0.522, 0.519]

g_2 = tanh([[0.1, 0.3, 0.2, 0.4, -0.2], [0.2, 0.4, -0.1, -0.2, 0.1], [-0.2, 0.1, -0.4, -0.1, 0.3]] @ [0 1 0 0 0] + [[-0.3, 0.2, -0.1], [0.2, -0.2, 0.1], [-0.1, -0.1, -0.3]] @ [0.058, 0.157, -0.003] + [0.1, -0.1, 0.2])

= tanh([0.253, 0.288, -0.016])

= [0.246, 0.272, -0.016]

c_2 = f_2 * c_1 + i_2 * g_2

= [0.459, 0.463, 0.534] * [0.052, 0.146, -0.005] + [0.409, 0.522, 0.519] * [0.246, 0.272, -0.016]

= [0.222, 0.284, -0.009]

o_2 = sigmoid([[0.3, 0.1, -0.2, 0.1, -0.4], [0.2, 0.1, -0.3, 0.2, -0.1], [0.1, -0.1, 0.2, 0.1, 0.3]] @ [0 1 0 0 0] + [[0.1, 0.1, -0.1], [0.2, -0.3, -0.2], [0.1, -0.1, 0.2]] @ [0.058, 0.157, -0.003] + [0.2, 0.3, 0.1])

= sigmoid([0.224, 0.287, 0.206])

= [0.556, 0.570, 0.551]

Y_2 = o_2 * tanh(c_2)

= [0.556, 0.570, 0.551] * tanh([0.222, 0.284, -0.009])

= [0.159, 0.163, -0.144]

Till time step = 5

This will go on till last input word vector for word “the”. In our example, input words are “the”, “cat”, “sat”, “on”, and “the”.

Also Read: Abstractive Text Summarization in 12 lines with T5

So total time step it will take = 5

“the” => time step = 1 input: x_the, y₀, c₀ , (calculated values: f₁, i₁, g₁, c₁, o₁, and y₁)
“cat” => time step = 2 input: x_cat, y₁, c₁ , (calculated values: f₂, i₂, g₂, c₂, o₂, and y₂)
“sat” => time step = 3 input: x_sat, y₂, c₂ , (calculated values: f₃, i₃, g₃, c₃, o₃, and y₃)
“on” => time step = 4 input: x_on, y₃, c₃ , (calculated values: f₄, i₄, g₄, c₄, o₄, and y₄)
“the” => time step = 5 input: x_the, y₄, c₄ , (calculated values: f₅, i₅, g₅, c₅, o₅, and y₅)

Next word Prediction

We can now use the output hidden state Y₅ (final output)to predict the next word in the sequence. This is typically done by applying a softmax activation function to the output of a linear layer that takes Y₅ as input.

p_5 = softmax(W_p * Y₅ + b_p)

Now we need to use W_p weight matrix with random values of shape of (output_size X vocabulary_size).

For example, we want to predict next word after Tim step = 2.

For this, we have Y₂ = [0.159, 0.163, -0.144]. Now length of the Y₂ vector = 3 and our vocabulary size = 5.

So the shape of W_p will be 3X5. And the shape of b_p will be 5X1

So now we have:

Y_2 = [0.159, 0.163, -0.144]

W_p = [[-0.007, 0.366, -0.155, -0.223, -0.110],
[ 0.126, -0.009, -0.130, 0.246, 0.155],
[-0.057, 0.170, -0.070, -0.212, -0.091]]

b_p = [0.083, -0.139, 0.017, -0.022, 0.061]

Note: Weight and bias matrix values are random values.

So now, let’s calculate the next word prediction probability after time step = 2.

p_2 = softmax(W_p * Y₂ + b_p)

= softmax([0.159, 0.163, -0.144] @ [[-0.007, 0.366, -0.155, -0.223, -0.110], [ 0.126, -0.009, -0.130, 0.246, 0.155], [-0.057, 0.170, -0.070, -0.212, -0.091]] + [0.083, -0.139, 0.017, -0.022, 0.061]

= softmax([-0.008, 0.173, -0.167, -0.290, 0.037]))

= [0.238, 0.305, 0.163, 0.116, 0.177]

We can see the second element (0.305) of our softmax output vector is with higher probability. So predicted next word will be the second word in our vocabulary (vocabulary[2]).

Now our vocabulary is: ["the", "cat", "sat", "on", and "dog"]. So the next predicted word will be vocabulary[2] = “cat”.

At this point with only one iteration, you can not expect the proper output. I just tried to explain you the entire path of a long short term memory model for only one forward propagation iteration.

Limitations of LSTM model

While LSTMs are a powerful tool for modeling sequential data, they do have some limitations. Here are some of the limitations of LSTMs:

Computationally expensive

For LSTMs to work well, they require a lot of data and computer resources, especially for large-scale applications. They are also computationally expensive to train.

Limited interpretability

Because LSTMs are black boxes, it is challenging to understand how and why particular decisions are made by the model.

Overfitting

when using smaller datasets, LSTMs might be vulnerable to overfitting. It means that the model may perform well on the training data but not for unseen data.

Training time

Due to their complex architecture, LSTMs may require more time to train than other models. This is because it needs to calculate all four gates (or states) for each input sequence.

End Note

So in summary RNN is good for small sentences. But for long sentences where the dependency gap is big, RNN fails to perform because of vanishing gradient problem.

LSTM (long short-term memory) is designed to tackle this long-dependency issue. It uses several gates as it memory to remember important things for a long period of time.

In this tutorial, I explainedLSTM neural network model from scratch by calculating one iteration of LSTM forward propagation to predict next word for a given sentence.

This is it for this tutorial. If you have any questions or suggestions regarding this post, please let me know in the comment section below.

Anindya

Hi there, I’m Anindya Naskar, Data Science Engineer. I created this website to show you what I believe is the best possible way to get your start in the field of Data Science.

Application of LSTM models

Why is LSTM better than RNN?

Types of LSTM model

Vanilla LSTM

Bidirectional LSTM

Stacked LSTM

Convolutional LSTM

Attention LSTM

LSTM Network Architecture

Gates of LSTM

1. Forget Gate

Forget Gate Equation

2. Input Gate

Input Gate Equation

3. Cell State

Cell Gate Equation

4. Output Gate

Output Gate Equation

Forward Propagation of LSTM from Scratch

Input Encoding

Time step = 1

Forget Gate Calculation

Input Gate Calculation

Cell Gate Calculation

Output Gate Calculation

Time step = 2

Till time step = 5

Next word Prediction

Limitations of LSTM model

Computationally expensive

Limited interpretability

Overfitting

Training time

End Note

Related Posts

Leave a comment Cancel reply