Understand LSTM Neural Network Model from Scratch


Have you ever seen the movies Memento or Ghajni? Our basic RNNs are like the heroes in those movies who suffer from short-term memory problems. LSTM is a special version of RNN (Recurrent Neural Network) model that solves this short-term memory issue. In this tutorial, we will understand Long Short Term Memory (LSTM) from scratch and calculate with real numbers.

Before we start learning about LSTM neural network model, I will recommend you to read this tutorial: Recurrent Neural Network tutorial for Beginners. You should first understand RNN architecture before learning LSTM because LSTM is one of the member of RNN family.

Application of LSTM models

Nowadays LSTM is solving various NLP tasks like language translation, sentiment analysis, text summarization, Time Series forecasting, etc.

For example, Google Translate uses LSTM-based neural machine translation architecture to translate text from one language to another.

LSTM models are used to improve the accuracy of speech recognition systems. For example, Apple’s Siri and Google Assistant use LSTM-based models to understand and respond to user’s voice commands.

LSTM model became popular in time series forecasting tasks also. You can use LSTM to predict stock prices, weather forecasting, energy demand, and other time-series data. For example, the Australian Energy Market Operator (AEMO) uses LSTM-based models to forecast electricity demand.

Why is LSTM better than RNN?

RNNs are interesting because they can use past knowledge to help with current tasks. For example, if you want to predict the next word in a sentence, RNNs can use the previous words to do it better. But RNN has some limitations, let’s explore those.

In theory, RNNs (Recurrent Neural Networks) can remember things from a long time ago. However, when it comes to practical usage, they fail to do this very well.

RNNs (Recurrent Neural Networks) perform well for small sentences. But for long sentences, RNN fails to perform because of the vanishing gradient problem. RNN can not remember things for a long time.

For example, if we want to predict the final word in the sentence “the stars are in the ____“. In this case, the gap between context word (“stars“) and where we need it (“sky“) is small.

In this kind of case, RNN will work properly.


Now let’s take another example. Now if our goal is to predict final word in this sentence: “Rome is a modern and cosmopolitan city that attracts millions of visitor. It is known for its delicious thin-crust _____

In this case, the gap between context word (“stars“) and where we need it (“pizza“) is too long.


As the gap increases, RNNs lose their ability to connect the dots and learn from it. This is because of the Vanishing Gradient problem of RNNs. Though in theory, RNN can handle such long dependency but in practice RNN struggles to learn and perform effectively in this situation because of the Vanishing Gradient problem.

LSTMs are designed specifically to tackle this long-term dependency problem. Unlike other models, they can remember information for extended periods of time without any issues.

Types of LSTM model

There are several types of LSTM models that differ in terms of their architecture and the way they handle input and output.

Vanilla LSTM

This is the most basic type of LSTM model that includes a memory cell, input gate, forget gate, and output gate. The memory cell stores information about the sequence and the gates control the flow of information in and out of the cell.

In this tutorial, I am going to explain this type of LSTM neural network model.

Bidirectional LSTM

In this type of LSTM model, the input sequence is processed in both forward and backward directions using two separate LSTM layers. This allows the model to learn from both past and future contexts and is useful for tasks such as speech recognition and language translation.

Stacked LSTM

This type of LSTM model includes multiple LSTM layers, where the output of one layer serves as the input for the next layer. Stacked LSTMs are useful for learning hierarchical representations of sequential data.

Convolutional LSTM

In this type of LSTM model, the input sequence is processed using convolutional layers before being fed into the LSTM layers. This is useful for tasks such as video analysis and motion prediction.

Attention LSTM

This LSTM model has something special called attention mechanism. It helps the model to concentrate on important parts of the input sequence based on what is needed for the task. This feature is very helpful for tasks like understanding people’s feelings (sentiment analysis) and describing pictures in words (image captioning).

LSTM Network Architecture

LSTMs or Long Short Term Memory networks are a type of Recurrent Neural Network or RNN that can learn and remember long-term dependencies. Hochreiter & Schmidhuber (1997) first introduced this deep learning algorithm, which gained popularity over time.

LSTMs are created to tackle the long-term dependency problem. Unlike RNNs, they can easily remember information for long periods of time

In Recurrent Neural Networks, there is a repeating module of a neural network. But this module is quite simple, like a single tanh layer.

Recurrent Neural Network

Here H is the hidden layer

LSTMs share the same chain-like structure as other Recurrent Neural Networks, but their repeating module has a unique structure. Rather than having just one neural network layer, LSTMs have four, and they interact in a highly specialized way.

Long Short Term Memory (LSTM)

These four layers are: forget gate, input gate, cell gate, and output gate. These gates are memory units of the LSTM model.

4 Gates of LSTM neural network model

Gates of LSTM

Now let me explain you all different gates or layers of LSTM neural network model with their uses.

Also Read:  Fine tune BERT for Question Answering

1. Forget Gate

Forget Gate

The forget gate is a neural network layer that decides what information to keep or discard from the cell state. It takes the previous hidden state (Yt-1) and the current input (Xt) as input. Then Sigmoid function generates output value between 0 and 1. A value of 0 means to forget the previous state, and a value of 1 means to keep it.

Forget Gate Equation

ft = sigmoid(Wf[xt, Yt-1] + bf)

If we use different sizes of weights, then this equation will look like below:

ft = sigmoid(Wf * xt + Uf * Yt-1 + bf)

In this equation,

  • Wf is a weight matrix that maps the input sequence xt to the forget gate values
  • Uf is a weight matrix that maps the previous hidden state Yt-1 to the forget gate values.
  • bf is a bias vector that is added to the weighted sum of Wf * x and Uf * Yt-1

Let’s take an example of a conversation between two friends, Ria and Priya.

Ria said, “I am going to the mall to buy some clothes.” Priya replied, “That’s great, I am working on my project at ____”

In this conversation, the forget gate will help us forget the previous context of Ria’s sentence and focus only on Priya’s response. The value of ‘ft‘ can be between 0 and 1, depending on the amount of information we want to remember. If we need to forget, the value will be 0, and if complete information is needed, the value will be 1.

2. Input Gate

Input Gate

The input gate decides what new information we can add to the current cell state. First, it takes the current state Xt and previously hidden state Yt-1 and puts them into the second Sigmoid function. Sigmoid function will convert the values between 0 and 1 and produce it.

After that, the same hidden state and current state info go through the tanh function. This helps regulate the network and creates a candidate cell vector (gt) with all possible values between -1 and 1.

Finally, point-wise multiplication happens between it and gt. This output is storing the information about the importance of a value (whether a new information should be stored or not).

Here sigmoid function decides which values we need to update and tanh function helps to determine how important the value is.

Input Gate Equation

it = sigmoid(Wi[xt, Yt-1] + bi)

gt = tanh(Wg[xt, Yt-1] + bg

Similarly, for different sizes of weight matrices, we can write,

it = sigmoid(Wi*xt + Ui*Yt-1 + bi)

gt = tanh(Wg*xt, Ug*Yt-1] + bg


  • Wi is the weight matrix that maps the input sequence xt to the input gate values
  • Ui is weight matrix that maps the previous hidden state Yt-1 to the input gate values
  • bi is bias vector that is added to the weighted sum of Wi * xt and Ui * Yt-1
  • Wg is a weight matrix that maps the input sequence x to the candidate values
  • Ug is a weight matrix that maps the previous hidden state h_prev to the candidate values
  • bg is a bias vector that is added to the weighted sum of Wg * x and Ug * Yt-1

During training, the values of Wi, Ui, bi, Wg, Ug, and bg are learned through backpropagation and gradient descent. So that the model can learn to selectively add new information to the cell state based on the input sequence and the previous hidden state.

For example consider the sentence: “I love to eat pizza, but I am allergic to ____

To predict the next word, an LSTM needs to selectively forget and remember information from the previous words. The input gate is responsible for deciding what new information to let into the cell state.

In this example, the input gate would determine how much weight to give to the word “pizza” and how much weight to give to the word that follows “allergic to“.

If the next word is “tomatoes” for instance, the input gate would need to remember information about “tomatoes” without overwriting the information about “pizza” that was previously stored in the cell state.

3. Cell State

Cell Gate

After calculating forget gate and input gate, we need to update the cell state of LSTM neural network model. The cell state acts like a conveyor belt running throughout the network. The cell state is crucial because it allows information to flow along the network without any changes.

Cell State is also known as the memory of LSTM.

In this cell state, all the decisions are made actually. First, the forget gate takes the previous cell state Ct-1 and multiplies it with the forget vector ft. If the resulting value is close to 0, it means that the corresponding information is less important and should be dropped from the cell state.

After that, it takes output of the input gate and perform point-wise addition which updates the cell state and create a new cell state (Ct).

Cell Gate Equation

Ct = ft * Ct-1 + it * gt

LSTMs have three gates, which are used to control and protect the cell state. The last one is output gate.

4. Output Gate

Output Gate

The output gate is the third and final gate in the LSTM architecture. It generates the output (Yt) of the hidden layer.

Also Read:  Continuous Bag of Words (CBOW) - Single Word Model - How It Works

First, we pass current state Xt and previous hidden state Yt-1 into the third sigmoid function and produce Ot. Then we are applying tanh function to the new cell state Ct. Finally apply point-wise multiplication to those output values and produce final output Yt

Output Gate Equation

ot = sigmoid(Wo * xt + Uo * Yt-1 + bo)

Yt = ot * tanh(ct)

Forward Propagation of LSTM from Scratch

Now let’s calculate forward propagation of LSTM to clear up our understanding. Let’s walk through a single iteration of an LSTM for next word prediction using the example sentence “The cat sat on the ____“.


Input Encoding

Like other NLP models, we first encode the input sequence “The cat sat on the” as a sequence of word vectors. You can use any technique such as one-hot encoding or word embeddings.

For simplicity, let’s assume we are using one-hot encoding, and we have a vocabulary size of 5 words “the”, “cat”, “sat”, “on”, and “dog”. The one-hot encoded input sequence would look like below:

[1 0 0 0 0] # “the”

[0 1 0 0 0] # “cat”

[0 0 1 0 0] # “sat”

[0 0 0 1 0] # “on”

In any kind of embedding technique, we should keep only unique words. This is the reason we used embedding vector of word “the” one time.

We’ll assume that our LSTM has a hidden state size of 3, so our weight matrices and biases have the following shapes:

  • W_f, U_f, b_f: (3, 5), (3, 3), (3,)
  • W_i, U_i, b_i: (3, 5), (3, 3), (3,)
  • W_o, U_o, b_o: (3, 5), (3, 3), (3,)
  • W_g, U_g, b_g: (3, 5), (3, 3), (3,)
  • W_p, b_p: (5, 3), (5,)

We’ll also assume that our initial hidden state Y0 and cell state C0 are both zero vectors of size 3.

Time step = 1

At the first time step t=1, we feed the input word vector of “the” [1 0 0 0 0] into the LSTM, along with the previous hidden state Y0 and cell state C0:

Forget Gate Calculation

We have randomly initialized the weight (Wf) as a 3×5 matrix and bias bf as 3×1 matrix and Uf as 3×3 matrics. Initializing the first hidden state Y0 as zero vectors of 1X3 matrix


So now we know the equation of forget get,

ft = sigmoid(Wf * xt + Uf * Yt-1 + bf)

for time step =1 the equation would be:

f1 = sigmoid(Wf * xthe + Uf * Y0 + bf)

After putting all values:

f_1 = sigmoid([[-0.1, 0.2, -0.3, 0.1, -0.2], [-0.3, 0.1, 0.2, 0.1, 0.3], [0.2, 0.1, -0.2, -0.3, -0.1]] @ [1 0 0 0 0] + [[0.2, -0.1, 0.1], [-0.2, -0.1, -0.2], [0.1, -0.3, 0.2]] @ [0 0 0] + [0.1, -0.2, 0.3])

= sigmoid([-0.1, 0.3, -0.1])

= [0.475, 0.574, 0.475]

Input Gate Calculation

Equation of input gate:

it = sigmoid(Wi*xt + Ui*Yt-1 + bi)

gt = tanh(Wg*xt + Ug*Yt-1] + bg)

Similarly, let’s define our weight matrices.

Weight matrices for it
Weight matrices for gt

Let’s first calculate it

it = sigmoid(Wi*xt + Ui*Yt-1 + bi)

for time step =1 we can write the equation like below:

i1 = sigmoid(Wi*xthe + Ui*Y0 + bi)

After putting all values:

i_1 = sigmoid([[-0.4, 0.2, 0.1, -0.2, 0.3], [-0.3, 0.2, 0.1, -0.3, 0.2], [0.1, -0.3, -0.1, 0.3, 0.2]] @ [1 0 0 0 0] + [[0.2, -0.1, 0.1], [-0.2, -0.1, -0.2], [0.1, -0.3, 0.2]] @ [0 0 0] + [-0.1, -0.2, 0.3])

= sigmoid([-0.1, 0.3, 0.3])

= [0.475, 0.574, 0.475]

Now let’s calculate the value of gt:

gt = tanh(Wg*xt + Ug*Yt-1 + bg)

for time step =1 the equation will look like below:

g1 = tanh(Wg*xthe + Ug*Y0 + bg)

After putting all values:

g_1 = tanh([[0.1, 0.3, 0.2, 0.4, -0.2], [0.2, 0.4, -0.1, -0.2, 0.1], [-0.2, 0.1, -0.4, -0.1, 0.3]] @ [1 0 0 0 0] + [[-0.3, 0.2, -0.1], [0.2, -0.2, 0.1], [-0.1, -0.1, -0.3]] @ [0 0 0] + [0.1, -0.1, 0.2])

= tanh([0.11, 0.26, -0.01])

= [0.110, 0.254, -0.010]

Cell Gate Calculation

As we know formula for cell gate is:

Ct = ft * Ct-1 + it * gt

for time step =1 we can write like below:

C1 = f1 * C0 + i1 * g1

In the above previous steps we have already calculated values for f1, C0 , i1, and g1. And we are keeping initial hidden state Y0 and cell state C0 values both zero vectors of size 3

So putting values in the equation will look like below:

c_1 = f_1 * c_0 + i_1 * g_1

= [0.475, 0.574, 0.475] * [0, 0, 0] + [0.475, 0.574, 0.475] * [0.110, 0.254, -0.010]

= [0.052, 0.146, -0.005]

Output Gate Calculation

Finally, we can calculate values of output gate. In this gate we need to calculate two values:

ot = sigmoid(Wo * xt + Uo * Yt-1 + bo)

Yt = ot * tanh(ct)

Before jump into calculation let’s define our weight matrices in the similar manner like other gates.

Output gate weight matrices

Let’s first calculate ot. For time step =1 we can write equation of ot like below:

o1 = sigmoid(Wo * xthe + Uo * Y0 + bo)

Now let’s put all the values to the equation:

o_1 = sigmoid([[0.3, 0.1, -0.2, 0.1, -0.4], [0.2, 0.1, -0.3, 0.2, -0.1], [0.1, -0.1, 0.2, 0.1, 0.3]] @ [1 0 0 0 0] + [[0.1, 0.1, -0.1], [0.2, -0.3, -0.2], [0.1, -0.1, 0.2]] @ [0 0 0] + [0.2, 0.3, 0.1])

= sigmoid([0.1, 0.5, 0.3])

= [0.524, 0.622, 0.574]

Now finally we can calculate hidden layer output (for time = 1) of LSTM neural network model. The equation for this output is:

Yt = ot * tanh(ct)

For time =1 we can write like this:

Y1 = o1 * tanh(c1)

Now we already know value for o1 and c1. Let’s put those values in this equation and produce the output.

Y1 = o1 * tanh(c1)

= [0.524, 0.622, 0.574] * [0.110, 0.254, -0.005]

= [0.058, 0.157, -0.003]

Time step = 2

At the second time step t=2, we feed the input vector for the word “cat” [0 1 0 0 0] into the LSTM neural network model, along with the previous hidden state h1 and cell state c1.

Note: Weight and bias matrices will be same for each time step for one iteration.

f_2 = sigmoid([[-0.1, 0.2, -0.3, 0.1, -0.2], [-0.3, 0.1, 0.2, 0.1, 0.3], [0.2, 0.1, -0.2, -0.3, -0.1]] @ [0 1 0 0 0] + [[0.2, -0.1, 0.1], [-0.2, -0.1, -0.2], [0.1, -0.3, 0.2]] @ [0.058, 0.157, -0.003] + [0.1, -0.2, 0.3]

= sigmoid([-0.165, -0.147, 0.137])

= [0.459, 0.463, 0.534]

i_2 = sigmoid([[-0.2, 0.3, -0.1, -0.3, 0.2], [-0.1, -0.2, -0.2, 0.2, 0.1], [0.1, 0.3, -0.2, -0.1, -0.1]] @ [0 1 0 0 0] + [[-0.2, 0.1, 0.1], [0.1, -0.2, -0.2], [-0.3, -0.1, 0.1]] @ [0.058, 0.157, -0.003] + [-0.2, 0.1, 0.1])

= sigmoid([-0.368, 0.088, 0.079])

= [0.409, 0.522, 0.519]

g_2 = tanh([[0.1, 0.3, 0.2, 0.4, -0.2], [0.2, 0.4, -0.1, -0.2, 0.1], [-0.2, 0.1, -0.4, -0.1, 0.3]] @ [0 1 0 0 0] + [[-0.3, 0.2, -0.1], [0.2, -0.2, 0.1], [-0.1, -0.1, -0.3]] @ [0.058, 0.157, -0.003] + [0.1, -0.1, 0.2])

= tanh([0.253, 0.288, -0.016])

= [0.246, 0.272, -0.016]

c_2 = f_2 * c_1 + i_2 * g_2

= [0.459, 0.463, 0.534] * [0.052, 0.146, -0.005] + [0.409, 0.522, 0.519] * [0.246, 0.272, -0.016]

= [0.222, 0.284, -0.009]

o_2 = sigmoid([[0.3, 0.1, -0.2, 0.1, -0.4], [0.2, 0.1, -0.3, 0.2, -0.1], [0.1, -0.1, 0.2, 0.1, 0.3]] @ [0 1 0 0 0] + [[0.1, 0.1, -0.1], [0.2, -0.3, -0.2], [0.1, -0.1, 0.2]] @ [0.058, 0.157, -0.003] + [0.2, 0.3, 0.1])

= sigmoid([0.224, 0.287, 0.206])

= [0.556, 0.570, 0.551]

Y_2 = o_2 * tanh(c_2)

= [0.556, 0.570, 0.551] * tanh([0.222, 0.284, -0.009])

= [0.159, 0.163, -0.144]

Till time step = 5

This will go on till last input word vector for word “the”. In our example, input words are “the”, “cat”, “sat”, “on”, and “the”.

Also Read:  Sentiment Analysis using VADER in Python

So total time step it will take = 5

  • “the” => time step = 1 input: xthe, y0, c0 , (calculated values: f1, i1, g1, c1, o1, and y1)
  • “cat” => time step = 2 input: xcat, y1, c1 , (calculated values: f2, i2, g2, c2, o2, and y2)
  • “sat” => time step = 3 input: xsat, y2, c2 , (calculated values: f3, i3, g3, c3, o3, and y3)
  • “on” => time step = 4 input: xon, y3, c3 , (calculated values: f4, i4, g4, c4, o4, and y4)
  • “the” => time step = 5 input: xthe, y4, c4 , (calculated values: f5, i5, g5, c5, o5, and y5)

Next word Prediction

We can now use the output hidden state Y5 (final output)to predict the next word in the sequence. This is typically done by applying a softmax activation function to the output of a linear layer that takes Y5 as input.

p_5 = softmax(Wp * Y5 + bp)

Now we need to use Wp weight matrix with random values of shape of (output_size X vocabulary_size).

For example, we want to predict next word after Tim step = 2.

For this, we have Y2 = [0.159, 0.163, -0.144]. Now length of the Y2 vector = 3 and our vocabulary size = 5.

So the shape of Wp will be 3X5. And the shape of bp will be 5X1

So now we have:

Y_2 = [0.159, 0.163, -0.144]
W_p = [[-0.007, 0.366, -0.155, -0.223, -0.110],
[ 0.126, -0.009, -0.130, 0.246, 0.155],
[-0.057, 0.170, -0.070, -0.212, -0.091]]
b_p = [0.083, -0.139, 0.017, -0.022, 0.061]

Note: Weight and bias matrix values are random values.

So now, let’s calculate the next word prediction probability after time step = 2.

p_2 = softmax(Wp * Y2 + bp)

= softmax([0.159, 0.163, -0.144] @ [[-0.007, 0.366, -0.155, -0.223, -0.110], [ 0.126, -0.009, -0.130, 0.246, 0.155], [-0.057, 0.170, -0.070, -0.212, -0.091]] + [0.083, -0.139, 0.017, -0.022, 0.061]

= softmax([-0.008, 0.173, -0.167, -0.290, 0.037]))

= [0.238, 0.305, 0.163, 0.116, 0.177]

We can see the second element (0.305) of our softmax output vector is with higher probability. So predicted next word will be the second word in our vocabulary (vocabulary[2]).

Now our vocabulary is: ["the", "cat", "sat", "on", and "dog"]. So the next predicted word will be vocabulary[2] = “cat”.

At this point with only one iteration, you can not expect the proper output. I just tried to explain you the entire path of a long short term memory model for only one forward propagation iteration.

Limitations of LSTM model

While LSTMs are a powerful tool for modeling sequential data, they do have some limitations. Here are some of the limitations of LSTMs:

Computationally expensive

For LSTMs to work well, they require a lot of data and computer resources, especially for large-scale applications. They are also computationally expensive to train.

Limited interpretability

Because LSTMs are black boxes, it is challenging to understand how and why particular decisions are made by the model.


when using smaller datasets, LSTMs might be vulnerable to overfitting. It means that the model may perform well on the training data but not for unseen data.

Training time

Due to their complex architecture, LSTMs may require more time to train than other models. This is because it needs to calculate all four gates (or states) for each input sequence.

End Note

So in summary RNN is good for small sentences. But for long sentences where the dependency gap is big, RNN fails to perform because of vanishing gradient problem.

LSTM (long short-term memory) is designed to tackle this long-dependency issue. It uses several gates as it memory to remember important things for a long period of time.

In this tutorial, I explainedLSTM neural network model from scratch by calculating one iteration of LSTM forward propagation to predict next word for a given sentence.

This is it for this tutorial. If you have any questions or suggestions regarding this post, please let me know in the comment section below.

Leave a comment