Continuous Bag of Words (CBOW) – Single Word Model – How It Works

 

To implement Word2Vec, there are two flavors which are — Continuous Bag-Of-Words (CBOW) and continuous Skip-gram (SG)
 
 
In this post I will explain only Continuous Bag of Word (CBOW) model with a one-word window to understand continuous bag of word (CBOW) clearly. If you can understand CBOW with single word model then multiword CBOW model will be so easy to you.
 
 
While explaining, I will present a few small examples with a text containing a few words. However, keep in mind that word2vec is typically trained with billions of words.

 

Continuous Bag of Words (CBOW):

It attempts to guess the output (target word) from its neighboring words (context words). You can think of it like fill in the blank task, where you need to guess word in place of blank by observing nearby words.

 

Continuous Bag of Words (CBOW) single-word model:

In this section we will be implementing the CBOW for single-word architecture of Word2Vec. The content is broken down into the following steps:
 
Data Preparation: Defining corpus by tokenizing text.
 
Generate Training Data: Build vocabulary of words, one-hot encoding for words, word index.
 
Train Model: Pass one hot encoded words through forward pass, calculate error rate by computing loss, and adjust weights using back propagation.
 
Output: By using trained model calculate word vector and find similar words.

I will explain CBOW steps without code but if you want full working code of CBOW with numpy from scratch, I have separate post for that, you can always jump into that.

Also Read:

1. Data Preparation:

Let’s say we have a text like below:
 
i like natural language processing
 
To make it simple I have chosen a sentence without capitalization and punctuation. Also I will not remove any stop words (“and”, “the” etc.) but for real world implementation you should do lots of cleaning task like stop word removal, replacing digits, remove punctuation etc.
 
After pre-processing we will convert the text to list of tokenized word.
 
[“i”, “like”, “natural”, “language”, “processing”]

2. Generate training data:

Unique vocabulary: Find unique vocabulary list. As we don’t have any duplicate word in our example text, so unique vocabulary will be:
 
[“i”, “like”, “natural”, “language”, “processing”]
 
Now to prepare training data for single word CBOW model, we define “target word” as the word which follows a given word in the text (which will be our “context word”). That means we will be predicting next word for a given word.
 

Now let’s construct our training examples, scanning through the text with a window will prepare a context word and a target word, like so:

For example, for context word “i” the target word will be “like”. For our example text full training data will looks like:

One-hot encoding: We need to convert text to one-hot encoding as algorithm can only understand numeric values.
 
For example encoded value of the word “i”, which appears first in the vocabulary, will be as the vector [1, 0, 0, 0, 0]. The word “like”, which appears second in the vocabulary, will be encoded as the vector [0, 1, 0, 0, 0]
 
 
Also Read:  Document Similarity Matching using TF-IDF & Python

 

So let’s see overall set of context-target words in one hot encoded form:
 
 
So as you can see above table is our final training data, where encoded target word is Y variablefor our model and encoded context word is X variable for our model.
 
Now we will move on to train our model.


3. Training Model:

 
 
 

So far, so good right? Now we need to pass this data into the basic neural network with one hidden layer and train it. Only one thing to note is that the desire vector dimension of any word will be the number of hidden nodes.
 
For this tutorial and demo purpose my desired vector dimension is 3. For example:
 
“i” => [0.001, 0.896, 0.763]  so number of hidden layer node will be 3.
 
Dimension (n): It is dimension of word embedding you can treat embedding as number of features or entity like organization, name, gender etc. It can be 10, 20, 100 etc. Increasing number of embedding layer will explain a word or token more deeply. Just for an example Google pre-trained word2vec have dimension of 300.
 
Now as you know a basic neural network training is divided into some steps:

1. Create model Architecture
2. Forward Propagation
3. Error Calculation
4. Weight tuning using backward pass

Before going into forward propagation we need to understand model architecture in vectorized form.

 

Also Read:


3.1 Model Architecture:

 
 
 
 

Let’s look back to our example text “I like natural language processing”. Suppose we are training the model against the first training data point where the context word is “i” and the target word is “like”. 
 
Ideally, the values of the weights should be such that when the input x=(1,0,0,0) – which is for the word “i” is given to the model –it will produce output which should close to y=(0,1,0,0) – which is the word “like”.
 
So now let’s create weight matrix for input to hidden layer (w).

Above picture shows how I have created weighted matrix for first two input nodes. So in the similar manner if we are creating weight matrix for all input nodes it should looks like below, whose dimension will be [3 X 5], in different way it’s 
 
[N x V]
 
Where: 
 
N: Number of embedding layer/ hidden unit
V: Number of unique vocabulary size
 
 

Also Read:  Training Data Preparation for Custom BERT Model
cbow equation
Now let’s create weight matrix for hidden to output layer (w/).
So in the similar manner (shown in the above picture) if we are creating weight matrix for all hidden nodes it should looks like below, whose dimension will be [5 X 3], in different way it’s [V X N] 
 
 
cbow formula in hand

So final vectorized form of CBOW model should look like below:

CBOW Vectorized form:

 
 
Where V is number of unique vocabulary and N is number of embedding layers (number of hidden units)
Now we can start forward propagation.

 

Also Read:

3.2 Forward Propagation:

Please refer to picture of CBOW vectorized form, if you have any confusion in below calculations.

Calculate hidden layer matrix (h):

 
So now:
 
cbow hidden layer calculation

Calculate output layer matrix (u):

 
So now:
cbow python

Calculate final Softmax output (y):

So:

y1 = Softmax (u1)
y2 = Softmax (u2)
y3 = Softmax (u3)
y4 = Softmax (u4)
y5 = Softmax (u5)
 
Now as you know softmax calculate probability for every possible class. Softmax function uses exponential as we want to get output from softmax in a range between 0 to 1.
 
Let’s have a look for only one (first) output from softmax function:
 
cbow architecture
So now if we want to generalize above equation (as the above equation for first output from softmax) we can write:
cbow output layer

3.3 Error Calculation:

As we have done with forward propagation, now we need to calculate error to know how accurately our model is performing and update weights (w and w/) accordingly.
 
As you know to calculate error we must have an actual value. That actual value needs to compare with predicted value.
 
For single word model next word is the target word for previous word.
Here we will be using log loss function to calculate error. First let’s have a look at generalized equation to minimize error/ log loss:


[E=-log(w_t|w_c)]

Where 
 
wt = Target word
wc = Context word
 
So now let’s calculate error / loss for first iteration only. For the first iteration “like” will be the target word and its position is 2.
 
So,
cbow error calculation from beginning
So now if we want to generalize above equation we can write like this:
 
error of Continuous Bag of Words
Where j* is the index of the actual word (target word) in the output layer. For first iteration index of target word is 2. So for first iteration j* will be 2 as position of word “like” is 2.


Also Read:  Build Question Answering System with BERT model

3.5 Back Propagation:

As we have done with forward propagation and error calculation now we need to tune weight matrices through back propagation.
Back propagation is required to update weight matrices (w and w/). To update weight we need to calculate derivative of loss with respect to each weight and subtract those derivatives with their corresponding weight. It’s called gradient descent technique to tune weight.
Let’s take a look for only one example to update second weight (w/).
In this back propagation phase we will update weight of all neurons from hidden layer to output (w/11 , w/12, w/13 …. w/15)

Step1: Gradient of E with respect to w/11:

derivative of Continuous Bag of Words
Now as we know
total loss of cbow
gradient descent in cbow
Derivative of log function with chain rule.
 
error calculation of cbow
We can generalize above derivative as:
Taking derivative of E with respect to uj:
final loss of cbow from scrach
Note: tj=1 if tj = tj* else tj = 0
 
That means tj is the actual output and yj is the predicted output. So ej is the error.
So for first iteration,
cbow output formula
Continuous Bag of Words formula
Now finally coming back to main derivative which we were trying to solve:
Continuous Bag of Words equation

So generalized form will looks like below:

Continuous Bag of Words from scratch

Step2: Updating the weight of w/11:

weight calculation cbow

Note: To keep it simple I am not considering any learning rate here.

 
Similarly we can update weight of w/12, w/13 …. w/35.
Now let’s update 1stweight which (w).

Step1: Gradient of E with respect to w11:
In this back propagation phase we will update weight of all neurons from input layer to hidden layer (w11, w12, …..w53)
 
Again I will take you through only first weigh (w11)
cbow error
As to change E with respect to h1, u1to u5 all are involved.

So,

cbow in numpy
As for u1 and h1,
cbow in python
math behind cbow

Now finally:

cbow loss
Step2: Updating the weight of w11:
weight function of cbow
Note: Again to keep it simple I am not considering any learning rate here.
 
Similarly we can update weight of w12, w13 … W53.

Final CBOW Word2Vec:

After doing above training process for many iteration final trained model comes with tuned and proper weights. At the end of entire training process we will consider first weight matrix (w) as our vector for our given sentence.
 
For example:
 
Our sentence or document wasi like natural language processing
And let’s say first weight of our trained model is:
weight of cbow
So word2vce of word “i” will be [w11, w12, w13]
 
Similarly word2vec for all word will be as follows:
Also Read:

Conclusion:

Now to conclude let’s recall our generalized equations:
 

Forward Propagation:

cbow

Back Propagation:

1. Update second weight (w/11):
cbow explained
General Vector Notation:
Continuous Bag of Words explained
Note:Symbol  denotes the outer product. In Python this can be obtained using the numpy.outermethod
 

2. Update second weight (w11):
cbow architecture explained

General Vector Notation:

cbow formula explained

Note:Symbol  denotes the outer product. In Python this can be obtained using the numpy.outer method

If you have any question or suggestion regarding this topic see you in comment section. I will try my best to answer.

3 thoughts on “Continuous Bag of Words (CBOW) – Single Word Model – How It Works”

Leave a comment