**— Continuous Bag-Of-Words (CBOW)**and

**continuous Skip-gram (SG)**.

**Continuous Bag of Words (CBOW):**

It attempts to guess the output (target word) from its neighboring words (context words). You can think of it like fill in the blank task, where you need to guess word in place of blank by observing nearby words.

__Data Preparation__**:**Defining corpus by tokenizing text.

__Generate Training Data__**:**Build vocabulary of words, one-hot encoding for words, word index.

__Train Model__**:**Pass one hot encoded words through

**forward pass**, calculate error rate by computing loss, and adjust weights using

**back propagation**.

I will explain CBOW steps without code but if you want full working code of **CBOW** with numpy from scratch, I have separate post for that, you can always jump into that.

__Also Read:__

**1. Data Preparation:**

**2. Generate training data:**

**: Find unique vocabulary list. As we don’t have any duplicate word in our example text, so unique vocabulary will be:**

__Unique vocabulary__Now let’s construct our training examples, scanning through the text with a window will prepare a context word and a target word, like so:

**i**” the target word will be “

**like**”. For our example text full training data will looks like:

**: We need to convert text to one-hot encoding as algorithm can only understand numeric values.**

__One-hot encoding__**i**”, which appears first in the vocabulary, will be as the vector [1, 0, 0, 0, 0]. The word “like”, which appears second in the vocabulary, will be encoded as the vector [0, 1, 0, 0, 0]

**Y variable**for our model and encoded context word is

**X variable**for our model.

**3. Training Model:**

__Dimension (n)__**:**It is dimension of word embedding you can treat embedding as number of features or entity like organization, name, gender etc. It can be 10, 20, 100 etc. Increasing number of embedding layer will explain a word or token more deeply. Just for an example Google pre-trained word2vec have dimension of 300.

__Also Read:__

**3.1 Model Architecture:**

**i**” and the target word is “like”.

**i**” is given to the model –it will produce output which should close to y=(0,1,0,0) – which is the word “like”.

**w**).

**[N x V]**

*N:**Number of embedding layer/ hidden unit*

*V:**Number of unique vocabulary size*

**w**).

^{/}**[V X N]**

So final vectorized form of CBOW model should look like below:

**CBOW Vectorized form:**

**V**is number of unique vocabulary and

**N**is number of embedding layers (number of hidden units)

__Also Read:__

**3.2 Forward Propagation:**

**Calculate hidden layer matrix (h):**

**Calculate output layer matrix (u):**

**Calculate final Softmax output (y):**

So:

_{1}= Softmax (u

_{1})

_{2}= Softmax (u

_{2})

_{3}= Softmax (u

_{3})

_{4}= Softmax (u

_{4})

_{5}= Softmax (u

_{5})

y_{j}=\frac{e^{j}}{\sum_{j=1}^{V}e^{j}}

**3.3 Error Calculation:**

**w and w**) accordingly.

^{/}

[E=-log(w_t|w_c)]

**w**= Target word

_{t}**w**= Context word

_{c}**for first iteration**only. For the first iteration “

**like**” will be the target word and its position is 2.

E=-u_{j^{*}}+log\sum_{j=1}^{V}e^{u_{j}}

**like**” is 2.

**3.5 Back Propagation:**

**w and w**). To update weight we need to calculate derivative of loss with respect to each weight and subtract those derivatives with their corresponding weight. It’s called

^{/}**gradient descent**technique to tune weight.

**w**).

^{/}^{/}

_{11 }, w

^{/}

_{12}, w

^{/}

_{13 …. }w

^{/}

_{15})

**Step1: Gradient of E with respect to w ^{/}_{11}:**

So,

\frac{dE(y_1)}{du_1 }= -\frac{du_1}{du_1}+\frac{d[log(e^{u_1} +e^{u_2} + e^{u_3} + e^{u_4} + e^{u_5} )])}{du_1 }\\\\\\ =-1+\frac{d[log(e^{u_1 }+e^{u_2} + e^{u_3} + e^{u_4} + e^{u_5} ) ]}{d(e^{u_1 }+e^{u_2} + e^{u_3} + e^{u_4} + e^{u_5} )},\frac{d(e^{u_1 }+e^{u_2} + e^{u_3} + e^{u_4} + e^{u_5} )}{du_1}

**E**with respect to

**u**:

_{j}**Note:**t

_{j}=1 if t

_{j}= t

_{j*}else t

_{j}= 0

_{j}is the actual output and y

_{j}is the predicted output. So e

_{j}is the error.

And,

\frac{du_1}{dw^{'}_{11}}=\frac{d(w^{'}_{11}h_1+w^{'}_{21}h_2+w^{'}_{31}h_3)}{dw^{'}_{11}}=h_1

So generalized form will looks like below:

\frac{dE}{dw^{'}}==e*h

**Step2: Updating the weight of w ^{/}_{11}:**

**new(w^{'}_{11})=w^{'}_{11}-\frac{dE(y_1)}{w^{'}_{11}}=(w^{'}_{11}-e_1h_1)**

**Note: **To keep it simple I am not considering any learning rate here.

^{/}

_{12}, w

^{/}

_{13 …. }w

^{/}

_{35}.

^{st}weight which (w).

**Step1: Gradient of E with respect to w**

_{11}:_{11}, w

_{12}, …..w

_{53})

_{11})

_{1}, u

_{1}to u

_{5}all are involved.

So,

\\\frac{dE}{dh_{1}}=(\frac{dE}{du_1},\frac{du_1}{dh_1})+(\frac{dE}{du_2},\frac{du_2}{dh_1})+(\frac{dE}{du_3},\frac{du_3}{dh_1})+(\frac{dE}{du_4},\frac{du_4}{dh_1})+(\frac{dE}{du_5},\frac{du_5}{dh_1})\\\\\\ =ew^{'}_{11}+ew^{'}_{12}+ew^{'}_{13}+ew^{'}_{14}+ew^{'}_{15}_{1 }and h

_{1},

And,

\frac{dh_1}{dw_{11}}=\frac{d(w_{11} x_1+w_{21} x_2+w_{31} x_3+w_{41} x_4+w_{51} x_5)}{dw_{11}}

Now finally:

\frac{dE}{dw_{11}}=\frac{dE}{dh_1},\frac{dh_1}{dw_{11}}\\\\ =(ew^{'}_{11}+ew^{'}_{12}+ew^{'}_{13}+ew^{'}_{14}+ew^{'}_{15})*x**Step2: Updating the weight of w**

_{11}:**Note:**Again to keep it simple I am not considering any learning rate here.

_{12}, w

_{13}… W

_{53}.

**Final CBOW Word2Vec:**

_{11}, w

_{12}, w

_{13}]

__Also Read:__**Conclusion:**

**Forward Propagation:**

\\h=wx\ u=w^{'}h\\\\ y_j=softmax(u_j)=\frac{e^{j}}{\sum_{j=1}^{V}e^{j}}\\ E=-log(w_t|w_c)=-u_{j^{*}}+log\sum_{j=1}^{V}e^{u_j}
**Back Propagation:**

**1. Update second weight (w**

^{/}_{11}):**Note:**Symbol ⊗ denotes the outer product. In Python this can be obtained using the

**numpy.outer**method

**2. Update second weight (w**

_{11}):**General Vector Notation:**

\\\frac{dE}{dw}=x\otimes (w^{'}e)\\\\ new(w)=w_{old}-\frac{dE}{dw}
**Note:**Symbol ⊗ denotes the outer product. In Python this can be obtained using the **numpy.outer** method

If you have any question or suggestion regarding this topic see you in comment section. I will try my best to answer.

vianplease, I want a python code. thx a lot

Tom Michalowskygood good. nice post

RoxorAwesome explanation.