__Also Read:__

**1. Tokenize text**

**2. Assign word IDs to each unique word**

**3. Replace words from documents with word IDs**

**4. Count Matrices Calculation:**

**randomly assign topics to each word/token**in each document, and then we will calculate a word to topic count matrix and document to topic count matrix.

For this example we will keep maximum topic number equal to 2. That means while randomly assigning topic to each words, they will fall under either topic 1 or topic 2.

**Word to topic Count Matrices:**

Let’s see Total count of Topic 1 and Topic 2 for all words. We can calculate it by summing corresponding columns which is 31 for Topic 1 and 32 for Topic 2. Which means for full document Topic 1 appears 31 times and Topic 2 appears 32 times.

We will use this count later.

**Document-topic count matrix:**

**Collapse Gibbs Sampling**.

**Formula for collapsed Gibbs Sampling:**

**Parameter Explanation:**

_{w,j}

^{WT }= While starting a iteration, number of times a word appeared as topic 1 and topic 2. This is done by word to topic count matrix.

Similarly word “football” appeared as topic 1 one time and as topic 2 one times.

_{d,j}

^{DT }=

^{ }While starting a iteration number of times a document appeared as topic 1 and topic 2.

Similarly document 4 appeared as topic 1 one time and as topic 2 nine times.

f. T: Number of topic. (here T = 2)

**Latent Dirichlet Allocation under the hood (LDA Steps):**

Let’s observe one iteration closely to understand what Gibbs Sampling is doing in LDA; what kind of output we are getting after Gibbs sampling is. Steps are listed below.

**Probability Calculation:**

Let’s calculate probability of

**Topic 1**for very first word

**“play”**for document 1 in hand.

Now for word “play” let’s calculate probability for Topic 1 for document 1:*General parameter initialization:*First let’s initialize

**α = 1 and β = 0.001**

And we already know Total no. of unique words

**(W) = 44**

*Parameters from word-topic count matrix:*While starting iteration, number of times a word “play” appeared as topic 1

**(**

**C**

_{w,j}^{WT}) = 0Now from word to topic matrix we know for full document Topic 1 appears 31 times.

So;

Please refer to word to topic count matrix if you have any confusion in above.*Parameters from document-topic count matrix:*While starting iteration number of times document 1 appeared as topic 1

**(**

**C**

_{d,j}^{DT}) = 2And Total Number of times document 1 appears as topic 1 and topic 2:

Please refer to document-topic count matrix if you have any confusion in above calculation.

And Finally Total number of unique topic **(T) =2**

Similarly yo can calculate Topic 2 probability for word “play” for document 1.

Now let’s see topic probability for all tokens (calculated in the same way shown above)

In this table last two columns are the output of first stage (Probability calculation)

**Final Topic Calculation:**

Let’s find the output of this stage:

And if you still confused about how probability is calculated then please have a look at **Collapse Gibb’s sampling equation** again, which I have already shown.

__Also Read:__- Latent Dirichlet Allocation for beginners: A HIGH LEVEL OVERVIEW
- Guide to build best LDA MODEL using GENSIM PYTHON

**Conclusion:**

- What is Latent Dirichlet Allocation
- How Latent Dirichlet Allocation works (LDA under the hood) from scratch
- Steps for Latent Dirichlet Allocation
- Theoretical Explanation of Latent Dirichlet Allocation