Latent Dirichlet Allocation is a generative probability model, which means it provide distribution of outputs and inputs based on latent variables.
In this post I will show you how Latent Dirichlet Allocation works, the inner view.
Let’s say we have some comments (listed below) and we want to cluster those comments based on topics those documents cover.
Also Read:
‘play football on holiday’
‘I like to watch football on holiday’
‘la liga match is on this holiday’
‘the vehicle is being rightly advertised as smooth and silent’
‘the vehicle has good pickup and is very comfortable to drive’
‘mileage of this vehicle is around 14kmpl’
‘the vehicle is a 7 seater MPV and it generates 300 Nm torque with a diesel engine’
To apply LDA on any document, all documents need to pass through some pre-processing steps.
1. Tokenize text
2. Assign word IDs to each unique word
3. Replace words from documents with word IDs
4. Count Matrices Calculation:
After doing all above steps of pre-processing, now we need to calculate count matrices. To do that first we randomly assign topics to each word/tokenin each document, and then we will calculate a word to topic count matrix and document to topic count matrix.
For this example we will keep maximum topic number equal to 2. That means while randomly assigning topic to each words, they will fall under either topic 1 or topic 2.
Word to topic Count Matrices:
Token/ Word |
Topic 1 Count |
Topic 2 Count |
play |
0 |
1 |
football |
1 |
1 |
on |
1 |
2 |
holiday |
3 |
0 |
I |
1 |
0 |
like |
0 |
1 |
to |
0 |
2 |
watch |
1 |
0 |
la |
0 |
1 |
liga |
1 |
0 |
match |
1 |
0 |
is |
2 |
3 |
this |
1 |
1 |
the |
1 |
2 |
vehicle |
2 |
2 |
being |
0 |
1 |
rightly |
0 |
1 |
advertised |
0 |
1 |
as |
0 |
1 |
smooth |
0 |
1 |
and |
1 |
2 |
silent |
1 |
0 |
has |
0 |
1 |
good |
1 |
0 |
pickup |
1 |
0 |
very |
1 |
0 |
comfortable |
1 |
0 |
drive |
0 |
1 |
mileage |
1 |
0 |
of |
0 |
1 |
around |
0 |
1 |
14kmpl |
1 |
0 |
a |
1 |
1 |
7 |
1 |
0 |
seater |
0 |
1 |
MPV |
0 |
1 |
it |
1 |
0 |
generates |
0 |
1 |
300 |
1 |
0 |
Nm |
0 |
1 |
torque |
1 |
0 |
with |
1 |
0 |
diesel |
1 |
0 |
engine |
1 |
0 |
Let’s see Total count of Topic 1 and Topic 2 for all words. We can calculate it by summing corresponding columns which is 31 for Topic 1 and 32 for Topic 2. Which means for full document Topic 1 appears 31 times and Topic 2 appears 32 times.
We will use this count later.
Now we will generate a document-topic count matrix, where the counts correspond to the number of tokens assigned to each topic for each document.
Document-topic count matrix:
Document |
Topic 1 Count |
Topic 2 Count |
1 |
2 |
2 |
2 |
4 |
3 |
3 |
5 |
2 |
4 |
1 |
9 |
5 |
7 |
4 |
6 |
2 |
5 |
7 |
10 |
7 |
Now it’s time to main part of LDA which is Collapse Gibbs Sampling.
Formula for collapsed Gibbs Sampling:
Let me explain above formula for individual topics.
Parameter Explanation:
a. Cw,jWT = While starting a iteration, number of times a word appeared as topic 1 and topic 2. This is done by word to topic count matrix.
For example before starting iteration 1, (From word to topic matrix we can see) throughout all document/ comment word “holiday” comes under topic 1 three times and comes under topic 2 zero times.
Similarly word “football” appeared as topic 1 one time and as topic 2 one times.
Token/ Word |
Topic 1 Count |
Topic 2 Count |
play |
0 |
1 |
football |
1 |
1 |
on |
1 |
2 |
holiday |
3 |
0 |
b. β = Per topic word distribution(concentration parameter)
c. W = Length of vocabulary (No. of unique token/ word in full document)
d. Cd,jDT = While starting a iteration number of times a document appeared as topic 1 and topic 2.
For example before starting iteration 1, (From document to topic matrix we can see) document 2 appeared as topic 1 four times and as topic 2 three times.
Similarly document 4 appeared as topic 1 one time and as topic 2 nine times.
Document |
Topic 1 Count |
Topic 2 Count |
1 |
2 |
2 |
2 |
4 |
3 |
3 |
5 |
2 |
4 |
1 |
9 |
e. α : Per document topic distribution.
f. T: Number of topic. (here T = 2)
Latent Dirichlet Allocation under the hood (LDA Steps):
Gibbs sampling should go through many more iterations to come up with optimum best result.
Let’s observe one iteration closely to understand what Gibbs Sampling is doing in LDA; what kind of output we are getting after Gibbs sampling is. Steps are listed below.
Probability Calculation:
Calculate probability of each word in each document. After calculating we will have a table like this. (i.e. probability is calculated by Gibbs sampling equation shown above)
Let’s calculate probability of Topic 1 for very first word “play” for document 1 in hand.
Now for word “play” let’s calculate probability for Topic 1 for document 1:
General parameter initialization:
First let’s initialize α = 1 and β = 0.001
And we already know Total no. of unique words (W) = 44
Parameters from word-topic count matrix:
While starting iteration, number of times a word “play” appeared as topic 1(Cw,jWT) = 0
Now from word to topic matrix we know for full document Topic 1 appears 31 times.
So;

Please refer to word to topic count matrix if you have any confusion in above.
Parameters from document-topic count matrix:
While starting iteration number of times document 1 appeared as topic 1 (Cd,jDT) = 2
And Total Number of times document 1 appears as topic 1 and topic 2:
Please refer to document-topic count matrix if you have any confusion in above calculation.
And Finally Total number of unique topic (T) =2
So lets recall the sampling equation again:
Similarly yo can calculate Topic 2 probability for word “play” for document 1.
Now let’s see topic probability for all tokens (calculated in the same way shown above)
Document |
Token Position in Doc |
Word/ Token |
Previous Iteration Topic |
Probability for Topic 1 |
Probability for Topic 2 |
1 |
1 |
play |
Topic 2 |
0.0000161062 |
0.0156191487 |
1 |
2 |
football |
Topic 2 |
0.0161222781 |
0.0156191487 |
1 |
3 |
on |
Topic 1 |
0.0161222781 |
0.0312226938 |
1 |
4 |
holiday |
Topic 1 |
0.0483346218 |
0.0000156035 |
2 |
1 |
I |
Topic 1 |
0.0179136423 |
0.0000138698 |
2 |
2 |
like |
Topic 2 |
0.0000178957 |
0.0138836877 |
2 |
3 |
to |
Topic 2 |
0.0000178957 |
0.0277535056 |
2 |
4 |
watch |
Topic 1 |
0.0179136423 |
0.0000138698 |
2 |
5 |
football |
Topic 1 |
0.0179136423 |
0.0138836877 |
2 |
6 |
on |
Topic 2 |
0.0179136423 |
0.0277535056 |
2 |
7 |
holiday |
Topic 1 |
0.0537051354 |
0.0000138698 |
3 |
1 |
la |
Topic 2 |
0.0000214749 |
0.0104127658 |
3 |
2 |
liga |
Topic 1 |
0.0214963707 |
0.0000104024 |
3 |
3 |
match |
Topic 1 |
0.0214963707 |
0.0000104024 |
3 |
4 |
is |
Topic 1 |
0.0429712666 |
0.0312174926 |
3 |
5 |
on |
Topic 2 |
0.0214963707 |
0.0208151292 |
3 |
6 |
this |
Topic 1 |
0.0214963707 |
0.0104127658 |
3 |
7 |
holiday |
Topic 1 |
0.0644461624 |
0.0000104024 |
4 |
1 |
the |
Topic 2 |
0.0053740927 |
0.0520378230 |
4 |
2 |
vehicle |
Topic 2 |
0.0107428166 |
0.0520378230 |
4 |
3 |
is |
Topic 2 |
0.0107428166 |
0.0780437315 |
4 |
4 |
being |
Topic 2 |
0.0000053687 |
0.0260319145 |
4 |
5 |
rightly |
Topic 2 |
0.0000053687 |
0.0260319145 |
4 |
6 |
advertised |
Topic 2 |
0.0000053687 |
0.0260319145 |
4 |
7 |
as |
Topic 2 |
0.0000053687 |
0.0260319145 |
4 |
8 |
smooth |
Topic 2 |
0.0000053687 |
0.0260319145 |
4 |
9 |
and |
Topic 2 |
0.0053740927 |
0.0520378230 |
4 |
10 |
silent |
Topic 1 |
0.0053740927 |
0.0000260059 |
5 |
1 |
the |
Topic 1 |
0.0198428038 |
0.0240174568 |
5 |
2 |
vehicle |
Topic 1 |
0.0396657845 |
0.0240174568 |
5 |
3 |
has |
Topic 2 |
0.0000198230 |
0.0120147297 |
5 |
4 |
good |
Topic 1 |
0.0198428038 |
0.0000120027 |
5 |
5 |
pickup |
Topic 1 |
0.0198428038 |
0.0000120027 |
5 |
6 |
and |
Topic 2 |
0.0198428038 |
0.0240174568 |
5 |
7 |
is |
Topic 1 |
0.0396657845 |
0.0360201838 |
5 |
8 |
very |
Topic 1 |
0.0198428038 |
0.0000120027 |
5 |
9 |
comfortable |
Topic 1 |
0.0198428038 |
0.0000120027 |
5 |
10 |
to |
Topic 2 |
0.0000198230 |
0.0240174568 |
5 |
11 |
drive |
Topic 2 |
0.0000198230 |
0.0120147297 |
6 |
1 |
mileage |
Topic 1 |
0.0107481854 |
0.0000208047 |
6 |
2 |
of |
Topic 2 |
0.0000107374 |
0.0208255316 |
6 |
3 |
this |
Topic 2 |
0.0107481854 |
0.0208255316 |
6 |
4 |
vehicle |
Topic 2 |
0.0214856333 |
0.0416302584 |
6 |
5 |
is |
Topic 2 |
0.0214856333 |
0.0624349852 |
6 |
6 |
around |
Topic 2 |
0.0000107374 |
0.0208255316 |
6 |
7 |
14kmpl |
Topic 1 |
0.0107481854 |
0.0000208047 |
7 |
1 |
the |
Topic 2 |
0.0186679009 |
0.0262927948 |
7 |
2 |
vehicle |
Topic 1 |
0.0373171526 |
0.0262927948 |
7 |
3 |
is |
Topic 2 |
0.0373171526 |
0.0394326222 |
7 |
4 |
a |
Topic 2 |
0.0186679009 |
0.0131529673 |
7 |
5 |
7 |
Topic 1 |
0.0186679009 |
0.0000131398 |
7 |
6 |
seater |
Topic 2 |
0.0000186493 |
0.0131529673 |
7 |
7 |
MPV |
Topic 2 |
0.0000186493 |
0.0131529673 |
7 |
8 |
and |
Topic 1 |
0.0186679009 |
0.0262927948 |
7 |
9 |
it |
Topic 1 |
0.0186679009 |
0.0000131398 |
7 |
10 |
generates |
Topic 2 |
0.0000186493 |
0.0131529673 |
7 |
11 |
300 |
Topic 1 |
0.0186679009 |
0.0000131398 |
7 |
12 |
Nm |
Topic 2 |
0.0000186493 |
0.0131529673 |
7 |
13 |
torque |
Topic 1 |
0.0186679009 |
0.0000131398 |
7 |
14 |
with |
Topic 1 |
0.0186679009 |
0.0000131398 |
7 |
15 |
a |
Topic 1 |
0.0186679009 |
0.0131529673 |
7 |
16 |
diesel |
Topic 1 |
0.0186679009 |
0.0000131398 |
7 |
17 |
engine |
Topic 1 |
0.0186679009 |
0.0000131398 |
In this table last two columns are the output of first stage (Probability calculation)
Final Topic Calculation:
This stage is quite easy. Based on highest probability of two topics for a word LDA will provide final topic for that particular word in that particular document.
Let’s find the output of this stage:
Document |
Token Position in Doc |
Word/ Token |
Previous Iteration Topic |
Probability for Topic 1 |
Probability for Topic 2 |
Final Topic |
1 |
1 |
play |
Topic 2 |
0.0000161062 |
0.0156191487 |
Topic 2 |
1 |
2 |
football |
Topic 2 |
0.0161222781 |
0.0156191487 |
Topic 1 |
1 |
3 |
on |
Topic 1 |
0.0161222781 |
0.0312226938 |
Topic 2 |
1 |
4 |
holiday |
Topic 1 |
0.0483346218 |
0.0000156035 |
Topic 1 |
2 |
1 |
I |
Topic 1 |
0.0179136423 |
0.0000138698 |
Topic 1 |
2 |
2 |
like |
Topic 2 |
0.0000178957 |
0.0138836877 |
Topic 2 |
2 |
3 |
to |
Topic 2 |
0.0000178957 |
0.0277535056 |
Topic 2 |
2 |
4 |
watch |
Topic 1 |
0.0179136423 |
0.0000138698 |
Topic 1 |
2 |
5 |
football |
Topic 1 |
0.0179136423 |
0.0138836877 |
Topic 1 |
2 |
6 |
on |
Topic 2 |
0.0179136423 |
0.0277535056 |
Topic 2 |
2 |
7 |
holiday |
Topic 1 |
0.0537051354 |
0.0000138698 |
Topic 1 |
3 |
1 |
la |
Topic 2 |
0.0000214749 |
0.0104127658 |
Topic 2 |
3 |
2 |
liga |
Topic 1 |
0.0214963707 |
0.0000104024 |
Topic 1 |
3 |
3 |
match |
Topic 1 |
0.0214963707 |
0.0000104024 |
Topic 1 |
3 |
4 |
is |
Topic 1 |
0.0429712666 |
0.0312174926 |
Topic 1 |
3 |
5 |
on |
Topic 2 |
0.0214963707 |
0.0208151292 |
Topic 1 |
3 |
6 |
this |
Topic 1 |
0.0214963707 |
0.0104127658 |
Topic 1 |
3 |
7 |
holiday |
Topic 1 |
0.0644461624 |
0.0000104024 |
Topic 1 |
4 |
1 |
the |
Topic 2 |
0.0053740927 |
0.0520378230 |
Topic 2 |
4 |
2 |
vehicle |
Topic 2 |
0.0107428166 |
0.0520378230 |
Topic 2 |
4 |
3 |
is |
Topic 2 |
0.0107428166 |
0.0780437315 |
Topic 2 |
4 |
4 |
being |
Topic 2 |
0.0000053687 |
0.0260319145 |
Topic 2 |
4 |
5 |
rightly |
Topic 2 |
0.0000053687 |
0.0260319145 |
Topic 2 |
4 |
6 |
advertised |
Topic 2 |
0.0000053687 |
0.0260319145 |
Topic 2 |
4 |
7 |
as |
Topic 2 |
0.0000053687 |
0.0260319145 |
Topic 2 |
4 |
8 |
smooth |
Topic 2 |
0.0000053687 |
0.0260319145 |
Topic 2 |
4 |
9 |
and |
Topic 2 |
0.0053740927 |
0.0520378230 |
Topic 2 |
4 |
10 |
silent |
Topic 1 |
0.0053740927 |
0.0000260059 |
Topic 1 |
5 |
1 |
the |
Topic 1 |
0.0198428038 |
0.0240174568 |
Topic 2 |
5 |
2 |
vehicle |
Topic 1 |
0.0396657845 |
0.0240174568 |
Topic 1 |
5 |
3 |
has |
Topic 2 |
0.0000198230 |
0.0120147297 |
Topic 2 |
5 |
4 |
good |
Topic 1 |
0.0198428038 |
0.0000120027 |
Topic 1 |
5 |
5 |
pickup |
Topic 1 |
0.0198428038 |
0.0000120027 |
Topic 1 |
5 |
6 |
and |
Topic 2 |
0.0198428038 |
0.0240174568 |
Topic 2 |
5 |
7 |
is |
Topic 1 |
0.0396657845 |
0.0360201838 |
Topic 1 |
5 |
8 |
very |
Topic 1 |
0.0198428038 |
0.0000120027 |
Topic 1 |
5 |
9 |
comfortable |
Topic 1 |
0.0198428038 |
0.0000120027 |
Topic 1 |
5 |
10 |
to |
Topic 2 |
0.0000198230 |
0.0240174568 |
Topic 2 |
5 |
11 |
drive |
Topic 2 |
0.0000198230 |
0.0120147297 |
Topic 2 |
6 |
1 |
mileage |
Topic 1 |
0.0107481854 |
0.0000208047 |
Topic 1 |
6 |
2 |
of |
Topic 2 |
0.0000107374 |
0.0208255316 |
Topic 2 |
6 |
3 |
this |
Topic 2 |
0.0107481854 |
0.0208255316 |
Topic 2 |
6 |
4 |
vehicle |
Topic 2 |
0.0214856333 |
0.0416302584 |
Topic 2 |
6 |
5 |
is |
Topic 2 |
0.0214856333 |
0.0624349852 |
Topic 2 |
6 |
6 |
around |
Topic 2 |
0.0000107374 |
0.0208255316 |
Topic 2 |
6 |
7 |
14kmpl |
Topic 1 |
0.0107481854 |
0.0000208047 |
Topic 1 |
7 |
1 |
the |
Topic 2 |
0.0186679009 |
0.0262927948 |
Topic 2 |
7 |
2 |
vehicle |
Topic 1 |
0.0373171526 |
0.0262927948 |
Topic 1 |
7 |
3 |
is |
Topic 2 |
0.0373171526 |
0.0394326222 |
Topic 2 |
7 |
4 |
a |
Topic 2 |
0.0186679009 |
0.0131529673 |
Topic 1 |
7 |
5 |
7 |
Topic 1 |
0.0186679009 |
0.0000131398 |
Topic 1 |
7 |
6 |
seater |
Topic 2 |
0.0000186493 |
0.0131529673 |
Topic 2 |
7 |
7 |
MPV |
Topic 2 |
0.0000186493 |
0.0131529673 |
Topic 2 |
7 |
8 |
and |
Topic 1 |
0.0186679009 |
0.0262927948 |
Topic 2 |
7 |
9 |
it |
Topic 1 |
0.0186679009 |
0.0000131398 |
Topic 1 |
7 |
10 |
generates |
Topic 2 |
0.0000186493 |
0.0131529673 |
Topic 2 |
7 |
11 |
300 |
Topic 1 |
0.0186679009 |
0.0000131398 |
Topic 1 |
7 |
12 |
Nm |
Topic 2 |
0.0000186493 |
0.0131529673 |
Topic 2 |
7 |
13 |
torque |
Topic 1 |
0.0186679009 |
0.0000131398 |
Topic 1 |
7 |
14 |
with |
Topic 1 |
0.0186679009 |
0.0000131398 |
Topic 1 |
7 |
15 |
a |
Topic 1 |
0.0186679009 |
0.0131529673 |
Topic 1 |
7 |
16 |
diesel |
Topic 1 |
0.0186679009 |
0.0000131398 |
Topic 1 |
7 |
17 |
engine |
Topic 1 |
0.0186679009 |
0.0000131398 |
Topic 1 |
Here you can see topic for some word is modified after iteration one.
This is it. Last column is showing final topic of each word for each document after end of one iteration. Now if we have three iterations this output will be provided to the next iteration as an input. It will go on like this for many more iterations.
And if you still confused about how probability is calculated then please have a look at Collapse Gibb’s sampling equation again, which I have already shown.
Also Read:
Conclusion:
In this post I have discussed about
- What is Latent Dirichlet Allocation
- How Latent Dirichlet Allocation works (LDA under the hood) from scratch
- Steps for Latent Dirichlet Allocation
- Theoretical Explanation of Latent Dirichlet Allocation
If you have any question in mind regarding this topic please let me know in comment section, will try my best to answer.
Aw, this was a really great post. In theory Id like to write like this also – taking time and real effort to make a good article.
Many thanks for the article, Keep up the good work and thank you again.
You have noted very interesting points ! ps decent site.
Thanks a lot for the explanation, i got things cleared very much.
Keep up the good effort, your really helping lot of people by sharing your knowledge
please give me the content. the step of this paper is not clearly , for example, we can saw any figures. at the location 1. Tokenize text. the following text is ………………………。can you send me a clearly text. thans
Not able to understand your point.
Thanks for the useful post! I wouldnt have discovered this otherwise!
Wow, awesome blog structure! How long have you been blogging for? you made blogging look easy. The entire glance of your website is magnificent, as neatly as the content!