Latent Dirichlet Allocation explained

Latent Dirichlet Allocation is a generative probability model, which means it provide distribution of outputs and inputs based on latent variables.

In this post I will show you how Latent Dirichlet Allocation works, the inner view.

Let’s say we have some comments (listed below) and we want to cluster those comments based on topics those documents cover.

Also Read:

‘play football on holiday’

‘I like to watch football on holiday’

‘la liga match is on this holiday’

‘the vehicle is being rightly advertised as smooth and silent’

‘the vehicle has good pickup and is very comfortable to drive’

‘mileage of this vehicle is around 14kmpl’

‘the vehicle is a 7 seater MPV and it generates 300 Nm torque with a diesel engine’

To apply LDA on any document, all documents need to pass through some pre-processing steps.

1. Tokenize text

2. Assign word IDs to each unique word

3. Replace words from documents with word IDs

4. Count Matrices Calculation:

After doing all above steps of pre-processing, now we need to calculate count matrices. To do that first we randomly assign topics to each word/tokenin each document, and then we will calculate a word to topic count matrix and document to topic count matrix.

For this example we will keep maximum topic number equal to 2. That means while randomly assigning topic to each words, they will fall under either topic 1 or topic 2.

Word to topic Count Matrices:

Token/ Word	Topic 1 Count	Topic 2 Count
play	0	1
football	1	1
on	1	2
holiday	3	0
I	1	0
like	0	1
to	0	2
watch	1	0
la	0	1
liga	1	0
match	1	0
is	2	3
this	1	1
the	1	2
vehicle	2	2
being	0	1
rightly	0	1
advertised	0	1
as	0	1
smooth	0	1
and	1	2
silent	1	0
has	0	1
good	1	0
pickup	1	0
very	1	0
comfortable	1	0
drive	0	1
mileage	1	0
of	0	1
around	0	1
14kmpl	1	0
a	1	1
7	1	0
seater	0	1
MPV	0	1
it	1	0
generates	0	1
300	1	0
Nm	0	1
torque	1	0
with	1	0
diesel	1	0
engine	1	0

Let’s see Total count of Topic 1 and Topic 2 for all words. We can calculate it by summing corresponding columns which is 31 for Topic 1 and 32 for Topic 2. Which means for full document Topic 1 appears 31 times and Topic 2 appears 32 times.

We will use this count later.

Now we will generate a document-topic count matrix, where the counts correspond to the number of tokens assigned to each topic for each document.

Document-topic count matrix:

Document	Topic 1 Count	Topic 2 Count
1	2	2
2	4	3
3	5	2
4	1	9
5	7	4
6	2	5
7	10	7

Now it’s time to main part of LDA which is Collapse Gibbs Sampling.

Formula for collapsed Gibbs Sampling:

Let me explain above formula for individual topics.

Parameter Explanation:

a. C_w,j^WT= While starting a iteration, number of times a word appeared as topic 1 and topic 2. This is done by word to topic count matrix.

For example before starting iteration 1, (From word to topic matrix we can see) throughout all document/ comment word “holiday” comes under topic 1 three times and comes under topic 2 zero times.

Similarly word “football” appeared as topic 1 one time and as topic 2 one times.

Also Read: Generate Synthetic Text Data with Faker in Python

Token/ Word	Topic 1 Count	Topic 2 Count
play	0	1
football	1	1
on	1	2
holiday	3	0

b. β = Per topic word distribution(concentration parameter)

c. W = Length of vocabulary (No. of unique token/ word in full document)

d. C_d,j^DT= While starting a iteration number of times a document appeared as topic 1 and topic 2.

For example before starting iteration 1, (From document to topic matrix we can see) document 2 appeared as topic 1 four times and as topic 2 three times.

Similarly document 4 appeared as topic 1 one time and as topic 2 nine times.

Document	Topic 1 Count	Topic 2 Count
1	2	2
2	4	3
3	5	2
4	1	9

e. α : Per document topic distribution.

f. T: Number of topic. (here T = 2)

Latent Dirichlet Allocation under the hood (LDA Steps):

Gibbs sampling should go through many more iterations to come up with optimum best result.

Let’s observe one iteration closely to understand what Gibbs Sampling is doing in LDA; what kind of output we are getting after Gibbs sampling is. Steps are listed below.

Probability Calculation:

Calculate probability of each word in each document. After calculating we will have a table like this. (i.e. probability is calculated by Gibbs sampling equation shown above)

Let’s calculate probability of Topic 1 for very first word “play” for document 1 in hand.

Now for word “play” let’s calculate probability for Topic 1 for document 1:

General parameter initialization:

First let’s initialize α = 1 and β = 0.001
And we already know Total no. of unique words (W) = 44

Parameters from word-topic count matrix:

While starting iteration, number of times a word “play” appeared as topic 1(C_w,j^WT) = 0

Now from word to topic matrix we know for full document Topic 1 appears 31 times.
So;

$small small {sum_{w=1}^{W}C_{w,j}^{WT}} = 31$

Please refer to word to topic count matrix if you have any confusion in above.

Parameters from document-topic count matrix:

While starting iteration number of times document 1 appeared as topic 1 (C_d,j^DT) = 2

And Total Number of times document 1 appears as topic 1 and topic 2:

$small small {sum_{w=1}^{T}C_{d,j}^{DT}} = 2+2=4$

Please refer to document-topic count matrix if you have any confusion in above calculation.

And Finally Total number of unique topic (T) =2

So lets recall the sampling equation again:

Similarly yo can calculate Topic 2 probability for word “play” for document 1.

Also Read: Convert Text to Speech with Deep Learning in Python

Now let’s see topic probability for all tokens (calculated in the same way shown above)

Document	Token Position in Doc	Word/ Token	Previous Iteration Topic	Probability for Topic 1	Probability for Topic 2
1	1	play	Topic 2	0.0000161062	0.0156191487
1	2	football	Topic 2	0.0161222781	0.0156191487
1	3	on	Topic 1	0.0161222781	0.0312226938
1	4	holiday	Topic 1	0.0483346218	0.0000156035
2	1	I	Topic 1	0.0179136423	0.0000138698
2	2	like	Topic 2	0.0000178957	0.0138836877
2	3	to	Topic 2	0.0000178957	0.0277535056
2	4	watch	Topic 1	0.0179136423	0.0000138698
2	5	football	Topic 1	0.0179136423	0.0138836877
2	6	on	Topic 2	0.0179136423	0.0277535056
2	7	holiday	Topic 1	0.0537051354	0.0000138698
3	1	la	Topic 2	0.0000214749	0.0104127658
3	2	liga	Topic 1	0.0214963707	0.0000104024
3	3	match	Topic 1	0.0214963707	0.0000104024
3	4	is	Topic 1	0.0429712666	0.0312174926
3	5	on	Topic 2	0.0214963707	0.0208151292
3	6	this	Topic 1	0.0214963707	0.0104127658
3	7	holiday	Topic 1	0.0644461624	0.0000104024
4	1	the	Topic 2	0.0053740927	0.0520378230
4	2	vehicle	Topic 2	0.0107428166	0.0520378230
4	3	is	Topic 2	0.0107428166	0.0780437315
4	4	being	Topic 2	0.0000053687	0.0260319145
4	5	rightly	Topic 2	0.0000053687	0.0260319145
4	6	advertised	Topic 2	0.0000053687	0.0260319145
4	7	as	Topic 2	0.0000053687	0.0260319145
4	8	smooth	Topic 2	0.0000053687	0.0260319145
4	9	and	Topic 2	0.0053740927	0.0520378230
4	10	silent	Topic 1	0.0053740927	0.0000260059
5	1	the	Topic 1	0.0198428038	0.0240174568
5	2	vehicle	Topic 1	0.0396657845	0.0240174568
5	3	has	Topic 2	0.0000198230	0.0120147297
5	4	good	Topic 1	0.0198428038	0.0000120027
5	5	pickup	Topic 1	0.0198428038	0.0000120027
5	6	and	Topic 2	0.0198428038	0.0240174568
5	7	is	Topic 1	0.0396657845	0.0360201838
5	8	very	Topic 1	0.0198428038	0.0000120027
5	9	comfortable	Topic 1	0.0198428038	0.0000120027
5	10	to	Topic 2	0.0000198230	0.0240174568
5	11	drive	Topic 2	0.0000198230	0.0120147297
6	1	mileage	Topic 1	0.0107481854	0.0000208047
6	2	of	Topic 2	0.0000107374	0.0208255316
6	3	this	Topic 2	0.0107481854	0.0208255316
6	4	vehicle	Topic 2	0.0214856333	0.0416302584
6	5	is	Topic 2	0.0214856333	0.0624349852
6	6	around	Topic 2	0.0000107374	0.0208255316
6	7	14kmpl	Topic 1	0.0107481854	0.0000208047
7	1	the	Topic 2	0.0186679009	0.0262927948
7	2	vehicle	Topic 1	0.0373171526	0.0262927948
7	3	is	Topic 2	0.0373171526	0.0394326222
7	4	a	Topic 2	0.0186679009	0.0131529673
7	5	7	Topic 1	0.0186679009	0.0000131398
7	6	seater	Topic 2	0.0000186493	0.0131529673
7	7	MPV	Topic 2	0.0000186493	0.0131529673
7	8	and	Topic 1	0.0186679009	0.0262927948
7	9	it	Topic 1	0.0186679009	0.0000131398
7	10	generates	Topic 2	0.0000186493	0.0131529673
7	11	300	Topic 1	0.0186679009	0.0000131398
7	12	Nm	Topic 2	0.0000186493	0.0131529673
7	13	torque	Topic 1	0.0186679009	0.0000131398
7	14	with	Topic 1	0.0186679009	0.0000131398
7	15	a	Topic 1	0.0186679009	0.0131529673
7	16	diesel	Topic 1	0.0186679009	0.0000131398
7	17	engine	Topic 1	0.0186679009	0.0000131398

In this table last two columns are the output of first stage (Probability calculation)

Final Topic Calculation:

This stage is quite easy. Based on highest probability of two topics for a word LDA will provide final topic for that particular word in that particular document.

Let’s find the output of this stage:

Document	Token Position in Doc	Word/ Token	Previous Iteration Topic	Probability for Topic 1	Probability for Topic 2	Final Topic
1	1	play	Topic 2	0.0000161062	0.0156191487	Topic 2
1	2	football	Topic 2	0.0161222781	0.0156191487	Topic 1
1	3	on	Topic 1	0.0161222781	0.0312226938	Topic 2
1	4	holiday	Topic 1	0.0483346218	0.0000156035	Topic 1
2	1	I	Topic 1	0.0179136423	0.0000138698	Topic 1
2	2	like	Topic 2	0.0000178957	0.0138836877	Topic 2
2	3	to	Topic 2	0.0000178957	0.0277535056	Topic 2
2	4	watch	Topic 1	0.0179136423	0.0000138698	Topic 1
2	5	football	Topic 1	0.0179136423	0.0138836877	Topic 1
2	6	on	Topic 2	0.0179136423	0.0277535056	Topic 2
2	7	holiday	Topic 1	0.0537051354	0.0000138698	Topic 1
3	1	la	Topic 2	0.0000214749	0.0104127658	Topic 2
3	2	liga	Topic 1	0.0214963707	0.0000104024	Topic 1
3	3	match	Topic 1	0.0214963707	0.0000104024	Topic 1
3	4	is	Topic 1	0.0429712666	0.0312174926	Topic 1
3	5	on	Topic 2	0.0214963707	0.0208151292	Topic 1
3	6	this	Topic 1	0.0214963707	0.0104127658	Topic 1
3	7	holiday	Topic 1	0.0644461624	0.0000104024	Topic 1
4	1	the	Topic 2	0.0053740927	0.0520378230	Topic 2
4	2	vehicle	Topic 2	0.0107428166	0.0520378230	Topic 2
4	3	is	Topic 2	0.0107428166	0.0780437315	Topic 2
4	4	being	Topic 2	0.0000053687	0.0260319145	Topic 2
4	5	rightly	Topic 2	0.0000053687	0.0260319145	Topic 2
4	6	advertised	Topic 2	0.0000053687	0.0260319145	Topic 2
4	7	as	Topic 2	0.0000053687	0.0260319145	Topic 2
4	8	smooth	Topic 2	0.0000053687	0.0260319145	Topic 2
4	9	and	Topic 2	0.0053740927	0.0520378230	Topic 2
4	10	silent	Topic 1	0.0053740927	0.0000260059	Topic 1
5	1	the	Topic 1	0.0198428038	0.0240174568	Topic 2
5	2	vehicle	Topic 1	0.0396657845	0.0240174568	Topic 1
5	3	has	Topic 2	0.0000198230	0.0120147297	Topic 2
5	4	good	Topic 1	0.0198428038	0.0000120027	Topic 1
5	5	pickup	Topic 1	0.0198428038	0.0000120027	Topic 1
5	6	and	Topic 2	0.0198428038	0.0240174568	Topic 2
5	7	is	Topic 1	0.0396657845	0.0360201838	Topic 1
5	8	very	Topic 1	0.0198428038	0.0000120027	Topic 1
5	9	comfortable	Topic 1	0.0198428038	0.0000120027	Topic 1
5	10	to	Topic 2	0.0000198230	0.0240174568	Topic 2
5	11	drive	Topic 2	0.0000198230	0.0120147297	Topic 2
6	1	mileage	Topic 1	0.0107481854	0.0000208047	Topic 1
6	2	of	Topic 2	0.0000107374	0.0208255316	Topic 2
6	3	this	Topic 2	0.0107481854	0.0208255316	Topic 2
6	4	vehicle	Topic 2	0.0214856333	0.0416302584	Topic 2
6	5	is	Topic 2	0.0214856333	0.0624349852	Topic 2
6	6	around	Topic 2	0.0000107374	0.0208255316	Topic 2
6	7	14kmpl	Topic 1	0.0107481854	0.0000208047	Topic 1
7	1	the	Topic 2	0.0186679009	0.0262927948	Topic 2
7	2	vehicle	Topic 1	0.0373171526	0.0262927948	Topic 1
7	3	is	Topic 2	0.0373171526	0.0394326222	Topic 2
7	4	a	Topic 2	0.0186679009	0.0131529673	Topic 1
7	5	7	Topic 1	0.0186679009	0.0000131398	Topic 1
7	6	seater	Topic 2	0.0000186493	0.0131529673	Topic 2
7	7	MPV	Topic 2	0.0000186493	0.0131529673	Topic 2
7	8	and	Topic 1	0.0186679009	0.0262927948	Topic 2
7	9	it	Topic 1	0.0186679009	0.0000131398	Topic 1
7	10	generates	Topic 2	0.0000186493	0.0131529673	Topic 2
7	11	300	Topic 1	0.0186679009	0.0000131398	Topic 1
7	12	Nm	Topic 2	0.0000186493	0.0131529673	Topic 2
7	13	torque	Topic 1	0.0186679009	0.0000131398	Topic 1
7	14	with	Topic 1	0.0186679009	0.0000131398	Topic 1
7	15	a	Topic 1	0.0186679009	0.0131529673	Topic 1
7	16	diesel	Topic 1	0.0186679009	0.0000131398	Topic 1
7	17	engine	Topic 1	0.0186679009	0.0000131398	Topic 1

Here you can see topic for some word is modified after iteration one.

This is it. Last column is showing final topic of each word for each document after end of one iteration. Now if we have three iterations this output will be provided to the next iteration as an input. It will go on like this for many more iterations.

And if you still confused about how probability is calculated then please have a look at Collapse Gibb’s sampling equation again, which I have already shown.

Also Read:

Also Read: Convert Natural Language Text to SQL Query with LLM

Conclusion:

In this post I have discussed about

What is Latent Dirichlet Allocation
How Latent Dirichlet Allocation works (LDA under the hood) from scratch
Steps for Latent Dirichlet Allocation
Theoretical Explanation of Latent Dirichlet Allocation

If you have any question in mind regarding this topic please let me know in comment section, will try my best to answer.

Anindya Naskar

8 thoughts on “Latent Dirichlet Allocation explained”

Japan

February 13, 2021 at 7:58 am

Aw, this was a really great post. In theory Id like to write like this also – taking time and real effort to make a good article.
Education XMC

February 20, 2021 at 11:19 am

Many thanks for the article, Keep up the good work and thank you again.
www.xmc.pl

March 15, 2021 at 11:50 am

You have noted very interesting points ! ps decent site.
Madhavan Kayambu

April 1, 2021 at 6:36 pm

Thanks a lot for the explanation, i got things cleared very much.

Keep up the good effort, your really helping lot of people by sharing your knowledge
rain

April 2, 2021 at 6:15 pm

please give me the content. the step of this paper is not clearly , for example, we can saw any figures. at the location 1. Tokenize text. the following text is ………………………。can you send me a clearly text. thans
- Anindya
  
  April 5, 2021 at 11:30 am
  
  Not able to understand your point.
Japonia

September 7, 2021 at 1:46 pm

Thanks for the useful post! I wouldnt have discovered this otherwise!
Filozofia

September 9, 2021 at 11:50 pm

Wow, awesome blog structure! How long have you been blogging for? you made blogging look easy. The entire glance of your website is magnificent, as neatly as the content!

<img fetchpriority="high" decoding="async" src="https://2.bp.blogspot.com/-f5Gdfd6EfC0/XFGjkYGuEvI/AAAAAAAABm0/Sve1z_5OFO88aaBAsE6pchxm4uKelxaTgCLcBGAs/s640/image001-min.png" width="640" height="300" border="0" data-original-height="210" data-original-width="445" />