Latent Dirichlet Allocation explained


Latent Dirichlet Allocation is a generative probability model, which means it provide distribution of outputs and inputs based on latent variables.

In this post I will show you how Latent Dirichlet Allocation works, the inner view.

Let’s say we have some comments (listed below) and we want to cluster those comments based on topics those documents cover.

Also Read:



‘play football on holiday’
‘I like to watch football on holiday’
‘la liga match is on this holiday’
‘the vehicle is being rightly advertised as smooth and silent’
‘the vehicle has good pickup and is very comfortable to drive’
‘mileage of this vehicle is around 14kmpl’
‘the vehicle is a 7 seater MPV and it generates 300 Nm torque with a diesel engine’

To apply LDA on any document, all documents need to pass through some pre-processing steps.


1. Tokenize text


2. Assign word IDs to each unique word

3. Replace words from documents with word IDs

4. Count Matrices Calculation:

After doing all above steps of pre-processing, now we need to calculate count matrices. To do that first we randomly assign topics to each word/tokenin each document, and then we will calculate a word to topic count matrix and document to topic count matrix.

For this example we will keep maximum topic number equal to 2. That means while randomly assigning topic to each words, they will fall under either topic 1 or topic 2.


Word to topic Count Matrices:

Token/ Word Topic 1 Count Topic 2 Count
play 0 1
football 1 1
on 1 2
holiday 3 0
I 1 0
like 0 1
to 0 2
watch 1 0
la 0 1
liga 1 0
match 1 0
is 2 3
this 1 1
the 1 2
vehicle 2 2
being 0 1
rightly 0 1
advertised 0 1
as 0 1
smooth 0 1
and 1 2
silent 1 0
has 0 1
good 1 0
pickup 1 0
very 1 0
comfortable 1 0
drive 0 1
mileage 1 0
of 0 1
around 0 1
14kmpl 1 0
a 1 1
7 1 0
seater 0 1
MPV 0 1
it 1 0
generates 0 1
300 1 0
Nm 0 1
torque 1 0
with 1 0
diesel 1 0
engine 1 0

Let’s see Total count of Topic 1 and Topic 2 for all words. We can calculate it by summing corresponding columns which is 31 for Topic 1 and 32 for Topic 2. Which means for full document Topic 1 appears 31 times and Topic 2 appears 32 times.

We will use this count later.

Now we will generate a document-topic count matrix, where the counts correspond to the number of tokens assigned to each topic for each document.

Document-topic count matrix:

Document Topic 1 Count Topic 2 Count
1 2 2
2 4 3
3 5 2
4 1 9
5 7 4
6 2 5
7 10 7

Now it’s time to main part of LDA which is Collapse Gibbs Sampling.

Formula for collapsed Gibbs Sampling:

Let me explain above formula for individual topics.
Parameter Explanation:

a.      Cw,jWT  = While starting a iteration, number of times a word appeared as topic 1 and topic 2. This is done by word to topic count matrix.
For example before starting iteration 1, (From word to topic matrix we can see) throughout all document/ comment word “holiday” comes under topic 1 three times and comes under topic 2 zero times.

                     Similarly word “football” appeared as topic 1 one time and as topic 2 one                           times.

Token/ Word Topic 1 Count Topic 2 Count
play 0 1
football 1 1
on 1 2
holiday 3 0
b.      β = Per topic word distribution(concentration parameter)
c.     W = Length of vocabulary (No. of unique token/ word in full document)
d.      Cd,jDT =  While starting a iteration number of times a document appeared as topic 1 and topic 2.
For example before starting iteration 1, (From document to topic matrix we can see) document 2 appeared as topic 1 four times and as topic 2 three times.

                     Similarly document 4 appeared as topic 1 one time and as topic 2 nine times.

Document Topic 1 Count Topic 2 Count
1 2 2
2 4 3
3 5 2
4 1 9
e.      α : Per document topic distribution.

                f.    T: Number of topic. (here T = 2)


Latent Dirichlet Allocation under the hood (LDA Steps):

Gibbs sampling should go through many more iterations to come up with optimum best result.

Let’s observe one iteration closely to understand what Gibbs Sampling is doing in LDA; what kind of output we are getting after Gibbs sampling is. Steps are listed below.

Probability Calculation:

Calculate probability of each word in each document. After calculating we will have a table like this. (i.e. probability is calculated by Gibbs sampling equation shown above)

Let’s calculate probability of Topic 1 for very first word “play” for document 1 in hand.


Now for word “play” let’s calculate probability for Topic 1 for document 1:

General parameter initialization:

First let’s initialize α = 1 and β = 0.001
And we already know Total no. of unique words (W) = 44

Parameters from word-topic count matrix:

While starting iteration, number of times a word “play” appeared as topic 1(Cw,jWT) = 0

Now from word to topic matrix we know for full document Topic 1 appears 31 times.
So;

Please refer to word to topic count matrix if you have any confusion in above.

Parameters from document-topic count matrix:

While starting iteration number of times document 1 appeared as topic 1 (Cd,jDT) = 2

And Total Number of times document 1 appears as topic 1 and topic 2:

  

Please refer to document-topic count matrix if you have any confusion in above calculation.

And Finally Total number of unique topic (T) =2

So lets recall the sampling equation again:

Similarly yo can calculate Topic 2 probability for word “play” for document 1.

Now let’s see topic probability for all tokens (calculated in the same way shown above)

Document Token Position in Doc Word/ Token Previous Iteration Topic Probability for Topic 1 Probability for Topic 2
1 1 play Topic 2 0.0000161062 0.0156191487
1 2 football Topic 2 0.0161222781 0.0156191487
1 3 on Topic 1 0.0161222781 0.0312226938
1 4 holiday Topic 1 0.0483346218 0.0000156035
2 1 I Topic 1 0.0179136423 0.0000138698
2 2 like Topic 2 0.0000178957 0.0138836877
2 3 to Topic 2 0.0000178957 0.0277535056
2 4 watch Topic 1 0.0179136423 0.0000138698
2 5 football Topic 1 0.0179136423 0.0138836877
2 6 on Topic 2 0.0179136423 0.0277535056
2 7 holiday Topic 1 0.0537051354 0.0000138698
3 1 la Topic 2 0.0000214749 0.0104127658
3 2 liga Topic 1 0.0214963707 0.0000104024
3 3 match Topic 1 0.0214963707 0.0000104024
3 4 is Topic 1 0.0429712666 0.0312174926
3 5 on Topic 2 0.0214963707 0.0208151292
3 6 this Topic 1 0.0214963707 0.0104127658
3 7 holiday Topic 1 0.0644461624 0.0000104024
4 1 the Topic 2 0.0053740927 0.0520378230
4 2 vehicle Topic 2 0.0107428166 0.0520378230
4 3 is Topic 2 0.0107428166 0.0780437315
4 4 being Topic 2 0.0000053687 0.0260319145
4 5 rightly Topic 2 0.0000053687 0.0260319145
4 6 advertised Topic 2 0.0000053687 0.0260319145
4 7 as Topic 2 0.0000053687 0.0260319145
4 8 smooth Topic 2 0.0000053687 0.0260319145
4 9 and Topic 2 0.0053740927 0.0520378230
4 10 silent Topic 1 0.0053740927 0.0000260059
5 1 the Topic 1 0.0198428038 0.0240174568
5 2 vehicle Topic 1 0.0396657845 0.0240174568
5 3 has Topic 2 0.0000198230 0.0120147297
5 4 good Topic 1 0.0198428038 0.0000120027
5 5 pickup Topic 1 0.0198428038 0.0000120027
5 6 and Topic 2 0.0198428038 0.0240174568
5 7 is Topic 1 0.0396657845 0.0360201838
5 8 very Topic 1 0.0198428038 0.0000120027
5 9 comfortable Topic 1 0.0198428038 0.0000120027
5 10 to Topic 2 0.0000198230 0.0240174568
5 11 drive Topic 2 0.0000198230 0.0120147297
6 1 mileage Topic 1 0.0107481854 0.0000208047
6 2 of Topic 2 0.0000107374 0.0208255316
6 3 this Topic 2 0.0107481854 0.0208255316
6 4 vehicle Topic 2 0.0214856333 0.0416302584
6 5 is Topic 2 0.0214856333 0.0624349852
6 6 around Topic 2 0.0000107374 0.0208255316
6 7 14kmpl Topic 1 0.0107481854 0.0000208047
7 1 the Topic 2 0.0186679009 0.0262927948
7 2 vehicle Topic 1 0.0373171526 0.0262927948
7 3 is Topic 2 0.0373171526 0.0394326222
7 4 a Topic 2 0.0186679009 0.0131529673
7 5 7 Topic 1 0.0186679009 0.0000131398
7 6 seater Topic 2 0.0000186493 0.0131529673
7 7 MPV Topic 2 0.0000186493 0.0131529673
7 8 and Topic 1 0.0186679009 0.0262927948
7 9 it Topic 1 0.0186679009 0.0000131398
7 10 generates Topic 2 0.0000186493 0.0131529673
7 11 300 Topic 1 0.0186679009 0.0000131398
7 12 Nm Topic 2 0.0000186493 0.0131529673
7 13 torque Topic 1 0.0186679009 0.0000131398
7 14 with Topic 1 0.0186679009 0.0000131398
7 15 a Topic 1 0.0186679009 0.0131529673
7 16 diesel Topic 1 0.0186679009 0.0000131398
7 17 engine Topic 1 0.0186679009 0.0000131398


In this table last two columns are the output of first stage (Probability calculation)

Final Topic Calculation:

This stage is quite easy. Based on highest probability of two topics for a word LDA will provide final topic for that particular word in that particular document.

Let’s find the output of this stage:

Document Token Position in Doc Word/ Token Previous Iteration Topic Probability for Topic 1 Probability for Topic 2 Final Topic
1 1 play Topic 2 0.0000161062 0.0156191487 Topic 2
1 2 football Topic 2 0.0161222781 0.0156191487 Topic 1
1 3 on Topic 1 0.0161222781 0.0312226938 Topic 2
1 4 holiday Topic 1 0.0483346218 0.0000156035 Topic 1
2 1 I Topic 1 0.0179136423 0.0000138698 Topic 1
2 2 like Topic 2 0.0000178957 0.0138836877 Topic 2
2 3 to Topic 2 0.0000178957 0.0277535056 Topic 2
2 4 watch Topic 1 0.0179136423 0.0000138698 Topic 1
2 5 football Topic 1 0.0179136423 0.0138836877 Topic 1
2 6 on Topic 2 0.0179136423 0.0277535056 Topic 2
2 7 holiday Topic 1 0.0537051354 0.0000138698 Topic 1
3 1 la Topic 2 0.0000214749 0.0104127658 Topic 2
3 2 liga Topic 1 0.0214963707 0.0000104024 Topic 1
3 3 match Topic 1 0.0214963707 0.0000104024 Topic 1
3 4 is Topic 1 0.0429712666 0.0312174926 Topic 1
3 5 on Topic 2 0.0214963707 0.0208151292 Topic 1
3 6 this Topic 1 0.0214963707 0.0104127658 Topic 1
3 7 holiday Topic 1 0.0644461624 0.0000104024 Topic 1
4 1 the Topic 2 0.0053740927 0.0520378230 Topic 2
4 2 vehicle Topic 2 0.0107428166 0.0520378230 Topic 2
4 3 is Topic 2 0.0107428166 0.0780437315 Topic 2
4 4 being Topic 2 0.0000053687 0.0260319145 Topic 2
4 5 rightly Topic 2 0.0000053687 0.0260319145 Topic 2
4 6 advertised Topic 2 0.0000053687 0.0260319145 Topic 2
4 7 as Topic 2 0.0000053687 0.0260319145 Topic 2
4 8 smooth Topic 2 0.0000053687 0.0260319145 Topic 2
4 9 and Topic 2 0.0053740927 0.0520378230 Topic 2
4 10 silent Topic 1 0.0053740927 0.0000260059 Topic 1
5 1 the Topic 1 0.0198428038 0.0240174568 Topic 2
5 2 vehicle Topic 1 0.0396657845 0.0240174568 Topic 1
5 3 has Topic 2 0.0000198230 0.0120147297 Topic 2
5 4 good Topic 1 0.0198428038 0.0000120027 Topic 1
5 5 pickup Topic 1 0.0198428038 0.0000120027 Topic 1
5 6 and Topic 2 0.0198428038 0.0240174568 Topic 2
5 7 is Topic 1 0.0396657845 0.0360201838 Topic 1
5 8 very Topic 1 0.0198428038 0.0000120027 Topic 1
5 9 comfortable Topic 1 0.0198428038 0.0000120027 Topic 1
5 10 to Topic 2 0.0000198230 0.0240174568 Topic 2
5 11 drive Topic 2 0.0000198230 0.0120147297 Topic 2
6 1 mileage Topic 1 0.0107481854 0.0000208047 Topic 1
6 2 of Topic 2 0.0000107374 0.0208255316 Topic 2
6 3 this Topic 2 0.0107481854 0.0208255316 Topic 2
6 4 vehicle Topic 2 0.0214856333 0.0416302584 Topic 2
6 5 is Topic 2 0.0214856333 0.0624349852 Topic 2
6 6 around Topic 2 0.0000107374 0.0208255316 Topic 2
6 7 14kmpl Topic 1 0.0107481854 0.0000208047 Topic 1
7 1 the Topic 2 0.0186679009 0.0262927948 Topic 2
7 2 vehicle Topic 1 0.0373171526 0.0262927948 Topic 1
7 3 is Topic 2 0.0373171526 0.0394326222 Topic 2
7 4 a Topic 2 0.0186679009 0.0131529673 Topic 1
7 5 7 Topic 1 0.0186679009 0.0000131398 Topic 1
7 6 seater Topic 2 0.0000186493 0.0131529673 Topic 2
7 7 MPV Topic 2 0.0000186493 0.0131529673 Topic 2
7 8 and Topic 1 0.0186679009 0.0262927948 Topic 2
7 9 it Topic 1 0.0186679009 0.0000131398 Topic 1
7 10 generates Topic 2 0.0000186493 0.0131529673 Topic 2
7 11 300 Topic 1 0.0186679009 0.0000131398 Topic 1
7 12 Nm Topic 2 0.0000186493 0.0131529673 Topic 2
7 13 torque Topic 1 0.0186679009 0.0000131398 Topic 1
7 14 with Topic 1 0.0186679009 0.0000131398 Topic 1
7 15 a Topic 1 0.0186679009 0.0131529673 Topic 1
7 16 diesel Topic 1 0.0186679009 0.0000131398 Topic 1
7 17 engine Topic 1 0.0186679009 0.0000131398 Topic 1
Here you can see topic for some word is modified after iteration one.
This is it. Last column is showing final topic of each word for each document after end of one iteration. Now if we have three iterations this output will be provided to the next iteration as an input. It will go on like this for many more iterations.

And if you still confused about how probability is calculated then please have a look at Collapse Gibb’s sampling equation again, which I have already shown.

Also Read:

Conclusion:

In this post I have discussed about
  • What is Latent Dirichlet Allocation
  • How Latent Dirichlet Allocation works (LDA under the hood) from scratch
  • Steps for Latent Dirichlet Allocation
  • Theoretical Explanation of Latent Dirichlet Allocation


If you have any question in mind regarding this topic please let me know in comment section, will try my best to answer.

Leave a Comment

Your email address will not be published. Required fields are marked *