‘the vehicle is being rightly advertised as smooth and silent’
‘the vehicle has good pickup and is very comfortable to drive’
‘mileage of this vehicle is around 14kmpl’
‘the vehicle is a 7 seater MPV and it generates 300 Nm torque with a diesel engine’
To apply LDA on any document, all documents need to pass through some pre-processing steps.
1. Tokenize text
2. Assign word IDs to each unique word
3. Replace words from documents with word IDs
4. Count Matrices Calculation:
After doing all above steps of pre-processing, now we need to calculate count matrices. To do that first we randomly assign topics to each word/tokenin each document, and then we will calculate a word to topic count matrix and document to topic count matrix.
For this example we will keep maximum topic number equal to 2. That means while randomly assigning topic to each words, they will fall under either topic 1 or topic 2.
Word to topic Count Matrices:
Token/ Word
Topic 1 Count
Topic 2 Count
play
0
1
football
1
1
on
1
2
holiday
3
0
I
1
0
like
0
1
to
0
2
watch
1
0
la
0
1
liga
1
0
match
1
0
is
2
3
this
1
1
the
1
2
vehicle
2
2
being
0
1
rightly
0
1
advertised
0
1
as
0
1
smooth
0
1
and
1
2
silent
1
0
has
0
1
good
1
0
pickup
1
0
very
1
0
comfortable
1
0
drive
0
1
mileage
1
0
of
0
1
around
0
1
14kmpl
1
0
a
1
1
7
1
0
seater
0
1
MPV
0
1
it
1
0
generates
0
1
300
1
0
Nm
0
1
torque
1
0
with
1
0
diesel
1
0
engine
1
0
Let’s see Total count of Topic 1 and Topic 2 for all words. We can calculate it by summing corresponding columns which is 31 for Topic 1 and 32 for Topic 2. Which means for full document Topic 1 appears 31 times and Topic 2 appears 32 times.
We will use this count later.
Now we will generate a document-topic count matrix, where the counts correspond to the number of tokens assigned to each topic for each document.
Document-topic count matrix:
Document
Topic 1 Count
Topic 2 Count
1
2
2
2
4
3
3
5
2
4
1
9
5
7
4
6
2
5
7
10
7
Now it’s time to main part of LDA which is Collapse Gibbs Sampling.
Formula for collapsed Gibbs Sampling:
Let me explain above formula for individual topics.
Parameter Explanation:
a.Cw,jWT = While starting a iteration, number of times a word appeared as topic 1 and topic 2. This is done by word to topic count matrix.
For example before starting iteration 1, (From word to topic matrix we can see) throughout all document/ comment word “holiday” comes under topic 1 three times and comes under topic 2 zero times.
Similarly word “football” appeared as topic 1 one time and as topic 2 one times.
b.β = Per topic word distribution(concentration parameter)
c.W = Length of vocabulary (No. of unique token/ word in full document)
d.Cd,jDT = While starting a iteration number of times a document appeared as topic 1 and topic 2.
For example before starting iteration 1, (From document to topic matrix we can see) document 2 appeared as topic 1 four times and as topic 2 three times.
Similarly document 4 appeared as topic 1 one time and as topic 2 nine times.
Document
Topic 1 Count
Topic 2 Count
1
2
2
2
4
3
3
5
2
4
1
9
e.α : Per document topic distribution.
f. T: Number of topic. (here T = 2)
Latent Dirichlet Allocation under the hood (LDA Steps):
Gibbs sampling should go through many more iterations to come up with optimum best result.
Let’s observe one iteration closely to understand what Gibbs Sampling is doing in LDA; what kind of output we are getting after Gibbs sampling is. Steps are listed below.
Probability Calculation:
Calculate probability of each word in each document. After calculating we will have a table like this. (i.e. probability is calculated by Gibbs sampling equation shown above) Let’s calculate probability of Topic 1 for very first word “play” for document 1 in hand.
Now for word “play” let’s calculate probability for Topic 1 for document 1: General parameter initialization: First let’s initialize α = 1 and β = 0.001 And we already know Total no. of unique words (W) = 44 Parameters from word-topic count matrix: While starting iteration, number of times a word “play” appeared as topic 1(Cw,jWT) = 0 Now from word to topic matrix we know for full document Topic 1 appears 31 times. So;
Please refer to word to topic count matrix if you have any confusion in above. Parameters from document-topic count matrix: While starting iteration number of times document 1 appeared as topic 1 (Cd,jDT) = 2 And Total Number of times document 1 appears as topic 1 and topic 2:
Please refer to document-topic count matrix if you have any confusion in above calculation.
And Finally Total number of unique topic (T) =2
So lets recall the sampling equation again:
Similarly yo can calculate Topic 2 probability for word “play” for document 1.
Now let’s see topic probability for all tokens (calculated in the same way shown above)
Document
Token Position in Doc
Word/ Token
Previous Iteration Topic
Probability for Topic 1
Probability for Topic 2
1
1
play
Topic 2
0.0000161062
0.0156191487
1
2
football
Topic 2
0.0161222781
0.0156191487
1
3
on
Topic 1
0.0161222781
0.0312226938
1
4
holiday
Topic 1
0.0483346218
0.0000156035
2
1
I
Topic 1
0.0179136423
0.0000138698
2
2
like
Topic 2
0.0000178957
0.0138836877
2
3
to
Topic 2
0.0000178957
0.0277535056
2
4
watch
Topic 1
0.0179136423
0.0000138698
2
5
football
Topic 1
0.0179136423
0.0138836877
2
6
on
Topic 2
0.0179136423
0.0277535056
2
7
holiday
Topic 1
0.0537051354
0.0000138698
3
1
la
Topic 2
0.0000214749
0.0104127658
3
2
liga
Topic 1
0.0214963707
0.0000104024
3
3
match
Topic 1
0.0214963707
0.0000104024
3
4
is
Topic 1
0.0429712666
0.0312174926
3
5
on
Topic 2
0.0214963707
0.0208151292
3
6
this
Topic 1
0.0214963707
0.0104127658
3
7
holiday
Topic 1
0.0644461624
0.0000104024
4
1
the
Topic 2
0.0053740927
0.0520378230
4
2
vehicle
Topic 2
0.0107428166
0.0520378230
4
3
is
Topic 2
0.0107428166
0.0780437315
4
4
being
Topic 2
0.0000053687
0.0260319145
4
5
rightly
Topic 2
0.0000053687
0.0260319145
4
6
advertised
Topic 2
0.0000053687
0.0260319145
4
7
as
Topic 2
0.0000053687
0.0260319145
4
8
smooth
Topic 2
0.0000053687
0.0260319145
4
9
and
Topic 2
0.0053740927
0.0520378230
4
10
silent
Topic 1
0.0053740927
0.0000260059
5
1
the
Topic 1
0.0198428038
0.0240174568
5
2
vehicle
Topic 1
0.0396657845
0.0240174568
5
3
has
Topic 2
0.0000198230
0.0120147297
5
4
good
Topic 1
0.0198428038
0.0000120027
5
5
pickup
Topic 1
0.0198428038
0.0000120027
5
6
and
Topic 2
0.0198428038
0.0240174568
5
7
is
Topic 1
0.0396657845
0.0360201838
5
8
very
Topic 1
0.0198428038
0.0000120027
5
9
comfortable
Topic 1
0.0198428038
0.0000120027
5
10
to
Topic 2
0.0000198230
0.0240174568
5
11
drive
Topic 2
0.0000198230
0.0120147297
6
1
mileage
Topic 1
0.0107481854
0.0000208047
6
2
of
Topic 2
0.0000107374
0.0208255316
6
3
this
Topic 2
0.0107481854
0.0208255316
6
4
vehicle
Topic 2
0.0214856333
0.0416302584
6
5
is
Topic 2
0.0214856333
0.0624349852
6
6
around
Topic 2
0.0000107374
0.0208255316
6
7
14kmpl
Topic 1
0.0107481854
0.0000208047
7
1
the
Topic 2
0.0186679009
0.0262927948
7
2
vehicle
Topic 1
0.0373171526
0.0262927948
7
3
is
Topic 2
0.0373171526
0.0394326222
7
4
a
Topic 2
0.0186679009
0.0131529673
7
5
7
Topic 1
0.0186679009
0.0000131398
7
6
seater
Topic 2
0.0000186493
0.0131529673
7
7
MPV
Topic 2
0.0000186493
0.0131529673
7
8
and
Topic 1
0.0186679009
0.0262927948
7
9
it
Topic 1
0.0186679009
0.0000131398
7
10
generates
Topic 2
0.0000186493
0.0131529673
7
11
300
Topic 1
0.0186679009
0.0000131398
7
12
Nm
Topic 2
0.0000186493
0.0131529673
7
13
torque
Topic 1
0.0186679009
0.0000131398
7
14
with
Topic 1
0.0186679009
0.0000131398
7
15
a
Topic 1
0.0186679009
0.0131529673
7
16
diesel
Topic 1
0.0186679009
0.0000131398
7
17
engine
Topic 1
0.0186679009
0.0000131398
In this table last two columns are the output of first stage (Probability calculation)
Final Topic Calculation:
This stage is quite easy. Based on highest probability of two topics for a word LDA will provide final topic for that particular word in that particular document.
Let’s find the output of this stage:
Document
Token Position in Doc
Word/ Token
Previous Iteration Topic
Probability for Topic 1
Probability for Topic 2
Final Topic
1
1
play
Topic 2
0.0000161062
0.0156191487
Topic 2
1
2
football
Topic 2
0.0161222781
0.0156191487
Topic 1
1
3
on
Topic 1
0.0161222781
0.0312226938
Topic 2
1
4
holiday
Topic 1
0.0483346218
0.0000156035
Topic 1
2
1
I
Topic 1
0.0179136423
0.0000138698
Topic 1
2
2
like
Topic 2
0.0000178957
0.0138836877
Topic 2
2
3
to
Topic 2
0.0000178957
0.0277535056
Topic 2
2
4
watch
Topic 1
0.0179136423
0.0000138698
Topic 1
2
5
football
Topic 1
0.0179136423
0.0138836877
Topic 1
2
6
on
Topic 2
0.0179136423
0.0277535056
Topic 2
2
7
holiday
Topic 1
0.0537051354
0.0000138698
Topic 1
3
1
la
Topic 2
0.0000214749
0.0104127658
Topic 2
3
2
liga
Topic 1
0.0214963707
0.0000104024
Topic 1
3
3
match
Topic 1
0.0214963707
0.0000104024
Topic 1
3
4
is
Topic 1
0.0429712666
0.0312174926
Topic 1
3
5
on
Topic 2
0.0214963707
0.0208151292
Topic 1
3
6
this
Topic 1
0.0214963707
0.0104127658
Topic 1
3
7
holiday
Topic 1
0.0644461624
0.0000104024
Topic 1
4
1
the
Topic 2
0.0053740927
0.0520378230
Topic 2
4
2
vehicle
Topic 2
0.0107428166
0.0520378230
Topic 2
4
3
is
Topic 2
0.0107428166
0.0780437315
Topic 2
4
4
being
Topic 2
0.0000053687
0.0260319145
Topic 2
4
5
rightly
Topic 2
0.0000053687
0.0260319145
Topic 2
4
6
advertised
Topic 2
0.0000053687
0.0260319145
Topic 2
4
7
as
Topic 2
0.0000053687
0.0260319145
Topic 2
4
8
smooth
Topic 2
0.0000053687
0.0260319145
Topic 2
4
9
and
Topic 2
0.0053740927
0.0520378230
Topic 2
4
10
silent
Topic 1
0.0053740927
0.0000260059
Topic 1
5
1
the
Topic 1
0.0198428038
0.0240174568
Topic 2
5
2
vehicle
Topic 1
0.0396657845
0.0240174568
Topic 1
5
3
has
Topic 2
0.0000198230
0.0120147297
Topic 2
5
4
good
Topic 1
0.0198428038
0.0000120027
Topic 1
5
5
pickup
Topic 1
0.0198428038
0.0000120027
Topic 1
5
6
and
Topic 2
0.0198428038
0.0240174568
Topic 2
5
7
is
Topic 1
0.0396657845
0.0360201838
Topic 1
5
8
very
Topic 1
0.0198428038
0.0000120027
Topic 1
5
9
comfortable
Topic 1
0.0198428038
0.0000120027
Topic 1
5
10
to
Topic 2
0.0000198230
0.0240174568
Topic 2
5
11
drive
Topic 2
0.0000198230
0.0120147297
Topic 2
6
1
mileage
Topic 1
0.0107481854
0.0000208047
Topic 1
6
2
of
Topic 2
0.0000107374
0.0208255316
Topic 2
6
3
this
Topic 2
0.0107481854
0.0208255316
Topic 2
6
4
vehicle
Topic 2
0.0214856333
0.0416302584
Topic 2
6
5
is
Topic 2
0.0214856333
0.0624349852
Topic 2
6
6
around
Topic 2
0.0000107374
0.0208255316
Topic 2
6
7
14kmpl
Topic 1
0.0107481854
0.0000208047
Topic 1
7
1
the
Topic 2
0.0186679009
0.0262927948
Topic 2
7
2
vehicle
Topic 1
0.0373171526
0.0262927948
Topic 1
7
3
is
Topic 2
0.0373171526
0.0394326222
Topic 2
7
4
a
Topic 2
0.0186679009
0.0131529673
Topic 1
7
5
7
Topic 1
0.0186679009
0.0000131398
Topic 1
7
6
seater
Topic 2
0.0000186493
0.0131529673
Topic 2
7
7
MPV
Topic 2
0.0000186493
0.0131529673
Topic 2
7
8
and
Topic 1
0.0186679009
0.0262927948
Topic 2
7
9
it
Topic 1
0.0186679009
0.0000131398
Topic 1
7
10
generates
Topic 2
0.0000186493
0.0131529673
Topic 2
7
11
300
Topic 1
0.0186679009
0.0000131398
Topic 1
7
12
Nm
Topic 2
0.0000186493
0.0131529673
Topic 2
7
13
torque
Topic 1
0.0186679009
0.0000131398
Topic 1
7
14
with
Topic 1
0.0186679009
0.0000131398
Topic 1
7
15
a
Topic 1
0.0186679009
0.0131529673
Topic 1
7
16
diesel
Topic 1
0.0186679009
0.0000131398
Topic 1
7
17
engine
Topic 1
0.0186679009
0.0000131398
Topic 1
Here you can see topic for some word is modified after iteration one.
This is it. Last column is showing final topic of each word for each document after end of one iteration. Now if we have three iterations this output will be provided to the next iteration as an input. It will go on like this for many more iterations.
And if you still confused about how probability is calculated then please have a look at Collapse Gibb’s sampling equation again, which I have already shown.
please give me the content. the step of this paper is not clearly , for example, we can saw any figures. at the location 1. Tokenize text. the following text is ………………………。can you send me a clearly text. thans
Wow, awesome blog structure! How long have you been blogging for? you made blogging look easy. The entire glance of your website is magnificent, as neatly as the content!
Aw, this was a really great post. In theory Id like to write like this also – taking time and real effort to make a good article.
Many thanks for the article, Keep up the good work and thank you again.
You have noted very interesting points ! ps decent site.
Thanks a lot for the explanation, i got things cleared very much.
Keep up the good effort, your really helping lot of people by sharing your knowledge
please give me the content. the step of this paper is not clearly , for example, we can saw any figures. at the location 1. Tokenize text. the following text is ………………………。can you send me a clearly text. thans
Not able to understand your point.
Thanks for the useful post! I wouldnt have discovered this otherwise!
Wow, awesome blog structure! How long have you been blogging for? you made blogging look easy. The entire glance of your website is magnificent, as neatly as the content!