Guide to Build Best LDA model using Gensim Python

In recent years, huge amount of data (mostly unstructured) is growing. It is difficult to extract relevant and desired information from it. In Text Mining (in the field of Natural Language Processing) Topic Modeling is a technique to extract the hidden topics from huge amount of text.

There are so many algorithms to do topic modeling. Latent Dirichlet Allocation (LDA) is one of those popular algorithms for topic modeling. In previous tutorials I have explained how it Latent Dirichlet Allocation (LDA) works. In this tutorial I am going to implement LDA in Python’s Gensim package.

Must Read:

Prerequisites to implement LDA with Gensim Python

You need two models or data to follow this tutorial. They are:

Stopwords of NLTK: Though Gensim have its own stopword but just to enlarge our stopword list we will be using NLTK stopword.
Spacy Model: We will be using spacy model for lemmatization only.

Run following commands in cmd to download and install spacy and (small) English model.

pip install -U spacy
python -m spacy download en_core_web_sm

## Download nltk stopword incase you don't have already
import nltk
nltk.download('stopwords')

Import packages for LDA

import gensim, spacy
import gensim.corpora as corpora
from nltk.corpus import stopwords

import pandas as pd
import re
from tqdm import tqdm
import time


import pyLDAvis
import pyLDAvis.gensim  # don't skip this
# import matplotlib.pyplot as plt
# %matplotlib inline

## Setup nlp for spacy
nlp = spacy.load("en_core_web_sm")

# Load NLTK stopwords
stop_words = stopwords.words('english')
# Add some extra words in it if required
stop_words.extend(['from', 'subject', 'use','pron'])

Newsgroup Data for LDA Topic Modeling

We will be using the 20-Newsgroups dataset for this tutorial. The dataset contains about 11k newsgroups posts (news). This is available as newsgroups.json.

# Import Dataset
df = pd.read_json('https://raw.githubusercontent.com/selva86/datasets/master/newsgroups.json')
## Or you can download the data from that link and load it by using same function
# View data
df.head()

Cleaning and Pre-processing for LDA

As you know cleaning and pre-processing is the common step for any kind of analysis. There are so many ways to do this based on your data and type of analysis you are doing. For our data and analysis I have divided this stage into following steps:

Remove emails: I don’t think emails are important for our analysis
Remove newline characters and extra space
Remove quotation marks
Lemmatization: using spacy
Tokenization:Split the text into sentences and the sentences into words (including Gensim stopword removal)
Stopword Removal: Final stopword removal by using NLTK stopword

Also Read:

# Convert into list
data = df.content.values.tolist()

### Cleaning data

# Remove Emails
data = [re.sub('S*@S*s?', '', sent) for sent in data]
# Remove new line characters and extra space
data = [re.sub('s+', ' ', sent) for sent in data]
# Remove single quotes
data = [re.sub("'", "", sent) for sent in data]

### Lemmatization
data_lemma = []
for txt in tqdm(data):
    lis = []
    doc = nlp(txt)
    for token in doc:
        lis.append(token.lemma_)
    data_lemma.append(' '.join(lis))

### Tokenization and gensim stopword removal

# You can look for all gensim stopwords by running -> 'gensim.parsing.preprocessing.STOPWORDS'

# Function to tokenize
# Also remove words whose length less than 3 (you can chang it)
def tokenization_with_gen_stop(text):
    result=[]
    for token in gensim.utils.simple_preprocess(text) :
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(token)

    return result

## Apply tokenization function
data_words = []
for txt in tqdm(data_lemma):
    data_words.append(tokenization_with_gen_stop(txt))

### NLTK Stopword removal (extra stopwords)

data_words_clean = []
for word in tqdm(data_words):
    wrd = []
    for w in word:
        if w not in stop_words:
            wrd.append(w)
    data_words_clean.append(wrd)

Prepare Dictionary and Corpus for Topic Modeling

As like any other algorithm LDA can only understand numeric values. So somehow we need to convert all cleaned text into numbers.

In this tutorial we will convert text (cleaned and tokenized word) into bag of words to make it numeric which you can think of as a dictionary, where the key is the word and value is the number of times that word occurs in the entire corpus.

To do so two main inputs of the LDA topic model are:

Dictionary:Unique ids for each unique word
Corpus: For each document number of times a particular word appeared

# Create Dictionary
dictionary = corpora.Dictionary(data_words_clean)
# Print dictionary
print(dictionary.token2id)

## Create Term document frequency (corpus)
# Term Document Frequency
corpus = [dictionary.doc2bow(text) for text in data_words_clean]
# Print corpus for first document
print(corpus[0])

Dictionary:

{‘able’: 0, ‘add’: 1, ‘addison_reed’: 2, ‘afloat’: 3, ‘alejandro_de’: 4, ‘allow’: 5 …..

For example id for word ‘able’ is 0, id for word ‘add’ is 1 and so on.

Corpus:

[(0, 1), (1, 2), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1),…

For example, (0, 1) above implies, for first document word id 0 (word: ‘able’) occurs once. Likewise, word id 1 (word: ‘add’) occurs twice and so on.

If you are still having problem to understand corpus because of word id, you can see a easy-readable form of the corpus itself by following script.

# Easy to observe format of corpus
[[(dictionary[id], freq) for id, freq in cp] for cp in corpus[:1]]

Train LDA Topic Model with Gensim

As we now have done with everything required to train the LDA model.

Here for this tutorial I will be providing few parameters to the LDA model those are:

Corpus: corpus data
num_topics: For this tutorial keeping topic number = 8
id2word:dictionary data
random_state: It will control randomness of training process
passes:Number of passes through the corpus during training.

Apart from those, there are lot many parameters you should consider while tuning your LDA model to get best performance. Those can be found here

start_time = time.time()
##
NUM_TOPICS = 8
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary,random_state=100,passes=10)
# Saving trained model
ldamodel.save('LDA_NYT')
# Loading trained model
ldamodel = gensim.models.ldamodel.LdaModel.load('LDA_NYT')
## Print time taken to train the model
print("--- %s seconds ---" % (time.time() - start_time))

Above code is done with single core process it takes time. If you want faster implementation of LDA (parallelized for multicore machines, parallelization uses multiprocessing). I have tested it in my i7 system and its takes half time than single core LDA.

start_time = time.time()
##
## Multicore LDA
NUM_TOPICS = 8
lda_multicore_model = gensim.models.ldamulticore.LdaMulticore(corpus, num_topics = NUM_TOPICS, id2word=dictionary,random_state=100,passes=10)
# Saving trained model
lda_multicore_model.save('LDA_NYT_multicore')
# Loading trained model
lda_multicore_model = gensim.models.ldamodel.LdaModel.load('LDA_NYT_multicore')
## Print time taken to train the model
print("--- %s seconds ---" % (time.time() - start_time))

View topics of LDA model

Above LDA model is built with 8 different topics where each topic is a combination of keywords and each keyword contributes a certain weightage to that particular topic.

# See the topics
ldamodel.print_topics(-1)

[(0,

‘0.008*”israel” + 0.007*”armenian” + 0.007*”turkish” + 0.007*”israeli” + 0.006*”armenians” + 0.004*”jews” + 0.004*”kill” + 0.003*”armenia” + 0.003*”play” + 0.003*”arab”‘),

(1,

‘0.013*”year” + 0.012*”game” + 0.011*”team” + 0.010*”organization” + 0.008*”write” + 0.008*”good” + 0.007*”article” + 0.007*”player” + 0.007*”think” + 0.007*”university”‘),

(2,

‘0.019*”organization” + 0.013*”line” + 0.010*”posting” + 0.010*”host” + 0.010*”nntp” + 0.010*”university” + 0.009*”write” + 0.009*”lines” + 0.008*”know” + 0.007*”drive”‘),

(3,

‘0.013*”people” + 0.010*”write” + 0.009*”know” + 0.009*”think” + 0.006*”article” + 0.006*”organization” + 0.005*”believe” + 0.005*”like” + 0.005*”thing” + 0.005*”time”‘),

(4,

‘0.010*”write” + 0.010*”organization” + 0.009*”article” + 0.007*”like” + 0.006*”line” + 0.005*”time” + 0.005*”good” + 0.005*”nntp” + 0.005*”posting” + 0.005*”host”‘),

(5,

‘0.011*”space” + 0.007*”information” + 0.006*”government” + 0.006*”chip” + 0.006*”encryption” + 0.005*”clipper” + 0.005*”public” + 0.004*”technology” + 0.004*”nasa” + 0.004*”datum”‘),

(6,

‘0.010*”gordon” + 0.009*”health” + 0.008*”medical” + 0.008*”banks” + 0.007*”doctor” + 0.007*”disease” + 0.007*”patient” + 0.005*”insurance” + 0.004*”treatment” + 0.004*”reply”‘),

(7,

‘0.022*”file” + 0.011*”window” + 0.010*”program” + 0.009*”image” + 0.007*”server” + 0.006*”line” + 0.006*”available” + 0.006*”include” + 0.006*”display” + 0.006*”application”‘)]

Interpret LDA Gensim result

topic 0 is a represented as ‘0.008*”israel” + 0.007*”armenian” + 0.007*”turkish” + 0.007*”israeli” + 0.006*”armenians” + 0.004*”jews” + 0.004*”kill” + 0.003*”armenia” + 0.003*”play” + 0.003*”arab”‘

It means the top 10 keywords that contribute to this topic are: ‘israel’, ‘armenian’, ‘turkish’…and so on and the weight of ‘israel’for topic 0 is 0.008.

The weights are how important a keyword is to that topic.

Looking at these keywords, can you guess what this topic could be? You may summarise that topic0 may be for “country” or “location”.

Similarly topic6 represents “helthcare”, topic7 represents “computer programming/ graphics”.

May be other topics are still not mature enough to decide their name, here comes the tuning part of LDA mode which I will cover later in this tutorial.

Evaluate LDA model

Like each algorithm LDA also needs to evaluate to judge how good our trained model is. There are two convenient measurement techniques:

Perplexity Score: Lower is better
Coherence Score: Higher is better

In my experience, coherence score is more helpful.

# Compute Perplexity Score
print('nPerplexity Score: ', ldamodel.log_perplexity(corpus))

# Compute Coherence Score
coherence_model_lda = gensim.models.CoherenceModel(model=ldamodel, texts=data_words_clean, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('nCoherence Score: ', coherence_lda)

Perplexity Score: -8.483322129214947

Coherence Score: 0.5751529939463009

Visualize topics-keywords of LDA

Now that the LDA model is built, the next step is to examine the produced topics and the associated keywords. There is no better tool than pyLDAvis package’s interactive chart and is designed to work well with jupyter notebooks.

While you are done with building LDA model, time to visualize topic with keywords. Python’s pyLDAvis package is best for that. It’s user interactive chart and is designed to work with jupyter notebook also.

# To plot at Jupyter notebook
pyLDAvis.enable_notebook()
plot = pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)
# Save pyLDA plot as html file
pyLDAvis.save_html(plot, 'LDA_NYT.html')
plot

Each bubble on the left-hand side plot represents individual topic. Larger the bubble, the more important topic is that.

A good topic model will have fairly big, non-overlapping bubbles scattered throughout the chart instead of being clustered in one.

Like for my case topic6, topic7 and topic8are big and non-overlapping but rest topics are overlapping to each other. You have already observed it while printing topic result topic6 represents “helthcare”, topic7 represents “computer programming/ graphics” and topic0 represent “country/ Location”. But rest are not explaining any particular topic.

Note:One important point I observed that topic number while printing topic and while plotting topics may not be same.

Train LDA with mallet

So far you have seen Gensim’s inbuilt version of the LDA algorithm. There is another package called Mallet which often gives a better quality of topics.The difference between Mallet and Gensim’s standard LDA is that, Gensim uses Variational Bayes sampling method which is faster but less precise than Mallet’s Gibbs Sampling.

MALLET is a Java-based package but Python, Gensim has a wrapper for Latent Dirichlet Allocation via Mallet.

Setup Mallet for LDA:

In order to use mallet for LDA, you need to download the zip file of Mallet Java package from here http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip

Unzip the file called mallet-2.0.8 and paste it in any drive. I have pasted it my c:drive.

Note:Make sure that java is installed and environment variable is set for java in your system

import os
## Setup mallet environment change it according to your drive
os.environ.update({'MALLET_HOME':r'C:/mallet-2.0.8'})
## Setup mallet path change it according to your drive
mallet_path = 'C:/mallet-2.0.8/bin/mallet'

start_time = time.time()
##
## Train LDA with mallet
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=20, id2word=dictionary)
## Print time taken to train the model
print("--- %s seconds ---" % (time.time() - start_time))

Evaluate Mallet LDA with Gensim LDA

Now time to evaluate this model, to see if Mallet’s LDA is giving better result than Gensim’s in built LDA or not.

# Compute Coherence Score for mallet
coherence_model_lda = gensim.models.CoherenceModel(model=ldamallet, texts=data_words_clean, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('nCoherence Score: ', coherence_lda)

Coherence Score: 0.6151392292265527

You can clearly observe the difference. Just by changing algorithm coherence score increased from 0.57 to 0.61.

Coherence Score for:

Gensim’s in built LDA:0.5751529939463009
Mallet’s LDA: 0.6151392292265527

Also Read: Continuous Bag of Words (CBOW) - Single Word Model - How It Works

Predict topic and keyword for new document with LDA model

Let’s try to predict topic and keyword for a new document by using our trained LDA model.

To do that the new document need to pass through each same step of data preparation.

## Keeping first content of dataframe as our new document
new_doc = df['content'][0]

### Cleaning data

# Remove Emails
data = re.sub('S*@S*s?', '', new_doc)
# Remove new line characters and extra space
data = re.sub('s+', ' ', data)
# Remove single quotes
data = re.sub("'", "", data)

### Lemmatization
data_lemma = []
lis = []
doc = nlp(data)
for token in doc:
    lis.append(token.lemma_)
data_lemma.append(' '.join(lis))
    
### Tokenization and gensim stopword removal

# You can look for all gensim stopwords by running -> 'gensim.parsing.preprocessing.STOPWORDS'

# Function to tokenize
# Also remove words whose length less than 3 (you can chang it)
def tokenization_with_gen_stop(text):
    result=[]
    for token in gensim.utils.simple_preprocess(text) :
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(token)
            
    return result

## Apply tokenization function
data_words = []
for txt in tqdm(data_lemma):
    data_words.append(tokenization_with_gen_stop(txt))
    
### NLTK Stopword removal (extra stopwords)

data_words_clean_new = []
for word in tqdm(data_words):
    for w in word:
        if w not in stop_words:
            data_words_clean_new.append(w)

After cleaning and pre-processing the data we need to create corpus for new document by using main dictionary.

# Create corpus for new document
corpus_new = dictionary.doc2bow(data_words_clean_new)
corpus_new

Finally we can print topic for new document.

print(ldamodel.get_document_topics(corpus_new))

[(2, 0.30002788), (3, 0.12005065), (4, 0.564969)]

LDA output shows that topic 4 has the highest probability assigned, and topic 2 has the second highest probability assigned.

Note:LDA only provides dominating topics.

Now you can find keywords for topic 4.

topic_prob = ldamodel.get_topic_terms(topicid=4)
for topic in topic_prob:
    print('word:',dictionary[topic[0]],'->','probability:',topic[1])

How to find the optimal number of topics for LDA?

Now at this point you how to do topic modelling (Latent Diriclet Allocation) by using Gensim inbuilt model and by using Mallet. But every where you had to mention topic number to train the LDA model. Is there any way to find optimum topic number for LDA?

I prefer to find the optimal number of topics by building many LDA models with different number of topics (k) and pick the one that gives the highest coherence value.

If same keywords are repeating in multiple topics, it’s probably a sign that the ‘k’ (number of topic) is too large.

Tuning LDA model

Like every algorithm LDA also needs to tune to get optimum result. To tune you can

Tune parameters values like alpha, eta, gamma_threshold, minimum_phi_value etc. And check coherence score (remember higher is better).
You can store those word in corpus/ dictionary having some particular parts of speech ( POS) like Noun, Adjective, Adverb etc.
You can use stemming instead of lemmatization or along with lemmatization

Also Read: Replace Text in a PDF File with Python

Conclusion

In this tutorial I have covered:

Prerequisites for LDA modeling
Packages required for LDA model
Cleaning and Pre-processing for LDA
Prepare Dictionary and Corpus for Topic Modeling
Train LDA Topic Model with Gensim
View topics in LDA model
Evaluate LDA model
Visualize topics-keywords of LDA
Train Topic model with Mallet
Difference between Gensim LDA with Mallet LDA
Predict topic and keyword for new document with LDA model
How to find the optimal number of topics for LDA?
How to tune LDA model

If you have any question or suggestion regarding this topic see you in comment section. I will try my best to answer.

Anindya Naskar

6 thoughts on “Guide to Build Best LDA model using Gensim Python”

Socjologia

February 10, 2021 at 12:29 am

Hello! Just wanted to say great site. Continue with the good work!
dino

February 10, 2021 at 6:21 am

Hey how are you doing? I just wanted to stop by and say that its been a pleasure reading your blog. I have bookmarked your website so that I can come back & read more in the future as well. plz do keep up the quality writing
Cukrzyca.XMC.pl

February 28, 2021 at 10:43 pm

I sincerely took joy in reading your webpage, you explained some first-class points. I want to bookmark your post. I saved you to delicious and yahoo bookmarks. I will attempt to revisit to your site and examine more posts.
Zastosowanie Szkła

March 7, 2021 at 8:17 pm

Nice to be visiting your webpage again, it has been months for me. Well this article that i have been waited for so long. I need this piece of writing to complete my assignment in college, and it has same topic with your article. Thanks, great share.
Co to jest Leasing

March 26, 2021 at 10:23 am

Good Read. Thanks.
www.Filozofia.XMC.pl

September 8, 2021 at 5:30 pm

Of course, what a fantastic blog and instructive posts, I definitely will bookmark your blog.All the Best!