Gensim Doc2Vec Python implementation

Doc2vec (also known as: paragraph2vec or sentence embedding) is the modified version of word2vec. The main objective of doc2vec is to convert sentence or paragraph to vector (numeric) form.

In Natural Language Processing Doc2Vec is used to find related sentences for a given sentence (instead of word in Word2Vec).

In this article I will walk you through a simple implementation of doc2vec using Python and Gensim. In simple word Doc2vec Gensim implementation.
I have a separate article for doc2vec to explain how it works. I recommend reading that article before reading this.

Must Read:

Data Pre-Processing for doc2vec Gensim

#Import packages
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize

## Exapmple document (list of sentences)
doc = ["I love data science",
"I love coding in python",
"I love building NLP tool",
"This is a good phone",
"This is a good TV",
"This is a good laptop"]

# Tokenization of each document
tokenized_doc = []
for d in doc:
tokenized_doc.append(word_tokenize(d.lower()))
tokenized_doc


Output:
[[‘i’, ‘love’, ‘data’, ‘science’],
 [‘i’, ‘love’, ‘coding’, ‘in’, ‘python’],
 [‘i’, ‘love’, ‘building’, ‘nlp’, ‘tool’],
 [‘this’, ‘is’, ‘a’, ‘good’, ‘phone’],
 [‘this’, ‘is’, ‘a’, ‘good’, ‘tv’],
 [‘this’, ‘is’, ‘a’, ‘good’, ‘laptop’]]

# Convert tokenized document into gensim formated tagged data
tagged_data = [TaggedDocument(d, [i]) for i, d in enumerate(tokenized_doc)]
tagged_data

Output:
[TaggedDocument(words=[‘i’, ‘love’, ‘data’, ‘science’], tags=[0]),
 TaggedDocument(words=[‘i’, ‘love’, ‘coding’, ‘in’, ‘python’], tags=[1]),
 TaggedDocument(words=[‘i’, ‘love’, ‘building’, ‘nlp’, ‘tool’], tags=[2]),
 TaggedDocument(words=[‘this’, ‘is’, ‘a’, ‘good’, ‘phone’], tags=[3]),
 TaggedDocument(words=[‘this’, ‘is’, ‘a’, ‘good’, ‘tv’], tags=[4]),
 TaggedDocument(words=[‘this’, ‘is’, ‘a’, ‘good’, ‘laptop’], tags=[5])]

Above steps are just basic data pre-processing steps. In real world complex application data pre-processing is not that much simple. I that case you should be using steps like stemming, lemmatization, n-grams, stop word removal etc. To make this tutorial simple I am avoiding those steps.

Now we are ready to train our doc2vec model.

Also Read:

Train save and load doc2vec model Python

Here I am using distributed memory paragraph vector (PV-DM) model as doc2vec.
Note: dm=1 means ‘distributed memory’ (PV-DM) and dm =0 means ‘distributed bag of words’ (PV-DBOW) 

## Train doc2vec model
model = Doc2Vec(tagged_data, vector_size=20, window=2, min_count=1, workers=4, epochs = 100)
# Save trained doc2vec model
model.save("test_doc2vec.model")
## Load saved doc2vec model
model= Doc2Vec.load("test_doc2vec.model")
## Print model vocabulary
model.wv.vocab


Output:
{‘a’: <gensim.models.keyedvectors.Vocab at 0xc45edbb710>,
 ‘building’: <gensim.models.keyedvectors.Vocab at 0xc45edbb518>,
 ‘coding’: <gensim.models.keyedvectors.Vocab at 0xc45edbb400>,
 ‘data’: <gensim.models.keyedvectors.Vocab at 0xc45edbb320>,
 ‘good’: <gensim.models.keyedvectors.Vocab at 0xc45edbb780>,
 ‘i’: <gensim.models.keyedvectors.Vocab at 0xc45edbb048>,
 ‘in’: <gensim.models.keyedvectors.Vocab at 0xc45edbb470>,
 ‘is’: <gensim.models.keyedvectors.Vocab at 0xc45edbb6d8>,
 ‘laptop’: <gensim.models.keyedvectors.Vocab at 0xc45edbb8d0>,
 ‘love’: <gensim.models.keyedvectors.Vocab at 0xc45edbb2b0>,
 ‘nlp’: <gensim.models.keyedvectors.Vocab at 0xc45edbb588>,
 ‘phone’: <gensim.models.keyedvectors.Vocab at 0xc45edbb7f0>,
 ‘python’: <gensim.models.keyedvectors.Vocab at 0xc45edbb4e0>,
 ‘science’: <gensim.models.keyedvectors.Vocab at 0xc45edbb390>,
 ‘this’: <gensim.models.keyedvectors.Vocab at 0xc45edbb668>,
 ‘tool’: <gensim.models.keyedvectors.Vocab at 0xc45edbb5f8>,
 ‘tv’: <gensim.models.keyedvectors.Vocab at 0xc45edbb860>}

Document Similarity using doc2vec Python

# find most similar doc 
test_doc = word_tokenize("That is a good device".lower())
model.docvecs.most_similar(positive=[model.infer_vector(test_doc)],topn=5)

Output:

[(5, 0.28079578280448914),
 (0, 0.1330653727054596),
 (3, 0.12503036856651306),
 (4, 0.05355849117040634),
 (2, 0.05051974207162857)]

Here (5, 0.28079578280448914) means our provided test_doc is most similar with document 5 of training document set with probability of 28%.

Note: document 5 of training data is: “This is a good laptop”

Conclusion:

In this Doc2vec Gensim tutorial I have discussed about:

    • How to implement doc2vec in python using Gensim
    • Data pre-processing for doc2cec
    • Train doc2vec model
    • Save trained doc2vec model
    • Load saved doc2vec model
    • Find doc2vec model vocabulary list
    • Find vector representation of document by using trained doc2vec model
    • Find similarity between two documents/ sentences by using doc2vec Gensim.
    If you have any question or suggestion regarding this topic please let me know in comment section, will try my best to answer.

    1 thought on “Gensim Doc2Vec Python implementation”

    Leave a Comment

    Your email address will not be published. Required fields are marked *