Gensim Doc2Vec Python implementation

Doc2vec (also known as: paragraph2vec or sentence embedding) is the modified version of word2vec. The main objective of doc2vec is to convert sentence or paragraph to vector (numeric) form.

In Natural Language Processing Doc2Vec is used to find related sentences for a given sentence (instead of word in Word2Vec).

In this article I will walk you through a simple implementation of doc2vec using Python and Gensim. In simple word Doc2vec Gensim implementation.

I have a separate article for doc2vec to explain how it works. I recommend reading that article before reading this.

Must Read:

Doc2vec explained

Data Pre-Processing for doc2vec Gensim

#Import packages
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize

## Exapmple document (list of sentences)
doc = ["I love data science",
        "I love coding in python",
        "I love building NLP tool",
        "This is a good phone",
        "This is a good TV",
        "This is a good laptop"]

# Tokenization of each document
tokenized_doc = []
for d in doc:
    tokenized_doc.append(word_tokenize(d.lower()))
tokenized_doc

Output:
[[‘i’, ‘love’, ‘data’, ‘science’],
[‘i’, ‘love’, ‘coding’, ‘in’, ‘python’],
[‘i’, ‘love’, ‘building’, ‘nlp’, ‘tool’],
[‘this’, ‘is’, ‘a’, ‘good’, ‘phone’],
[‘this’, ‘is’, ‘a’, ‘good’, ‘tv’],
[‘this’, ‘is’, ‘a’, ‘good’, ‘laptop’]]

# Convert tokenized document into gensim formated tagged data
tagged_data = [TaggedDocument(d, [i]) for i, d in enumerate(tokenized_doc)]
tagged_data

Output:

[TaggedDocument(words=[‘i’, ‘love’, ‘data’, ‘science’], tags=[0]),

TaggedDocument(words=[‘i’, ‘love’, ‘coding’, ‘in’, ‘python’], tags=[1]),

TaggedDocument(words=[‘i’, ‘love’, ‘building’, ‘nlp’, ‘tool’], tags=[2]),

TaggedDocument(words=[‘this’, ‘is’, ‘a’, ‘good’, ‘phone’], tags=[3]),

TaggedDocument(words=[‘this’, ‘is’, ‘a’, ‘good’, ‘tv’], tags=[4]),

TaggedDocument(words=[‘this’, ‘is’, ‘a’, ‘good’, ‘laptop’], tags=[5])]

Also Read: Train BERT from Scratch on Custom Domain Data

Above steps are just basic data pre-processing steps. In real world complex application data pre-processing is not that much simple. I that case you should be using steps like stemming, lemmatization, n-grams, stop word removal etc. To make this tutorial simple I am avoiding those steps.

Now we are ready to train our doc2vec model.

Also Read:

Word2vec Skipgram Explained

Continuous Bag ofWords (CBOW) – Multi Word Model – How It Works

Continuous Bagof Words (CBOW) – Single Word Model – How It Works

Train save and load doc2vec model Python

Here I am using distributed memory paragraph vector (PV-DM) model as doc2vec.

Note: dm=1 means ‘distributed memory’ (PV-DM) and dm =0 means ‘distributed bag of words’ (PV-DBOW)

## Train doc2vec model
model = Doc2Vec(tagged_data, vector_size=20, window=2, min_count=1, workers=4, epochs = 100)
# Save trained doc2vec model
model.save("test_doc2vec.model")
## Load saved doc2vec model
model= Doc2Vec.load("test_doc2vec.model")
## Print model vocabulary
model.wv.vocab

Output:

{‘a’: <gensim.models.keyedvectors.Vocab at 0xc45edbb710>,

‘building’: <gensim.models.keyedvectors.Vocab at 0xc45edbb518>,

‘coding’: <gensim.models.keyedvectors.Vocab at 0xc45edbb400>,

‘data’: <gensim.models.keyedvectors.Vocab at 0xc45edbb320>,

‘good’: <gensim.models.keyedvectors.Vocab at 0xc45edbb780>,

‘i’: <gensim.models.keyedvectors.Vocab at 0xc45edbb048>,

‘in’: <gensim.models.keyedvectors.Vocab at 0xc45edbb470>,

‘is’: <gensim.models.keyedvectors.Vocab at 0xc45edbb6d8>,

‘laptop’: <gensim.models.keyedvectors.Vocab at 0xc45edbb8d0>,

‘love’: <gensim.models.keyedvectors.Vocab at 0xc45edbb2b0>,

‘nlp’: <gensim.models.keyedvectors.Vocab at 0xc45edbb588>,

‘phone’: <gensim.models.keyedvectors.Vocab at 0xc45edbb7f0>,

‘python’: <gensim.models.keyedvectors.Vocab at 0xc45edbb4e0>,

‘science’: <gensim.models.keyedvectors.Vocab at 0xc45edbb390>,

‘this’: <gensim.models.keyedvectors.Vocab at 0xc45edbb668>,

‘tool’: <gensim.models.keyedvectors.Vocab at 0xc45edbb5f8>,

‘tv’: <gensim.models.keyedvectors.Vocab at 0xc45edbb860>}

Document Similarity using doc2vec Python

# find most similar doc 
test_doc = word_tokenize("That is a good device".lower())
model.docvecs.most_similar(positive=[model.infer_vector(test_doc)],topn=5)

Output:

[(5, 0.28079578280448914),

(0, 0.1330653727054596),

(3, 0.12503036856651306),

(4, 0.05355849117040634),

(2, 0.05051974207162857)]

Here (5, 0.28079578280448914) means our provided test_doc is most similar with document 5 of training document set with probability of 28%.

Note: document 5 of training data is: “This is a good laptop”

Conclusion:

In this Doc2vec Gensim tutorial I have discussed about:

How to implement doc2vec in python using Gensim

Data pre-processing for doc2cec

Train doc2vec model

Save trained doc2vec model

Load saved doc2vec model

Find doc2vec model vocabulary list

Find vector representation of document by using trained doc2vec model

Find similarity between two documents/ sentences by using doc2vec Gensim.

If you have any question or suggestion regarding this topic please let me know in comment section, will try my best to answer.

Anindya Naskar

2 thoughts on “Gensim Doc2Vec Python implementation”

Hairstyles

September 23, 2020 at 8:08 am

I have learned a few important things by means of your post.
Submit listing

April 10, 2021 at 6:37 am

Great to be visiting your blog again, it continues to be months for me. Nicely this article that ive been waited for so long. I will need this article to complete my assignment inside the university, and it has same subject with your write-up. Thanks, great share.

Data Pre-Processing for doc2vec Gensim

Train save and load doc2vec model Python

Document Similarity using doc2vec Python

Conclusion:

Related Posts

2 thoughts on “Gensim Doc2Vec Python implementation”

Leave a comment Cancel reply