Doc2vec (also known as: paragraph2vec or sentence embedding) is the modified version of word2vec. The main objective of doc2vec is to convert sentence or paragraph to vector (numeric) form.
In Natural Language Processing Doc2Vec is used to find related sentences for a given sentence (instead of word in Word2Vec).
In this article I will walk you through a simple implementation of doc2vec using Python and Gensim. In simple word Doc2vec Gensim implementation.
I have a separate article for doc2vec to explain how it works. I recommend reading that article before reading this.
Must Read:
Data Pre-Processing for doc2vec Gensim
#Import packages from gensim.models.doc2vec import Doc2Vec, TaggedDocument from nltk.tokenize import word_tokenize ## Exapmple document (list of sentences) doc = ["I love data science", "I love coding in python", "I love building NLP tool", "This is a good phone", "This is a good TV", "This is a good laptop"] # Tokenization of each document tokenized_doc = [] for d in doc: tokenized_doc.append(word_tokenize(d.lower())) tokenized_doc
Output:
[[‘i’, ‘love’, ‘data’, ‘science’],
[‘i’, ‘love’, ‘coding’, ‘in’, ‘python’],
[‘i’, ‘love’, ‘building’, ‘nlp’, ‘tool’],
[‘this’, ‘is’, ‘a’, ‘good’, ‘phone’],
[‘this’, ‘is’, ‘a’, ‘good’, ‘tv’],
[‘this’, ‘is’, ‘a’, ‘good’, ‘laptop’]]
# Convert tokenized document into gensim formated tagged data tagged_data = [TaggedDocument(d, [i]) for i, d in enumerate(tokenized_doc)] tagged_data
Output:
[TaggedDocument(words=[‘i’, ‘love’, ‘data’, ‘science’], tags=[0]),
TaggedDocument(words=[‘i’, ‘love’, ‘coding’, ‘in’, ‘python’], tags=[1]),
TaggedDocument(words=[‘i’, ‘love’, ‘building’, ‘nlp’, ‘tool’], tags=[2]),
TaggedDocument(words=[‘this’, ‘is’, ‘a’, ‘good’, ‘phone’], tags=[3]),
TaggedDocument(words=[‘this’, ‘is’, ‘a’, ‘good’, ‘tv’], tags=[4]),
TaggedDocument(words=[‘this’, ‘is’, ‘a’, ‘good’, ‘laptop’], tags=[5])]
Above steps are just basic data pre-processing steps. In real world complex application data pre-processing is not that much simple. I that case you should be using steps like stemming, lemmatization, n-grams, stop word removal etc. To make this tutorial simple I am avoiding those steps.
Now we are ready to train our doc2vec model.
Also Read:
Train save and load doc2vec model Python
Here I am using distributed memory paragraph vector (PV-DM) model as doc2vec.
Note: dm=1 means ‘distributed memory’ (PV-DM) and dm =0 means ‘distributed bag of words’ (PV-DBOW)
## Train doc2vec model model = Doc2Vec(tagged_data, vector_size=20, window=2, min_count=1, workers=4, epochs = 100) # Save trained doc2vec model model.save("test_doc2vec.model") ## Load saved doc2vec model model= Doc2Vec.load("test_doc2vec.model") ## Print model vocabulary model.wv.vocab
Output:
{‘a’: <gensim.models.keyedvectors.Vocab at 0xc45edbb710>,
‘building’: <gensim.models.keyedvectors.Vocab at 0xc45edbb518>,
‘coding’: <gensim.models.keyedvectors.Vocab at 0xc45edbb400>,
‘data’: <gensim.models.keyedvectors.Vocab at 0xc45edbb320>,
‘good’: <gensim.models.keyedvectors.Vocab at 0xc45edbb780>,
‘i’: <gensim.models.keyedvectors.Vocab at 0xc45edbb048>,
‘in’: <gensim.models.keyedvectors.Vocab at 0xc45edbb470>,
‘is’: <gensim.models.keyedvectors.Vocab at 0xc45edbb6d8>,
‘laptop’: <gensim.models.keyedvectors.Vocab at 0xc45edbb8d0>,
‘love’: <gensim.models.keyedvectors.Vocab at 0xc45edbb2b0>,
‘nlp’: <gensim.models.keyedvectors.Vocab at 0xc45edbb588>,
‘phone’: <gensim.models.keyedvectors.Vocab at 0xc45edbb7f0>,
‘python’: <gensim.models.keyedvectors.Vocab at 0xc45edbb4e0>,
‘science’: <gensim.models.keyedvectors.Vocab at 0xc45edbb390>,
‘this’: <gensim.models.keyedvectors.Vocab at 0xc45edbb668>,
‘tool’: <gensim.models.keyedvectors.Vocab at 0xc45edbb5f8>,
‘tv’: <gensim.models.keyedvectors.Vocab at 0xc45edbb860>}
Document Similarity using doc2vec Python
# find most similar doc test_doc = word_tokenize("That is a good device".lower()) model.docvecs.most_similar(positive=[model.infer_vector(test_doc)],topn=5)
Output:
[(5, 0.28079578280448914),
(0, 0.1330653727054596),
(3, 0.12503036856651306),
(4, 0.05355849117040634),
(2, 0.05051974207162857)]
Here (5, 0.28079578280448914) means our provided test_doc is most similar with document 5 of training document set with probability of 28%.
Note: document 5 of training data is: “This is a good laptop”
Conclusion:
In this Doc2vec Gensim tutorial I have discussed about:
- How to implement doc2vec in python using Gensim
- Data pre-processing for doc2cec
- Train doc2vec model
- Save trained doc2vec model
- Load saved doc2vec model
- Find doc2vec model vocabulary list
- Find vector representation of document by using trained doc2vec model
- Find similarity between two documents/ sentences by using doc2vec Gensim.
If you have any question or suggestion regarding this topic please let me know in comment section, will try my best to answer.
I have learned a few important things by means of your post.
Great to be visiting your blog again, it continues to be months for me. Nicely this article that ive been waited for so long. I will need this article to complete my assignment inside the university, and it has same subject with your write-up. Thanks, great share.