Gensim word2vec python implementation

word embedding python
 
Word embedding is most important technique in Natural Language Processing (NLP)
 
By using word embedding is used to convert/ map words to vectors of real numbers.
By using word embedding you can extract meaning of a word in a document, relation with other words of that document, semantic and syntactic similarity etc.
 
Word2vec is one of the popular techniques of word embedding. In my last few tutorials I have explained about different flavour of word2vec and how they work.
 
This tutorial I will show you gensim word2vec model implementation in python.
 

Must Read:
 

Now let’s start gensim word2vec python implementation.

Install Packages

Now let’s install some packages to implement word2vec in gensim.
 
pip install gensim==3.8.3
pip install spacy==2.2.4
python -m spacy download en_core_web_sm
pip install matplotlib
pip install tqdm
 
It is a good practice to make a fresh virtual environment while working with this kind of project.
 
 
Now let’s import all required packages

Import packages to implement word2vec python

Let’s import all required packages for gensim word2vec implementation.
 
import json
import pandas as pd
from time import time
import re
from tqdm import tqdm

import spacy
nlp = spacy.load("en_core_web_sm", disable=['ner', 'parser']) # disabling Named Entity Recognition for speed

# To extract n-gram from text
from gensim.models.phrases import Phrases, Phraser
# To train word2vec
from gensim.models import Word2Vec
# To load pre trained word2vec
from gensim.models import KeyedVectors
# To read glove word embedding
from gensim.scripts.glove2word2vec import glove2word2vec

from sklearn.decomposition import PCA
import numpy as np
from sklearn.manifold import TSNE
import seaborn as sns
sns.set_style("darkgrid")
import matplotlib.pyplot as plt


# To find word frequency (Term frequency)
from collections import defaultdict

Download Data to implement word2vec gensim

For this tutorial I will be using yelp customer review dataset, find the link below to download it from kaggle.
 
 
This is a review dataset of various restaurants and their food and services. There are various columns in the dataset like: ‘user_id’, ‘business_id‘, ‘text‘, ‘date‘, ‘compliment_count.
 
Since we are only interested about building word2vec (word embeddings), so we for this tutorial I will only use ‘text’ colmn.

Data Pre-processing for gensim word2vec

Data preparation to implement word embedding using gensim word2vec can be vary problem to problem. In this tutorial I have tried to share standard data pre-processing which can be implemented in most word2ve gensim projects.
 
Let’s first read yelp review dataset and check if there is any missing value or not.
# Read yelp review tip dataset
yelp_df = pd.read_json("input_data/yelp_academic_dataset_tip.json", lines=True)

print('List of all columns')
print(list(yelp_df))

# Checking for missing values in our dataframe
# No there is no missing value
yelp_df.isnull().sum()
 
 

Step1: Lemmatizing, remove stopwords and Remove non-alphabetic characters

# lemmatizing,  removing the stopwords and Removing non-alphabetic characters

# Initializing regular expression
regex = re.compile('[^a-zA-Z]')

clean_txt = []
for row_num in tqdm(range(len(yelp_df))):
    txt_list = []
    doc = nlp(yelp_df['text'][row_num])
    for token in doc:
        if not token.is_stop:
            # Removes non-alphabetic characters from lemmatized word
            txt_list.append(regex.sub('', token.lemma_))
    if len(txt_list) > 0:
        clean_txt.append(' '.join(txt_list))

Step2: Remove duplicates and missing values

# Make a Dataframe
yelp_df_clean = pd.DataFrame({'clean': clean_txt})
# Remove duplicates and missing values
yelp_df_clean = yelp_df_clean.dropna().drop_duplicates()
yelp_df_clean.shape

Step3: Extract bigrams for gensim word2vec

# As Phrases() takes list of list as it's input
# Converting dataframe to list of list
sent = [row for row in yelp_df_clean['clean']]

# Tokenization of each sentance
token_sent = [doc.split(" ") for doc in sent]

# Configuring Phrases() for bigram
bigram = Phrases(token_sent, min_count=35, threshold=2,delimiter=b' ')

# Intializing Phrases() for bigram
bigram_phraser = Phraser(bigram)

# Extract bigrams for gensim word2vec
bigram_token = []
for sen in token_sent:
    bigram_token.append(bigram_phraser[sen])
At this point our data is ready for word2vec python implementation. Now let’s see top frequent word to check whether our cleaned data still have inflation or not.
# Count most frequent words
word_freq = defaultdict(int)
for sen in bigram_token:
    for i in sen:
        word_freq[i] += 1
len(word_freq)

# print most frequent words
sorted(word_freq, key=word_freq.get, reverse=True)[:10]

Train word2vec python

In this tutorial I will train skipgram model. You can also train CBW model by changing sg value to 0.
 
Also Read:
 
Let’s train gensim word2vec model with our own custom data as following:
 
# Train word2vec
yelp_model = Word2Vec(bigram_token, min_count=1,size= 300,workers=3, window =3, sg = 1)
Now let’s explore the hyper parameters used in this model.
 
min_count: Ignores all words with total frequency lower than this number. Default value for min_count is 5.
 
size: Dimensionality of the word vectors. Default value is 100.
 
workers: Number of threads to train the model (faster training with multicore machines).
 
window: Maximum distance between the current and predicted word within a sentence.
 
sg: Training algorithm: 1 for skip-gram; otherwise CBOW. So we are training skipgram model.

Save & Load Gensim word2vec model

It is good practice to save trained word2vec model so that we can load pre trained word2vec model later for later use and we can also update word2vec model.

# Save word2vec gensim model
yelp_model.save("output_data/word2vec_model_yelp")
# Load saved gensim word2vec model
trained_yelp_model = Word2Vec.load("output_data/word2vec_model_yelp")

Explore Gensim word2vec model

Now it’s time to explore word embedding of our trained gensim word2vec model.

# Check word embedding for a perticular word
trained_yelp_model.wv['food']
gensim word2vec python
# Dimention must be 300
trained_yelp_model.wv['food'].shape

(300,)

# Check top 10 similar word for a given word by gensim word2vec
# trained_yelp_model.wv.most_similar('food')[:10]
trained_yelp_model.wv.most_similar('food', topn=10)

cosine similarity word embedding
# Check top 10 similarity score between two word
trained_yelp_model.wv.similarity('beer', 'drink')

0.5015675

# Most opposite to a word
trained_yelp_model.wv.most_similar(negative=["food"], topn=10)
The similarity score you are getting for a particular word is calculated by taking cosine similarity between two specific words using their word vector (word embedding).
 
Note, If you check similarity between two identical words, the score will be 1 as the range of the cosine similarity is [-1 to 1] and sometimes can go between [0,1] depending on how it’s being computed.

Word2vec visualization tsne

It’s difficult to visualize word2vec (word embedding) directly as word embedding usually have more than 3 dimensions (in our case 300).
 
Now for word2vec visualization we need to reduce dimension by applying PCA (Principal Component Analysis) and T-SNE.


Following code is to visualise word2vec using tsne plot.

# tsne plot for below word
# for_word = 'food'
def tsne_plot(for_word, w2v_model):
    # trained word2vec model dimention
    dim_size = w2v_model.wv.vectors.shape[1]

    arrays = np.empty((0, dim_size), dtype='f')
    word_labels = [for_word]
    color_list  = ['red']

    # adds the vector of the query word
    arrays = np.append(arrays, w2v_model.wv.__getitem__([for_word]), axis=0)

    # gets list of most similar words
    sim_words = w2v_model.wv.most_similar(for_word, topn=10)

    # adds the vector for each of the closest words to the array
    for wrd_score in sim_words:
        wrd_vector = w2v_model.wv.__getitem__([wrd_score[0]])
        word_labels.append(wrd_score[0])
        color_list.append('green')
        arrays = np.append(arrays, wrd_vector, axis=0)

    #---------------------- Apply PCA and tsne to reduce dimention --------------

    # fit 2d PCA model to the similar word vectors
    model_pca = PCA(n_components = 10).fit_transform(arrays)

    # Finds 2d coordinates t-SNE
    np.set_printoptions(suppress=True)
    Y = TSNE(n_components=2, random_state=0, perplexity=15).fit_transform(model_pca)

    # Sets everything up to plot
    df_plot = pd.DataFrame({'x': [x for x in Y[:, 0]],
                       'y': [y for y in Y[:, 1]],
                       'words_name': word_labels,
                       'words_color': color_list})

    #------------------------- tsne plot Python -----------------------------------

    # plot dots with color and position
    plot_dot = sns.regplot(data=df_plot,
                     x="x",
                     y="y",
                     fit_reg=False,
                     marker="o",
                     scatter_kws={'s': 40,
                                  'facecolors': df_plot['words_color']
                                 }
                    )

    # Adds annotations with color one by one with a loop
    for line in range(0, df_plot.shape[0]):
         plot_dot.text(df_plot["x"][line],
                 df_plot['y'][line],
                 '  ' + df_plot["words_name"][line].title(),
                 horizontalalignment='left',
                 verticalalignment='bottom', size='medium',
                 color=df_plot['words_color'][line],
                 weight='normal'
                ).set_size(15)


    plt.xlim(Y[:, 0].min()-50, Y[:, 0].max()+50)
    plt.ylim(Y[:, 1].min()-50, Y[:, 1].max()+50)

    plt.title('t-SNE visualization for word "{}'.format(for_word.title()) +'"')

# tsne plot for top 10 similar word to 'food'
tsne_plot(for_word='food', w2v_model=trained_yelp_model)
word2vec visualization tsne plot gensim word2vec
T-SNE plot for custom word2vec model
Also Read:  Install TensorFlow GPU with Jupiter notebook for Windows

Update pre-trained gensim word2vec model

As you have trained and saved you custom word2vec model. You can always load and update this saved model with new data set. 

You can also update any pre-trained word embedding (explained below) like google pre-trained word2vec or glove pre-trained model etc.
 
new_data = [['yes', 'this', 'is', 'the', 'word2vec', 'model'],[ 'if',"you","have","think","about","it"]]

# Update trained gensim word2vec model
trained_yelp_model.build_vocab(new_data, update = True)

# Update word2vec gensim model using new data
new_model = trained_yelp_model.train(new_data, total_examples=trained_yelp_model.corpus_count, epochs=trained_yelp_model.iter)

Working with Pre-trained word embeddings python

Pre-trained models are most simple way to start working with word embeddings. The advantage pre-trained word embeddings is that they can leverage massive amount of datasets that you may not have access to, built using billions of different unique words.
 
Pre trained models are also available in different languages; it may help you to build multi-lingual applications.
 
You can further update pre-trained word2vec model using your own custom data.
 
Now let’s work with some popular pre-trained embeddings in Python gensim.

Google Pre trained word2vec

You can download Google pretrained word2vec model by below link:

 
 
After downloading Google pre-trained word embedding you need to extract it into a folder, and then follow below code.
 
Note, dimension of google pre-trained word2vec is 300.
# https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit
# Load Google Pre trained word2vec model
pretrained_google_news_model = KeyedVectors.load_word2vec_format('input_data/GoogleNews-vectors-negative300.bin', binary=True)

# Access vectors for specific words with a keyed lookup:
vector = pretrained_google_news_model['food']
# see the shape of the vector (300,)
vector.shape

(300,)

# tsne plot for top 10 similar word to 'food'
tsne_plot(for_word='food', w2v_model=pretrained_google_news_model)
google pretrained wor2vec python

Glove pre-trained model

GloVe (developed by Stanford research team) is an unsupervised learning algorithm for obtaining vector representations for words (word vector).
 
You just need to download glove pretrained model by below link and flow below code to work with glove pre trained model.
 

glove.6B.300d.txt

# https://nlp.stanford.edu/projects/glove/
# Convert and save glove word embedding to gensim format
glove2word2vec(glove_input_file="input_data/glove.6B.300d.txt", word2vec_output_file="output_data/gensim_glove_vectors.txt")

# Read saved gensim glove word embedding
glove_model = KeyedVectors.load_word2vec_format("output_data/gensim_glove_vectors.txt", binary=False)

# tsne plot for top 10 similar word to 'food'
tsne_plot(for_word='food', w2v_model=glove_model)
pretrained glove python word2cec gensim

Spacy pre-trained word embedding

Python library spacy also have pretrained word embeddings. You can use space pre-trained word embedding by downloading them using below command.
 
Note, I am sharing command to download English language pre trained word vector model, though spacy supports and provide multiple language word embedding.
 

For Small word embedding (I am using in this tutorial)

python -m spacy download en_core_web_sm

Medium size spacy word embedding

python -m spacy download en_core_web_md

Large size spacy word embedding

python -m spacy download en_core_web_lg

# process a sentence using the pretrained model
doc = nlp("This is a sample text to check spacy word embeddings")
for token in doc:
    print(token.i, token)

# Get the vector for word 'text':
doc[4].vector
spacy word embedding python
 

Explore word2vec Online

If you just want to explore word2vec online visit blow link. This bionlp portal helps you to explore four different word2vec models.
 
You can check Similarity between two words and word analogy.
 
 

Conclusion

In this tutorial you learned:
 
  • Implement of word2vec python
  • Data Pre-processing for word2vec gensim
  • Train word2vec python
  • Save trained gensim word2vec model
  • Load saved gensim word2vec model
  • Gensim word2vec visualization tsne
  • Update pre-trained word2vec gensim model
  • Working with Google Pre trained word2vec gensim
  • Gensim Glove pre-trained model
  • Working with Spacy pre-trained word embedding
 

If you have any question or suggestion regarding this topic see you in comment section. I will try my best to answer.

Also Read:  Train BERT from Scratch on Custom Domain Data

Leave a comment