FastText Word Embeddings Python implementation

FastText is a library developed by the Facebook research team for text classification and word embeddings. FastText is popular due to its training speed and accuracy. If you want you can read the official fastText paper.

There are two frameworks of FastText:

  1. Text Representation (fastText word embeddings)
  2. Text Classification

In this fastText tutorial post, I will only talk about fastText word embeddings python implementation in windows. I will use Gensim fastText library to train fastText word embeddings in Python.

How FastText word embeddings work

FastText is a modified version of word2vec (i.e.. Skip-Gram and CBOW). The only difference between fastText vs word2vec is it’s pooling strategies (what are the input, output, and dictionary of the model).

In word2vec each word is represented as a bag of words but in FastText each word is represented as a bag of character n-gram.

Character n-gram is the contiguous sequence of n items from a given sample of a character or word. It may be bigram, trigram, etc.
For example character trigram (n = 3) of the word “where” will be:

<wh, whe, her, ere, re>

In FastText architecture, they have also included word itself with character n-gram. That means input data to the model for the word “where” will be:

<wh, whe, her, ere, re> and <where>

Now the model I am referring same as word2vec which is a shallow neural network with one hidden layer.

FastText Architecture

Let’s explore FastText architecture in detail.
Let’s say we have a text like below:

“natural language processing in python”

Now to prepare training data for (Skip-Gram based) FastText model, we define “context word” as the word which follows a given word in the text (which will be our “target word”). That means we will be predicting the surrounding word for a given word.

Note: FastText word embeddings supports both Continuous Bag of Words (CBOW) and Skip-Gram models. In this article, I will explain and implement the skip-gram model to learn vector representation (FastText word embeddings).

Now let’s construct our training examples (like Skip-Gram), scanning through the text with a window will prepare a context word and a target word, like so:

gensim fastText

For above example, for context word “i” and “natural” the target word will be “like”. Full training data for FastText word embedding will looks like below. By observing below training data, your confusion of fastText vs word2vec should clear.

fastText python

Now you know in word2vec (skip-gram) each word is represented as a bag of words but in FastText each word is represented as a bag of character n-gram. This training data preparation is the only difference between FastText word embeddings and skip-gram (or CBOW) word embeddings.

After training data preparation of FastText, training the word embedding, finding word similarity, etc. are same as the word2vec model (for our example similar to the skip-gram model).

Now let’s see how to implement FastText word embeddings in python using Gensim library.

FastText vs word2vec

Word2vec treats each word like an atomic entity and generates a vector for each word. Word2vec cannot provide good results for rare and out of vocabulary words.

FastText (an extension of word2vec model), treats each word as composed of character n-grams. FastText word embeddings generate better word embeddings for rare and out of vocabulary words because even if words are rare their character n-grams are still shared with other words.

FastText word embeddings python implementation

Import Packages to implement Gensim FastText

Now let’s import required libraries to train FastText word embeddings in Python.
It is a good practice to make a fresh virtual environment while working with this kind of project.

import pandas as pd
import numpy as np
import re
from tqdm import tqdm

import nltk
en_stop = set(nltk.corpus.stopwords.words('english'))

from gensim.models.fasttext import FastText

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import seaborn as sns
import matplotlib.pyplot as plt

# Lemmatization
from nltk.stem import WordNetLemmatizer
stemmer = WordNetLemmatizer()

Download Data to implement Gensim FastText

For this tutorial, I will use the yelp customer review dataset, Download it from Kaggle by clicking the link below.

yelp_academic_dataset_tip.json

This is a review dataset of various restaurants for their foods and services. There are various columns in the dataset like: ‘business_id‘, ‘text‘, ‘user_id’, ‘date‘, ‘compliment_count‘.

Since we are only interested in building fastText word embeddings, so we for this tutorial I will only use ‘text’ column.

Data Pre-processing for Gensim FastText word embeddings Python

Data pre-processing steps for any Natural Language Processing task can vary from problem to problem. In this tutorial, I have tried to share standard data pre-processing steps which can be implemented in most word embeddings projects.

Let’s first read the yelp review dataset and check if there is any missing value or not.

# Read yelp review tip dataset
yelp_df = pd.read_json("yelp_academic_dataset_tip.json", lines=True)

print('List of all columns')
print(list(yelp_df))

# Checking for missing values in our dataframe
# No there is no missing value
yelp_df.isnull().sum()
List of all columns
['user_id', 'business_id', 'text', 'date', 'compliment_count']
user_id             0
business_id         0
text                0
date                0
compliment_count    0
dtype: int64

Step1: Convert text column into list and subset

Now we need only text column from yelp data as a list. For this fastText tutorial post I am going to use only first 1 lack rows of text data to train fastText word embeddings model.

# Subset data for gensim fastText model
all_sent = list(yelp_df['text'])
some_sent = all_sent[0:100000]
some_sent[0:10]
['Here for a quick mtg',
 'Cucumber strawberry refresher',
 'Very nice good service good food',
 "It's a small place. The staff is friendly.",
 '8 sandwiches, $24 total...what a bargain!!! And the sandwiches are awesome!!!',
 "Great ramen! Not only is the presentation gorgeous but the food is so good! Go and sit outside, it's a good atmosphere.",
 'Cochinita Pibil was memorable & delicious !',
 'Get a tsoynami for sure.',
 'Kelly is an awesome waitress there!',
 'Check out the great assortment of organic & conventional produce.']

Step2: Text Cleaning

In text cleaning I have used below steps:

  • Removing white extra space from text
  • Removed all special characters from the text
  • Removed all single characters from the text
  • Converted text to lower case
  • Word tokenization
  • Lemmatization
  • Remove stop words from the text
  • Removed words length less than 3 from text
# Text cleaning function for gensim fastText word embeddings in python
def process_text(document):
    
            # Remove extra white space from text
        document = re.sub(r'\s+', ' ', document, flags=re.I)
        
        # Remove all the special characters from text
        document = re.sub(r'\W', ' ', str(document))

        # Remove all single characters from text
        document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document)

        # Converting to Lowercase
        document = document.lower()

        # Word tokenization       
        tokens = document.split()
        # Lemmatization using NLTK
        lemma_txt = [stemmer.lemmatize(word) for word in tokens]
        # Remove stop words
        lemma_no_stop_txt = [word for word in lemma_txt if word not in en_stop]
        # Drop words 
        tokens = [word for word in tokens if len(word) > 3]
                
        clean_txt = ' '.join(lemma_no_stop_txt)

        return clean_txt
clean_corpus = [process_text(sentence) for sentence in tqdm(some_sent) if sentence.strip() !='']

word_tokenizer = nltk.WordPunctTokenizer()
word_tokens = [word_tokenizer.tokenize(sent) for sent in tqdm(clean_corpus)]
word_tokens
[['quick', 'mtg'],
 ['cucumber', 'strawberry', 'refresher'],
 ['nice', 'good', 'service', 'good', 'food'],
 ['small', 'place', 'staff', 'friendly'],
 ['8', 'sandwich', '24', 'total', 'bargain', 'sandwich', 'awesome'],
 ['great',
.....
.....]

Train fastText word embeddings python

As you already know FastText word embeddings support both Continuous Bag of Words (CBOW) and Skip-Gram models. In this tutorial, I will implement fastText word embeddings for Skip-Gram only.

Let’s train Gensim fastText word embeddings model with our own custom data:

# Defining values for parameters
embedding_size = 300
window_size = 5
min_word = 5
down_sampling = 1e-2

%%time
fast_Text_model = FastText(word_tokens,
                      size=embedding_size,
                      window=window_size,
                      min_count=min_word,
                      sample=down_sampling,
                      workers = 4,
                      sg=1,
                      iter=100)
Wall time: 4min 43s

It took around 5 minutes to train fastText word embeddings model for one lack of data in my system. Let me know how much time is taken in your system in the comment section.

Let’s explore the hyperparameters used in this model.

size: Dimensionality of the word vectors. window=window_size,
min_count: The model ignores all words with total frequency lower than this.
sample: The threshold for configuring which higher-frequency words are randomly down sampled, useful range is (0, 1e-5).
workers: Use these many worker threads to train the model (=faster training with multicore machines).
sg: Training algorithm: skip-gram if sg=1, otherwise CBOW.
iter: Number of iterations (epochs) over the corpus.

Save & Load Gensim fastText word embeddings Python model

It is good practice to save trained fastText word embeddings model so that we can load pre-trained fastText word embeddings model for later use and we can also update fastText word embeddings model.

from gensim.models import Word2Vec
# Save fastText gensim model
fast_Text_model.save("model/ft_model_yelp")
# Load saved gensim fastText model
fast_Text_model = Word2Vec.load("model/ft_model_yelp")

Explore Gensim fastText model

Now it’s time to explore word embedding of our trained Gensim fastText word embeddings model.

# Check word embedding for a perticular word
fast_Text_model.wv['chicken']
array([ 0.13001636,  0.25822237, -0.14504315,  0.21484736,  0.06036462,
       -0.360732  ,  0.41524523, -0.28196216, -0.23852299, -0.10876507,
        0.19519699, -0.3540741 , -0.26487216, -0.04653963,  0.00437508,
       -0.0315864 , -0.31310642, -0.13987574,  0.07247996,  0.46589893,
       -0.46027127, -0.19893737,  0.04489833,  0.14796586,  0.07446998,
       -0.32722327,  0.19599411,  0.0702702 , -0.13025823, -0.1362607 ,
        0.15696119, -0.27530903, -0.04977593, -0.06334984, -0.0031571 ,
       -0.06545188,  0.14801551, -0.04874644,  0.24035221, -0.0827332 ,
       -0.14662515,  0.53350425, -0.3495066 ,  0.00763269,  0.06650718,

The similarity score here is calculated by taking cosine similarity between two specific words using their word vector (word embedding).
Note, If you check similarity between two identical words (same words), the score will be 1 as the range of the cosine similarity is [-1 to 1] and sometimes can go between [0,1] depending upon how it’s being calculated.

# Dimention must be 300
fast_Text_model.wv['chicken'].shape
(300,)
# Check top 10 similar word for a given word by gensim fastText
fast_Text_model.wv.most_similar('chicken', topn=10)
[('manchurian', 0.43081963062286377),
 ('saag', 0.39911460876464844),
 ('fried', 0.3987428545951843),
 ('yakisoba', 0.39076220989227295),
 ('kofta', 0.3893798589706421),
 ('paprikash', 0.38901203870773315),
 ('roti', 0.38451915979385376),
 ('chix', 0.37896728515625),
 ('shanghai', 0.3775859475135803),
 ('tahini', 0.37675777077674866)]
# Check top 10 similarity score between two word
fast_Text_model.wv.similarity('beer', 'drink')
0.23751268
# Most opposite to a word
fast_Text_model.wv.most_similar(negative=["chicken"], topn=10)
[('major', 0.06609419733285904),
 ('condition', 0.0604780912399292),
 ('facility', 0.052031226456165314),
 ('immediate', 0.051679082214832306),
 ('immediately', 0.05050109326839447),
 ('conditioned', 0.044969189912080765),
 ('often', 0.04249657690525055),
 ('record', 0.042114950716495514),
 ('conditioner', 0.03859124705195427),
 ('occasion', 0.03758380562067032)]

FastText word embeddings visualization using tsne

It’s difficult to visualize fastText word embeddings directly as word embedding usually have more than 3 dimensions (in our case 300).

Now for fastText word embeddings visualization, we need to reduce dimension by applying PCA (Principal Component Analysis) and T-SNE.

The following code is to visualize fastText word embeddings using tsne plot.

# tsne plot for below word
# for_word = 'food'
def tsne_plot(for_word, w2v_model):
    # trained fastText model dimention
    dim_size = w2v_model.wv.vectors.shape[1]

    arrays = np.empty((0, dim_size), dtype='f')
    word_labels = [for_word]
    color_list  = ['red']

    # adds the vector of the query word
    arrays = np.append(arrays, w2v_model.wv.__getitem__([for_word]), axis=0)

    # gets list of most similar words
    sim_words = w2v_model.wv.most_similar(for_word, topn=10)

    # adds the vector for each of the closest words to the array
    for wrd_score in sim_words:
        wrd_vector = w2v_model.wv.__getitem__([wrd_score[0]])
        word_labels.append(wrd_score[0])
        color_list.append('green')
        arrays = np.append(arrays, wrd_vector, axis=0)

    #---------------------- Apply PCA and tsne to reduce dimention --------------

    # fit 2d PCA model to the similar word vectors
    model_pca = PCA(n_components = 10).fit_transform(arrays)

    # Finds 2d coordinates t-SNE
    np.set_printoptions(suppress=True)
    Y = TSNE(n_components=2, random_state=0, perplexity=15).fit_transform(model_pca)

    # Sets everything up to plot
    df_plot = pd.DataFrame({'x': [x for x in Y[:, 0]],
                       'y': [y for y in Y[:, 1]],
                       'words_name': word_labels,
                       'words_color': color_list})

    #------------------------- tsne plot Python -----------------------------------

    # plot dots with color and position
    plot_dot = sns.regplot(data=df_plot,
                     x="x",
                     y="y",
                     fit_reg=False,
                     marker="o",
                     scatter_kws={'s': 40,
                                  'facecolors': df_plot['words_color']
                                 }
                    )

    # Adds annotations with color one by one with a loop
    for line in range(0, df_plot.shape[0]):
         plot_dot.text(df_plot["x"][line],
                 df_plot['y'][line],
                 '  ' + df_plot["words_name"][line].title(),
                 horizontalalignment='left',
                 verticalalignment='bottom', size='medium',
                 color=df_plot['words_color'][line],
                 weight='normal'
                ).set_size(15)


    plt.xlim(Y[:, 0].min()-50, Y[:, 0].max()+50)
    plt.ylim(Y[:, 1].min()-50, Y[:, 1].max()+50)

    plt.title('t-SNE visualization for word "{}'.format(for_word.title()) +'"')
# tsne plot for top 10 similar word to 'chicken'
tsne_plot(for_word='chicken', w2v_model=fast_Text_model)
gensim fastText explained
Visualize FastText word embeddings Python

Update pre-trained Gensim fastText model

Like word2vec you can also update your saved custom fastText model.

new_data = [['yes', 'this', 'is', 'the', 'word2vec', 'model'],[ 'if',"you","have","think","about","it"]]

# Update trained gensim fastText model
fast_Text_model.build_vocab(new_data, update = True)

# Update gensim fastText model using new data
new_model = fast_Text_model.train(new_data, total_examples=fast_Text_model.corpus_count, epochs=fast_Text_model.iter)

Working with Gensim fastText pre-trained model

Pre-trained models are the most simple way to start working with word embeddings. The advantage of pre-trained word embeddings is that they can leverage the massive amount of datasets that you may not have access to, built using billions of different unique words.

Facebook hosts Word vectors for 157 languages. You can download and use these pre-trained fastText word embeddings models free. You can download one of them by clicking the below link:

Download pre-trained fastText word embeddings

Now let’s work with Gensim fastText pre-trained models. Let’s start with the English model.

# Load pretrained fastText word embeddings python with gensim
from gensim.models.fasttext import load_facebook_model
pretrained_fastText_en = load_facebook_model('pretrined fastText model/cc.en.300.bin.gz')
# tsne plot for top 10 similar word to 'food'
tsne_plot(for_word='chicken', w2v_model=pretrained_fastText)
pre-trained fastText word embeddings python
Pre-Trained fastText word embeddings

Conclusion

In this tutorial you learned:

  • How FastText word embeddings work
  • FastText Architecture
  • FastText vs word2vec
  • FastText word embeddings python implementation
  • Data Pre-processing for Gensim FastText word embeddings Python
  • How to Train fastText word embeddings python
  • Save Gensim fastText word embeddings Python model
  • Load Gensim fastText word embeddings Python model
  • Explore Gensim fastText model
  • FastText visualization using tsne
  • Update pre-trained Gensim fastText model
  • Working with Gensim fastText pre-trained model

If you have any question or suggestion regarding this topic see you in comment section. I will try my best to answer.

Leave a Comment

Your email address will not be published. Required fields are marked *