Sentence Similarity Matching using Transformer

sentence-similarity-using-transformer

A few weeks back I wrote a tutorial about document similarity matching using TF-IDF technique. In this tutorial, we will take that to the next level and use a transformer-based algorithm for matching similarity between two sentence or paragraphs, or documents.

In recent years, transformer-based models have revolutionized natural language processing (NLP) tasks due to their ability to capture long-range dependencies and contextual information effectively. Whether it is a question answering, Chatbot or Named Entity Recognition everywhere we are trying to use transformer model to improve accuracy of our downstream task.

In my last tutorial, I showed you how you calculate document similarity matching using TF-IDF. That technique was giving good accuracy. But it has some limitations. TF-IDF works well for documents with huge number of words. It does not have any context information.

For example, TF-IDF can not identify the difference between below two sentences. This is because the backbone of TF-IDF technique is bag of words. Since both the sentences have the words”I“, “like“, and “apple“, TF-IDF will produce high similarity between those two sentences.

I like to eat apple.

I like Apple products.

While matching sentence similarity or paragraph similarity, we can solve this contextual issue using a good transformer model. In this tutorial, I am going to use sentence_transformers module to find sentence similarity easily.

If you want to clear your concept of Transformer model architecture with theory and coding then, I will highly recommend you to take this Udemy course: Transformers for Natural Language Processing.

Once you complete that course, to make your hands dirty with various transformer-based applications, you should take this course>> Deep Learning: Natural Language Processing with Transformers.

About sentence_transformers

Sentence Transformers is an open-source Python library to convert sentence or text into text embeddings or sentence embeddings. To do that it uses pretrained models like BERT, RoBERTa, etc. Developed by the UKPLab at the University of Duisburg-Essen, Sentence Transformers simplifies the process of obtaining high-quality numerical representations (embeddings) for sentences, paragraphs, or text snippets.

The Sentence Transformers library makes it simple to use these powerful language models for your own projects. It’s like having a helpful friend that can quickly understand what sentences mean and give you easy-to-use codes to work with.

Sentence Similarity Matching in Python

The entire process to find similarities between two documents is pretty easy with sentence_transformers library. Let me break the entire process into some steps for your better understanding.

Step1: Install Libraries

As you know the first step of working with any Python libraries is to install it. So we also need to install sentence_transformers library to use in our project. Below is the pip command to do that.

pip install sentence_transformers

Step2: Download pre-trained model

Now that we have installed the required library, we can import those libraries and finally, we need to load or download a pretrained transformer model for sentence similarity matching.

By default, this model will be saved inside your default cache directory. For me: C:\Users\Anindya\.cache\huggingface. If you want to save or download this model to your working directory say D:\ drive, then you can follow steps from this tutorial: Huggingface: Download and Save Model to Custom Path.

import numpy as np
from sentence_transformers import SentenceTransformer, util

# Download or Load sentence transformer model
roberta_model = SentenceTransformer('stsb-roberta-large')
download-roberta-model-for-setence-or-document-similarity-using-transformer

For this demo project, I am downloading and using stsb-roberta-large model. This is a good pre-trained transformer-based model for sentence similarity matching. But you can also use other supported models of sentence_transformers library.

Also Read:  Top 23 Dataset for Chatbot Training

Let me list down supported models below. I will divide all model lists into two parts: Official and Unofficial. Official models are the models which are shared by sentence_transformers library.

Official Model List

Below table contains the complete list of official sentence transformer models.

Model Name Performance Sentence Embeddings (14   Datasets)  Performance Semantic Search (6 Datasets)     Avg. Performance  Speed  Model   Size 
all-mpnet-base-v2  69.57 57.02 63.30 2800 420 MB
gtr-t5-xxl  70.73 55.76 63.25 50 9230 MB
gtr-t5-xl  69.88 55.88 62.88 230 2370 MB
sentence-t5-xxl  70.88 54.40 62.64 50 9230 MB
gtr-t5-large  69.90 54.85 62.38 800 640 MB
all-mpnet-base-v1  69.98 54.69 62.34 2800 420 MB
multi-qa-mpnet-base-dot-v1  66.76 57.60 62.18 2800 420 MB
multi-qa-mpnet-base-cos-v1  66.29 57.46 61.88 2800 420 MB
all-roberta-large-v1  70.23 53.05 61.64 800 1360 MB
sentence-t5-xl  69.23 51.19 60.21 230 2370 MB
all-distilroberta-v1  68.73 50.94 59.84 4000 290 MB
all-MiniLM-L12-v1  68.83 50.78 59.80 7500 120 MB
all-MiniLM-L12-v2  68.70 50.82 59.76 7500 120 MB
multi-qa-distilbert-dot-v1  66.67 52.51 59.59 4000 250 MB
multi-qa-distilbert-cos-v1  65.98 52.83 59.41 4000 250 MB
gtr-t5-base  67.65 51.15 59.40 2500 210 MB
sentence-t5-large  68.74 49.05 58.89 800 640 MB
all-MiniLM-L6-v2  68.06 49.54 58.80 14200 80 MB
multi-qa-MiniLM-L6-cos-v1  64.33 51.83 58.08 14200 80 MB
all-MiniLM-L6-v1  68.03 48.07 58.05 14200 80 MB
paraphrase-mpnet-base-v2  67.97 47.43 57.70 2800 420 MB
msmarco-bert-base-dot-v5  62.68 52.11 57.39 2800 420 MB
multi-qa-MiniLM-L6-dot-v1  63.90 49.19 56.55 14200 80 MB
sentence-t5-base  67.84 44.63 56.23 2500 210 MB
msmarco-distilbert-base-tas-b  62.57 49.25 55.91 4000 250 MB
msmarco-distilbert-dot-v5  61.84 49.47 55.66 4000 250 MB
paraphrase-distilroberta-base-v2  66.27 43.10 54.69 4000 290 MB
paraphrase-MiniLM-L12-v2  66.01 43.01 54.51 7500 120 MB
paraphrase-multilingual-mpnet-base-v2  65.83 41.68 53.75 2500 970 MB
paraphrase-TinyBERT-L6-v2  66.19 41.07 53.63 4500 240 MB
paraphrase-MiniLM-L6-v2  64.82 40.31 52.56 14200 80 MB
paraphrase-albert-small-v2  64.46 40.04 52.25 5000 43 MB
paraphrase-multilingual-MiniLM-L12-v2  64.25 39.19 51.72 7500 420 MB
paraphrase-MiniLM-L3-v2  62.29 39.19 50.74 19000 61 MB
distiluse-base-multilingual-cased-v1  61.30 29.87 45.59 4000 480 MB
distiluse-base-multilingual-cased-v2  60.18 27.35 43.77 4000 480 MB
average_word_embeddings_komninos  51.13 21.64 36.39 22000 240 MB
average_word_embeddings_glove.6B.300d  49.79 22.71 36.25 34000 420 MB
Un-Official Model List

Here is the list of unofficial but good sentence transformer models for document similarity matching.

Model NameSTSb Performance (Higher = Better)Speed (Sent. / Sec on V100 GPU)
stsb-mpnet-base-v288,572800
stsb-roberta-base-v287,212300
stsb-distilroberta-base-v286,414000
nli-mpnet-base-v286,532800
stsb-roberta-large86,39830
nli-roberta-base-v285,542300
stsb-roberta-base85,442300
stsb-bert-large85,29830
stsb-distilbert-base85,164000
stsb-bert-base85,142300
nli-distilroberta-base-v284,384000
paraphrase-xlm-r-multilingual-v183,502300
paraphrase-distilroberta-base-v181,814000
nli-bert-large79,19830
nli-distilbert-base78,694000
nli-roberta-large78,69830
nli-bert-large-max-pooling78,41830
nli-bert-large-cls-pooling78,29830
nli-distilbert-base-max-pooling77,614000
nli-roberta-base77,492300
nli-bert-base-max-pooling77,212300
nli-bert-base77,122300
nli-bert-base-cls-pooling76,32300
average_word_embeddings_glove.6B.300d61,7734000
average_word_embeddings_komninos61,5622000
average_word_embeddings_levy_dependency59,2222000
average_word_embeddings_glove.840B.300d52,5434000
Also Read:  Extract Custom Keywords using NLTK POS tagger in python

Step3: Define Input Sentences

Let’s now define two sentences to calculate the semantic similarity between them.

# Define two sentences
sentence1 = 'I love to read books.'
sentence2 = 'Reading is a great way to relax.'

Step4: Convert to Sentence Embeddings

This is the main part of the entire process of finding the similarity between two sentences. In this step, we need to convert our input sentences into embeddings.

Embedding is the fancy name of converting text into numeric form. This is required for any NLP task because machine learning algorithms can not understand text data.

# Encode two input sentences to get their embeddings
sentence1_embedding = roberta_model.encode(sentence1, convert_to_tensor=True)
sentence2_embedding = roberta_model.encode(sentence2, convert_to_tensor=True)

# Print sentence embeddings
print('Sentence1 Embedding: ', '\n', sentence1_embedding, '\n\n', 'Sentence2 Embedding: ', '\n', sentence2_embedding)
Sentence1 Embedding:  
 tensor([-0.2088,  0.7282, -1.2856,  ...,  1.0879, -1.5999,  0.3414]) 

 Sentence2 Embedding:  
 tensor([-0.2012,  0.5449, -0.7802,  ...,  0.5461, -1.0937,  0.5332])

Step5: Calculate Similarly Score

In this step, we will just take those two sentence embeddings and calculate the cosine distance between them. Cosine distance is the most accurate and popular technique for embedding vector or text similarity. You can try other distance equations like Eucladian, manhattan, etc.

# Comput similarity score between two sentence embeddings
cosine_similarity_score = util.pytorch_cos_sim(sentence1_embedding, sentence2_embedding)

# Print similarity score between two input sentences
print('Sentence 1: ', sentence1)
print('Sentence 2: ', sentence2)
print('Similarity score: ', cosine_similarity_score.item())
Sentence 1:  I love to read books.
Sentence 2:  Reading is a great way to relax.
Similarity score:  0.6019294857978821

As you can see the similarity between those two input sentences is 60%, which is correct. This similarity score is nothing but how close those two input sentences are in our embedding vector.

If you are still confused about this vector embedding concept then I will highly recommend you to take this small and low-cost Udemy course: Build NLP text embeddings using python.

Retrieve Top K most similar sentences from a corpus given a sentence

In the above section, I showed you how sentence similarity or document similarity matching works. Now in real-life projects, you might need to apply this similarity function to n number of sentences. So thought of sharing that functionality also.

In the below code block, I am defining a list of sentences or input corpus. Now what I want to do is that, for a query sentence I want to find top most similar sentences from this input corpus.

# Defining list of sentences or documents
corpus = [
    'I love to read books.',
    'Programming is my passion.',
    'Traveling broadens the mind.',
    'Music is the universal language.'
]

In the similar way, let’s convert above input corpus into an embedding vector. Note that in embedding output each row is representing embedding for each sentence from the input corpus.

# Encode corpus or list of documents to get corpus embeddings
corpus_embedding = roberta_model.encode(corpus, convert_to_tensor = True)
corpus_embedding
tensor([[-0.2088,  0.7282, -1.2856,  ...,  1.0879, -1.5999,  0.3414],
        [-0.2646, -0.8260, -0.5265,  ...,  0.4369, -2.1071,  0.9789],
        [-0.3444,  0.1054,  0.0739,  ..., -0.0828, -0.4223,  0.9589],
        [-1.4762,  1.0114,  0.1005,  ...,  0.9997, -0.4622,  0.0648]])

Let’s now define our query sentence and convert that into sentence embedding. Below Python code is to do that.

# Set query sentence or input sentence
query_sentence = 'Reading is a great way to relax.'

# Encode query sentence to get sentence embeddings
query_sentence_embedding = roberta_model.encode(query_sentence, convert_to_tensor = True)
query_sentence_embedding
tensor([-0.2012,  0.5449, -0.7802,  ...,  0.5461, -1.0937,  0.5332])

Now we will find sentence similarity matching using the same transformer model (stsb-roberta-large) and print top 2 most similar sentences (with query sentence) from the corpus document.

# Find most similar sentence from the corpus
# Define number of similar sentences to return
top_n = 2

# Compute cosine similarity scores of the query sentence with the corpus
cosine_similarity_score = util.pytorch_cos_sim(query_sentence_embedding, corpus_embedding)[0]

# Sort the similarity result in descending order and get the top k result
top_results = np.argpartition(-cosine_similarity_score, range(top_n))[0:top_n]
# Print query sentence
print('For Sentence: ', query_sentence, '\n')

# Print top similar sentences from the corpus
print('Top', top_n, 'most similar sentences in corpus: ')

for idx in top_results[0:top_n]:
    print(corpus[idx], 'Score: %.4f' % (cosine_similarity_score[idx]))
For Sentence:  Reading is a great way to relax. 

Top 2 most similar sentences in corpus: 
I love to read books. Score: 0.6019
Traveling broadens the mind. Score: 0.3404

FAQs

List of supported models for sentence transformers

There are so many pre-trained models you can work with simple_transformers library. Some of them are official and most of them are unofficial. I have listed all of them in the above section.

Also Read:  Continuous Bag of Words (CBOW) - Single Word Model - How It Works
Number of tokens supported by sentence transformer

The number of words that Sentence Transformers can handle depends on the type of pre-trained model used. In our case, we are using stsb-roberta-large model. This model supports 1024 tokens. You can validate that using below Python script.

len(sentence1_embedding)
1024

Final Note

In this tutorial, we learned how to implement sentence similarity matching using a pre-trained transformer model like RoBERTa. We are getting good accuracy for contextual sentences. But even this kind of model has some drawbacks.

For example, you are working on a project and you need to find similarity between two documents. Now those documents have a huge number of words (more than 1024). In this case, you can not directly apply sentence transformer to the whole document.

To solve this, you first need to chunk or break your entire document into some parts based on the token limit of your model. Then you can apply transformer-based sentence similarity to those chunks of words or parts of the document. This way you can bypass the token limitation issue of any Transformer model.

Leave a comment