A few weeks back I wrote a tutorial about document similarity matching using TF-IDF technique. In this tutorial, we will take that to the next level and use a transformer-based algorithm for matching similarity between two sentence or paragraphs, or documents.
In recent years, transformer-based models have revolutionized natural language processing (NLP) tasks due to their ability to capture long-range dependencies and contextual information effectively. Whether it is a question answering, Chatbot or Named Entity Recognition everywhere we are trying to use transformer model to improve accuracy of our downstream task.
In my last tutorial, I showed you how you calculate document similarity matching using TF-IDF. That technique was giving good accuracy. But it has some limitations. TF-IDF works well for documents with huge number of words. It does not have any context information.
For example, TF-IDF can not identify the difference between below two sentences. This is because the backbone of TF-IDF technique is bag of words. Since both the sentences have the words”I“, “like“, and “apple“, TF-IDF will produce high similarity between those two sentences.
I like to eat apple.
I like Apple products.
While matching sentence similarity or paragraph similarity, we can solve this contextual issue using a good transformer model. In this tutorial, I am going to use
sentence_transformers module to find sentence similarity easily.
If you want to clear your concept of Transformer model architecture with theory and coding then, I will highly recommend you to take this Udemy course: Transformers for Natural Language Processing.
Once you complete that course, to make your hands dirty with various transformer-based applications, you should take this course>> Deep Learning: Natural Language Processing with Transformers.
Sentence Transformers is an open-source Python library to convert sentence or text into text embeddings or sentence embeddings. To do that it uses pretrained models like BERT, RoBERTa, etc. Developed by the UKPLab at the University of Duisburg-Essen, Sentence Transformers simplifies the process of obtaining high-quality numerical representations (embeddings) for sentences, paragraphs, or text snippets.
The Sentence Transformers library makes it simple to use these powerful language models for your own projects. It’s like having a helpful friend that can quickly understand what sentences mean and give you easy-to-use codes to work with.
Sentence Similarity Matching in Python
The entire process to find similarities between two documents is pretty easy with
sentence_transformers library. Let me break the entire process into some steps for your better understanding.
Step1: Install Libraries
As you know the first step of working with any Python libraries is to install it. So we also need to install
sentence_transformers library to use in our project. Below is the
pip command to do that.
pip install sentence_transformers
Step2: Download pre-trained model
Now that we have installed the required library, we can import those libraries and finally, we need to load or download a pretrained transformer model for sentence similarity matching.
By default, this model will be saved inside your default cache directory. For me: C:\Users\Anindya\.cache\huggingface. If you want to save or download this model to your working directory say D:\ drive, then you can follow steps from this tutorial: Huggingface: Download and Save Model to Custom Path.
import numpy as np from sentence_transformers import SentenceTransformer, util # Download or Load sentence transformer model roberta_model = SentenceTransformer('stsb-roberta-large')
For this demo project, I am downloading and using
stsb-roberta-large model. This is a good pre-trained transformer-based model for sentence similarity matching. But you can also use other supported models of
Let me list down supported models below. I will divide all model lists into two parts: Official and Unofficial. Official models are the models which are shared by
Official Model List
Below table contains the complete list of official sentence transformer models.
Un-Official Model List
Here is the list of unofficial but good sentence transformer models for document similarity matching.
Step3: Define Input Sentences
Let’s now define two sentences to calculate the semantic similarity between them.
# Define two sentences sentence1 = 'I love to read books.' sentence2 = 'Reading is a great way to relax.'
Step4: Convert to Sentence Embeddings
This is the main part of the entire process of finding the similarity between two sentences. In this step, we need to convert our input sentences into embeddings.
Embedding is the fancy name of converting text into numeric form. This is required for any NLP task because machine learning algorithms can not understand text data.
# Encode two input sentences to get their embeddings sentence1_embedding = roberta_model.encode(sentence1, convert_to_tensor=True) sentence2_embedding = roberta_model.encode(sentence2, convert_to_tensor=True) # Print sentence embeddings print('Sentence1 Embedding: ', '\n', sentence1_embedding, '\n\n', 'Sentence2 Embedding: ', '\n', sentence2_embedding)
Sentence1 Embedding: tensor([-0.2088, 0.7282, -1.2856, ..., 1.0879, -1.5999, 0.3414]) Sentence2 Embedding: tensor([-0.2012, 0.5449, -0.7802, ..., 0.5461, -1.0937, 0.5332])
Step5: Calculate Similarly Score
In this step, we will just take those two sentence embeddings and calculate the cosine distance between them. Cosine distance is the most accurate and popular technique for embedding vector or text similarity. You can try other distance equations like Eucladian, manhattan, etc.
# Comput similarity score between two sentence embeddings cosine_similarity_score = util.pytorch_cos_sim(sentence1_embedding, sentence2_embedding) # Print similarity score between two input sentences print('Sentence 1: ', sentence1) print('Sentence 2: ', sentence2) print('Similarity score: ', cosine_similarity_score.item())
Sentence 1: I love to read books. Sentence 2: Reading is a great way to relax. Similarity score: 0.6019294857978821
As you can see the similarity between those two input sentences is 60%, which is correct. This similarity score is nothing but how close those two input sentences are in our embedding vector.
If you are still confused about this vector embedding concept then I will highly recommend you to take this small and low-cost Udemy course: Build NLP text embeddings using python.
Retrieve Top K most similar sentences from a corpus given a sentence
In the above section, I showed you how sentence similarity or document similarity matching works. Now in real-life projects, you might need to apply this similarity function to n number of sentences. So thought of sharing that functionality also.
In the below code block, I am defining a list of sentences or input corpus. Now what I want to do is that, for a query sentence I want to find top most similar sentences from this input corpus.
# Defining list of sentences or documents corpus = [ 'I love to read books.', 'Programming is my passion.', 'Traveling broadens the mind.', 'Music is the universal language.' ]
In the similar way, let’s convert above input corpus into an embedding vector. Note that in embedding output each row is representing embedding for each sentence from the input corpus.
# Encode corpus or list of documents to get corpus embeddings corpus_embedding = roberta_model.encode(corpus, convert_to_tensor = True) corpus_embedding
tensor([[-0.2088, 0.7282, -1.2856, ..., 1.0879, -1.5999, 0.3414], [-0.2646, -0.8260, -0.5265, ..., 0.4369, -2.1071, 0.9789], [-0.3444, 0.1054, 0.0739, ..., -0.0828, -0.4223, 0.9589], [-1.4762, 1.0114, 0.1005, ..., 0.9997, -0.4622, 0.0648]])
Let’s now define our query sentence and convert that into sentence embedding. Below Python code is to do that.
# Set query sentence or input sentence query_sentence = 'Reading is a great way to relax.' # Encode query sentence to get sentence embeddings query_sentence_embedding = roberta_model.encode(query_sentence, convert_to_tensor = True) query_sentence_embedding
tensor([-0.2012, 0.5449, -0.7802, ..., 0.5461, -1.0937, 0.5332])
Now we will find sentence similarity matching using the same transformer model (
stsb-roberta-large) and print top 2 most similar sentences (with query sentence) from the corpus document.
# Find most similar sentence from the corpus # Define number of similar sentences to return top_n = 2 # Compute cosine similarity scores of the query sentence with the corpus cosine_similarity_score = util.pytorch_cos_sim(query_sentence_embedding, corpus_embedding) # Sort the similarity result in descending order and get the top k result top_results = np.argpartition(-cosine_similarity_score, range(top_n))[0:top_n] # Print query sentence print('For Sentence: ', query_sentence, '\n') # Print top similar sentences from the corpus print('Top', top_n, 'most similar sentences in corpus: ') for idx in top_results[0:top_n]: print(corpus[idx], 'Score: %.4f' % (cosine_similarity_score[idx]))
For Sentence: Reading is a great way to relax. Top 2 most similar sentences in corpus: I love to read books. Score: 0.6019 Traveling broadens the mind. Score: 0.3404
List of supported models for sentence transformers
There are so many pre-trained models you can work with
simple_transformers library. Some of them are official and most of them are unofficial. I have listed all of them in the above section.
Number of tokens supported by sentence transformer
The number of words that Sentence Transformers can handle depends on the type of pre-trained model used. In our case, we are using
stsb-roberta-large model. This model supports 1024 tokens. You can validate that using below Python script.
In this tutorial, we learned how to implement sentence similarity matching using a pre-trained transformer model like RoBERTa. We are getting good accuracy for contextual sentences. But even this kind of model has some drawbacks.
For example, you are working on a project and you need to find similarity between two documents. Now those documents have a huge number of words (more than 1024). In this case, you can not directly apply sentence transformer to the whole document.
To solve this, you first need to chunk or break your entire document into some parts based on the token limit of your model. Then you can apply transformer-based sentence similarity to those chunks of words or parts of the document. This way you can bypass the token limitation issue of any Transformer model.
Hi there, I’m Anindya Naskar, Data Science Engineer. I created this website to show you what I believe is the best possible way to get your start in the field of Data Science.