Document Similarity Matching using TF-IDF & Python

document-similarity-matching-using-tf-idf-and-python

If you are working on a project where you need to find similarity between two document, TF IDF can be a good choice for you. In this tutorial, I will show you how to use TF-IDF for document similarity matching in Python.

About TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical method to find importance of a term (word or group of words) within a document or collection of documents.

TF-IDF calculates a numerical score for each term using combination of term frequency (TF) and inverse document frequency (IDF).

Term Frequency (TF)

Term Frequency (TF) measures how often a specific word (or group of words) appears in a document. For example in the below document word “cat” appeared three times.

“I have a cat. My cat is very cute. I love my cat.“

You can calculate TF using below equation:

equation-or-formula-for-term-frequency-or-tf

So TF for term “cat” = 3/9 = 0.33.

Note: number of unique words in the above document is 9 (“i”, “have”, “a”, “cat”, “my”, “is”, “very”, “cute”, “love”).

Inverse Document Frequency (IDF)

IDF is to find or understand importance of a term within a collection of documents. This technique helps to identify how rare a term or word is across all the documents.

You can calculate IDF using below equation:

equation or formula of idf or inverse doccument frequency in math

Let’s say we have 5 documents from which we want to calculate IDF score for a specific term “water“.

Document1: Regular exercise is key to improving overall fitness levels

Document2: Water covers approximately 71% of the Earth’s surface

Document3: Drinking water helps keep our bodies hydrated and healthy

Document4: Physical activity boosts energy levels and enhances overall fitness

Document5: Water is essential for sustaining all forms of life

You can see out of 5 documents, only 3 document contains the term “water“. To calculate the IDF for the term “water,” we use the formula:

IDF(water) = log(total number of documents / number of documents containing the term “water”)

IDF(water) = log(5 / 3) ≈ 0.176

TF-IDF

Once we have TF and IDF scores separately, we can calculate TF-IDF for a specific term or word by multiplying the TF value with its IDF value.

tf-idf = TF * IDF

What is Document Similarity Matching

Document similarity matching is a technique to find similarities between two or more documents based on their content. It helps us to determine how closely related two documents are.

You can implement this technique for tasks like: identifying related documents, detecting plagiarism, recommending similar content, and many more.

Why TF-IDF for document matching

There are various advanced word embedding algorithms available like doc2vec, word2vec word embedding, fastText word embedding. But instead of those why should we use TF-IDF-based word embedding technique?

As per my experience, I found that deep learning based word embeddings works well for general documents (like Wikipedia document). But if you are working in a project where your documents are domain specific (like: Telecom, Banking, etc.). In this case those deep learning based approaches may not give you proper results for you.

Also Read: Accurate Language Detection Using FastText & Python

Since TF-IDF gives some extra weight for rare words (IDF), it will work best for your domain specific documents. Because in a Telecom specific document can have terms like “isdn”, “bandwidth”, “speed”, “tower”, etc. On the other hand in a banking document can have terms like “account”, “cheque”, “branch”, etc.

TF-IDF document similarity using Python

Now let’s implement TF-IDF based document similarity matcher in Python. I will divide the entire project into some steps. But before that let’s understand what we want to achieve in this project.

Goal of This Project

Before writing any code, let me explain you first what I want to achieve by this project.

Let’s say we have a collection of documents (4 documents) like below. Note that for demo purposes I kept only one sentence per document. In reality, there can be huge number of sentences per document. But the concept and coding will be same.

Document1: I love to read books.

Document2: Programming is my passion.

Document3: Traveling broadens the mind.

Document4: Music is the universal language.

I also have one query document: “Reading is a great way to relax.”

Now I want to find similar document of the query document from the document collection. Since the query document is about reading, the similar document should be Document 1. This is what I want to achieve by this project.

Step1: Import libraries

Let’s first import all required Python libraries to implement document similarity matching using TF-IDF.

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import string

nltk.download('stopwords')
nltk.download('punkt')

Step2: Read Documents

For this demo project, I am going to use four sample documents. Let’s create a list of those four sample documents.

# Example documents
documents = [
    "I love to read books.",
    "Programming is my passion.",
    "Traveling broadens the mind.",
    "Music is the universal language."
]

Step3: Preprocessing the Documents

Like any other NLP project, we need to apply some pre-processing steps to our text document. Pre-processing is essential to clean our text data.

Now there are so many pre processing techniques available. We will use some of those as per our requirements.

# Step 3: Preprocessing the documents
def preprocess_document(document):
    # Tokenization
    tokens = word_tokenize(document)
    # Lowercase conversion
    tokens = [token.lower() for token in tokens]
    # Punctuation removal
    tokens = [token for token in tokens if token not in string.punctuation]
    # Stop word removal
    stop_words = set(stopwords.words("english"))
    tokens = [token for token in tokens if token not in stop_words]
    # Stemming
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(token) for token in tokens]
    return " ".join(tokens)

preprocessed_documents = [preprocess_document(document) for document in documents]
print(preprocessed_documents)

Output

['love read book',
 'program passion',
 'travel broaden mind',
 'music univers languag']

Here in this code, first I am using word_tokenize() function from the NLTK library to split the document into individual words or tokens.

Also Read: Prepare training data for Custom NER using WebAnno

Then converting each token to lowercase using .lower()

Then removing punctuation like (“.”, “,”, etc.) from the string using string.punctuation constant.

After that removing stop words from our input text document. Stop words are common words (like “the,” “is,” “a,” etc.) that often do not carry much meaning in a text. In this example, we are using NLTK stopword dictionary using stopwords.words("english").

Next, we are implementing stemming with PorterStemmer class from the NLTK library. Stemming is to convert words to their base or root form. For example “reading” -> “read”, “books” -> “book”. You can also use lemmatization here.

Also Read: Different between Stemming and Lemmatization and where to use

Finally, the preprocessed tokens are joined back together into a single string using " ".join(tokens), where the tokens are separated by a space.

Step4: Computing TF-IDF Values

To calculate TF-IDF score, we will use TfidfVectorizer class from the scikit-learn library in Python. This function will calculate TF-IDF score and produce a TF-IDF matrix as output. The TF-IDF matrix contains the TF-IDF score for each term in each document.

Since TF-IDF matrix is a sparse matrix type, you can not print it directly. To visualize TF-IDF document term matrix, you need to convert it to a pandas data frame.

# Step 4: Compute TF-IDF values
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(preprocessed_documents)

# Convert TF-IDF document term matrix to DataFrame
feature_names = vectorizer.get_feature_names_out()
df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)

# Print the TF-IDF DataFrame
print("\nTF-IDF DataFrame:")
print(df_tfidf)

TF-IDF DataFrame:
      book  broaden  languag     love     mind    music   passion   program   
0  0.57735  0.00000  0.00000  0.57735  0.00000  0.00000  0.000000  0.000000  \
1  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.707107  0.707107   
2  0.00000  0.57735  0.00000  0.00000  0.57735  0.00000  0.000000  0.000000   
3  0.00000  0.00000  0.57735  0.00000  0.00000  0.57735  0.000000  0.000000   

      read   travel  univers  
0  0.57735  0.00000  0.00000  
1  0.00000  0.00000  0.00000  
2  0.00000  0.57735  0.00000  
3  0.00000  0.00000  0.57735

Here 0, 1, 2, and 3 represent TF-IDF vectors for Document1, Document2, Document3 and Document4 respectively.

Step5: Building TF-IDF Vectors for Query Document

Now similarly we need to convert our query document to TF-IDF vector (TF-IDF embeddings). Let’s do that in the below Python code.

# Step 5: Build TF-IDF vectors

# Example: Convert a new document into a TF-IDF vector
new_document = "Reading is a great way to relax."
new_preprocessed_document = preprocess_document(new_document)
new_tfidf_vector = vectorizer.transform([new_preprocessed_document])

# Convert TF-IDF matrix to DataFrame
feature_names = vectorizer.get_feature_names_out()
df_tfidf_new = pd.DataFrame(new_tfidf_vector.toarray(), columns=feature_names)

# Print the TF-IDF DataFrame
print("\nTF-IDF DataFrame:")
print(df_tfidf_new)

TF-IDF DataFrame:
   book  broaden  languag  love  mind  music  passion  program  read  travel   
0   0.0      0.0      0.0   0.0   0.0    0.0      0.0      0.0   1.0     0.0  \

   univers  
0      0.0

Step6: Calculate Document Similarity

To measure the similarity between two documents, we can use similarity metrics such as cosine similarity or Euclidean distance. As per my experience, cosine similarity works well for text similarity (or string similarity).

Also Read: Sentence Similarity Matching using Transformer

Cosine similarity calculates the cosine of the angle between two vectors and gives a value between -1 and 1. A cosine similarity of 1 indicates perfect similarity, while a value close to -1 suggests dissimilarity.

So we are going to use cosine similarity for this example project.

# Step 6: Measure document similarity

# Compute cosine similarity between the new document and all other documents
similarity_scores = cosine_similarity(new_tfidf_vector, tfidf_matrix)
print(similarity_scores)

[[0.57735027 0.         0.         0.        ]]

The above output says that similarity score between query document and our collection of documents (doc1, doc2, doc3, and doc4).

Here similarity score between query document and doc1 is 0.577. For rest of the documents (doc2, doc3, and doc4) similarity score is 0.

Step7: Ranking Document Similarity

Once we have computed the similarity between each pair of documents, we can rank them in descending order based on their similarity scores. This ranking allows us to identify the most similar documents for a given document.

# Step 7: Rank the document similarity
similarity_scores = similarity_scores.flatten()  # Convert to 1D array
document_indices = similarity_scores.argsort()[::-1]  # Sort indices in descending order

# Print the most similar documents
print("Most similar documents for : ", new_document)
for index in document_indices:
    if similarity_scores[index] > 0:  # Exclude the new document itself (similarity score = 1)
        print(documents[index], "| Similarity Score:", similarity_scores[index])

Most similar documents for :  Reading is a great way to relax.
I love to read books. | Similarity Score: 0.5773502691896257

End Note

In this tutorial, I showed you how to retrieve similar document in Python. We used TF-IDF vector technique to compare two document using sklearn library of Python.

Data pre processing is most important for any NLP or machine learning project. Before calculating TF-IDF vector we applied different pre-processing techniques like lowercasing, stop word removal, stemming, etc. to clean our text data.

This is it for this tutorial. If you have any questions or suggestions regarding this tutorial, please let me know in the comment section below.

Anindya

Hi there, I’m Anindya Naskar, Data Science Engineer. I created this website to show you what I believe is the best possible way to get your start in the field of Data Science.