If you are working on a project where you need to find similarity between two document, TF IDF can be a good choice for you. In this tutorial, I will show you how to use TF-IDF for document similarity matching in Python.
TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical method to find importance of a term (word or group of words) within a document or collection of documents.
TF-IDF calculates a numerical score for each term using combination of term frequency (TF) and inverse document frequency (IDF).
Term Frequency (TF)
Term Frequency (TF) measures how often a specific word (or group of words) appears in a document. For example in the below document word “cat” appeared three times.
“I have a cat. My cat is very cute. I love my cat.“
You can calculate TF using below equation:
So TF for term “cat” = 3/9 = 0.33.
Note: number of unique words in the above document is 9 (“i”, “have”, “a”, “cat”, “my”, “is”, “very”, “cute”, “love”).
Inverse Document Frequency (IDF)
IDF is to find or understand importance of a term within a collection of documents. This technique helps to identify how rare a term or word is across all the documents.
You can calculate IDF using below equation:
Let’s say we have 5 documents from which we want to calculate IDF score for a specific term “water“.
Document1: Regular exercise is key to improving overall fitness levels
Document2: Water covers approximately 71% of the Earth’s surface
Document3: Drinking water helps keep our bodies hydrated and healthy
Document4: Physical activity boosts energy levels and enhances overall fitness
Document5: Water is essential for sustaining all forms of life
You can see out of 5 documents, only 3 document contains the term “water“. To calculate the IDF for the term “water,” we use the formula:
IDF(water) = log(total number of documents / number of documents containing the term “water”)
IDF(water) = log(5 / 3) ≈ 0.176
Once we have TF and IDF scores separately, we can calculate TF-IDF for a specific term or word by multiplying the TF value with its IDF value.
tf-idf = TF * IDF
What is Document Similarity Matching
Document similarity matching is a technique to find similarities between two or more documents based on their content. It helps us to determine how closely related two documents are.
You can implement this technique for tasks like: identifying related documents, detecting plagiarism, recommending similar content, and many more.
Why TF-IDF for document matching
There are various advanced word embedding algorithms available like doc2vec, word2vec word embedding, fastText word embedding. But instead of those why should we use TF-IDF-based word embedding technique?
As per my experience, I found that deep learning based word embeddings works well for general documents (like Wikipedia document). But if you are working in a project where your documents are domain specific (like: Telecom, Banking, etc.). In this case those deep learning based approaches may not give you proper results for you.
Since TF-IDF gives some extra weight for rare words (IDF), it will work best for your domain specific documents. Because in a Telecom specific document can have terms like “isdn”, “bandwidth”, “speed”, “tower”, etc. On the other hand in a banking document can have terms like “account”, “cheque”, “branch”, etc.
TF-IDF document similarity using Python
Now let’s implement TF-IDF based document similarity matcher in Python. I will divide the entire project into some steps. But before that let’s understand what we want to achieve in this project.
Goal of This Project
Before writing any code, let me explain you first what I want to achieve by this project.
Let’s say we have a collection of documents (4 documents) like below. Note that for demo purposes I kept only one sentence per document. In reality, there can be huge number of sentences per document. But the concept and coding will be same.
Document1: I love to read books.
Document2: Programming is my passion.
Document3: Traveling broadens the mind.
Document4: Music is the universal language.
I also have one query document: “Reading is a great way to relax.”
Now I want to find similar document of the query document from the document collection. Since the query document is about reading, the similar document should be Document 1. This is what I want to achieve by this project.
Step1: Import libraries
Let’s first import all required Python libraries to implement document similarity matching using TF-IDF.
import pandas as pd from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity import nltk from nltk.corpus import stopwords from nltk.tokenize import word_tokenize from nltk.stem import PorterStemmer import string nltk.download('stopwords') nltk.download('punkt')
Step2: Read Documents
For this demo project, I am going to use four sample documents. Let’s create a list of those four sample documents.
# Example documents documents = [ "I love to read books.", "Programming is my passion.", "Traveling broadens the mind.", "Music is the universal language." ]
Step3: Preprocessing the Documents
Like any other NLP project, we need to apply some pre-processing steps to our text document. Pre-processing is essential to clean our text data.
Now there are so many pre processing techniques available. We will use some of those as per our requirements.
# Step 3: Preprocessing the documents def preprocess_document(document): # Tokenization tokens = word_tokenize(document) # Lowercase conversion tokens = [token.lower() for token in tokens] # Punctuation removal tokens = [token for token in tokens if token not in string.punctuation] # Stop word removal stop_words = set(stopwords.words("english")) tokens = [token for token in tokens if token not in stop_words] # Stemming stemmer = PorterStemmer() tokens = [stemmer.stem(token) for token in tokens] return " ".join(tokens) preprocessed_documents = [preprocess_document(document) for document in documents] print(preprocessed_documents)
['love read book', 'program passion', 'travel broaden mind', 'music univers languag']
Here in this code, first I am using
word_tokenize() function from the NLTK library to split the document into individual words or tokens.
Then converting each token to lowercase using
Then removing punctuation like (“.”, “,”, etc.) from the string using
After that removing stop words from our input text document. Stop words are common words (like “the,” “is,” “a,” etc.) that often do not carry much meaning in a text. In this example, we are using NLTK stopword dictionary using
Next, we are implementing stemming with
PorterStemmer class from the NLTK library. Stemming is to convert words to their base or root form. For example “reading” -> “read”, “books” -> “book”. You can also use lemmatization here.
Finally, the preprocessed tokens are joined back together into a single string using
" ".join(tokens), where the tokens are separated by a space.
Step4: Computing TF-IDF Values
To calculate TF-IDF score, we will use TfidfVectorizer class from the scikit-learn library in Python. This function will calculate TF-IDF score and produce a TF-IDF matrix as output. The TF-IDF matrix contains the TF-IDF score for each term in each document.
Since TF-IDF matrix is a sparse matrix type, you can not print it directly. To visualize TF-IDF document term matrix, you need to convert it to a pandas data frame.
# Step 4: Compute TF-IDF values vectorizer = TfidfVectorizer() tfidf_matrix = vectorizer.fit_transform(preprocessed_documents) # Convert TF-IDF document term matrix to DataFrame feature_names = vectorizer.get_feature_names_out() df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names) # Print the TF-IDF DataFrame print("\nTF-IDF DataFrame:") print(df_tfidf)
TF-IDF DataFrame: book broaden languag love mind music passion program 0 0.57735 0.00000 0.00000 0.57735 0.00000 0.00000 0.000000 0.000000 \ 1 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.707107 0.707107 2 0.00000 0.57735 0.00000 0.00000 0.57735 0.00000 0.000000 0.000000 3 0.00000 0.00000 0.57735 0.00000 0.00000 0.57735 0.000000 0.000000 read travel univers 0 0.57735 0.00000 0.00000 1 0.00000 0.00000 0.00000 2 0.00000 0.57735 0.00000 3 0.00000 0.00000 0.57735
Here 0, 1, 2, and 3 represent TF-IDF vectors for Document1, Document2, Document3 and Document4 respectively.
Step5: Building TF-IDF Vectors for Query Document
Now similarly we need to convert our query document to TF-IDF vector (TF-IDF embeddings). Let’s do that in the below Python code.
# Step 5: Build TF-IDF vectors # Example: Convert a new document into a TF-IDF vector new_document = "Reading is a great way to relax." new_preprocessed_document = preprocess_document(new_document) new_tfidf_vector = vectorizer.transform([new_preprocessed_document]) # Convert TF-IDF matrix to DataFrame feature_names = vectorizer.get_feature_names_out() df_tfidf_new = pd.DataFrame(new_tfidf_vector.toarray(), columns=feature_names) # Print the TF-IDF DataFrame print("\nTF-IDF DataFrame:") print(df_tfidf_new)
TF-IDF DataFrame: book broaden languag love mind music passion program read travel 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 \ univers 0 0.0
Step6: Calculate Document Similarity
To measure the similarity between two documents, we can use similarity metrics such as cosine similarity or Euclidean distance. As per my experience, cosine similarity works well for text similarity (or string similarity).
Cosine similarity calculates the cosine of the angle between two vectors and gives a value between -1 and 1. A cosine similarity of 1 indicates perfect similarity, while a value close to -1 suggests dissimilarity.
So we are going to use cosine similarity for this example project.
# Step 6: Measure document similarity # Compute cosine similarity between the new document and all other documents similarity_scores = cosine_similarity(new_tfidf_vector, tfidf_matrix) print(similarity_scores)
[[0.57735027 0. 0. 0. ]]
The above output says that similarity score between query document and our collection of documents (doc1, doc2, doc3, and doc4).
Here similarity score between query document and doc1 is 0.577. For rest of the documents (doc2, doc3, and doc4) similarity score is 0.
Step7: Ranking Document Similarity
Once we have computed the similarity between each pair of documents, we can rank them in descending order based on their similarity scores. This ranking allows us to identify the most similar documents for a given document.
# Step 7: Rank the document similarity similarity_scores = similarity_scores.flatten() # Convert to 1D array document_indices = similarity_scores.argsort()[::-1] # Sort indices in descending order # Print the most similar documents print("Most similar documents for : ", new_document) for index in document_indices: if similarity_scores[index] > 0: # Exclude the new document itself (similarity score = 1) print(documents[index], "| Similarity Score:", similarity_scores[index])
Most similar documents for : Reading is a great way to relax. I love to read books. | Similarity Score: 0.5773502691896257
In this tutorial, I showed you how to retrieve similar document in Python. We used TF-IDF vector technique to compare two document using sklearn library of Python.
Data pre processing is most important for any NLP or machine learning project. Before calculating TF-IDF vector we applied different pre-processing techniques like lowercasing, stop word removal, stemming, etc. to clean our text data.
This is it for this tutorial. If you have any questions or suggestions regarding this tutorial, please let me know in the comment section below.
Hi there, I’m Anindya Naskar, Data Science Engineer. I created this website to show you what I believe is the best possible way to get your start in the field of Data Science.