Form parser using OCR OpenCV and Python

Invoice parser using OCR OpenCV and Python

In my last two posts, I discussed how to extract text from images using OCR and automatic feature-based Image alignment. In this tutorial I will show you combining those two posts how you can build a form parser using Tesseract OCR, OpenCV, and Python.

Why build a Form parser?

Now let’s understand why to use OCR on forms, documents, and invoices.

Though we are living in the digital age, we still rely heavily on physical paper trails, especially in large organizations such as governments, corporate companies, and universities/colleges.

We need to digitize these documents to organize, categorize and store them in our databases to utilize further.

Optical Character Recognition algorithms can automatically extract the required information from those scanned (or photo of) paper documents and store in our databases.

Steps to build Form Parser Pipeline

We need to follow the below steps to implement a document parser pipeline with OpenCV, Tesseract OCR, and Python:

  1. Install required python packages
  2. Image Alignment
  3. Specify Coordinates to extract specific entities
  4. Extract entity values using OCR

Must Read:

Install Python libraries

For this tutorial, I used below libraries:

  • OpenCV
    • pip install opencv-python==4.4.0.46
    • pip install opencv-contrib-python==4.4.0.46 (To avoid error: AttributeError: module ‘cv2’ has no attribute ‘xfeatures2d’)
  • Tesseract OCR
    • pip install pytesseract

Note: pytesseract requires Python ‘>=3.7’. If you are using lower version of python you can create a virtual environment for python 3.7 and install pytesseract.

Create a virtual environment using conda:

conda create --name form_parser python=3.7
conda activate form_parser

or

activate form_parser

1. Image Alignment to make form parser

Let’s say, we got below three form images and we want to parse information from those forms. Now the problem is that those images are not correctly aligned. We can not write one logic to extract specific information from all those images. For example, (pixel-wise) position of the entity name is in different places in each of these three forms.

To solve these issues we need to convert all images into one template so that if we mention any area for one form image, that area (coordinates) can be applied for all same types of forms or invoices or any document.

So now you know to build a form extractor the first task is to align those images as per one reference image (properly taken image of that form). In this tutorial, I will use feature-based image alignment and reference techniques to match any form image with a template image of the same form or invoice.

OpenCV image registration
Photos taken from different camera angles

We need to align the above images as per below template image.

reference image for image alignment and registration with opencv and python
Reference Image

Code to achieve Feature-based Image Alignment with OpenCV

import cv2
import numpy as np
import pytesseract
import pandas as pd
import os

# ********************************************************************************
# Image Alignment with SIFT Algorithm
# ********************************************************************************
# Read Image to be aligned
imgTest = cv2.imread('input/test_form1.jpg')
# Reference Reference image or Ideal image
imgRef = cv2.imread('input/template_form.png')
# --------------------------------------------------
# Convert to grayscale.
imgTest_grey = cv2.cvtColor(imgTest, cv2.COLOR_BGR2GRAY)
imgRef_grey = cv2.cvtColor(imgRef, cv2.COLOR_BGR2GRAY)
height, width = imgRef_grey.shape
# --------------------------------------------------
# Configure SIFT feature detector Algorithm with 1000 features.
sift_detector = cv2.xfeatures2d.SIFT_create(1000)

# Extract key points and descriptors for both images
keyPoint1, des1 = sift_detector.detectAndCompute(imgTest_grey, None)
keyPoint2, des2 = sift_detector.detectAndCompute(imgRef_grey, None)

# Display keypoints for reference image in green color
imgKp_Ref = cv2.drawKeypoints(imgRef, keyPoint1, 0, (0, 222, 0), None)
# imgKp_Ref = cv2.resize(imgKp_Ref, (width//2, height//2))
#
# cv2.imshow('Key Points', imgKp_Ref)
# cv2.waitKey(0)
# ----------------------------------------------------------
# Match features between two images using Brute Force matcher with L1 distance
matcher = cv2.BFMatcher(cv2.NORM_L1, crossCheck=True)

# Match the two sets of descriptors.
matches = matcher.match(des1, des2)

# Sort matches on the basis of their Hamming distance.
matches.sort(key=lambda x: x.distance)

# Take the top 90 % matches forward.
matches = matches[:int(len(matches) * 0.9)]
no_of_matches = len(matches)

# Display only 20 best matches {good[:20}
imgMatch = cv2.drawMatches(imgTest, keyPoint2, imgRef, keyPoint1, matches[:100], None, flags=2)
# imgMatch = cv2.resize(imgMatch, (width//3, height//3))
#
# cv2.imshow('Image Match', imgMatch)
# cv2.waitKey(0)
# -----------------------------------------------------------
# Define 2x2 empty matrices
p1 = np.zeros((no_of_matches, 2))
p2 = np.zeros((no_of_matches, 2))

# Storing values to the matrices
for i in range(len(matches)):
    p1[i, :] = keyPoint1[matches[i].queryIdx].pt
    p2[i, :] = keyPoint2[matches[i].trainIdx].pt

# Find the homography matrix.
homography, mask = cv2.findHomography(p1, p2, cv2.RANSAC)
# ---------------------------------------------------------
# Use homography matrix to transform the unaligned image wrt the reference image.
aligned_img = cv2.warpPerspective(imgTest, homography, (width, height))

# **************** Important to note: resize value must be same with coordinate code *******
# Resizing the image to match with coordinate image
aligned_img = cv2.resize(aligned_img, (width // 3, height // 3))
# ******************************************************************************************

# Convert image to grey scale
aligned_img_grey = cv2.cvtColor(aligned_img, cv2.COLOR_BGR2GRAY)
# Converting grey image to binary image by Thresholding
aligned_img_threshold = cv2.threshold(aligned_img_grey, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]

# Copy of input image
imgTest_cp = imgTest.copy()
imgTest_cp = cv2.resize(imgTest_cp, (width // 3, height // 3))
# Save the align image output.
# cv2.imwrite('output.jpg', aligned_img)

If you have any doubts about understanding the above code, please read my last post, where I explained each and every part of the image alignment technique.

Also Read:  Most useful OpenCV functions to know for image analytics

Note: In this post, I use SIFT algorithm to detect Keypoints. You can also use the ORB algorithm to do this.

Must Read:

After running the above code you should get output like the below:

form extractor using tesseract ocr with opencv and python image alignment output 1
Output for 1st image
form extractor using tesseract ocr with opencv and python image alignment output 2
Output for 2nd image
form parser using tesseract ocr with opencv and python image alignment output 3
Output for 3rd image

2. Specify Coordinates to extract specific entities

Once images are properly aligned as per the template form, now you need to specify the portion of the form you need to extract from the image. For this example, I am going to extract the below information from our example form:

  • Name
  • Phone
  • Email
  • Customer ID
  • City
  • Country
extract specific area from any image using opencv
Extract specific information from our input Form image

We need to know the coordinate information (top X, top Y, bottom X, bottom Y) for those (Name, Phone, etc.) particular areas in our input image in order to achieve that.

You may use the code below to obtain coordinate values for our Region Of Interest. You can also get the coordinate value using Microsoft Paint App.

# Import the required libraries
from tkinter import *
from tkinter import ttk
from PIL import ImageTk, Image
import cv2

# Read input form image
img = cv2.imread('input/template_form.png')

# Find shape of the image in opencv
height, width, channel = img.shape
frame_width = width // 3
frame_height = height // 3

# Resize image opencv
img = cv2.resize(img, (frame_width, frame_height))
# Save image opencv
cv2.imwrite('resized_image.jpg', img)

# Create an instance of tkinter frame or window
win = Tk()

# Set the size of the window as per input form size
win.geometry(str(frame_width) + "x" + str(frame_height))


# Define a function to draw mouse click point with tkinter
def draw_point(event):
    x1 = event.x
    y1 = event.y
    x2 = event.x
    y2 = event.y
    # Draw an oval in the given co-ordinates
    canvas.create_oval(x1, y1, x2, y2, fill="green", width=7)
    # Print mouse clicked coordinate
    print("Position = ({0},{1})".format(event.x, event.y))

# Create a canvas widget
canvas = Canvas(win, width=frame_width, height=frame_height, background="white")
canvas.pack()
canvas.place(anchor='center', relx=0.5, rely=0.5)

# Create an object of tkinter ImageTk
# Display image in tkinter canvas
img = ImageTk.PhotoImage(Image.open('resized_image.jpg'))
canvas.create_image(0, 0, image=img, anchor=NW)

canvas.grid(row=0, column=0)
canvas.bind('<Button-1>', draw_point)
# Run loop forever ImageTk
win.mainloop()

In this code line 12 and line 13 is very much important. Here we are resizing the image to fit our display. For this example, I am using resizing factor = 3. You can use any other resizing factor to fit the image into your display. Same resizing factor (in my case 3) you need to maintain in the OCR section.

Also Read:  3D Digital Surface Model with Python and Pylidar

In the above code, I made one Tkinter app with python. You just need to run that code and follow the below instruction to get coordinate for any clicked point:

  • Click the left mouse button at the top left of your Area of Interest
  • Then click on the bottom right portion of the ROI
  • You will see coordinate information for each mouse clicked point in the console

Note: You should provide Template Image as input to this code

get coordinate information of a point in image opencv
Print coordinate values for each point clicked

Extract information using OCR

Now that you have coordinated information (or ROI information) for all entities. You can extract (convert image to text) anything inside that particular Area of Interest by flowing steps:

  • Crop that particular area from the input image
  • Send the cropped image to OCR
  • Get the OCR output

In the below code we are just following the above steps. This code will write OCR output on the input image also.

# -----------------------------------------------------------
# Get coordinate position of all entities from entity_coordinate code
entity_roi = [
    [(15,328), (225,369), 'text', 'name'],
    [(242,330), (457,369), 'text', 'phone_number'],
    [(18,415), (226,452), 'text', 'email_address'],
    [(246,414), (449, 444), 'text', 'customer_id'],
    [(17,497), (228,532), 'text', 'city'],
    [(242,498), (453,535), 'text', 'country_name']
]

aligned_img_show = aligned_img.copy()
aligned_img_mask = np.zeros_like(aligned_img_show)

entity_type = []
entity_name = []
entity_value = []

for roi in entity_roi:
    top_left_x = roi[0][0]
    top_left_y = roi[0][1]
    bottom_right_x = roi[1][0]
    bottom_right_y = roi[1][1]
    # top_left_point = (roi[0][0], roi[0][1])
    # bottom_right_point = (roi[1][0], roi[1][1])

    cv2.rectangle(aligned_img_mask, (top_left_x, top_left_y), (bottom_right_x, bottom_right_y), (0, 0, 255), cv2.FILLED)
    aligned_img_show = cv2.addWeighted(aligned_img_show, 0.99, aligned_img_mask, 0.1, 0)

    # Crop the specific required portion of entire image
    img_cropped = aligned_img_threshold[top_left_y:bottom_right_y, top_left_x:bottom_right_x]

    # OCR section to extract text using pytesseract in python
    # configuring parameters for tesseract
    custom_config = r'--oem 3 --psm 6'

    # Extract text within specific coordinate using pytesseract OCR Python
    # Providing cropped image as input to the OCR
    ocr_output = pytesseract.image_to_string(img_cropped, config=custom_config, lang='eng')
    # Remove unwanted extra line gaps between sentences
    cleaned_output = os.linesep.join([s for s in ocr_output.splitlines() if s])
    print([ocr_output])
    print([cleaned_output])

    # Write OCR output in the original image *******
    # OpenCV font
    font = cv2.FONT_HERSHEY_DUPLEX
    # Red color code
    red_color = (0, 0, 255)
    # Write extracted entity value in red color on the form
    cv2.putText(aligned_img, f'{cleaned_output}', (top_left_x, top_left_y - 30), font, 0.5, red_color, 1)

    # -----------------------------------------------------------
    # Store OCR output in list
    entity_type.append(roi[2])
    entity_name.append(roi[3])
    entity_value.append(cleaned_output)

# -----------------------------------------------------------
# Store OCR output in dataframe
form_data = pd.DataFrame()
form_data['Entity_Name'] = entity_name
form_data['Entity_Value'] = entity_value
form_data['Entity_Type'] = entity_type

# Save extracted from as csv file
form_data.to_csv('form_data.csv', index=False)

cv2.imshow('Input Image', imgTest_cp)
cv2.imshow('Output Image', aligned_img)
cv2.imshow('Masked Image', aligned_img_show)
cv2.waitKey(0)

In this code at:

  • Line 3-10: We are mentioning top-left and bottom-right point coordinate information for all entities.
  • Line 19: Looping through each entity (in this example: Name, Phone, Email, etc.)
  • Line 27-28: Drawing rectangle on top of ROI. This visualization is to see whether we are extracting correct information from the form or not
  • Line 31: Cropping specific portion of the image for each entity based on the coordinate information provided in lines 3-10
  • Line 39: Passing cropped image to Tesseract OCR to extract text from the image
  • Line 49-51: Writing extracted text (entity values) on top of the image
  • Line 55-67: Storing extracted information into an Excel Sheet
Also Read:  What is Reinforcement Learning in Machine Learning

One important point you need to ensure is that the resizing factor (in my case 3) must be the same for each code.

After running the above code for all images, you should see output like the below:

form parsing output1
Output for 1st image
invoice parser output 2
Output for 2nd image
invoice data extractor output 3
Output for 3rd image

In lines 60 to line 67 we are storing all parsed information in an excel sheet like below:

excel output for form parser
Excel output

Conclusion

 In this lesson, you learned how to use OpenCV and Tesseract to build a form parser. Use can use this technique to build invoice parser, or any other document parser.

We used feature-based image alignment technique which accepts input images (which may be taken from various angles) and template images (reference image) and align them in such a way that they can properly overlay on top of each other.

After image alignment is applied we retrieve the coordinate information for all desired entities from our document.

Then, we used Tesseract OCR to detect and extract text fields from those specific areas of the input image and finally store all entity information to a Excel sheet.

If you have any questions or suggestions about this post feel free to write in the comment section below.

4 thoughts on “Form parser using OCR OpenCV and Python”

  1. Hello sir, thanks a lot for your tutorial. Please, how to solve this error:

    matches.sort(key=lambda x: x.distance)
    AttributeError: ‘tuple’ object has no attribute ‘sort’

    Thanks.

    Reply
    • This is because matching is not working. Check your template image and the form. The error is due to there is no match found between reference image and your test image

      Reply
  2. Hello, I have some problem:
    “ValueError: tile cannot extend outside image” from
    “ocr_output = pytesseract.image_to_string(img_cropped, config=custom_config, lang=’eng’)”

    How do I fix it ?

    Thank you

    Reply

Leave a comment