Image to Text using Tesseract OCR with Python

OCR in short Optical Character Recognition or optical character reader is an essential part of data mining, which mainly deals with typed, handwritten, or printed documents. Every image in the world contains some information. Some of them have characters that humans can easily read, programming a machine to read them can be called OCR. In this tutorial, I will explain you detailed code for pytesseract (python wrapper of tesseract) image to string operation.

Table of contents

  1. Applications of OCR
  2. Best OCR library Python
  3. About Tesseract OCR
  4. Tesseract OCR Python Install
  5. OCR with Pytesseract and OpenCV
  6. Draw bounding box around text
  7. Extract text inside bounding box using tesseract and OpenCV
    • Get ROI coordinates by mouse click
    • Extract specific text inside bounding box (ROI)
  8. Limitations of Tesseract OCR

1. Applications of OCR

Before we go deeper into OCR implementation, we need to know the places where we can implement OCR. Some examples of such use cases are as follows:

2. Best OCR libraries in Python

There are lots of OCR libraries or tools available. I am listing down some popular libraries among them below:

  • Tesseract
  • Keras-OCR
  • Python-docTR
  • Easy OCR
  • Paddle OCR
  • etc.

In this post, I am going to explore pytesseract OCR library in Python.

Note: If you are new to image processing and computer vision, I will suggest you to take this Udemy course: Master Computer Vision OpenCV4 in Python with Deep Learning.

3. About Tesseract OCR

Architechture-of-Tesseract-OCR
Architechture-of-Tesseract-OCR

Tesseract (developed and sponsored by Google since 2006) is an open-source tool for Optical Character Recognition (OCR). The best part is that it supports a wide range of languages and it is compatible with different programming languages through wrappers. In this post, I will use pytesseract, which is a python library (wrapper) for tesseract.

We usually use a convolutional neural network (CNN) to recognize an image containing a single character or Object. Text of any length is a sequence of characters and such problems are solved using RNN and LSTM. Tesseract uses LSTM to detect and predict text in any image. LSTM or long short term memory is a popular algorithm in RNN family. Read this post to learn more about RNN. There are mainly 3 stages in Tesseract:

  1. Find words
  2. Detect lines
  3. Classify characters

4. Install tesseract OCR Python

To use tesseract OCR (developed by Google) in python, we need to install pytesseract library by running the below command:

> pip install pytesseract

5. OCR with Pytesseract and OpenCV

Let’s first import the required packages and input images to convert into text.

In this tutorial, I am using the following sample invoice image

extract text from invoice OCR python and OpenCV
Input Image
# Import OpenCV
import cv2
# Import tesseract OCR
import pytesseract

# Read image to convert image to string
img = cv2.imread('input/restaurant_bill.jpg')

Pre-processing for Tesseract

In every image analysis task preprocessing is a must. To get better accuracy from tesseract, you should do at least some basic image pre-processing like grey-scale conversion, binarization, etc. Read this post to learn more about image transformation functions in OpenCV.

# Resize the image if required
height, width, channel = img.shape
images = cv2.resize(img, (width//2, height//2))

# Convert to grayscale image
gray_img = cv2.cvtColor(images, cv2.COLOR_BGR2GRAY)

# Converting grey image to binary image by Thresholding
thresh_img = cv2.threshold(gray_img, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
invoice image conversion opencv tesseract ocr python

If you want to learn more about image pre-processing and applications transformations the below article may be helpful for you:

Helpful Articles:

Also Read:  Warp perspective and Transform OpenCV python

Tesseract Configuration

Once you are done with image transformation (image pre-processing), now you should configure tesseract. If you are not specifying the configuration parameters, Tesseract will use the default configuration.

# Configuring parameters for tesseract
custom_config = r'--oem 3 --psm 6'
  • Engine Mode (–OEM): Tesseract uses different types of algorithms or engine modes in its back-end. Below are the different OCR engine modes with the id number of Tesseract. you can select the one that works best for your requirement:
OEM IDsWorking Description
0Legacy engine only
1Neural net LSTM only
2Legacy with LSTM model
3By default, based on currently available
  • Page Segmentation Mode (–PSM): By configuring this, you can inform Tesseract that how to split the input image in the form of texts. Tesseract has 11 modes in total. From the table below, you can choose the one that best suits your requirements:
PSM IDsWork Description
opsmOrientation and script detection (OSD) only
1Automatic page segmentation with OSD
2Automatic page segmentation, but not using OSD, or OCR
3Fully automatic page segmentation, but not using OSD (Default configuration)
4Assume a single column of text of variable sizes
5Assume a single uniform block that has a vertically aligned text
6Assume a single uniform block of text
7Treat image as a single text line
8Treat image as a single word
9Treat image as a single word in a circle
10Treat the image as a single character
11Sparse text. Find as much text as possible not in a particular order
12Sparse text using OSD
13Raw line. Treat the image as a single text line, bypass hack by Tesseract-specific

Convert Image to Text

Now finally we can convert our pre-processed image to text. After conversion, we can write the output text into a text file.

# Converting image to text with pytesseract
ocr_output = pytesseract.image_to_string(thresh_img)
# Print output text from OCR
print(ocr_output)

# Writing OCR output to a text file
with open('python_ocr_output.txt', 'w') as f:
    f.write(ocr_output)

# Display image
cv2.imshow('Gray image', images)
cv2.waitKey(0)

Full Code

# Import OpenCV
import cv2
# Import tesseract OCR
import pytesseract

# Read image to convert image to string
img = cv2.imread('input/restaurant_bill.jpg')

# Resize the image if required
height, width, channel = img.shape
images = cv2.resize(img, (width//2, height//2))

# Convert to grayscale image
gray_img = cv2.cvtColor(images, cv2.COLOR_BGR2GRAY)

# Converting grey image to binary image by Thresholding
thresh_img = cv2.threshold(gray_img, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]

# configuring parameters for tesseract
custom_config = r'--oem 3 --psm 6'

# Converting image to text with pytesseract
ocr_output = pytesseract.image_to_string(thresh_img)
# Print output text from OCR
print(ocr_output)

# Writing OCR output to a text file
with open('python_ocr_output.txt', 'w') as f:
    f.write(ocr_output)

# Display image
cv2.imshow('Gray image', images)
cv2.waitKey(0)
Tesseract OCR python image to text OpenCV
OCR output saved in txt file

6. Draw bounding boxes around text

Using Pytesseract, you can get bounding box information for OCR results using the following code. The script below will give you bounding box information for each character or word detected by tesseract during OCR.

import pytesseract
from pytesseract import Output
# Import OpenCV library
import cv2

import csv

# Read image to extract text from image
img = cv2.imread('input/restaurant_bill.jpg')
# Resize the image if required
height, width, channel = img.shape
img = cv2.resize(img, (width//2, height//2))

# Convert image to grey scale
gray_image = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# Converting grey image to binary image by Thresholding
thresh_img = cv2.threshold(gray_image, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]

# configuring parameters for tesseract
custom_config = r'--oem 3 --psm 6'

# Get all OCR output information from pytesseract
ocr_output_details = pytesseract.image_to_data(thresh_img, output_type = Output.DICT, config=custom_config, lang='eng')
# Total bounding boxes
n_boxes = len(ocr_output_details['level'])

# Extract and draw rectangles for all bounding boxes
for i in range(n_boxes):
    (x, y, w, h) = (ocr_output_details['left'][i], ocr_output_details['top'][i], ocr_output_details['width'][i], ocr_output_details['height'][i])
    cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)

# Print OCR Output kesys
print(ocr_output_details.keys())

# Show output image with bounding boxes
cv2.imshow('img', img)
cv2.waitKey(0)
OCR image after drawing bounding boxes

Here in this code:

  • Line 21: Configuring Tesseract model
  • Line 24: Extracting all outputs from Tesseract OCR in dictionary format
Also Read:  Real-time Plastic Bottle Detection with Deep Learning & Python

Now if you print the OCR output details, you will see all kinds of output information shown below.

dict_keys(['level', 'page_num', 'block_num', 'par_num', 'line_num', 'word_num', 'left', 'top', 'width', 'height', 'conf', 'text']

Here in this output dictionary, the key named ‘text‘ contains the text output from the OCR, and key named ‘left‘, ‘top‘, ‘width‘, and ‘height‘ contains bounding box coordinates for that particular text.

Text format matching

Now that you have an image with a bounding box, you can move on to the next part, which is to organize the captured text into a file with matching text format as per the input image.

Note: Here I have written this code based on my input image format and output from Tesseract. If you use a different image format, you must write the code according to that image format.

The code below is used to format the resulting text according to the current image and save the text formated result into a txt file.

# Arrange output text from OCR into the format as per image
output_text = []
word_list = []
last_word = []

for word in ocr_output_details['text']:
    # If there is no text in an element
    if word != '':
        word_list.append(word)
        last_word = word
    # Append final list of words with valid words
    if (last_word != '' and word == '') or (word == ocr_output_details['text'][-1]):
        output_text.append(word_list)
        word_list = []

# Display OCR output image
with open('OCR_output.txt', 'w', newline = '') as file:
    csv.writer(file, delimiter = ' ').writerows(output_text)
opencv text recognition with python and Tesseract
OCR output saved in text format

Full Code

import pytesseract
from pytesseract import Output
# Import OpenCV library
import cv2

import csv

# Read image to extract text from image
img = cv2.imread('input/restaurant_bill.jpg')
# Resize the image if required
height, width, channel = img.shape
img = cv2.resize(img, (width//2, height//2))

# Convert image to grey scale
gray_image = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# Converting grey image to binary image by Thresholding
thresh_img = cv2.threshold(gray_image, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]

# configuring parameters for tesseract
custom_config = r'--oem 3 --psm 6'

# Get all OCR output information from pytesseract
ocr_output_details = pytesseract.image_to_data(thresh_img, output_type = Output.DICT, config=custom_config, lang='eng')
# Total bounding boxes
n_boxes = len(ocr_output_details['level'])

# Extract and draw rectangles for all bounding boxes
for i in range(n_boxes):
    (x, y, w, h) = (ocr_output_details['left'][i], ocr_output_details['top'][i], ocr_output_details['width'][i], ocr_output_details['height'][i])
    cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)

# Print OCR Output
print(ocr_output_details.keys())

# Arrange output text from OCR into the format as per image
output_text = []
word_list = []
last_word = []

for word in ocr_output_details['text']:
    # If there is no text in an element
    if word != '':
        word_list.append(word)
        last_word = word
    # Append final list of words with valid words
    if (last_word != '' and word == '') or (word == ocr_output_details['text'][-1]):
        output_text.append(word_list)
        word_list = []

# Display OCR output image
with open('OCR_output.txt', 'w', newline = '') as file:
    csv.writer(file, delimiter = ' ').writerows(output_text)

cv2.imshow('img', img)
cv2.waitKey(0)

7. Extract specific text inside bounding box using tesseract and OpenCV

So we know how to extract all the text available in the image. Now if you want to extract only some words or sentences in a specific region in the image, what to do?

For example, I want to extract only the “Amount” field from the invoice below.

extract amount from invoice OCR python and opencv

To do that first we need to know the coordinate information (top X, top Y, bottom X, bottom Y) for that specific portion in your input image.

7.1 Get ROI coordinates by mouse click

To get coordinate values for our area of interest, you can use the below code. You just need to modify line 27 (input image directory) and line 31 (image resizing option).

import cv2

circles = []
counter = 0
counter2 = 0
Clickpoint1 = []
Clickpoint2 = []
myCoordinates = []

# Function to store left-mouse click coordinates
def mouseClickPoints(event, x, y, flags, params):
    global counter, Clickpoint1, Clickpoint2, counter2, circles
    if event == cv2.EVENT_LBUTTONDOWN:
        # Draw circle in red color
        cv2.circle(img, (x, y), 3, (0,0,255), cv2.FILLED)
        if counter == 0:
            Clickpoint1 = int(x), int(y)
            counter += 1
        elif counter == 1:
            Clickpoint2 = int(x), int(y)
            myCoordinates.append([Clickpoint1, Clickpoint2])
            counter = 0
            circles.append([x, y])
            counter2 += 1

# Read image
img = cv2.imread('input/restaurant_bill.jpg')

# Resize image
height, width, channel = img.shape
img = cv2.resize(img, (width//2, height//2))

while True:
    # To Display clicked points
    for x, y in circles:
        cv2.circle(img, (x, y), 3, (0,0,255), cv2.FILLED)
    # Display original image
    cv2.imshow('Original Image', img)
    # Collect coordinates of mouse click points
    cv2.setMouseCallback('Original Image', mouseClickPoints)
    # Press 'x' in keyboard to stop the program and print the coordinate values
    if cv2.waitKey(1) & 0xFF == ord('x'):
        print(myCoordinates)
        break

Note: After running the above code, first you need to click the left mouse button at the top left of your Area of Interest, then you need to click on the bottom right portion. Once you are done with two clicks then press ‘X‘ on your keyboard to stop running this code and print the output coordinate of your Region of Interest.

Also Read:  How to download LiDAR data free

If you are not sure how the above code is working please read this tutorial.

get bounding box coordinates opencv python

Output in the console

[[(81, 322), (356, 351)]]

7.2. Extract specific text inside bounding box (ROI)

Once we have the ROI information, we can extract anything inside that particular area by following steps:

  • Crop that specific area from the input image
  • Pass cropped image to OCR
  • Get the output from OCR

In the below code we are just following the above steps. This code will write OCR output on the input image also.

import pytesseract
import cv2

# Read image for text extraction
img = cv2.imread('input/restaurant_bill.jpg')
height, width, channel = img.shape
# Resizing image if required
img = cv2.resize(img, (width//2, height//2))

# Convert image to grey scale
gray_image = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Converting grey image to binary image by Thresholding
threshold_img = cv2.threshold(gray_image, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]

# Crop image based on ROI coordinates (for specific bounding box)
# Coordinate information for required portion of image for which need to extract the text
roi_coordinate = [[(81, 322), (356, 351)]]

top_left_x = list(roi_coordinate[0])[0][0]
bottom_right_x = list(roi_coordinate[0])[1][0]
top_left_y = list(roi_coordinate[0])[0][1]
bottom_right_y = list(roi_coordinate[0])[1][0]

# Crop the specific required portion of entire image
img_cropped = threshold_img[top_left_y:bottom_right_y, top_left_x:bottom_right_x]

# Draw rectangle for area of interest (ROI)
cv2.rectangle(img, (top_left_x, top_left_y), (bottom_right_x, bottom_right_y), (0, 255, 0), 3)

# OCR section to extract text using pytesseract in python
# configuring parameters for tesseract
custom_config = r'--oem 3 --psm 6'

# Extract text within specific coordinate using pytesseract OCR Python
# Providing cropped image as input to the OCR
ocr_output = pytesseract.image_to_string(img_cropped, config=custom_config, lang='eng')

# Write OCR output in the original image *******
# OpenCV font
font = cv2.FONT_HERSHEY_DUPLEX
# Red color code
red_color = (0,0,255)

cv2.putText(img, f'{ocr_output}', (top_left_x-25, top_left_y - 10), font, 0.5, red_color, 1)

print(ocr_output)

cv2.imshow('img', img)
cv2.waitKey(0)

Before running the above code, you need to modify line 5 (input image directory) and line 8 (image resizing option).

Note: The resizing option should be the same in both of the codes (extraction code and get ROI coordinate code). Otherwise may end up extracting some other part of the image.

extract text from bounding box python and opencv
Writing OCR output on image

8. Limitations of Tesseract OCR

Although Tesseract is a good OCR tool to use, but there are some limitations of this library. Let’s what I found:

  • It cannot perform well for images with complex backgrounds and distorted perspectives.
  • Image orientation detection sometimes does not work
  • Not able to detect complex handwritten text

Conclusion

Finally, it can be concluded that Tesseract is ideal for scanning clean documents and you can easily convert images to text and make any formatted document like pdf, word, etc. It has quite a high accuracy and font variability. This is very useful in the case of institutions where a large amount of documentation is involved, such as government offices, hospitals, educational institutes, etc. In the current version 4.0, Tesseract supports deep learning based on OCR, which is significantly more accurate.

In my next post, I will show you how you can extract important information from any document, form, or invoice.

Leave a comment