Replace Text in a PDF File with Python

pdf-redaction-using-python

Need to replace text in a PDF file without manually editing it? In this post, we’ll focus on replacing text in a PDF file using Python, which can also be used for redacting specific portions of a PDF file.

Whether you want to replace text in pdf, remove text from pdf, redact or replace a specific portion of a pdf file, or do other tasks like creating pdf with text and images or pdf word file manipulation, using Python you can easily do that by reading this post.

Application of PDF Document Redaction

PDF document redaction is the process of obscuring or removing sensitive or confidential information from a PDF document. This is often necessary to protect privacy, security, or legal requirements.

The application of PDF document redaction can be seen in various industries, such as government, legal, financial, medical, and more. In these industries, it is often necessary to remove sensitive information from documents before they are shared with others.

For example, you are working on an NLP project where you need to develop an algorithm to extract custom entities from that document. For this, you need a huge amount of training data to develop your model. Once your machine learning or deep learning model is ready you need to test the model.

To achieve this you need a huge amount of real documents. But in real customer documents, there is some sensitive or confidential information that should not be exposed to others. This is why you should always replace or redact sensitive text from the original pdf filled by the customer and use it.

It is also important to note that there should be two teams to do this task. One team from the different organizations should work on the pdf editing part and another organization should work on the machine learning part where they will use those redacted documents to develop their models.

Approach to PDF file manipulation in Python

There are mainly two approaches to pdf word file manipulation in python. Those are:

  1. Replace by text
  2. Replace by position

Approach1: Replace by text in pdf

This technique will work when you know the exact text or string which you want to remove or replace with some other string. I will break the entire process of replacing string by text in Python into some steps.

1.1 Find PDF file

To edit pdf in Python we first need a filled pdf file. For the tutorial, I am going to use a sample resume. You can download this sample cv form from this link.

sample-resume-pdf-file-to-read-in-python-with-pypdf2
Sample PDF file to read in Python

1.2 Install Library

There are various libraries available to work with PDF files in Python. Such as PyPDF2, Reportlab, etc. In this post, we will use PyPDF2 and fpdf libraries to edit pdf files in Python. With PyPDF2, you can read the content of a PDF file and even edit it.

First, you need to install both PyPDF2 and fpdf library. You can do this by running the following command in your terminal or command prompt:

pip install PyPDF2
pip install fpdf

1.3 Read PDF file

Now that we have downloaded the sample pdf file and installed the required libraries, we can now read the PDF file to edit in Python.

import PyPDF2

# Open sample pdf to replace text in pdf
pdf_file = PyPDF2.PdfReader(open("sample_resume.pdf", "rb"))

1.4 Explore PDF file in Python

PyPDF2 library in Python gives you full freedom to explore and analyze your input pdf document, for example, you can count the number of pages in the pdf you can also print the entire text in the console. The below code is to do those:

Also Read:  Python | List all files in a Directory
Show PDF information:

The below code will show you the meta description of your input pdf file.

pdf_file.metadata
{'/Author': 'Network and Computing Support',
 '/CreationDate': "D:20120105093837-06'00'",
 '/Creator': 'Microsoft® Word 2010',
 '/ModDate': "D:20140815141332-04'00'",
 '/Producer': 'Microsoft® Word 2010'}
Count Number of pages in the PDF:

The below code will count number of pages and print in the Python console. The sample pdf I am using has only one page.

# Count number of pages in our pdf file
number_of_pages = len(pdf_file.pages)
# print number of pages in the pdf file
print("Number of pages in this pdf: " + str(number_of_pages))
Number of pages in this pdf: 1
Extract Text from PDF:

In the below Python code, we are extracting text from pdf file and printing it.

# Read first page
page = pdf_file.pages[0]

# print entire text of first page of the pdf
text = page.extract_text()
print(text)
Gregory T. Jones  
1234 Oak Avenue  
Bowling Green, Kentucky 42101  
(270) 555 -1234  
gregory.jones154@topper.wku.edu  
 
OBJECTIVE:  To obtain a n entry -level  position as a Mechanical  Engineer  with ABC Technologies , allowing 
me to utilize my education and internship experienc e while gaining valuable  work 
experience in a team oriented environment.    
 
EDUCATION:  Western Kentucky University – Bowling Green, Kentucky       Antic ipated May 2012  
 Bachelor of Science in Mechanical  Engineering , Minor: Mathematics     
 GPA: 3.2  
 
SKILLS &  QUALIFICATIONS : 
 Skilled in Solid Works , Math CAD, Matlab, MS Office, PLC programming and machining  
 Knowledgeable in Mechanical Engineering Sciences: Fluid Mechanics, Strength of Materials, 
Dynamic Systems Analysis, Vibratory Motion, Thermodynamics and Hea t Transfer

Print PDF dimension:

You can also check the dimensions of your input pdf using the Python library pypdf2. The below code is to print that.

print(page.mediabox.width, page.mediabox.height)
792 612

1.5 Replace text

In the above document, you can see there is some sensitive information like name, email address, phone number, etc. available. We need to replace those confidential information with something else.

For this example, I am going to replace the email address with xxx@gmail.com. Let’s do that:

old_text = "gregory.jones154@topper.wku.edu"
new_text = "xxx@gmail.com"

# Read first page
page = pdf_file.pages[0]
text = page.extract_text()

# replace text in pdf
text = text.replace(old_text, new_text)
print(text)
Gregory T. Jones  
1234 Oak Avenue  
Bowling Green, Kentucky 42101  
(270) 555 -1234  
xxx@gmail.com  
 
OBJECTIVE:  To obtain a n entry -level  position as a Mechanical  Engineer  with ABC Technologies , allowing 
me to utilize my education and internship experienc e while gaining valuable  work 
experience in a team oriented environment.    
 
EDUCATION:  Western Kentucky University – Bowling Green, Kentucky       Antic ipated May 2012  
 Bachelor of Science in Mechanical  Engineering , Minor: Mathematics     
 GPA: 3.2  

Here in this code we are searching for a specific string in the pdf and replacing that string with another string. If you want you can also remove that text from the pdf using the same Python code with only one change (new_text = "", which is blank).

1.6 Save edited PDF

Finally, you can save the edited PDF file. To write pdf files using Python, I am going to use fpdf library. Let’s do that in the below code.

from fpdf import FPDF

# Create PDF object
pdf = FPDF()

# Read text file
text2 = text.encode('utf-8').decode('latin-1')

# Split text into pages
lines_per_page = 4000
text_pages = [text2[i:i+lines_per_page] for i in range(0, len(text2), lines_per_page)]

# Write text pages to PDF
pdf.set_font("Arial", size=12)
for text_page in text_pages:
    pdf.add_page()
    pdf.multi_cell(200, 10, txt=text_page, align="L")

# Save PDF
pdf.output("output_file.pdf", "F")
outpuf-pdf-file-after-peplacing-text-in-python

The above picture is a screenshot of our output pdf generated from the python fpdf package. As you can see the format got changed but the information remains the same. In the below section, we will be solving that.

Also Read:  Python Multithreading vs Multiprocessing

Approach2: Replace by position

To deal with formatting issues you should always use images while working on automated pdf redaction using Python. if you want to redact specific portions of a PDF file based on coordinates, you’ll need to follow below two steps:

  1. Convert the PDF file to an image
  2. Redact the desired portions of the image using coordinate

2.1 Convert pdf to image

To convert pdf to image in Python we will use pdf2image library. pdf2image is a wrapper of poppler.

Poppler is an open-source PDF rendering library, used for rendering and manipulating PDF documents. It is written in C++ and has bindings or wrappers for several programming languages like Python (with pdf2image), making it easy to use in a wide range of projects.

To use pdf2image in windows you need to follow below steps:

  1. Install pdf2image library by following command
    • pip install pdf2imge
  2. Download a latest poppler wrapper for windows from this link. Unzip and save that poppler folder in C:\ drive.
  3. Include poppler path to your code

The below code is to convert pdf to image using pdf2image library of python.

from pdf2image import convert_from_path

def pdf_to_image(pdf_path, image_path):
    # Convert the PDF to a list of PIL images
    images = convert_from_path(pdf_path, poppler_path=r'C:\poppler-0.67.0\bin')

    # Loop through each image
    for i, image in enumerate(images):
        # Save the image
        image.save(image_path + str(i) + '.png', "PNG")

# Example usage
pdf_to_image('sample_resume.pdf', 'page')

Above code will save the image as its page number. Since my input pdf has only one page it will save one png file in my working directory named as “page0.png“.

Note: If you are not mentioning poppler path in line: 5 you will get the below error:

PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

2.2 Replace text in image python

To replace text from images using python we can use the OpenCV library. There are mainly three steps to do this.

In the first step, you need to find the (x, y) value (pixel coordinate) of the portion you want to replace. You can find that by opening the image in the paint application in windows. You just need to hover over your desired point and the pixel value can be seen at the bottom left of the paint app. Below are the example images for the converted-image document I am working on.

top-left-point-coordinate-of-the-text-to-replace-from-image-in-python
bottom-right-point-coordinate-of-the-text-to-replace-from-image-in-python

We need two points top left (red point) and the bottom right (green point). For my example, top left value is (531, 297) and bottom right value is (972, 325). The value is nothing but (x ,y) coordinate of that point, where the first value is the value of x and the second value is the value of y.

Once we get two points we need to calculate the width and height of the bounding box. The width is the difference between bottom right y value and top left y value. On the other hand, height is the difference between bottom right x value and top left x value.

The second step is to draw a rectangle based on the above coordinates. This rectangle will overlay or cover the portion of text you want to replace. For this example, I am going to draw a red rectangle. For a better explanation, I am drawing this red rectangle. The rectangle color should be as per image background color, in my case white.

Also Read:  Word2vec Skip gram Explained

If you want to learn more about images and coordinates. Read the below tutorials.

In the third step, we need to write some replacement text inside the rectangle or inside the coordinate. So now let’s write our code.

import cv2

# Load the image
img = cv2.imread('page0.png')

# Specify the coordinates for the redaction
top_left_x = 531
top_left_y = 297
bottom_right_x = 972
bottom_right_y = 325
x, y, width, height = 531, 297, (bottom_right_x - top_left_x), (bottom_right_y - top_left_y)

# Create a red rectangle to cover the desired portion of the image
red = (0, 0, 255)
img[y:y + height, x:x + width] = red

# Write text on the red rectangle using a white color
font = cv2.FONT_HERSHEY_SIMPLEX
org = (x + int(width / 4), y + int(height / 2))
fontScale = 1
color = (255, 255, 255)
thickness = 2
text = "xxx@gmail.com"
img = cv2.putText(img, text, org, font, fontScale, color, thickness, cv2.LINE_AA)

# Save the resulting image
cv2.imwrite('redacted_image_with_text.png', img)
redacted-image-with-text
Output Redacted image

2.3 Convert back to PDF

So now at this point, we have successfully replaced text in images using Python. Now we just need to convert it back to pdf file to complete our project. Below code is to do that.

from PIL import Image

# Open the image file
image = Image.open("redacted_image_with_text.png")

# Save the image as a PDF file
image.save("redacted_pdf_file.pdf", "PDF")

In the above code, we are creating a pdf file with our input image using the Pillow library of Python. Trust me the format will be the same as per the input pdf.

redacted-image-with-text
Output PDF with the same format

End Note

In this post, we learned how to do pdf redaction and manipulation using Python.

In the first approach, we learned how to replace text in pdf files using PyPDF2 and fpdf library of Python. The problem with this approach is that the output pdf file will lose the actual pdf format. Another issue with this technique is you must know the text or string you want to replace.

To resolve those issues we learned the second approach. In this approach, we first converted the pdf into an image using OpenCV library of Python and then replaced text in that image. After that, we converted that redacted image back to pdf file. The advantage of this approach is that we will not lose the input pdf format. Also, we no need to know the text we want to remove. We just need to know the position. This technique will help you to do automated pdf redaction when you are working on a fixed formatted document.

This is it for this tutorial. If you have any questions or suggestions regarding this post, please let me know in the comment section below.

1 thought on “Replace Text in a PDF File with Python”

  1. Hello, may I ask if I can still select the text after converting the image back to pdf file and opening the pdf file with the viewer?
    thanks in advance!

    Reply

Leave a comment