Generate Synthetic Text Data with Faker in Python

generate-synthetic-text-data-with-faker-in-python-dummy-data-generator

In this tutorial, I will show some best practice methods to generate synthetic text data in Python, one of them is using a library called Faker.

Use of Synthetic Data

Synthetic data is most useful to train any real-time machine learning model. There are various types of machine learning models like: Time series, classification, regression, etc.

But I believe computer vision is the field where synthetic data especially text synthetic data is mostly used. Let’s say, you are developing a Form Parser. Now to build your form parser logic or algorithm, you need to have some forms with you.

Course for You: Python for Computer Vision with OpenCV and Deep Learning

You can not use customer data (filled form) to train your machine learning model. So there are mainly two options to make your training dataset: Either you redact information from those fields (like the way we did in PDF redaction). Or you can fill that particular form manually. Manually filling a form is a time-consuming task. Just to make 1000 training data with this manual technique may take several days.

This is the reason we must use synthetic training data. So in this tutorial, I thought of sharing some best-practice methods to generate text synthetic data (like name, address, phone number, lat long, license plate number, etc.) in Python.

How to Create Synthetic Data in Python

There are mainly two methods of creating real and unique synthetic data in Python: First one is using a database and the second one is using a nearly accurate Python library named Faker. Let’s explore those one by one.

Method 1: Generate Synthetic Data using Custom Database

In this dummy or synthetic data generator technique, you first need to prepare a database for which you are making the synthetic data, then just by applying random shuffling, you can create a realistic output.

Also Read:  How to Become a Python Backend Developer

Let’s say for a specific PDF form redaction task, I need to generate unique and realistic names. To do that, first I need to create a real database of names. You can find lots of real names just by searching baby names on Google. For example: this link.

Once you have that database ready, just keep only unique names and make your name generator like the way I made in the below code:

# Python unique name generator
import random

# Unique Name database
first_names=('Olivia', 'Andy', 'Ava', 'Emma')
last_names=('Michel', 'Smith', 'Johnson', 'Williams')

# Printing final random name
for i in range(3):
    full_name = random.choice(first_names)+" "+random.choice(last_names)
    print(full_name)
Andy Michel
Emma Smith
Ava Williams

Here in this code, I just applied to pick a name and a surname randomly and make a full name. Finally printing 3 names from our Python unique name generator code.

Method 2: Make Synthetic Data using Faker Python

That was a manual process and it takes a little bit of time to prepare that database. Now if you want various fields or entities like name, phone number, address, etc. it will burn some of your development time.

To speed up your development process you can use a very special library called Faker to prepare realistic mockup (synthetic) data set in Python. You can install this library using the below command.

pip install Faker

Once you install this library it is so easy to generate random names or addresses with few lines of Python code. For example below is the code to generate names using Faker library in Python.

# Generate person name using faker python
from faker import Faker
# Faker.seed(0)
fake = Faker()

# Print 10 unique names
for _ in range(10):
    print(fake.name())
Janice Johnston
Collin Lopez
Mary Alvarez
Peter Mcdowell
Sarah Villanueva
Kimberly Myers
Desiree Cain
Stephanie Lawrence
Lauren Hayes
Whitney Stark

Carefully note that, I commented out this line of code: Faker.seed(0). You only need to use this line of code when you want to generate same result every time from Faker library in Python.

Also Read:  Guide to Build Best LDA model using Gensim Python

Similarly, you can generate random unique address like below:

# Generate person name using faker python
from faker import Faker
# Faker.seed(0)
fake = Faker()

# Print 10 unique address
for _ in range(10):
    print(fake.address())
PSC 3185, Box 6765
APO AP 01156
08345 Malone Shores Apt. 616
Goodchester, NC 50938
35911 Elizabeth Knoll Suite 492
New James, NJ 33867
693 Levy Via Suite 172
North Ericaborough, NC 00920
8418 James Groves Apt. 872
Melissaside, ID 61049
3119 Morgan Port Suite 687
Gilesville, OR 14614
PSC 8124, Box 5507
APO AP 20129
142 Melissa Route Apt. 391
Franklinshire, NY 72503
9725 Clay Oval Suite 096
North Michaelmouth, CT 99543
02981 Michael Circle
Mcculloughfort, MA 33496

Now these are the unique names or addresses. If you want to generate country wise names or addresses, you can also do that in Faker. You just need to use Faker Localized Providers to do that.

Faker provides database for almost all possible countries and languages. Below is the complete list of Faker locales.

ar_AA
ar_AE
ar_BH
ar_EG
ar_JO
ar_PS
ar_SA
az_AZ
bg_BG
bn_BD
bs_BA
cs_CZ
da_DK
de
de_AT
de_CH
de_DE
dk_DK
el_CY
el_GR
en
en_AU
en_CA
en_GB
en_IE
en_IN
en_NZ
en_PH
en_TH
en_US
es
es_AR
es_CA
es_CL
es_CO
es_ES
es_MX
et_EE
fa_IR
fi_FI
fil_PH
fr_BE
fr_CA
fr_CH
fr_FR
fr_QC
ga_IE
he_IL
hi_IN
hr_HR
hu_HU
hy_AM
id_ID
it_CH
it_IT
ja_JP
ka_GE
ko_KR
la
lb_LU
lt_LT
lv_LV
mt_MT
ne_NP
nl_BE
nl_NL
no_NO
or_IN
pl_PL
pt_BR
pt_PT
ro_RO
ru_RU
sk_SK
sl_SI
sq_AL
sv_SE
ta_IN
th
th_TH
tl_PH
tr_TR
tw_GH
uk_UA
vi_VN
zh_CN
zh_TW

In the above locale code, first two letter represents the language code and the last two letters represent the country code. For example, if I want to generate Indian names and addresses in English language, I will use code en_IN on the other hand if I want to use Hindi language then I will use this code: hi_IN.

Below is an example code to generate Indian names and addresses using the Faker library of Python.

from faker import Faker
# Generate indian names and address
fake = Faker('en_IN')

for _ in range(10):
    print('Name:')
    print(fake.name())
    print('Address:')
    print(fake.address())
Name:
Kanav Sinha
Address:
H.No. 008, Karan Chowk, Morbi 057148
Name:
Jivin Sekhon
Address:
158, Sarkar Road, Solapur 340688
Name:
Charvi Ramesh
Address:
83/33, Mangal Chowk, Mango-137907
Name:
Pari Raval
Address:
82/600
Ramakrishnan Zila
Kumbakonam 270455
Name:
Ivan Dasgupta
Address:
24, Chakrabarti, Shahjahanpur 091944
Name:
Jayan Kothari
Address:
16, Ramesh Zila
Pondicherry-318423
Name:
Zaina Sandhu
Address:
91, Butala Nagar
Gangtok-559627
Name:
Riya Ravi
Address:
H.No. 68
Dass Path, Mira-Bhayandar 944092
Name:
Aniruddh Rajan
Address:
81
Bhattacharyya Ganj, Solapur-311301
Name:
Suhana Dora
Address:
H.No. 827, Chhabra Circle
Karimnagar-498119

List of Entities of Faker

Not only names and addresses, using Faker, you can generate various entities. Below is the complete list of entities supported by the Faker tool.

date_time
internet
job
lorem
person
phone_number
automotive
color
address
bank
company
currency
ssn
geo
barcode
misc
credit_card
emoji
file
isbn
passport
profile
python
sbn
user_agent

Final Thought

In this tutorial, I guided you two methods which I prefer to use for generating synthetic data, one is using a traditional method and the second is using Faker tool in Python.

Also Read:  Fine-tune T5 to Make Custom ChatBot

If you need to have control over the generated names, you can use the first method. But in this case, you need to spend some time to prepare your database. But if you need a quick and good result then you must try Faker tool in Python.

This is it for this article. If you have any questions or suggestions regarding this Python Faker tutorial article, please let me know in the comment section below. If you are new to the computer vision field, I will recommend you to take this Udemy course: Python for Computer Vision with OpenCV and Deep Learning.

Leave a comment