In this tutorial, I will show some best practice methods to generate synthetic text data in Python, one of them is using a library called Faker.
Use of Synthetic Data
Synthetic data is most useful to train any real-time machine learning model. There are various types of machine learning models like: Time series, classification, regression, etc.
But I believe computer vision is the field where synthetic data especially text synthetic data is mostly used. Let’s say, you are developing a Form Parser. Now to build your form parser logic or algorithm, you need to have some forms with you.
Course for You: Python for Computer Vision with OpenCV and Deep Learning
You can not use customer data (filled form) to train your machine learning model. So there are mainly two options to make your training dataset: Either you redact information from those fields (like the way we did in PDF redaction). Or you can fill that particular form manually. Manually filling a form is a time-consuming task. Just to make 1000 training data with this manual technique may take several days.
This is the reason we must use synthetic training data. So in this tutorial, I thought of sharing some best-practice methods to generate text synthetic data (like name, address, phone number, lat long, license plate number, etc.) in Python.
How to Create Synthetic Data in Python
There are mainly two methods of creating real and unique synthetic data in Python: First one is using a database and the second one is using a nearly accurate Python library named Faker. Let’s explore those one by one.
Method 1: Generate Synthetic Data using Custom Database
In this dummy or synthetic data generator technique, you first need to prepare a database for which you are making the synthetic data, then just by applying random shuffling, you can create a realistic output.
Let’s say for a specific PDF form redaction task, I need to generate unique and realistic names. To do that, first I need to create a real database of names. You can find lots of real names just by searching baby names on Google. For example: this link.
Once you have that database ready, just keep only unique names and make your name generator like the way I made in the below code:
# Python unique name generator import random # Unique Name database first_names=('Olivia', 'Andy', 'Ava', 'Emma') last_names=('Michel', 'Smith', 'Johnson', 'Williams') # Printing final random name for i in range(3): full_name = random.choice(first_names)+" "+random.choice(last_names) print(full_name)
Andy Michel Emma Smith Ava Williams
Here in this code, I just applied to pick a name and a surname randomly and make a full name. Finally printing 3 names from our Python unique name generator code.
Method 2: Make Synthetic Data using Faker Python
That was a manual process and it takes a little bit of time to prepare that database. Now if you want various fields or entities like name, phone number, address, etc. it will burn some of your development time.
To speed up your development process you can use a very special library called Faker to prepare realistic mockup (synthetic) data set in Python. You can install this library using the below command.
pip install Faker
Once you install this library it is so easy to generate random names or addresses with few lines of Python code. For example below is the code to generate names using Faker library in Python.
# Generate person name using faker python from faker import Faker # Faker.seed(0) fake = Faker() # Print 10 unique names for _ in range(10): print(fake.name())
Janice Johnston Collin Lopez Mary Alvarez Peter Mcdowell Sarah Villanueva Kimberly Myers Desiree Cain Stephanie Lawrence Lauren Hayes Whitney Stark
Carefully note that, I commented out this line of code:
Faker.seed(0). You only need to use this line of code when you want to generate same result every time from Faker library in Python.
Similarly, you can generate random unique address like below:
# Generate person name using faker python from faker import Faker # Faker.seed(0) fake = Faker() # Print 10 unique address for _ in range(10): print(fake.address())
PSC 3185, Box 6765 APO AP 01156 08345 Malone Shores Apt. 616 Goodchester, NC 50938 35911 Elizabeth Knoll Suite 492 New James, NJ 33867 693 Levy Via Suite 172 North Ericaborough, NC 00920 8418 James Groves Apt. 872 Melissaside, ID 61049 3119 Morgan Port Suite 687 Gilesville, OR 14614 PSC 8124, Box 5507 APO AP 20129 142 Melissa Route Apt. 391 Franklinshire, NY 72503 9725 Clay Oval Suite 096 North Michaelmouth, CT 99543 02981 Michael Circle Mcculloughfort, MA 33496
Now these are the unique names or addresses. If you want to generate country wise names or addresses, you can also do that in Faker. You just need to use Faker Localized Providers to do that.
Faker provides database for almost all possible countries and languages. Below is the complete list of Faker locales.
ar_AA ar_AE ar_BH ar_EG ar_JO ar_PS ar_SA az_AZ bg_BG bn_BD bs_BA cs_CZ da_DK de de_AT de_CH de_DE dk_DK el_CY el_GR en en_AU en_CA en_GB en_IE en_IN en_NZ en_PH en_TH en_US es es_AR es_CA es_CL es_CO es_ES es_MX et_EE fa_IR fi_FI fil_PH fr_BE fr_CA fr_CH fr_FR fr_QC ga_IE he_IL hi_IN hr_HR hu_HU hy_AM id_ID it_CH it_IT ja_JP ka_GE ko_KR la lb_LU lt_LT lv_LV mt_MT ne_NP nl_BE nl_NL no_NO or_IN pl_PL pt_BR pt_PT ro_RO ru_RU sk_SK sl_SI sq_AL sv_SE ta_IN th th_TH tl_PH tr_TR tw_GH uk_UA vi_VN zh_CN zh_TW
In the above locale code, first two letter represents the language code and the last two letters represent the country code. For example, if I want to generate Indian names and addresses in English language, I will use code en_IN on the other hand if I want to use Hindi language then I will use this code: hi_IN.
Below is an example code to generate Indian names and addresses using the Faker library of Python.
from faker import Faker # Generate indian names and address fake = Faker('en_IN') for _ in range(10): print('Name:') print(fake.name()) print('Address:') print(fake.address())
Name: Kanav Sinha Address: H.No. 008, Karan Chowk, Morbi 057148 Name: Jivin Sekhon Address: 158, Sarkar Road, Solapur 340688 Name: Charvi Ramesh Address: 83/33, Mangal Chowk, Mango-137907 Name: Pari Raval Address: 82/600 Ramakrishnan Zila Kumbakonam 270455 Name: Ivan Dasgupta Address: 24, Chakrabarti, Shahjahanpur 091944 Name: Jayan Kothari Address: 16, Ramesh Zila Pondicherry-318423 Name: Zaina Sandhu Address: 91, Butala Nagar Gangtok-559627 Name: Riya Ravi Address: H.No. 68 Dass Path, Mira-Bhayandar 944092 Name: Aniruddh Rajan Address: 81 Bhattacharyya Ganj, Solapur-311301 Name: Suhana Dora Address: H.No. 827, Chhabra Circle Karimnagar-498119
List of Entities of Faker
Not only names and addresses, using Faker, you can generate various entities. Below is the complete list of entities supported by the Faker tool.
date_time internet job lorem person phone_number automotive color address bank company currency ssn geo barcode misc credit_card emoji file isbn passport profile python sbn user_agent
In this tutorial, I guided you two methods which I prefer to use for generating synthetic data, one is using a traditional method and the second is using Faker tool in Python.
If you need to have control over the generated names, you can use the first method. But in this case, you need to spend some time to prepare your database. But if you need a quick and good result then you must try Faker tool in Python.
This is it for this article. If you have any questions or suggestions regarding this Python Faker tutorial article, please let me know in the comment section below. If you are new to the computer vision field, I will recommend you to take this Udemy course: Python for Computer Vision with OpenCV and Deep Learning.
Hi there, I’m Anindya Naskar, Data Science Engineer. I created this website to show you what I believe is the best possible way to get your start in the field of Data Science.