Generate Synthetic Tabular Test Data in Python

generate-synthetic-tabular-data-using-ctgan-model-in-python

In my previous article, I guided you to generate dummy text data using Faker tool in Python. This article is an extension of that. Here I will show how easily you can generate synthetic test data (structured or tabular data) from a sample input data using a GAN based machine learning algorithm named CTGAN in Python. This synthetic test data will have all flavors and relationships of our input data.

In my last tutorial, I showed you how you can generate fake text data using Faker tool in Python. Fake text data for example fake name, fake address, etc. That is useful for data redaction tasks. But what if we want to generate a whole structured or tabular data with various columns and rows in Python? I will guide you that in this article.

Use of Synthetic Test Data

There are various uses for a synthetic table or structured data. For example, you trained a machine learning algorithm and you want to test that algorithm with various unknown data. Now you can not ask your client for another set real time data. Your client may disagree with that. Or there can be lack of data to test.

Or maybe you are working as a Python backend developer and you made an application (web or desktop application). In that application some table data you want to fetch or display. In that case, also you need some dummy table data.

Methods to Generate Synthetic Data in Python

Before going to the easy and simple step (which I used in this article), let me give you an overview of how traditionally a synthetic test data generation works (whether you are working in Python or any other language, the process is same).

To generate a synthetic test data in Python you must have a sample or input data which you want to replicate right? So you have a dataset and you want to generate a similar test dataset. It should not copy any data from the input dataset. Most importantly the generated dataset should have the same type of relation between each columns.

To generate a new dataset that maintains the same relationships between columns as an existing dataset, without copying any data, you need to do the following analysis:

  • Understand the Distribution: For each column in your dataset, understand the distribution of data. This could be normal, binomial, poisson, etc. You can use histogram plots to visualize and understand the distribution.
  • Descriptive Statistics: Cheack descriptive statistics like mean, median, mode, variance, etc. to understand your data.
  • Correlation: Check the correlation between each numeric column. You can use Chi-Squared test for categorical columns.

These are the basic analysis you need to perform on your input dataset. Then you need to generate your synthetic test dataset in such a way that all relationships and statistics are same as input dataset (you can use Python to perform all of these analysis).

These methods are good. But you need to spend lots of time to analyze and then finally generate your dummy dataset. In this article, I will replace all of that manual analysis with a machine learning model. Trust me the performance of this model surprised me. The resulting generated synthetic data looks exactly similar to the input data without copying anything.

Steps to Generate Synthetic Data in Python

In this article, I will show you in Python how easily you can generate synthetic tabular test data using a GAN based algorithm named (CTGAN). It will take you around 7 to 8 lines of code to generate dummy table data. Now let me break the entire code into some steps:

Install Required Libraries

Let’s first install all required Python libraries for this tabular data generation project.

!pip install pandas
!pip install pytorch-lightning==2.2.0
!pip install sdv

Read Input Data

As you already know, to generate synthetic data we need a sample input data. That generated synthetic data should have all the flavor of our input data. So we first need to read this sample input data. For this example, I am going to use the sales orders dataset from Kaggle. You can download this data from this link.

import pandas as pd

# Load data from a CSV file
data = pd.read_csv('sales_data.csv')

# Look at some sample data
data.head()
sample-input-data-for-test-data-generation-in-python-using-gan-based-machine-learning-algorithm

Check Basic Statistics

Before generating the dataset let’s first quickly check some basic statistics about our input dataset. For this example, I am only checking the type and description of each column.

# Check the data types of each column
print(data.dtypes)
Order Date           object
Order ID              int64
Product              object
Product_ean         float64
catégorie            object
Purchase Address     object
Quantity Ordered      int64
Price Each          float64
Cost price          float64
turnover            float64
margin              float64
dtype: object
# Calculate descriptive statistics
print(data.describe())
            Order ID   Product_ean  Quantity Ordered     Price Each  \
count  185950.000000  1.859500e+05     185950.000000  185950.000000   
mean   230417.569379  5.509211e+12          1.124383     184.399735   
std     51512.737110  2.598403e+12          0.442793     332.731330   
min    141234.000000  1.000083e+12          1.000000       2.990000   
25%    185831.250000  3.254280e+12          1.000000      11.950000   
50%    230367.500000  5.511235e+12          1.000000      14.950000   
75%    275035.750000  7.765195e+12          1.000000     150.000000   
max    319670.000000  9.999983e+12          9.000000    1700.000000   

          Cost price       turnover         margin  
count  185950.000000  185950.000000  185950.000000  
mean       69.668583     185.490917     115.289422  
std       109.424191     332.919771     225.227190  
min         1.495000       2.990000       1.495000  
25%         5.975000      11.950000       5.975000  
50%         7.475000      14.950000       7.475000  
75%        97.500000     150.000000      52.500000  
max       561.000000    3400.000000    2278.000000  

Configure and Train CTGANSynthesizer Model

As I said, I will use a GAN based machine learning model to generate synthetic structured data (table data). So we need to train that model right? For this example, to generate synthetic test data I am going to use CTGANSynthesizer model from the SDV library of Python.

Also Read:  Page Rank Algorithm and Implementation in python

To train an sdv based CTGANSynthesizer model, we first need to configure its metadata using SingleTableMetadata function. This function is used to automatically detect the type and structure of your input data frame. So we need to pass our input data to this function.

The next step is to pass this metadata to the CTGANSynthesizer and finally train it. For this example, I used 500 batch sizes and 10 epochs. You can try other configurations also. Finally, I am calculating the training time. For me, it took around 5 minutes to complete the training for 10 epochs. It may differ in your system. I am using an i7 laptop. I am also saving this trained model as pickle format (.pkl)

# import warnings
# warnings.filterwarnings('ignore')

from datetime import datetime

from sdv.single_table import CTGANSynthesizer
from sdv.metadata import SingleTableMetadata

# Configure Model Meta Data
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data)

# Start training CTGANSynthesizer data generation model
start_time = datetime.now()h

# Train Data synthesizer model
model = CTGANSynthesizer(metadata, batch_size = 500, epochs = 10, verbose=True)
model.fit(data)

# Traing ended
end_time = datetime.now()

# Save trained model
model.save("Output Model/sdv_ctgan_sales_data.pkl")

# Show training time
print('Duration: {}'.format(end_time - start_time))
synthetic-data-generation-machine-learning-algorithms-ctgan-output-in-python-time-series

As you can see above code is showing some warning saying: We strongly recommend saving the metadata using ‘save_to_json’ for replicability in future SDV versions. To avoid or ignore this warning you can uncomment the first two lines of code (I commented out those two lines to show you the warning).

Generate Sample Data

Okay, so we trained a synthetic test data (table data) generator model in Python with a GAN based model. Let’s now load this model and try to generate some data. In the below Python code, I am loading the saved pickle model and generating test data with 20 rows. You can generate any number of rows (just need to change the num_rows value).

# load the model from disk
loaded_model = CTGANSynthesizer.load("Output Model/sdv_ctgan_sales_data.pkl")
new_data = loaded_model.sample(num_rows = 20)
new_data.head()
generated-sample-synthetic-test-data-from-gan-based-machine-learning-algorithm-in-python

Modify Generated Data

As you can see in the above generated output test data, the column Purchase Address does not contain an actual address. It generates data. But one problem is that the address field is generating values like this: sdv-pii-d8n8b, sdv-pii-us9yo, etc.

The reason why the address field is generating values like sdv-pii-d8n8b, sdv-pii-us9yo, etc. is because the SDV library in Python automatically detects and anonymizes personal identifiable information (PII) such as addresses, names, phone numbers, etc. This is done to protect the privacy of the real data and prevent any leakage of sensitive information.

Also Read:  Naive Bayes algorithm in Machine Learning with Python

To solve this, you can use the anonymize_fields parameter in the detect_from_dataframe method to specify which fields you want to anonymize and how. For example, you can use anonymize_fields={‘address’: {‘type’: ‘address’, ‘locale’: ‘IN’}} to generate Indian addresses. Below is the code to do this.

First, let’s check how SDV converts each column type.

metadata.columns
{'Order Date': {'sdtype': 'datetime', 'datetime_format': '%Y-%m-%d %H:%M:%S'},
 'Order ID': {'sdtype': 'numerical'},
 'Product': {'sdtype': 'categorical'},
 'Product_ean': {'sdtype': 'id'},
 'catégorie': {'sdtype': 'categorical'},
 'Purchase Address': {'sdtype': 'unknown', 'pii': True},
 'Quantity Ordered': {'sdtype': 'categorical'},
 'Price Each': {'sdtype': 'numerical'},
 'Cost price': {'sdtype': 'numerical'},
 'turnover': {'sdtype': 'numerical'},
 'margin': {'sdtype': 'numerical'}}

Note: All the above code I am running in Jupyter Notebook, so not using print the function.

As you can see in the above output, the type of all other columns are like numerical, categorical, or id but for Purchase Address type is ‘unknown’, ‘pii’: True. That means, SDV detected this column as sensitive column data. This is the reason it activates its data wrapping or data conversion function by mentioning ‘pii’: True.

The above output is for input data. Let’s now check the data transformation. Transformed data means the output data.

model.get_transformers()
{'Order Date': UnixTimestampEncoder(datetime_format='%Y-%m-%d %H:%M:%S', enforce_min_max_values=True),
 'Order ID': FloatFormatter(learn_rounding_scheme=True, enforce_min_max_values=True),
 'Product': None,
 'Product_ean': IDGenerator(),
 'catégorie': None,
 'Purchase Address': AnonymizedFaker(function_name='bothify', function_kwargs={'text': 'sdv-pii-?????', 'letters': '0123456789abcdefghijklmnopqrstuvwxyz'}),
 'Quantity Ordered': None,
 'Price Each': FloatFormatter(learn_rounding_scheme=True, enforce_min_max_values=True),
 'Cost price': FloatFormatter(learn_rounding_scheme=True, enforce_min_max_values=True),
 'turnover': FloatFormatter(learn_rounding_scheme=True, enforce_min_max_values=True),
 'margin': FloatFormatter(learn_rounding_scheme=True, enforce_min_max_values=True)}

In the above output, you can see that for Purchase Address, it is transforming the actual field to ‘text’: ‘sdv-pii-?????’ using AnonymizedFaker function. This is the reason in our generated data we observed values like sdv-pii-d8n8b, sdv-pii-us9yo (instead of a real address) in the Purchase Address column.

To generate random real address, you need to modify AnonymizedFaker function for that specific column (Purchase Address). In the below Python code inside AnonymizedFaker function, you need to mention provider_name and function_name.

These are nothing but the function names of the Faker tool. Since I am trying to generate an address field, I am mentioning provider_name=’address’ and function_name=’address’. You can find the entire list of faker functions from this link. If you face any difficulty understanding these functions please read this article: Generate Synthetic Text Data with Faker in Python.

Once you modify AnonymizedFaker function, the rest of the code is same. You need to load your previously trained CTGANSynthesizer model and update its transform function using update_transformers. Then again you need to train CTGANSynthesizer model with this update and save this model. Note: I updated only one column transformation but you can update the transformation for multiple columns at the same time.

from rdt.transformers import AnonymizedFaker

# Configure new transformation
purchase_address_transformer = AnonymizedFaker(
    provider_name='address',
    function_name='address',
    enforce_uniqueness=True
)

# Update and apply new transformation to 'Purchase Address' colum
loaded_model.update_transformers({'Purchase Address': purchase_address_transformer})

# Again train updated model
loaded_model.fit(data)

# Save Updated model as pickle file
loaded_model.save("Output Model/sdv_ctgan_sales_data_v1.pkl")

# load the model from disk
loaded_model_v1 = CTGANSynthesizer.load("Output Model/sdv_ctgan_sales_data_v1.pkl")
new_data_v1 = loaded_model_v1.sample(num_rows = 20)
new_data_v1.head()
update-column-value-using-faker-functin-in-ctgan-model-while-generating-synthetic-excel-data

As you can see in the above output data is generated Purchase Address column with proper address. Now these addresses are not real, they are some random fake addresses.

Data Validation

In the data validation part, you can check different statistics to confirm that the generated data have the same flavor and nature as your input data. To keep this article simple, I am only checking description statistics like mean, standard deviation, etc. Below is the code to compare the description statistics of two data frames (generated and input).

# Description statistics of input data
input_data_desc = data.describe()

# Description statistics of generated data
output_data_desc = new_data_v1.describe()

# Compare those
df_diff = input_data_desc.compare(output_data_desc)

# Print the result
print(df_diff)
compare-description-statistics-of-two-dataframe-in-python

As you can see in the above output almost all statistical values are near or similar for both the input and generated dataframe. Here self means the statistics of input_data_desc and other means stat of output_data_desc.

Also Read:  Install TensorFlow GPU with Jupiter notebook for Windows

But one problem here. Description statistics can only check stat for numerical variables. So we need to compare categorical variables separately.

Note: Since there are so many variables, all are not fitting in the same output screen, you need to scroll right to see all variables (this is the way of working with Jupyter Notebook). To fit in the same screen you may subset your output pandas dataframe.

# Product Category
print(new_data_v1['catégorie'].unique())
print(data['catégorie'].unique())
['Vêtements' 'Électronique' 'Sports' 'Alimentation']
['Vêtements' 'Alimentation' 'Sports' 'Électronique']

In the above code, I am checking whether categories are same for input and output dataframe or not. The above output says both are same. Let’s check another categorical variable, Product.

# Product Type
print(new_data_v1['Product'].unique())
print(data['Product'].unique())
['Bose SoundSport Headphones' 'AA Batteries (4-pack)' 'Wired Headphones'
 'Apple Airpods Headphones' 'Vareebadd Phone' 'AAA Batteries (4-pack)'
 'iPhone' 'Lightning Charging Cable' 'USB-C Charging Cable'
 '27in 4K Gaming Monitor']
['iPhone' 'Lightning Charging Cable' 'Wired Headphones' '27in FHD Monitor'
 'AAA Batteries (4-pack)' '27in 4K Gaming Monitor' 'USB-C Charging Cable'
 'Bose SoundSport Headphones' 'Apple Airpods Headphones'
 'Macbook Pro Laptop' 'Flatscreen TV' 'Vareebadd Phone'
 'AA Batteries (4-pack)' 'Google Phone' '20in Monitor'
 '34in Ultrawide Monitor' 'ThinkPad Laptop' 'LG Dryer'
 'LG Washing Machine']

As you can see in the above output that, generated data have all the categories from input data. Now the question is why all categories from input data are not present in the generated dataframe. The simple answer is we are generating only 20 rows of test data. So all categories from input data may not be available in the generated data. If you try to generate 100 or 1000 rows of data then all categories from input data may be available in the generated data.

Next Steps

In this article, I shared one of the easiest way to generate synthetic test data(table or structured data) in Python. We successfully generated a test synthetic data with 20 rows and modified the Address column. To do all these I used a simple sales order data from Kaggle. Using this Python code try to generate synthetic data for any other type of business problem like time series and let me know the result in the comment section.

While fitting input data to the model I did not perform any data pre-processing steps. That should not be a good idea for all the cases. If you see the data generation model is not generating data as expected, then try to preprocess your input data before fitting it into the model.

This preprocessing can be One-Hot encoding (for categorical variables), scaling (like StandardScaler) for numerical variables, missing value imputation (by default CTGANSynthesizer imputes missing value with 0), etc.

This is it for this article. If you have any questions or suggestions regarding this article, please let me know in the comment section below.

Leave a comment