Posts Synthetic Data 02 - Generating Synthetic Tabular Data
Post
Cancel

Synthetic Data 02 - Generating Synthetic Tabular Data

Preview Image

Image Source

Prerequisite Knowledge

  • Intermediate Machine Learning
  • Basic Programming

Introduction

In our previous article, we introduced the concept of Synthetic data and its applications in data privacy and machine learning. In this article, we will show you how to generate synthetic tabular data using a generative adversarial network (GAN).

Tabular data is one of the most common and important data modalities. Enormous amounts of data, such as clinical trial records, financial data, census results, are all represented in tabular format. The ability to use synthetic datasets where sensitive attributes and Personally Identifiable Information (PII) are not disclosed, is crucial for staying compliant with privacy regulations, and convenient for data analysis, sharing, and experimenting.

Wondering why generative models could be an ideal method to employ for creating synthetic data? Well, in generative models, a neural network (NN) is used to approximate the underlying probability distribution of an input data in a high-dimensional latent space. After the probability distribution has been learned, the model can then generate synthetic records by randomly sampling from the distribution 1. As a result, the generated records contain none of the original data itself, but retains the real dataset’s original underlying probability distribution.

What is a GAN?

Generative Adversarial Network 2

A GAN consists of two models:

  • A generator that learns to produce fake data.

  • A discriminator that learns to distinguish the generator’s fake data from the real data.

The two models compete against each other in a zero-sum game that drives the whole system towards optimization. At the start of the training, the generator is not very good at generating fake data, and the discriminator is able to catch the fake data easily. But as training progresses, the generator learns to get progressively better at generating fake data, and fooling the discriminator, until the discriminator is unable to tell if the input is real or not. Check out I.Goodfellow et. al 3 to see the mathematical concept behind the GAN.

Tabular GAN:

Developing a general-purpose GAN that would reliably work for a tabular dataset is not a straightforward task.

Challenges include:

  • Mixed data types: numerical, categorical, time, text
  • Different distributions: multimodal, long tail, non-gaussian
  • Learning from sparse one-hot-encoded vectors
  • Highly imbalanced categorical columns

To produce highly realistic tabular data, we will use conditional generative adversarial networks - CTGAN 4. This model is developed by Xu et al. of MIT, and it is an open source project 5. CTGAN uses GAN-based methods to model tabular data distribution and sample rows from the distribution. In CTGAN, the mode-specific normalization technique is leveraged to deal with columns that contain non-Gaussian and multimodal distributions, while a conditional generator and training-by-sampling methods are used to combat class imbalance problems. 6

CTGAN 7

The conditional generator generates synthetic rows conditioned on one of the discrete columns. With training-by-sampling, the cond and training data are sampled according to the log-frequency of each category, thus CTGAN can evenly explore all possible discrete values. 8

Now, let’s see how to employ the CTGAN to generate a synthetic dataset from a real dataset! We use the Census Income 9 dataset, which is a built in dataset in the package as an example. (Remember to pip install ctgan).

1
2
3
from ctgan import load_demo
real_data = load_demo()
print(real_data.head(5))

This loads the dataset:

age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country income
39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K

As we can see, the table contains information about working adults, including their age, gender, education, working hours-per-week, income etc. It’s a multivariate dataset containing a mix of categorical, continuous and discrete variables. Now let us use CTGANSynthesizer to create a synthetic copy of this tabular data.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from ctgan import CTGANSynthesizer

# Identifies all the discrete columns

discrete_columns = [
    'workclass',
    'education',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'native-country',
    'income'
]

# Initiates the CTGANSynthesizer and call its fit method to pass in the table
 
ctgan = CTGANSynthesizer(epochs=10)
ctgan.fit(real_data, discrete_columns)

#generate synthetic data, 1000 rows of data

synthetic_data = ctgan.sample(1000)
print(synthetic_data.head(5))

This returns a table of synthetic data, identical to the real data.

age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country income
17 Self-emp-not-inc 158273 Bachelors 16 Never-married Machine-op-inspct Wife White Male -32 -8 39 United-States <=50K
24 Private 397997 HS-grad 9 Married-civ-spouse ? Unmarried White Female -31 -5 39 United-States >50K
53 Private 271068 Some-college 3 Married-civ-spouse Exec-managerial Not-in-family White Female 75 -5 40 United-States >50K
23 Private 276827 HS-grad 2 Married-civ-spouse Prof-specialty Unmarried Asian-Pac-Islander Male -48 -5 40 United-States <=50K
52 Private 187681 Some-college 9 Married-civ-spouse Sales Not-in-family White Male 34 -4 41 United-States >50K

Now, let’s check just how similar the synthetic data is to the real data. For this, we will use table_evaluator 10 to visualize the difference between the fake and real data. (Make sure to pip install table-evaluator first)

1
2
3
from table_evaluator import load_data, TableEvaluator
table_evaluator = TableEvaluator(real_data, synthetic_data)
table_evaluator.visual_evaluation()

Distribution Per Feature: 3 Features (Age, Occupation, Hours-Per-Week Worked)

Correlation Matrix between Real and Synthetic Data

Absolute Log Mean and STDs of Real and Synthetic Data

Looking at the distribution per feature plot, correlation matrix and absolute Log Mean and STD’s plot, we can see that the synthetic records represent the real ones pretty well. As an example, we can also run table_evaluator.evaluate(target_col='income') to get the F1 scores and the Jaccard similarity score for each feature.

Conclusion:

In this second installment of the synthetic data series, we look into how to generate synthetic tabular dataset using CTGAN. Synthetic data unlocks opportunities for data sharing, experimenting, and analysis on a large scale, without disclosing sensitive information. It’s a tremendously useful tool!

In the next article (last of the series), we will discuss some limitations, challenges and the future of synthetic data.

Subscribe to our newsletter, and join us on discord where we share some of our other favourite and awesome open source projects just like this one!11

Reference:

  1. Goodfellow, Ian J., Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. “Generative adversarial networks.” arXiv preprint arXiv:1406.2661 (2014). 

  2. https://www.freecodecamp.org/news/an-intuitive-introduction-to-generative-adversarial-networks-gans-7a2264a81394/ 

  3. Same as No. 1 

  4. Xu, L., Skoularidou, M., Cuesta-Infante, A., & Veeramachaneni, K. (2019). Modeling tabular data using conditional gan. arXiv preprint arXiv:1907.00503. 

  5. https://github.com/sdv-dev/CTGAN 

  6. Same as No. 4 

  7. Same as No. 4 

  8. Same as No. 4 

  9. https://archive.ics.uci.edu/ml/datasets/adult 

  10. https://pypi.org/project/table-evaluator/ 

  11. Editorial review provided by Prateek Sanyal 

This post is licensed under CC BY 4.0 by the author.