- Intermediate Machine Learning
- Basic Programming
In our previous article, we introduced the concept of Synthetic data and its applications in data privacy and machine learning. In this article, we will show you how to generate synthetic tabular data using a generative adversarial network (GAN).
Tabular data is one of the most common and important data modalities. Enormous amounts of data, such as clinical trial records, financial data, census results, are all represented in tabular format. The ability to use synthetic datasets where sensitive attributes and Personally Identifiable Information (PII) are not disclosed, is crucial for staying compliant with privacy regulations, and convenient for data analysis, sharing, and experimenting.
Wondering why generative models could be an ideal method to employ for creating synthetic data? Well, in generative models, a neural network (NN) is used to approximate the underlying probability distribution of an input data in a high-dimensional latent space. After the probability distribution has been learned, the model can then generate synthetic records by randomly sampling from the distribution 1. As a result, the generated records contain none of the original data itself, but retains the real dataset’s original underlying probability distribution.
What is a GAN?
Generative Adversarial Network 2
A GAN consists of two models:
A generator that learns to produce fake data.
A discriminator that learns to distinguish the generator’s fake data from the real data.
The two models compete against each other in a zero-sum game that drives the whole system towards optimization. At the start of the training, the generator is not very good at generating fake data, and the discriminator is able to catch the fake data easily. But as training progresses, the generator learns to get progressively better at generating fake data, and fooling the discriminator, until the discriminator is unable to tell if the input is real or not. Check out I.Goodfellow et. al 3 to see the mathematical concept behind the GAN.
Developing a general-purpose GAN that would reliably work for a tabular dataset is not a straightforward task.
- Mixed data types: numerical, categorical, time, text
- Different distributions: multimodal, long tail, non-gaussian
- Learning from sparse one-hot-encoded vectors
- Highly imbalanced categorical columns
To produce highly realistic tabular data, we will use conditional generative adversarial networks - CTGAN 4. This model is developed by Xu et al. of MIT, and it is an open source project 5. CTGAN uses GAN-based methods to model tabular data distribution and sample rows from the distribution. In CTGAN, the mode-specific normalization technique is leveraged to deal with columns that contain non-Gaussian and multimodal distributions, while a conditional generator and training-by-sampling methods are used to combat class imbalance problems. 6
The conditional generator generates synthetic rows conditioned on one of the discrete columns. With training-by-sampling, the cond and training data are sampled according to the log-frequency of each category, thus CTGAN can evenly explore all possible discrete values. 8
Now, let’s see how to employ the CTGAN to generate a synthetic dataset from a real dataset! We use the Census Income 9 dataset, which is a built in dataset in the package as an example. (Remember to
pip install ctgan).
1 2 3 from ctgan import load_demo real_data = load_demo() print(real_data.head(5))
This loads the dataset:
As we can see, the table contains information about working adults, including their age, gender, education, working hours-per-week, income etc. It’s a multivariate dataset containing a mix of categorical, continuous and discrete variables. Now let us use
CTGANSynthesizer to create a synthetic copy of this tabular data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 from ctgan import CTGANSynthesizer # Identifies all the discrete columns discrete_columns = [ 'workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country', 'income' ] # Initiates the CTGANSynthesizer and call its fit method to pass in the table ctgan = CTGANSynthesizer(epochs=10) ctgan.fit(real_data, discrete_columns) #generate synthetic data, 1000 rows of data synthetic_data = ctgan.sample(1000) print(synthetic_data.head(5))
This returns a table of synthetic data, identical to the real data.
Now, let’s check just how similar the synthetic data is to the real data. For this, we will use table_evaluator 10 to visualize the difference between the fake and real data. (Make sure to
pip install table-evaluator first)
1 2 3 from table_evaluator import load_data, TableEvaluator table_evaluator = TableEvaluator(real_data, synthetic_data) table_evaluator.visual_evaluation()
Distribution Per Feature: 3 Features (Age, Occupation, Hours-Per-Week Worked)
Correlation Matrix between Real and Synthetic Data
Absolute Log Mean and STDs of Real and Synthetic Data
Looking at the distribution per feature plot, correlation matrix and absolute Log Mean and STD’s plot, we can see that the synthetic records represent the real ones pretty well. As an example, we can also run
table_evaluator.evaluate(target_col='income') to get the F1 scores and the Jaccard similarity score for each feature.
In this second installment of the synthetic data series, we look into how to generate synthetic tabular dataset using CTGAN. Synthetic data unlocks opportunities for data sharing, experimenting, and analysis on a large scale, without disclosing sensitive information. It’s a tremendously useful tool!
In the next article (last of the series), we will discuss some limitations, challenges and the future of synthetic data.
Subscribe to our newsletter, and join us on discord where we share some of our other favourite and awesome open source projects just like this one!11
Goodfellow, Ian J., Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. “Generative adversarial networks.” arXiv preprint arXiv:1406.2661 (2014). ↩
Same as No. 1 ↩
Xu, L., Skoularidou, M., Cuesta-Infante, A., & Veeramachaneni, K. (2019). Modeling tabular data using conditional gan. arXiv preprint arXiv:1907.00503. ↩
Same as No. 4 ↩
Same as No. 4 ↩
Same as No. 4 ↩
Editorial review provided by Prateek Sanyal ↩