- Basic Machine Learning
- Basic Programming
“Data is the new oil in the digital era” 1. Software engineers and data scientists often need access to large volumes of real data to develop, experiment, and innovate. Collecting such data unfortunately also introduces security liabilities and privacy concerns which affect individuals, organizations, and society at large. Data containing Personally Identifiable Information (PII) and Personal Health Information (PHI) are particularly vulnerable to disclosure, and need to be protected.
Regulations such as the General Data Protection Regulation (GDPR) 2 serve to provide a level of legal protection for user data, but consequently introduce new technical challenges by restricting data usage, collection, and storage methods. In light of this, synthetic data could serve as a viable solution to protect user data privacy, stay compliant with regulations, and still maintain the pace and ability for development and innovation.
In this article, we will dive into the details of the significant benefits synthetic data offers:
- To support machine learning model development, allowing for faster iteration of experimentation and model building.
- To facilitate collaboration between data owners and external data scientists/engineers, in a more secure, and privacy compliant way.
What is Synthetic Data?
As the name suggests, synthetic data is in essence, ‘fake’ data that is artificially or programmatically generated, as opposed to ‘real’ data that is collected through real-world surveys or events. The creation of synthetic data stems from real data, and a good synthetic dataset is able to capture the underlying structure and display the same statistical distributions as the original data, rendering it indistinguishable from the real one.
The first major benefit of synthetic data is its ability to support machine learning/deep learning model development. Often, developers need the flexibility to quickly test an idea or experiment with building a new model. However, sometimes it takes weeks to acquire and prepare sufficient amounts of data. Synthetic data opens the gateway for faster iteration of model training and experimenting, as it provides a blueprint of how models can be built on real data. Additionally, with synthetic data, ML practitioners gain complete sovereignty over the dataset. This includes, controlling the degree of class separations, sampling size, and degree of noise of the dataset. In this article, we will show you how to improve an imbalanced dataset for machine learning with synthetic data.
The second major benefit of synthetic data is that it can protect data privacy. Real data contains sensitive and private user information that cannot be freely shared and is legally constrained. Approaches to preserve data privacy such as the k-anonymity model 3 involve omitting data records to a certain extent. This results in an overall loss of information and data utility. In such cases, synthetic data serves as an excellent alternative to these data anonymization techniques. Synthetic datasets can be more openly published, shared, analyzed, without revealing actual individual information.
Synthetic Data for ML Model Training
In this section, we will demonstrate one use case of synthetic data for machine learning - fixing an imbalanced dataset, to support training of a more accurate model.
What is dataset imbalance and why is it important?
“Any dataset with an unequal class distribution is technically imbalanced. However, a dataset is said to be imbalanced when there is a significant, or in some cases extreme, disproportion among the number of examples of each class of the problem.” 4
Machine learning model accuracy is severely hindered when trained on an imbalanced dataset. This is because the model is not exposed to sufficient samples of the minority class during training, inhibiting its ability to recognize instances when evaluated on test and actual production data. For example, in fraud detection tasks, actual fraud records are always of the minority class, and it’s those ones that we need to be able to detect.
A technique we can employ to combat the imbalanced dataset problem is “resampling”. This is done either through undersampling the majority or oversampling the minority. Below, we show the use of SMOTE (Synthetic Minority Over-sampling Technique) to oversample the minority class and fix an imbalanced dataset. SMOTE creates synthetic observations based on existing minority observations, using a k-nearest neighbour algorithm. 5
We begin by creating an imbalanced dataset with a ratio of Class 1 to Class 2 being (90:10), using the
1 2 3 4 5 6 7 8 9 10 11 12 13 14 from sklearn.datasets import make_classification import numpy as np import pandas as pd import matplotlib.pyplot as plt X, y = make_classification( n_classes=2, weights=[0.9, 0.1], n_samples=100, ) df = pd.DataFrame(X) df['target'] = y colors = ['g', 'm'] plt.xlabel("Classes") plt.ylabel("Number of samples") df.target.value_counts().plot(kind='bar', color=colors, title='Imbalanced Dataset')
We can visualize our dataset in a 2D graph by reducing the dimensions of the dataset using Principal Component Analysis (PCA).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 def plot_data(X,y,title="Imbalanced Dataset - 2 Component PCA"): fig = plt.figure(figsize = (6,6)) ax = fig.add_subplot(1,1,1) ax.set_xlabel('PCA 1', fontsize = 12) ax.set_ylabel('PCA 2', fontsize = 12) ax.set_title(title, fontsize = 16) colors = ['g', 'm'] classes = ['0','1'] for i,c in zip(np.unique(y), colors): ax.scatter( X[y==i, 0], X[y==i, 1], c=c ) ax.legend(classes)
1 2 3 4 5 from sklearn.decomposition import PCA pca = PCA(n_components=2) X = pca.fit_transform(X) plot_data(X,y)
Next, we apply SMOTE to oversample the minority class and balance our dataset.
1 2 3 4 5 from imblearn.over_sampling import SMOTE smote = SMOTE() X_os, y_os = smote.fit_sample(X, y) plot_data(X_os,y_os, title="Balanced Dataset - 2 Component PCA")
1 2 3 4 5 6 df1 = pd.DataFrame(X_os) df1['target'] = y_os colors = ['g', 'm'] plt.xlabel("Classes") plt.ylabel("Number of samples") df1.target.value_counts().sort_index().plot(kind='bar', color=colors, title='Balanced Dataset')
We can see that the number of samples in Class 1 and Class 2 is now equal, and the dataset is balanced.
You should oversample only the training dataset. So make sure to split the dataset into training and testing before you apply SMOTE. Otherwise, bias will be introduced in the test dataset, and it won’t reflect the true evaluation of your model.
Other methods to combat imbalanced dataset include: undersampling the majority class, collecting more data, changing evaluation metrics, etc.
Synthetic Data for Privacy Protection
Instead of masking or anonymizing the original data, one can use synthetic data to protect data privacy.
- Retains the underlying structure and statistical distribution of the original data
- Does not rely on masking or omitting of the original data
- Provides a strong privacy guarantee to prevent sensitive user information from being disclosed
The picture below shows a synthetic dataset being created from a sample real dataset. In the next article we will dive into the method of generating a synthetic dataset from a real dataset like the following.
Sample real data:
|Name||Age||Gender||SIN||Chest pain location|
|Rylie Bradford||72||M||100 709 112||0|
|Karyn Polley||54||F||722 260 965||1|
|Gordie Quincy||53||M||795 635 739||1|
Sample synthetic data:
|Name||Age||Gender||SIN||Chest pain location|
|Simone Peacock||75||F||970 440 905||1|
|Allyson Wortham||69||M||748 665 544||1|
|Cyprian Traylor||46||M||265 183 491||0|
In doing so, data can be more freely shared and published, opening up collaboration opportunities on projects across government agencies, social sciences, health sectors, and software companies, where data is heavily regulated by privacy guidelines.
One of the most significant advantages that synthetic data offers is its ability to protect user data privacy. This unlocks opportunities for data sharing, publishing, and collaboration on a larger scale. Additionally, you could turn to synthetic data as a means to support the development of machine learning models and software tools, since it’s a cheaper and faster approach when compared to collecting real world data.
In the next article, we will show you how to generate synthetic data from real dataset, and also discuss some limitations, challenges and the future of synthetic data.6
Sweeney, L., 2002a. Achieving k-anonymity privacy protection using generalization and suppression. Int. J. Uncertain. Fuzz. Knowl.-Based Syst. 10 (05), 571–588. ↩
Fernández, A., García, S., Galar, M., Prati, R. C., Krawczyk, B., & Herrera, F. (2018). Learning from imbalanced data sets (pp. 1-377). Berlin: Springer. ↩
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357. ↩
Editorial Review by Prateek Sanyal ↩