Privacy and Utility Evaluation of Synthetic Tabular Data for Machine Learning

Hermsen, Felix; Mandal, Avikarsha

doi:10.1007/978-3-031-57978-3_17

2024

Conference Paper

Abstract

Synthetic data generation approaches have attracted a lot of attention as a potential substitute for classical anonymization methods. However, synthetic data still pose a wide range of privacy risks, for example, dataset containing data points close to real data points, thus, increasing risks of linkage attacks. While differentially private generative models are generally considered immune to privacy attacks, it is not immediately evident how these models maintain privacy with reasonable utility. In this study, we evaluate the privacy and utility trade-offs in synthetic data generated by the state-of-the-art generative model CTGAN and its differentially private variant DPCTGAN for mixed tabular data domain. We conduct experiments using widely recognized benchmark datasets to highlight the importance of selecting optimal hyperparameters such that the model converges during training and produces synthetic data with satisfactory utility. Our experiments show that synthetic data generators, which were trained with differential privacy, may experience collapse during the training phase. While the addition of a smaller noise allows the training to converge, still could limit risks against privacy attacks such as membership inference and linkage.