Data-Centric AI: Tabular Data Synthesis with Deep Generative Models
Because of its wide range of applications, generative artificial intelligence (Generative AI) has received a lot of attention in academia and industry. Despite significant advances in computer vision and natural language processing, the use of Generative AI in tabular data is still underexplored. This gap is especially significant given the prevalence of tabular data as the primary data modality. This thesis seeks to close this gap by focusing on the efficient generation, evaluation, refinement, and application of tabular data, while addressing the challenges inherent in its heterogeneous nature.
The primary goal of this thesis is to develop efficient algorithms for tabular data synthesis, advancing the field of Generative AI in the tabular domain. To achieve this goal, four specific research objectives have been outlined.
First, this thesis addresses gaps in evaluation metrics by unifying a framework for the comprehensive and consistent assessment of synthetic data. The proposed framework is designed to meet diverse business requirements across various downstream tasks by incorporating a wide range of advanced metrics. These metrics cover various types of evaluations, including univariate, bivariate, multivariate, cluster, and record-level evaluations. Additionally, standardized visualizations are provided to facilitate qualitative assessments. The results indicate that the proposed evaluation framework not only enables consistent comparisons, rankings, and the selection of different data synthesis approaches but also acts as a valuable tool to promptly assess the reliability of results obtained from synthetic data.
Second, this thesis presents innovative algorithms for the generation of tabular data, aiming to significantly enhance the quality of data synthesis. One of the contributions is the development of a reversible feature engineering pipeline designed to automatically represent tabular data in an efficient format while also ensuring that the transformed data can be easily converted back to its original format. Additionally, the thesis proposes novel deep learning-based tabular data generation models that are capable of learning the joint distribution of multivariate datasets without relying on predefined distribution assumptions. Experimental results highlight the effectiveness of the proposed algorithms in handling heterogeneous data, demonstrating superior performance in most scenarios within the same training duration when compared to similar alternatives.
Third, this thesis introduces innovative synthetic data prototype selection algorithms with the goal of refining the generated samples. This approach aims to leverage the advantages of Deep Generative Models, which, once trained, can generate unlimited and diverse synthetic data. By carefully identifying and selecting high-quality samples or removing unrealistic ones, the quality of the synthetic data can be enhanced from a post-processing perspective. Building on this hypothesis, we recognize the iterative nature inherent in the data synthesis procedure, where the processes of data generation, evaluation, and refinement should operate repeatedly in a cyclical flow. During the data generation phase, the evaluation framework assists in identifying potential issues and risks based on use cases, while the refinement (post-processing) step iteratively improves synthetic data in alignment with the evaluation outcomes.
Last, this thesis applies and validates the proposed models across various domains, including business, healthcare, and government, each requiring distinct downstream tasks. Specifically, we assess our algorithms from the standpoint of data balancing using churn data, evaluate our models focusing on data augmentation with health data, and test our algorithms from the perspective of data representation considering data privacy concerns within sensitive citizen data. These applications serve as practical illustrations, demonstrating the effectiveness and utility of the proposed Generative AI models in real-world scenarios.
In summary, this thesis makes a substantial contribution to the advancement of Generative AI within the domain of tabular data. It not only presents innovative algorithms and evaluation methods but also introduces practical frameworks applicable to real-world scenarios. The findings demonstrate the considerable capacity of Generative AI to fundamentally transform tabular data in various fields, ultimately leading to improved data availability, quality, and quantity.