Improved techniques for training tabular GANs using Cramer's V statistics

Mendikowski, Melle; Schindler, Benjamin; Schmid, Thomas; Möller, Ralf; Hartwig, Mattis

Please use this identifier to cite or link to this item: http://dx.doi.org/10.25673/115107

Full metadata record

DC Field	Value	Language
dc.contributor.author	Mendikowski, Melle	-
dc.contributor.author	Schindler, Benjamin	-
dc.contributor.author	Schmid, Thomas	-
dc.contributor.author	Möller, Ralf	-
dc.contributor.author	Hartwig, Mattis	-
dc.date.accessioned	2024-03-04T08:47:26Z	-
dc.date.available	2024-03-04T08:47:26Z	-
dc.date.issued	2023	-
dc.identifier.uri	https://opendata.uni-halle.de//handle/1981185920/117063	-
dc.identifier.uri	http://dx.doi.org/10.25673/115107	-
dc.description.abstract	Considering the growing global demand for machine learning training data, synthetic data generation is a reasonable way to address the versatile challenges in data acquisition. Conditional Tabular Generative Adversarial Network (CTGAN), an extension of the widely used Generative Adversarial Network (GAN), is considered one of the most promising techniques in the field of tabular data generation. Despite numerous successes of CTGAN, a lack of preserving categorical dependencies within the data has been identified. In prior work, the Cramer’s V (CV) as a natural metric for representing the correlation of categorical dependencies was proposed for hyperparameter tuning of CTGAN models. In this paper, we explore two novel strategies to directly integrate CV statistics of data batches within CTGAN training. The first approach is a generator loss term that penalizes differences between the CV statistics of the original and generated data. The second innovation is the extraction of the CV matrix as an additional feature for the critic. By applying our proposed methods to three benchmark datasets, we improve the averaged accuracy of supervised learning models trained on synthesized data by 11 % compared to the legacy CTGAN. We also outline the impact of CV statistics on preserving dependencies between categorical data columns in terms of integrity and contingency similarity, discuss existing challenges, and identify potential improvements.	eng
dc.language.iso	eng	-
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/	-
dc.subject.ddc	610	-
dc.title	Improved techniques for training tabular GANs using Cramer's V statistics	eng
dc.type	Article	-
local.versionType	publishedVersion	-
local.bibliographicCitation.journaltitle	Proceedings of the 36th Canadian Conference on Artificial Intelligence	-
local.bibliographicCitation.pagestart	1	-
local.bibliographicCitation.pageend	12	-
local.bibliographicCitation.publishername	Canadian Artificial Intelligence Association	-
local.bibliographicCitation.publisherplace	[Kitchener, ON]	-
local.bibliographicCitation.doi	10.21428/594757db.4c0ffb71	-
local.openaccess	true	-
dc.identifier.ppn	1870547667	-
cbs.publication.displayform	2023	-
local.bibliographicCitation.year	2023	-
cbs.sru.importDate	2024-03-04T08:47:03Z	-
local.bibliographicCitation	Enthalten in Proceedings of the 36th Canadian Conference on Artificial Intelligence - [Kitchener, ON] : Canadian Artificial Intelligence Association, 2023	-
local.accessrights.dnb	free	-
Appears in Collections:	Open Access Publikationen der MLU

Files in This Item:

File	Description	Size	Format
51682622311607.pdf		844.76 kB	Adobe PDF	View/Open

Show simple item record BibTeX EndNote