ReTabSyn: Realistic Tabular Data Synthesis via Reinforcement Learning
Deep generative models can help with data scarcity and privacy by producing synthetic training data, but they struggle in low-data, imbalanced tabular settings to fully learn the complex data distribution. We argue that striving for the full joint distribution could be overkill; for greater data efficiency, models should prioritize learning the conditional distribution $P(ymid bm{X})$, as suggested by recent theoretical analysis. Therefore, we overcome this limitation with textbf{ReTabSyn}, a textbf{Re}inforced textbf{Tab}ular textbf{Syn}thesis pipeline that provides direct feedback on […]