语言模型是现实的表格数据生成器

论文标题

语言模型是现实的表格数据生成器

Language Models are Realistic Tabular Data Generators

论文作者

Borisov, Vadim, Seßler, Kathrin, Leemann, Tobias, Pawelczyk, Martin, Kasneci, Gjergji

论文摘要

表格数据是最古老，最普遍的数据形式之一。但是，具有原始数据特征的合成样品的产生仍然是表格数据的重大挑战。尽管来自计算机视觉域中的许多生成模型，例如变异自动编码器或生成的对抗网络，都适用于表格数据的生成，但对最近的基于变形金刚的大型语言模型（LLM）的研究较少，这些模型（LLMS）本质上也是生成性的。为此，我们提出了巨大的（生成逼真的表格数据），该数据利用自动回归生成LLM来采样合成但高度逼真的表格数据。此外，通过在特征的任何子集上进行调节，Great可以对表格数据分布进行建模；其余功能将被采样，而无需额外的开销。我们在一系列实验中证明了拟议方法的有效性，该实验量化了从多个角度量化产生的数据样本的有效性和质量。我们发现，Great在众多现实世界和合成数据集中保持最先进的性能，具有各种尺寸的异质特征类型。

Tabular data is among the oldest and most ubiquitous forms of data. However, the generation of synthetic samples with the original data's characteristics remains a significant challenge for tabular data. While many generative models from the computer vision domain, such as variational autoencoders or generative adversarial networks, have been adapted for tabular data generation, less research has been directed towards recent transformer-based large language models (LLMs), which are also generative in nature. To this end, we propose GReaT (Generation of Realistic Tabular data), which exploits an auto-regressive generative LLM to sample synthetic and yet highly realistic tabular data. Furthermore, GReaT can model tabular data distributions by conditioning on any subset of features; the remaining features are sampled without additional overhead. We demonstrate the effectiveness of the proposed approach in a series of experiments that quantify the validity and quality of the produced data samples from multiple angles. We find that GReaT maintains state-of-the-art performance across numerous real-world and synthetic data sets with heterogeneous feature types coming in various sizes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题