星：SQL指导的预培训，用于上下文依赖于文本到SQL解析

论文标题

星：SQL指导的预培训，用于上下文依赖于文本到SQL解析

STAR: SQL Guided Pre-Training for Context-dependent Text-to-SQL Parsing

论文作者

Cai, Zefeng, Li, Xiangyu, Hui, Binyuan, Yang, Min, Li, Bowen, Li, Binhua, Cao, Zheng, Li, Weijie, Huang, Fei, Si, Luo, Li, Yongbin

论文摘要

在本文中，我们提出了一个新颖的SQL指导的培训前框架明星，以依赖上下文的文本到SQL解析，该框架将其利用上下文信息丰富自然语言（NL）话语（NL）话语和表格架构表示，以进行文本到SQL对话。具体而言，我们提出了两个新型的预训练目标，它们分别探讨了每个文本到sql对话中NL话语和SQL查询的上下文相互作用的相互作用：（i）架构状态跟踪（SST）目标，以通过在每个架构中进行固定和更新的架构形式的SQL查询模式在每个架构中进行互动和更新的数量，并探索sql sql squeries squeries squere schema的形式，并在每个架构中进行互动，并在每个架构中进行互动，并在每个架构中进行互动; （ii）采用加权对比度学习的话语依赖性跟踪（UDT）的目标，将两个语义上相似的NL话语汇总在一起，并在每次对话中推动语义上不同的NL语言的表示。此外，我们构建了一个高质量的大规模上下文依赖性文本到SQL对话语料库，以预先训练明星。广泛的实验表明，Star在两个下游基准（SPARC和COSQL）上实现了新的最先进性能，这显着超过了以前的训练方法，并在排行榜上排名第一。我们认为，释放构建的语料库，代码库和预训练的星星检查站将推动该领域的研究。为了获得可重复性，我们在https://github.com/alibabaresearch/damo-convai/tree/main/main/star上发布代码和数据。

In this paper, we propose a novel SQL guided pre-training framework STAR for context-dependent text-to-SQL parsing, which leverages contextual information to enrich natural language (NL) utterance and table schema representations for text-to-SQL conversations. Concretely, we propose two novel pre-training objectives which respectively explore the context-dependent interactions of NL utterances and SQL queries within each text-to-SQL conversation: (i) schema state tracking (SST) objective that tracks and explores the schema states of context-dependent SQL queries in the form of schema-states by predicting and updating the value of each schema slot during interaction; (ii) utterance dependency tracking (UDT) objective that employs weighted contrastive learning to pull together two semantically similar NL utterances and push away the representations of semantically dissimilar NL utterances within each conversation. In addition, we construct a high-quality large-scale context-dependent text-to-SQL conversation corpus to pre-train STAR. Extensive experiments show that STAR achieves new state-of-the-art performance on two downstream benchmarks (SParC and CoSQL), significantly outperforming previous pre-training methods and ranking first on the leaderboard. We believe the release of the constructed corpus, codebase and pre-trained STAR checkpoints would push forward the research in this area. For reproducibility, we release our code and data at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/star.

下载PDF全文

下载文献需遵守相关版权规定

论文标题