论文标题
检索以数据增强方式在赞助搜索中频繁查询的同义关键字
Retrieve Synonymous keywords for Frequent Queries in Sponsored Search in a Data Augmentation Way
论文作者
论文摘要
在赞助搜索中,检索同义关键字对于准确的针对性广告非常重要。查询和关键字之间的语义差距以及极高的精度要求(> = 95 \%)是该任务的两个主要挑战。据我们所知,尚未公开讨论这个问题。在工业赞助的搜索系统中,通常提前进行频繁查询的检索到的关键字,并存储在查找表中。将这些结果视为种子数据集,我们提出了一个类似数据提升的框架,以改善这些频繁查询的同义检索性能。该框架包括两个步骤:基于翻译的检索和基于判别的过滤。首先,我们设计了一个基于TRIE的翻译模型来增量数据。在此阶段,进行了一个核心词的技巧,这增加了数据增量的4.2倍,同时保持原始精度。然后,我们使用基于BERT的判别模型来滤除非同步对,该对超过了具有11 \%绝对AUC改进的传统功能驱动的GBDT模型。该方法已成功应用于百度赞助的搜索系统,该系统在收入方面取得了重大改善。此外,一个包含500K同义成对的商业数据集(精度为95 \%)向公众发布给释义研究(http://ai.baidu.com/broad/suborcoriation?dataset = Paraphrasing)。
In sponsored search, retrieving synonymous keywords is of great importance for accurately targeted advertising. The semantic gap between queries and keywords and the extremely high precision requirements (>= 95\%) are two major challenges to this task. To the best of our knowledge, the problem has not been openly discussed. In an industrial sponsored search system, the retrieved keywords for frequent queries are usually done ahead of time and stored in a lookup table. Considering these results as a seed dataset, we propose a data-augmentation-like framework to improve the synonymous retrieval performance for these frequent queries. This framework comprises two steps: translation-based retrieval and discriminant-based filtering. Firstly, we devise a Trie-based translation model to make a data increment. In this phase, a Bag-of-Core-Words trick is conducted, which increased the data increment's volume 4.2 times while keeping the original precision. Then we use a BERT-based discriminant model to filter out nonsynonymous pairs, which exceeds the traditional feature-driven GBDT model with 11\% absolute AUC improvement. This method has been successfully applied to Baidu's sponsored search system, which has yielded a significant improvement in revenue. In addition, a commercial Chinese dataset containing 500K synonymous pairs with a precision of 95\% is released to the public for paraphrase study (http://ai.baidu.com/broad/subordinate?dataset=paraphrasing).