论文标题
在神经相似性度量下,大约最近的邻居搜索大规模推荐
Approximate Nearest Neighbor Search under Neural Similarity Metric for Large-Scale Recommendation
论文作者
论文摘要
多年来,已经对推荐系统进行了基于模型的方法。现代推荐系统通常求助于1)将用户项目偏好定义为其嵌入表示形式之间的距离的表示模型,以及2)基于嵌入的基于基于的近似邻居(ANN)搜索以解决大规模语料库引入的效率问题。在提供有效检索的同时,基于嵌入式的检索模式也限制了模型容量,因为用户项目偏好度量的形式仅限于其嵌入表示表示之间的距离。但是,对于其他更精确的用户偏好措施,例如,偏好得分直接来自深神经网络,由于缺乏有效的检索方法,它们在计算上是棘手的,并且详尽地搜索所有用户项目对是不切实际的。在本文中,我们提出了一种新颖的方法,将ANN搜索扩展到任意匹配功能,例如深度神经网络。我们的主要思想是在所有项目中构造的相似图中,在相似的图中执行贪婪的步行。为了解决以下问题:图形构造和用户项目匹配功能的相似性度量是异质的,我们提出了一项可插入的对抗训练任务,以确保具有任意匹配功能的图形搜索可以达到相当高的精度。开源和行业数据集的实验结果证明了我们方法的有效性。所提出的方法已完全部署在TAOBAO展示广告平台中,并带来了大量的广告收入增加。我们还在本文中总结了我们在部署方面的详细经验。
Model-based methods for recommender systems have been studied extensively for years. Modern recommender systems usually resort to 1) representation learning models which define user-item preference as the distance between their embedding representations, and 2) embedding-based Approximate Nearest Neighbor (ANN) search to tackle the efficiency problem introduced by large-scale corpus. While providing efficient retrieval, the embedding-based retrieval pattern also limits the model capacity since the form of user-item preference measure is restricted to the distance between their embedding representations. However, for other more precise user-item preference measures, e.g., preference scores directly derived from a deep neural network, they are computationally intractable because of the lack of an efficient retrieval method, and an exhaustive search for all user-item pairs is impractical. In this paper, we propose a novel method to extend ANN search to arbitrary matching functions, e.g., a deep neural network. Our main idea is to perform a greedy walk with a matching function in a similarity graph constructed from all items. To solve the problem that the similarity measures of graph construction and user-item matching function are heterogeneous, we propose a pluggable adversarial training task to ensure the graph search with arbitrary matching function can achieve fairly high precision. Experimental results in both open source and industry datasets demonstrate the effectiveness of our method. The proposed method has been fully deployed in the Taobao display advertising platform and brings a considerable advertising revenue increase. We also summarize our detailed experiences in deployment in this paper.