论文标题
通过匪徒反馈的分层对话偏好启发
Hierarchical Conversational Preference Elicitation with Bandit Feedback
论文作者
论文摘要
对话建议的最新进展为通过对话互动有效地引起用户的喜好提供了一种有希望的方法。为了实现这一目标,推荐系统与用户进行对话,询问他们对不同项目或项目类别的偏好。用于冷启动用户的大多数现有的对话推荐系统都使用多臂强盗框架以在线方式学习用户的偏好。但是,他们依靠预定义的对话频率来询问项目类别而不是单个项目,这可能会导致过度的对话互动,从而损害用户体验。为了启用有关钥匙问题的更灵活的质疑,我们制定了一个新的对话匪徒问题,该问题允许推荐系统在每个回合中选择一个键期或要推荐的项目,并明确对这些动作的回报进行建模。这激发了我们处理密钥询问和项目建议之间的新探索探索(EE)权衡,这要求我们准确地对密钥期和项目奖励之间的关系进行建模。我们进行了一项调查并分析现实世界数据集,以发现与先前的工作中的假设不同,密钥奖励主要受代表性项目的奖励影响。我们提出了两种强盗算法HIER-UCB和HIER-LINUCB,它们利用了这种观察到的关系以及密钥处理和项目之间的层次结构,以有效地了解要推荐的项目。从理论上讲,我们证明我们的算法可以减少遗憾边界对先前工作总数的依赖。我们验证了我们提出的算法,并在合成和现实世界数据上都遗憾。
The recent advances of conversational recommendations provide a promising way to efficiently elicit users' preferences via conversational interactions. To achieve this, the recommender system conducts conversations with users, asking their preferences for different items or item categories. Most existing conversational recommender systems for cold-start users utilize a multi-armed bandit framework to learn users' preference in an online manner. However, they rely on a pre-defined conversation frequency for asking about item categories instead of individual items, which may incur excessive conversational interactions that hurt user experience. To enable more flexible questioning about key-terms, we formulate a new conversational bandit problem that allows the recommender system to choose either a key-term or an item to recommend at each round and explicitly models the rewards of these actions. This motivates us to handle a new exploration-exploitation (EE) trade-off between key-term asking and item recommendation, which requires us to accurately model the relationship between key-term and item rewards. We conduct a survey and analyze a real-world dataset to find that, unlike assumptions made in prior works, key-term rewards are mainly affected by rewards of representative items. We propose two bandit algorithms, Hier-UCB and Hier-LinUCB, that leverage this observed relationship and the hierarchical structure between key-terms and items to efficiently learn which items to recommend. We theoretically prove that our algorithm can reduce the regret bound's dependency on the total number of items from previous work. We validate our proposed algorithms and regret bound on both synthetic and real-world data.