论文标题
在线子集选择使用$α$核,没有增强的遗憾
Online Subset Selection using $α$-Core with no Augmented Regret
论文作者
论文摘要
我们重新审视在线学习设置中最佳子集选择的经典问题。假设集合$ [n] $由$ n $不同的元素组成。在$ t $ th的回合中,对手选择单调奖励函数$ f_t:2^{[n]} \ to \ mathbb {r} _+$,该_+$为$ [n]的每个子集分配了非负奖励。 $ T $ TH回合向学习者透露。由于其选择,该政策在$ t $ th回合上获得了$ f_t(s_t)$的奖励。我们的目标是设计在线顺序子集选择策略,以最大程度地提高预期的累积奖励。在这方面,我们提出了一种称为Score(用Core的子集选择)的在线学习政策,该政策解决了大量奖励功能的问题。拟议的分数策略基于奖励功能的新的多面体表征,称为$α$ - 核心 - 合作游戏理论文献中核心的概括。我们根据一个名为$α$的遗憾的新绩效指标为分数政策建立学习保证。在这个新的指标中,将在线政策的性能与无限制的离线基准进行了比较,该基准可以在每回合中选择所有$ n $元素。我们表明,可以通过分数策略有效地优化包括子模型在内的大量奖励功能。我们还将提出的政策扩展到乐观的学习设置,其中学习者可以访问有关奖励功能的其他不信任提示。最后,我们以开放问题的清单结束了论文。
We revisit the classic problem of optimal subset selection in the online learning set-up. Assume that the set $[N]$ consists of $N$ distinct elements. On the $t$th round, an adversary chooses a monotone reward function $f_t: 2^{[N]} \to \mathbb{R}_+$ that assigns a non-negative reward to each subset of $[N].$ An online policy selects (perhaps randomly) a subset $S_t \subseteq [N]$ consisting of $k$ elements before the reward function $f_t$ for the $t$th round is revealed to the learner. As a consequence of its choice, the policy receives a reward of $f_t(S_t)$ on the $t$th round. Our goal is to design an online sequential subset selection policy to maximize the expected cumulative reward accumulated over a time horizon. In this connection, we propose an online learning policy called SCore (Subset Selection with Core) that solves the problem for a large class of reward functions. The proposed SCore policy is based on a new polyhedral characterization of the reward functions called $α$-Core - a generalization of Core from the cooperative game theory literature. We establish a learning guarantee for the SCore policy in terms of a new performance metric called $α$-augmented regret. In this new metric, the performance of the online policy is compared with an unrestricted offline benchmark that can select all $N$ elements at every round. We show that a large class of reward functions, including submodular, can be efficiently optimized with the SCore policy. We also extend the proposed policy to the optimistic learning set-up where the learner has access to additional untrusted hints regarding the reward functions. Finally, we conclude the paper with a list of open problems.