论文标题
有效的算法,用于学习以未观察到的上下文控制土匪
Efficient Algorithms for Learning to Control Bandits with Unobserved Contexts
论文作者
论文摘要
在研究有限动作空间的基于学习的控制政策的研究中,上下文匪徒广泛使用。尽管对具有完美观察到的上下文向量的土匪进行了充分研究,但对不完美观察到的上下文的情况知之甚少。对于这种设置,现有方法是不可应用的,需要新的概念和技术框架。我们提出了一种可实现的后抽样算法,该算法对于具有不完善的上下文观察结果的匪徒,并研究其学习最佳决策的绩效。所提供的数值结果将算法的性能与不同量的兴趣相关联,包括武器,维度,观察矩阵,后验缩放因子和信噪比的比率。通常,提出的算法暴露了从嘈杂的不完美观察结果中学习并采取相应行动的效率。也讨论了分析提供的启发性理解以及它指出的有趣的未来方向。
Contextual bandits are widely-used in the study of learning-based control policies for finite action spaces. While the problem is well-studied for bandits with perfectly observed context vectors, little is known about the case of imperfectly observed contexts. For this setting, existing approaches are inapplicable and new conceptual and technical frameworks are required. We present an implementable posterior sampling algorithm for bandits with imperfect context observations and study its performance for learning optimal decisions. The provided numerical results relate the performance of the algorithm to different quantities of interest including the number of arms, dimensions, observation matrices, posterior rescaling factors, and signal-to-noise ratios. In general, the proposed algorithm exposes efficiency in learning from the noisy imperfect observations and taking actions accordingly. Enlightening understandings the analyses provide as well as interesting future directions it points to, are discussed as well.