自导近似线性程序

论文标题

自导近似线性程序

Self-guided Approximate Linear Programs

论文作者

Pakiman, Parshan, Nadarajah, Selvaprabu, Soheili, Negar, Lin, Qihang

论文摘要

近似线性程序（ALP）是基于价值函数近似值（VFA）的知名模型，以获得策略和下限，以折扣成本马尔可夫决策过程（MDP）的最佳政策成本。制定ALP需要（i）基础函数，其定义VFA的线性组合以及（ii）状态 - 权益分布，该分布确定不同状态在ALP目标中的相对重要性，以最大程度地减少VFA误差。这两种选择通常都是启发式方法：基础函数选择依赖于域知识，而使用启发式政策访问的状态频率指定了状态相关性分布。我们提出了一个自引导的阿尔卑斯山的序列，该序列嵌入了通过廉价采样获得的随机基础函数，并使用了先前迭代中的已知VFA来指导当前迭代中的VFA计算。自指导的阿尔卑斯山减轻了基础功能选择期间对域知识的需求，以及最初选择国家 - 相关分布的影响，从而大大减轻了ALP的实施负担。我们从该序列上建立了VFA上的高概率误差界限，并表明策略绩效的最坏情况得到了改善。我们发现，这些有利的实施和理论属性转化为令人鼓舞的库存控制和期权定价应用程序的数值结果，其中自引导的ALP策略改善了特定于问题的方法的策略。更广泛地说，我们的研究迈出了有意义的一步，迈向了MDP的应用程序不足的政策和界限。

Approximate linear programs (ALPs) are well-known models based on value function approximations (VFAs) to obtain policies and lower bounds on the optimal policy cost of discounted-cost Markov decision processes (MDPs). Formulating an ALP requires (i) basis functions, the linear combination of which defines the VFA, and (ii) a state-relevance distribution, which determines the relative importance of different states in the ALP objective for the purpose of minimizing VFA error. Both these choices are typically heuristic: basis function selection relies on domain knowledge while the state-relevance distribution is specified using the frequency of states visited by a heuristic policy. We propose a self-guided sequence of ALPs that embeds random basis functions obtained via inexpensive sampling and uses the known VFA from the previous iteration to guide VFA computation in the current iteration. Self-guided ALPs mitigate the need for domain knowledge during basis function selection as well as the impact of the initial choice of the state-relevance distribution, thus significantly reducing the ALP implementation burden. We establish high probability error bounds on the VFAs from this sequence and show that a worst-case measure of policy performance is improved. We find that these favorable implementation and theoretical properties translate to encouraging numerical results on perishable inventory control and options pricing applications, where self-guided ALP policies improve upon policies from problem-specific methods. More broadly, our research takes a meaningful step toward application-agnostic policies and bounds for MDPs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题