部分可观测时空混沌系统的无模型预测

论文标题

部分可观测时空混沌系统的无模型预测

Timing is Everything: Learning to Act Selectively with Costly Actions and Budgetary Constraints

论文作者

Mguni, David, Sootla, Aivar, Ziomek, Juliusz, Slumbers, Oliver, Dai, Zipeng, Shao, Kun, Wang, Jun

论文摘要

许多现实世界的设置涉及执行行动的成本；金融系统和燃料成本的交易成本是常见的例子。在这些设置中，在每次步骤中执行操作迅速积累了成本，导致次优的结果。此外，反复的表演会产生磨损，最终会破坏。确定\ textit {何时进行}对于取得成功的成果至关重要，但是，当尚未解决的动作尚未解决时，有效\ textit {Learning {Learning {Learning}的挑战是最佳的。 In this paper, we introduce a reinforcement learning (RL) framework named \textbf{L}earnable \textbf{I}mpulse \textbf{C}ontrol \textbf{R}einforcement \textbf{A}lgorithm (LICRA), for learning to optimally select both when to act and which actions to take when actions incur costs. Licra的核心是一种嵌套结构，它结合了RL和一种称为\ textit {Impulse Control}的策略形式，该策略在动作成本时学会了最大化目标。我们证明，无缝采用任何RL方法的Licra会收敛到最佳选择何时执行动作及其最佳幅度的策略。然后，我们扩大LICRA来处理代理商最多可以执行$ K <\ iftty $动作的问题，更普遍地面临预算限制。我们显示Licra了解最佳价值功能，并确保几乎可以肯定地满足预算限制。我们在Openai Gym的\ textit {Lunar Lander}和\ textit {Highway {Highway}环境中证明了Licra对基准RL方法的优越性能，以及财务中Merton Portfolio问题的变体。

Many real-world settings involve costs for performing actions; transaction costs in financial systems and fuel costs being common examples. In these settings, performing actions at each time step quickly accumulates costs leading to vastly suboptimal outcomes. Additionally, repeatedly acting produces wear and tear and ultimately, damage. Determining \textit{when to act} is crucial for achieving successful outcomes and yet, the challenge of efficiently \textit{learning} to behave optimally when actions incur minimally bounded costs remains unresolved. In this paper, we introduce a reinforcement learning (RL) framework named \textbf{L}earnable \textbf{I}mpulse \textbf{C}ontrol \textbf{R}einforcement \textbf{A}lgorithm (LICRA), for learning to optimally select both when to act and which actions to take when actions incur costs. At the core of LICRA is a nested structure that combines RL and a form of policy known as \textit{impulse control} which learns to maximise objectives when actions incur costs. We prove that LICRA, which seamlessly adopts any RL method, converges to policies that optimally select when to perform actions and their optimal magnitudes. We then augment LICRA to handle problems in which the agent can perform at most $k<\infty$ actions and more generally, faces a budget constraint. We show LICRA learns the optimal value function and ensures budget constraints are satisfied almost surely. We demonstrate empirically LICRA's superior performance against benchmark RL methods in OpenAI gym's \textit{Lunar Lander} and in \textit{Highway} environments and a variant of the Merton portfolio problem within finance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题