论文标题
GTLO:一种广义和非线性的多目标深钢筋学习方法
gTLO: A Generalized and Non-linear Multi-Objective Deep Reinforcement Learning Approach
论文作者
论文摘要
在现实世界的决策优化中,通常必须考虑多个竞争目标。在经典的加强学习之后,这些目标必须合并为单个奖励功能。相比之下,多目标增强学习(MORL)方法从每个目标奖励的向量中学习。在多政策的情况下,优化了针对相互冲突目标的各种偏好的决策政策集。当训练期间不知道目标偏好或在应用过程中动态变化时,这一点尤其重要。通常,虽然基于线性标量化的单一目标增强学习方法是直接的,但这些方法可通过这些方法实现的解决方案仅限于帕累托前沿的凸区域。诸如阈值词典顺序(TLO)之类的非线性MORL方法旨在克服这一限制。广义MORL方法利用函数近似来跨客观偏好概括,从而隐含地以数据有效的方式学习多个策略,即使对于具有高维或连续状态空间的复杂决策问题也是如此。在这项工作中,我们提出了\ textit {广义阈值词典顺序}(GTLO),这是一种新型方法,旨在将非线性Morl与广义Morl的优势相结合。我们介绍了该算法的深入强化学习实现,并在非线性MORL的标准基准和制造过程控制领域的现实世界中提供了令人鼓舞的结果。
In real-world decision optimization, often multiple competing objectives must be taken into account. Following classical reinforcement learning, these objectives have to be combined into a single reward function. In contrast, multi-objective reinforcement learning (MORL) methods learn from vectors of per-objective rewards instead. In the case of multi-policy MORL, sets of decision policies for various preferences regarding the conflicting objectives are optimized. This is especially important when target preferences are not known during training or when preferences change dynamically during application. While it is, in general, straightforward to extend a single-objective reinforcement learning method for MORL based on linear scalarization, solutions that are reachable by these methods are limited to convex regions of the Pareto front. Non-linear MORL methods like Thresholded Lexicographic Ordering (TLO) are designed to overcome this limitation. Generalized MORL methods utilize function approximation to generalize across objective preferences and thereby implicitly learn multiple policies in a data-efficient manner, even for complex decision problems with high-dimensional or continuous state spaces. In this work, we propose \textit{generalized Thresholded Lexicographic Ordering} (gTLO), a novel method that aims to combine non-linear MORL with the advantages of generalized MORL. We introduce a deep reinforcement learning realization of the algorithm and present promising results on a standard benchmark for non-linear MORL and a real-world application from the domain of manufacturing process control.