基于群集独立的MDP的控制

论文标题

基于群集独立的MDP的控制

Cluster-Based Control of Transition-Independent MDPs

论文作者

Fiscko, Carmel, Kar, Soummya, Sinopoli, Bruno

论文摘要

这项工作研究了基于群集独立于Markov决策过程（TI-MDP）的基于群集的控制策略的有效解决方案方法。我们专注于对多代理系统的控制，中央规划师（CP）会影响代理人选择理想的群体行为。将代理分配为不相交的簇，在同一群集中的代理接收相同的控件，但是不同簇中的代理可能会接收不同的控件。在温和的假设下，可以将此过程建模为TI-MDP，其中每个因素都描述了一个群集的行为。 Ti-MDP的动作空间相对于簇数的指数变为指数。为了有效地在这个快速扩展的空间中找到策略，我们提出了一个聚集的贝尔曼操作员，该操作员在任何评估时在一个集群中优化了一个群集。我们提出了群集的价值迭代（CVI），该迭代使用该操作员在整个簇上进行“循环”优化。 CVI比标准值迭代（VI）的收敛速度更快，并且可以找到与MDP真正最佳值接近的策略。研究了具有可分离奖励功能的特殊类Ti-MDP，这表明CVI将在此类问题上找到最佳的政策。最后，探索了最佳的聚类分配问题。具有子模块奖励函数的值函数TI-MDP显示为子模块函数，因此可以使用supidular集合优化来找到几乎最佳的聚类分配。我们提出了一种迭代性贪婪群集分裂算法，该算法在每次迭代时都会产生单调的次数提高。最后，模拟提供了对拟议方法的经验评估。

This work studies efficient solution methods for cluster-based control policies of transition-independent Markov decision processes (TI-MDPs). We focus on control of multi-agent systems, whereby a central planner (CP) influences agents to select desirable group behavior. The agents are partitioned into disjoint clusters whereby agents in the same cluster receive the same controls but agents in different clusters may receive different controls. Under mild assumptions, this process can be modeled as a TI-MDP where each factor describes the behavior of one cluster. The action space of the TI-MDP becomes exponential with respect to the number of clusters. To efficiently find a policy in this rapidly scaling space, we propose a clustered Bellman operator that optimizes over the action space for one cluster at any evaluation. We present Clustered Value Iteration (CVI), which uses this operator to iteratively perform "round robin" optimization across the clusters. CVI converges exponentially faster than standard value iteration (VI), and can find policies that closely approximate the MDP's true optimal value. A special class of TI-MDPs with separable reward functions are investigated, and it is shown that CVI will find optimal policies on this class of problems. Finally, the optimal clustering assignment problem is explored. The value functions TI-MDPs with submodular reward functions are shown to be submodular functions, so submodular set optimization may be used to find a near optimal clustering assignment. We propose an iterative greedy cluster splitting algorithm, which yields monotonic submodular improvement in value at each iteration. Finally, simulations offer empirical assessment of the proposed methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题