论文标题
最大程度地减少Markovian来源的不正确信息的年龄
Minimizing the Age of Incorrect Information for Unknown Markovian Source
论文作者
论文摘要
信息最小化问题的年龄已在实时监控应用程序框架中进行了广泛的研究。在本文中,我们考虑了监视根据马尔可夫过程进化的未知远程源状态的问题。中央调度程序在每个时间段都决定是否安排源以接收新状态更新的方式,以最大程度地减少不正确信息的平均年龄(MAOII)。当调度程序知道源参数时,我们将最小化问题作为MDP问题。然后,我们证明最佳解决方案是基于阈值的策略。当来源的参数未知时,问题的困难在于找到一种在剥削和探索之间取舍的策略。实际上,我们需要提供一项由调度程序实施的算法,该算法共同估算未知参数(勘探)并最大程度地减少MAOII(剥削)。但是,考虑到我们的系统模型,我们只能在监视器决定安排安排时探索源。然后,采用贪婪的方法,我们有可能在特定时间以估计马尔可夫源的参数估算到相应的最佳解决方案永远不会传输的马尔可信源参数的情况下,最终停止探索过程。在这种情况下,我们无法再提高未知参数的估计,这可能会显着损害算法的性能。为此,我们开发了一种新的学习算法,该算法在探索和剥削之间可以很好地平衡以避免这个主要问题。然后,我们从理论上分析了与精灵解决方案相比,通过证明时间t的遗憾是log(t),从而分析了算法的性能。最后,我们提供了一些数值结果,以强调与贪婪方法相比,我们派生的政策的表现。
The age of information minimization problems has been extensively studied in Real-time monitoring applications frameworks. In this paper, we consider the problem of monitoring the states of unknown remote source that evolves according to a Markovian Process. A central scheduler decides at each time slot whether to schedule the source or not in order to receive the new status updates in such a way as to minimize the Mean Age of Incorrect Information (MAoII). When the scheduler knows the source parameters, we formulate the minimization problem as an MDP problem. Then, we prove that the optimal solution is a threshold-based policy. When the source's parameters are unknown, the problem's difficulty lies in finding a strategy with a good trade-off between exploitation and exploration. Indeed, we need to provide an algorithm implemented by the scheduler that jointly estimates the unknown parameters (exploration) and minimizes the MAoII (exploitation). However, considering our system model, we can only explore the source if the monitor decides to schedule it. Then, applying the greedy approach, we risk definitively stopping the exploration process in the case where at a specific time, we end up with an estimation of the Markovian source's parameters to which the corresponding optimal solution is never to transmit. In this case, we can no longer improve the estimation of our unknown parameters, which may significantly detract from the performance of the algorithm. For that, we develop a new learning algorithm that gives a good balance between exploration and exploitation to avoid this main problem. Then, we theoretically analyze the performance of our algorithm compared to a genie solution by proving that the regret bound at time T is log(T). Finally, we provide some numerical results to highlight the performance of our derived policy compared to the greedy approach.