减少废物的模拟环境和增强学习方法

论文标题

减少废物的模拟环境和增强学习方法

A Simulation Environment and Reinforcement Learning Method for Waste Reduction

论文作者

Jullien, Sami, Ariannezhad, Mozhdeh, Groth, Paul, de Rijke, Maarten

论文摘要

在零售业（例如，杂货店，服装商店，在线零售商）中，库存经理必须与长期风险（订购导致产品浪费）之间的短期风险（无需出售商品）平衡。由于缺乏有关未来客户购买的信息，因此这项平衡任务特别困难。在本文中，我们从分销的角度研究了将杂货店库存补充杂货店的库存的问题。目的是在最大程度地减少浪费的同时最大程度地提高销售量，并且对服装商的实际消费不确定性。鉴于对食物的需求不断增长，食物浪费对环境，经济和购买力的影响不断增长，因此这个问题具有很高的相关性。我们将库存补货作为一项新的强化学习任务，表现出基于代理商的行动来调节的随机行为，从而使环境可以部分观察到。我们做出了两个主要贡献。首先，我们根据真正的杂货店数据和专家知识介绍了一个新的强化学习环境，零售。这种环境是高度随机的，并给增强学习从业者带来了独特的挑战。我们表明，经典供应链算法对环境的未来行为的不确定性无法很好地处理，并且分配方法是解决不确定性的好方法。其次，我们介绍了GTDQN，这是一种分布强化学习算法，该学习算法在奖励空间上学习了广义的Tukey Lambda分布。 GTDQN为我们的环境提供了强大的基准。在这种部分可观察到的环境中，在整体奖励和减少产生的废物的情况下，它都超过了其他分布强化学习方法。

In retail (e.g., grocery stores, apparel shops, online retailers), inventory managers have to balance short-term risk (no items to sell) with long-term-risk (over ordering leading to product waste). This balancing task is made especially hard due to the lack of information about future customer purchases. In this paper, we study the problem of restocking a grocery store's inventory with perishable items over time, from a distributional point of view. The objective is to maximize sales while minimizing waste, with uncertainty about the actual consumption by costumers. This problem is of a high relevance today, given the growing demand for food and the impact of food waste on the environment, the economy, and purchasing power. We frame inventory restocking as a new reinforcement learning task that exhibits stochastic behavior conditioned on the agent's actions, making the environment partially observable. We make two main contributions. First, we introduce a new reinforcement learning environment, RetaiL, based on real grocery store data and expert knowledge. This environment is highly stochastic, and presents a unique challenge for reinforcement learning practitioners. We show that uncertainty about the future behavior of the environment is not handled well by classical supply chain algorithms, and that distributional approaches are a good way to account for the uncertainty. Second, we introduce GTDQN, a distributional reinforcement learning algorithm that learns a generalized Tukey Lambda distribution over the reward space. GTDQN provides a strong baseline for our environment. It outperforms other distributional reinforcement learning approaches in this partially observable setting, in both overall reward and reduction of generated waste.

下载PDF全文

下载文献需遵守相关版权规定

论文标题