在强化学习的背景下进行视觉处理

论文标题

在强化学习的背景下进行视觉处理

Visual processing in context of reinforcement learning

论文作者

Hlynsson, Hlynur Davíð

论文摘要

尽管深度强化学习（RL）最近取得了许多成功，但其方法仍然效率低下，这使得在数据方面解决了众多问题的昂贵。我们的目标是通过利用无标记的数据中的丰富监督信号来进行学习状态表示，以解决这一问题。本文介绍了三种不同的表示算法，可以访问传统RL算法使用的数据源的不同子集：（i）Grica的灵感来自独立的组件分析（ICA），并训练深度神经网络以输出输入统计独立的特征。 Grica通过最大程度地减少每个功能与其他功能之间的相互信息来做到这一点。此外，格里卡仅需要未分类的环境国家。（ii）潜在表示预测（LARP）还需要更多的上下文：除了要求状态作为输入外，它还需要先前的状态和连接它们的动作。该方法通过预测环境的下一个状态表示当前状态和行动来学习状态表示。预测变量与图形搜索算法一起使用。（iii）重新培训通过训练深层神经网络学习奖励功能的平滑版本，从而学习了状态代表。该表示形式用于预处理输入到深度RL，而奖励预测指标用于奖励成型。此方法仅需要环境中的状态奖励对学习表示表示。我们发现，每种方法都有其优点和缺点，并从我们的实验中得出结论，包括无监督的表示在解决问题的管道中学习可以加快学习的速度。

Although deep reinforcement learning (RL) has recently enjoyed many successes, its methods are still data inefficient, which makes solving numerous problems prohibitively expensive in terms of data. We aim to remedy this by taking advantage of the rich supervisory signal in unlabeled data for learning state representations. This thesis introduces three different representation learning algorithms that have access to different subsets of the data sources that traditional RL algorithms use: (i) GRICA is inspired by independent component analysis (ICA) and trains a deep neural network to output statistically independent features of the input. GrICA does so by minimizing the mutual information between each feature and the other features. Additionally, GrICA only requires an unsorted collection of environment states. (ii) Latent Representation Prediction (LARP) requires more context: in addition to requiring a state as an input, it also needs the previous state and an action that connects them. This method learns state representations by predicting the representation of the environment's next state given a current state and action. The predictor is used with a graph search algorithm. (iii) RewPred learns a state representation by training a deep neural network to learn a smoothed version of the reward function. The representation is used for preprocessing inputs to deep RL, while the reward predictor is used for reward shaping. This method needs only state-reward pairs from the environment for learning the representation. We discover that every method has their strengths and weaknesses, and conclude from our experiments that including unsupervised representation learning in RL problem-solving pipelines can speed up learning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题