用增强和变压器优化超参数

论文标题

用增强和变压器优化超参数

Hyperparameter optimization with REINFORCE and Transformers

论文作者

Krishna, Chepuri Shri, Gupta, Ashish, Narayan, Swarnim, Rai, Himanshu, Manchanda, Diksha

论文摘要

强化学习为神经建筑搜索（NAS）带来了令人鼓舞的结果。在本文中，我们演示了如何通过使用简化的变压器块对策略网络进行建模来提高其性能。简化的变压器使用基于2流的注意机制来对超参数依赖性进行建模，同时避免层归一化和位置编码。我们认为，这种简约的设计平衡了模型的复杂性与表现力，使其适合在探索预算有限的高维搜索空间中发现最佳体系结构。我们证明了如何通过a）使用参与者 - 批评风格算法而不是普通的香草策略梯度和b）具有共享参数的结合变压器块，每个块以不同的自动调查分解顺序进行条件。我们的算法效果很好，因为NAS和通用超参数优化（HPO）算法：它在NAS-Bench-101上的大多数算法都优于基准NAS NAS算法的公共数据集。特别是，它优于基于RL的方法，该方法使用替代体系结构对策略网络进行建模，从而强调了在此设置中使用基于注意力的网络的价值。作为一种通用的HPO算法，它在发现2个回归任务的更准确的多层感知器模型体系结构方面优于随机搜索。在设计实验和报告结果时，我们遵守了Lindauer和Hutter中列出的指南。

Reinforcement Learning has yielded promising results for Neural Architecture Search (NAS). In this paper, we demonstrate how its performance can be improved by using a simplified Transformer block to model the policy network. The simplified Transformer uses a 2-stream attention-based mechanism to model hyper-parameter dependencies while avoiding layer normalization and position encoding. We posit that this parsimonious design balances model complexity against expressiveness, making it suitable for discovering optimal architectures in high-dimensional search spaces with limited exploration budgets. We demonstrate how the algorithm's performance can be further improved by a) using an actor-critic style algorithm instead of plain vanilla policy gradient and b) ensembling Transformer blocks with shared parameters, each block conditioned on a different auto-regressive factorization order. Our algorithm works well as both a NAS and generic hyper-parameter optimization (HPO) algorithm: it outperformed most algorithms on NAS-Bench-101, a public data-set for benchmarking NAS algorithms. In particular, it outperformed RL based methods that use alternate architectures to model the policy network, underlining the value of using attention-based networks in this setting. As a generic HPO algorithm, it outperformed Random Search in discovering more accurate multi-layer perceptron model architectures across 2 regression tasks. We have adhered to guidelines listed in Lindauer and Hutter while designing experiments and reporting results.

下载PDF全文

下载文献需遵守相关版权规定

论文标题