无训练的变压器体系结构搜索

论文标题

无训练的变压器体系结构搜索

Training-free Transformer Architecture Search

论文作者

Zhou, Qinqin, Sheng, Kekai, Zheng, Xiawu, Li, Ke, Sun, Xing, Tian, Yonghong, Chen, Jie, Ji, Rongrong

论文摘要

最近，Vision Transformer（VIT）在几项计算机视觉任务中取得了巨大的成功。这些进度与体系结构设计高度相关，因此，值得提出变形金刚架构搜索（TAS）以自动搜索更好的VIT。但是，根据我们的实验观察结果，当前的TAS方法是耗时的，CNN中现有的零成本代理并不能很好地概括为VIT搜索空间。在本文中，我们首次研究了如何以无培训方式进行TA，并设计有效的无培训TA（TF-TAS）方案。首先，我们观察到VIT中多头自我注意力（MSA）和多层感知器（MLP）的特性完全不同，并且MSA的突触多样性显着影响性能。其次，根据观察结果，我们设计了TF-TAS中的模块化策略，该策略从两个理论角度评估和对VIT体系结构进行了排名：突触多样性和突触显着性，称为DSS-Indicator。使用DSS指示剂，评估结果与VIT模型的测试精度密切相关。实验结果表明，我们的TF-TAS在手动或自动设计VIT体系结构的最先进的情况下取得了竞争性能，并且可以大大提高VIT搜索空间的搜索效率：从$ 24 $ GPU天到不到$ 0.5 $ $ GPU天。此外，提出的DSS指示剂的表现优于现有的尖端零成本方法（例如，TE-Score和Naswot）。

Recently, Vision Transformer (ViT) has achieved remarkable success in several computer vision tasks. The progresses are highly relevant to the architecture design, then it is worthwhile to propose Transformer Architecture Search (TAS) to search for better ViTs automatically. However, current TAS methods are time-consuming and existing zero-cost proxies in CNN do not generalize well to the ViT search space according to our experimental observations. In this paper, for the first time, we investigate how to conduct TAS in a training-free manner and devise an effective training-free TAS (TF-TAS) scheme. Firstly, we observe that the properties of multi-head self-attention (MSA) and multi-layer perceptron (MLP) in ViTs are quite different and that the synaptic diversity of MSA affects the performance notably. Secondly, based on the observation, we devise a modular strategy in TF-TAS that evaluates and ranks ViT architectures from two theoretical perspectives: synaptic diversity and synaptic saliency, termed as DSS-indicator. With DSS-indicator, evaluation results are strongly correlated with the test accuracies of ViT models. Experimental results demonstrate that our TF-TAS achieves a competitive performance against the state-of-the-art manually or automatically design ViT architectures, and it promotes the searching efficiency in ViT search space greatly: from about $24$ GPU days to less than $0.5$ GPU days. Moreover, the proposed DSS-indicator outperforms the existing cutting-edge zero-cost approaches (e.g., TE-score and NASWOT).

下载PDF全文

下载文献需遵守相关版权规定

论文标题