基础变压器

论文标题

Foundation Transformers

论文作者

Wang, Hongyu, Ma, Shuming, Huang, Shaohan, Dong, Li, Wang, Wenhui, Peng, Zhiliang, Wu, Yu, Bajaj, Payal, Singhal, Saksham, Benhaim, Alon, Patra, Barun, Liu, Zhun, Chaudhary, Vishrav, Song, Xia, Wei, Furu

论文摘要

跨语言，视觉，语音和多模式的模型体系结构的大量收敛正在出现。但是，以同一名称为“变形金刚”，上述领域使用不同的实现来提高性能，例如，BERT的后层和GPT和Vision Transformers的前层。我们呼吁开发用于真正通用建模的基础变压器，该建模是各种任务和模式的首选架构，并保证了训练稳定性。在这项工作中，我们介绍了一个名为Magneto的变压器变体以实现目标。具体而言，我们提出了良好表达性的子层，而理论上的初始化策略从deepnet中得出，以稳定扩展。广泛的实验证明了其优越的性能和更好的稳定性，而不是为各种应用设计的事实上的变压器变体，包括语言建模（即BERT和GPT），机器翻译，视觉预处理（即BEIT），语音识别和多模态预处理（即BEIT-3）。

A big convergence of model architectures across language, vision, speech, and multimodal is emerging. However, under the same name "Transformers", the above areas use different implementations for better performance, e.g., Post-LayerNorm for BERT, and Pre-LayerNorm for GPT and vision Transformers. We call for the development of Foundation Transformer for true general-purpose modeling, which serves as a go-to architecture for various tasks and modalities with guaranteed training stability. In this work, we introduce a Transformer variant, named Magneto, to fulfill the goal. Specifically, we propose Sub-LayerNorm for good expressivity, and the initialization strategy theoretically derived from DeepNet for stable scaling up. Extensive experiments demonstrate its superior performance and better stability than the de facto Transformer variants designed for various applications, including language modeling (i.e., BERT, and GPT), machine translation, vision pretraining (i.e., BEiT), speech recognition, and multimodal pretraining (i.e., BEiT-3).

下载PDF全文

下载文献需遵守相关版权规定

论文标题