您需要多次退出：动态早期退出以加速统一视觉语言模型

论文标题

您需要多次退出：动态早期退出以加速统一视觉语言模型

You Need Multiple Exiting: Dynamic Early Exiting for Accelerating Unified Vision Language Model

论文作者

Tang, Shengkun, Wang, Yaqing, Kong, Zhenglun, Zhang, Tianchi, Li, Yao, Ding, Caiwen, Wang, Yanzhi, Liang, Yi, Xu, Dongkuan

论文摘要

大型变压器模型通过统一体系结构为各种下游视觉语言任务带来了重大改进。性能改进是随着模型大小的增加而导致推理速度缓慢和切断成本增加的。尽管某些某些预测受益于大规模模型的全部复杂性，但并非所有输入都需要相同的计算来进行进行，这可能导致计算资源浪费。为了应对这一挑战，提议提早退出以在输入复杂性方面适应分配计算能力以提高推理效率。现有的早期退出策略通常基于中间层采取输出信心，以此作为输入复杂性的代表，以产生层次跳过的决定。但是，由于在编码器中的输出置信度估计的困难，此类策略不能适用于具有编码器和解码器的广泛使用的统一体系结构中的编码器。在节省计算能力的术语中，它忽略了编码器组件中的早期退出。为了应对这一挑战，我们为统一的视觉语言模型提出了一种新颖的早期退出策略，该策略允许在输入层相似性的方面同时动态跳过编码器和解码器中的层，并具有多次早期退出，即\ textbf {mue}。通过分解编码器中的图像和文本方式，MUE是灵活的，可以在模态上跳过不同的层，从而提高推理效率，同时最大程度地减少性能下降。 SNLI-VE和MS可可数据集的实验表明，所提出的方法可以将预期推理时间最多减少50 \％和40 \％，同时保持99 \％和96 \％的性能。

Large-scale Transformer models bring significant improvements for various downstream vision language tasks with a unified architecture. The performance improvements come with increasing model size, resulting in slow inference speed and increased cost for severing. While some certain predictions benefit from the full complexity of the large-scale model, not all of inputs need the same amount of computation to conduct, potentially leading to computation resource waste. To handle this challenge, early exiting is proposed to adaptively allocate computational power in term of input complexity to improve inference efficiency. The existing early exiting strategies usually adopt output confidence based on intermediate layers as a proxy of input complexity to incur the decision of skipping following layers. However, such strategies cannot apply to encoder in the widely-used unified architecture with both encoder and decoder due to difficulty of output confidence estimation in the encoder. It is suboptimal in term of saving computation power to ignore the early exiting in encoder component. To handle this challenge, we propose a novel early exiting strategy for unified visual language models, which allows dynamically skip the layers in encoder and decoder simultaneously in term of input layer-wise similarities with multiple times of early exiting, namely \textbf{MuE}. By decomposing the image and text modalities in the encoder, MuE is flexible and can skip different layers in term of modalities, advancing the inference efficiency while minimizing performance drop. Experiments on the SNLI-VE and MS COCO datasets show that the proposed approach MuE can reduce expected inference time by up to 50\% and 40\% while maintaining 99\% and 96\% performance respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题