MLP-3D：类似于MLP的3D体系结构，带有分组的时间混合

论文标题

MLP-3D：类似于MLP的3D体系结构，带有分组的时间混合

MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing

论文作者

Qiu, Zhaofan, Yao, Ting, Ngo, Chong-Wah, Mei, Tao

论文摘要

卷积神经网络（CNN）被视为视觉识别的首选模型。最近，基于多头自我注意力（MSA）或多层感知器（MLP）的无卷积网络变得越来越流行。然而，由于视频数据的差异和复杂性，利用这些新染色的网络进行视频识别并不是一件容易的事。在本文中，我们提出了MLP-3D Networks，这是一种新颖的MLP式3D体系结构，用于视频识别。具体而言，体系结构由MLP-3D块组成，其中每个块包含一个跨令牌施加的一个MLP（即令牌混合MLP），一个MLP独立应用于每个令牌（即通道MLP）。通过得出新型的分组时间混合（GTM）操作，我们将基本令牌混合MLP配备了时间建模的能力。 GTM将输入令牌分为几个时间组，并用共享投影矩阵在每个组中线性映射令牌。此外，我们通过不同的分组策略设计了几种GTM的变体，并通过贪婪的体系结构搜索在MLP-3D网络的不同块中构成了每个变体。在不依赖卷积或注意机制的情况下，我们的MLP-3D网络分别在某些东西上的V2和Kinetics-400数据集中获得了68.5 \％/81.4 \％TOP-1的准确性。尽管计算较少，但结果与最新通用的3D CNN和视频变压器相当。源代码可从https://github.com/zhaofanqiu/mlp-3d获得。

Convolutional Neural Networks (CNNs) have been regarded as the go-to models for visual recognition. More recently, convolution-free networks, based on multi-head self-attention (MSA) or multi-layer perceptrons (MLPs), become more and more popular. Nevertheless, it is not trivial when utilizing these newly-minted networks for video recognition due to the large variations and complexities in video data. In this paper, we present MLP-3D networks, a novel MLP-like 3D architecture for video recognition. Specifically, the architecture consists of MLP-3D blocks, where each block contains one MLP applied across tokens (i.e., token-mixing MLP) and one MLP applied independently to each token (i.e., channel MLP). By deriving the novel grouped time mixing (GTM) operations, we equip the basic token-mixing MLP with the ability of temporal modeling. GTM divides the input tokens into several temporal groups and linearly maps the tokens in each group with the shared projection matrix. Furthermore, we devise several variants of GTM with different grouping strategies, and compose each variant in different blocks of MLP-3D network by greedy architecture search. Without the dependence on convolutions or attention mechanisms, our MLP-3D networks achieves 68.5\%/81.4\% top-1 accuracy on Something-Something V2 and Kinetics-400 datasets, respectively. Despite with fewer computations, the results are comparable to state-of-the-art widely-used 3D CNNs and video transformers. Source code is available at https://github.com/ZhaofanQiu/MLP-3D.

下载PDF全文

下载文献需遵守相关版权规定

论文标题