论文标题
用于连续手语识别的时空多提示网络
Spatial-Temporal Multi-Cue Network for Continuous Sign Language Recognition
论文作者
论文摘要
尽管深度学习在连续的手语识别(CSLR)中取得了成功,但深层模型通常集中在最歧视的特征上,而忽略了其他潜在的非平凡和信息性内容。这种特征严重限制了他们学习隐式视觉语法的能力,背后是不同的视觉提示的协作(i,e。,手形,面部表情和身体姿势)。通过将多提示学习注入神经网络设计中,我们提出了一个时空多提示(STMC)网络来解决基于视觉的序列学习问题。我们的STMC网络由空间多提示(SMC)模块和时间多数-CUE(TMC)模块组成。 SMC模块专用于空间表示,并借助于独立的姿势估计分支明确分解了不同线索的视觉特征。 TMC模块沿两个平行路径(即基因内和互联)建模时间相关性,该路径旨在保持独特性并探索多个提示的协作。最后,我们设计了一个联合优化策略,以实现STMC网络的端到端序列学习。为了验证有效性,我们对三个大规模CSLR基准进行实验:Phoenix-2014,CSL和Phoenix-2014-T。实验结果表明,所提出的方法在所有三个基准测试中都实现了新的最新性能。
Despite the recent success of deep learning in continuous sign language recognition (CSLR), deep models typically focus on the most discriminative features, ignoring other potentially non-trivial and informative contents. Such characteristic heavily constrains their capability to learn implicit visual grammars behind the collaboration of different visual cues (i,e., hand shape, facial expression and body posture). By injecting multi-cue learning into neural network design, we propose a spatial-temporal multi-cue (STMC) network to solve the vision-based sequence learning problem. Our STMC network consists of a spatial multi-cue (SMC) module and a temporal multi-cue (TMC) module. The SMC module is dedicated to spatial representation and explicitly decomposes visual features of different cues with the aid of a self-contained pose estimation branch. The TMC module models temporal correlations along two parallel paths, i.e., intra-cue and inter-cue, which aims to preserve the uniqueness and explore the collaboration of multiple cues. Finally, we design a joint optimization strategy to achieve the end-to-end sequence learning of the STMC network. To validate the effectiveness, we perform experiments on three large-scale CSLR benchmarks: PHOENIX-2014, CSL and PHOENIX-2014-T. Experimental results demonstrate that the proposed method achieves new state-of-the-art performance on all three benchmarks.