使用GCN和BERT的基于姿势的手语识别

论文标题

使用GCN和BERT的基于姿势的手语识别

Pose-based Sign Language Recognition using GCN and BERT

论文作者

Tunga, Anirudh, Nuthalapati, Sai Vidyaranya, Wachs, Juan

论文摘要

手语识别（SLR）在弥合听力与声音受损社区与社会其他地区之间的沟通差距方面起着至关重要的作用。单词级手语识别（WSLR）是迈向理解和解释手语的第一步。但是，识别视频的迹象是一项艰巨的任务，因为单词的含义取决于微妙的身体运动，手动配置和其他动作的结合。 WSLR的最新基于姿势的体系结构可以同时模拟不同框架中的空间和时间依赖性，或者仅在不完全利用空间信息的情况下对时间信息进行建模。我们使用基于姿势的新方法解决WSLR的问题，该方法可分别捕获空间和时间信息并执行晚期融合。我们提出的架构使用图形卷积网络（GCN）明确捕获了视频中的空间交互。帧之间的时间依赖性使用来自变形金刚（BERT）的双向编码器表示捕获。 WLASL的实验结果是一种标准的单词级手语识别数据集，表明我们的模型通过提高预测准确性高达5％，从而极大地超过了基于姿势方法的最新方法。

Sign language recognition (SLR) plays a crucial role in bridging the communication gap between the hearing and vocally impaired community and the rest of the society. Word-level sign language recognition (WSLR) is the first important step towards understanding and interpreting sign language. However, recognizing signs from videos is a challenging task as the meaning of a word depends on a combination of subtle body motions, hand configurations, and other movements. Recent pose-based architectures for WSLR either model both the spatial and temporal dependencies among the poses in different frames simultaneously or only model the temporal information without fully utilizing the spatial information. We tackle the problem of WSLR using a novel pose-based approach, which captures spatial and temporal information separately and performs late fusion. Our proposed architecture explicitly captures the spatial interactions in the video using a Graph Convolutional Network (GCN). The temporal dependencies between the frames are captured using Bidirectional Encoder Representations from Transformers (BERT). Experimental results on WLASL, a standard word-level sign language recognition dataset show that our model significantly outperforms the state-of-the-art on pose-based methods by achieving an improvement in the prediction accuracy by up to 5%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题