论文标题
通过交叉模态增强来提高连续的手语识别
Boosting Continuous Sign Language Recognition via Cross Modality Augmentation
论文作者
论文摘要
连续的手语识别(SLR)处理不一致的视频文本对,并使用单词错误率(WER),即编辑距离作为主要评估度量。由于它不是可区分的,因此我们通常使用连接式时间分类(CTC)客观损失优化学习模型,从而最大程度地利用了顺序对齐的后验概率。由于优化差距,具有最高解码概率的预测句子可能不是WER公制下的最佳选择。为了解决这个问题,我们提出了一种具有交叉模态增强的新型建筑。具体而言,我们首先通过模拟WER的计算过程,即文本标签及其相应视频的替换,删除和插入。借助这些真实和生成的伪视频文本对,我们提出了多个损失项,以最大程度地降低视频和地面真相标签之间的交叉模态距离,并使网络区分真实和伪模式之间的差异。所提出的框架可以轻松地扩展到其他基于CTC的连续SLR体系结构。对两个连续SLR基准测试的广泛实验,即RWTh-Phoenix-Weather和CSL,验证了我们提出的方法的有效性。
Continuous sign language recognition (SLR) deals with unaligned video-text pair and uses the word error rate (WER), i.e., edit distance, as the main evaluation metric. Since it is not differentiable, we usually instead optimize the learning model with the connectionist temporal classification (CTC) objective loss, which maximizes the posterior probability over the sequential alignment. Due to the optimization gap, the predicted sentence with the highest decoding probability may not be the best choice under the WER metric. To tackle this issue, we propose a novel architecture with cross modality augmentation. Specifically, we first augment cross-modal data by simulating the calculation procedure of WER, i.e., substitution, deletion and insertion on both text label and its corresponding video. With these real and generated pseudo video-text pairs, we propose multiple loss terms to minimize the cross modality distance between the video and ground truth label, and make the network distinguish the difference between real and pseudo modalities. The proposed framework can be easily extended to other existing CTC based continuous SLR architectures. Extensive experiments on two continuous SLR benchmarks, i.e., RWTH-PHOENIX-Weather and CSL, validate the effectiveness of our proposed method.