论文标题

可构型变压器换能器语音识别的可变注意力掩模

Variable Attention Masking for Configurable Transformer Transducer Speech Recognition

论文作者

Swietojanski, Pawel, Braun, Stefan, Can, Dogan, da Silva, Thiago Fraga, Ghoshal, Arnab, Hori, Takaaki, Hsiao, Roger, Mason, Henry, McDermott, Erik, Silovsky, Honza, Travadi, Ruchir, Zhuang, Xiaodan

论文摘要

这项工作研究了在基于变压器传感器的语音识别中使用注意力掩盖的使用,以构建用于不同部署场景的单一可配置模型。我们提供了一组全面的实验,比较了固定遮罩,其中每个帧都使用相同的注意性掩码,并用块状掩蔽,其中每个框架的注意力面膜都由块边界确定,并在识别精度和延迟方面确定。然后,我们探讨了可变遮罩的使用,其中注意力面罩是从训练时在目标分布中采样的,以构建可以以不同配置工作的模型。最后,我们研究了如何使用单个可配置模型来执行第一次通过流识别和第二次通过的声学纠正。实验表明,与固定的掩盖相比,有或没有固定的掩盖,掩盖的掩蔽可以达到更好的准确性与潜伏期的权衡。我们还表明,在声学重新评分方案中,可变掩蔽可提高准确性多达8%。

This work studies the use of attention masking in transformer transducer based speech recognition for building a single configurable model for different deployment scenarios. We present a comprehensive set of experiments comparing fixed masking, where the same attention mask is applied at every frame, with chunked masking, where the attention mask for each frame is determined by chunk boundaries, in terms of recognition accuracy and latency. We then explore the use of variable masking, where the attention masks are sampled from a target distribution at training time, to build models that can work in different configurations. Finally, we investigate how a single configurable model can be used to perform both first pass streaming recognition and second pass acoustic rescoring. Experiments show that chunked masking achieves a better accuracy vs latency trade-off compared to fixed masking, both with and without FastEmit. We also show that variable masking improves the accuracy by up to 8% relative in the acoustic re-scoring scenario.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源