大型词汇识别的基于音素的神经传感器

论文标题

大型词汇识别的基于音素的神经传感器

Phoneme Based Neural Transducer for Large Vocabulary Speech Recognition

论文作者

Zhou, Wei, Berger, Simon, Schlüter, Ralf, Ney, Hermann

论文摘要

为了加入语音识别的古典和端到端方法的优势，我们为基于音素的神经传感器建模提供了一种简单，新颖和竞争的方法。比较不同的对齐标签拓扑，并提出了基于单词末端的音素标签增强以提高性能。利用音素的局部依赖性，我们采用了简化的神经网络结构，并与外部单词级语言模型直接集成来保留SEQ-to-seq建模的一致性。我们还使用框架跨透明镜损失提出了一个简单，稳定和高效的训练程序。表明一个语音上下文大小足以获得最佳性能。采用简化的计划抽样方法进行进一步改进，并简要比较不同的解码方法。我们最佳模型的总体性能与TED-Lium Release 2和Thandboard Corpora的最新结果（SOTA）相媲美。

To join the advantages of classical and end-to-end approaches for speech recognition, we present a simple, novel and competitive approach for phoneme-based neural transducer modeling. Different alignment label topologies are compared and word-end-based phoneme label augmentation is proposed to improve performance. Utilizing the local dependency of phonemes, we adopt a simplified neural network structure and a straightforward integration with the external word-level language model to preserve the consistency of seq-to-seq modeling. We also present a simple, stable and efficient training procedure using frame-wise cross-entropy loss. A phonetic context size of one is shown to be sufficient for the best performance. A simplified scheduled sampling approach is applied for further improvement and different decoding approaches are briefly compared. The overall performance of our best model is comparable to state-of-the-art (SOTA) results for the TED-LIUM Release 2 and Switchboard corpora.

下载PDF全文

下载文献需遵守相关版权规定

论文标题