论文标题
大型词汇识别的基于音素的神经传感器
Phoneme Based Neural Transducer for Large Vocabulary Speech Recognition
论文作者
论文摘要
为了加入语音识别的古典和端到端方法的优势,我们为基于音素的神经传感器建模提供了一种简单,新颖和竞争的方法。比较不同的对齐标签拓扑,并提出了基于单词末端的音素标签增强以提高性能。利用音素的局部依赖性,我们采用了简化的神经网络结构,并与外部单词级语言模型直接集成来保留SEQ-to-seq建模的一致性。我们还使用框架跨透明镜损失提出了一个简单,稳定和高效的训练程序。表明一个语音上下文大小足以获得最佳性能。采用简化的计划抽样方法进行进一步改进,并简要比较不同的解码方法。我们最佳模型的总体性能与TED-Lium Release 2和Thandboard Corpora的最新结果(SOTA)相媲美。
To join the advantages of classical and end-to-end approaches for speech recognition, we present a simple, novel and competitive approach for phoneme-based neural transducer modeling. Different alignment label topologies are compared and word-end-based phoneme label augmentation is proposed to improve performance. Utilizing the local dependency of phonemes, we adopt a simplified neural network structure and a straightforward integration with the external word-level language model to preserve the consistency of seq-to-seq modeling. We also present a simple, stable and efficient training procedure using frame-wise cross-entropy loss. A phonetic context size of one is shown to be sufficient for the best performance. A simplified scheduled sampling approach is applied for further improvement and different decoding approaches are briefly compared. The overall performance of our best model is comparable to state-of-the-art (SOTA) results for the TED-LIUM Release 2 and Switchboard corpora.