在儿童演讲中进行实时错误发音检测

论文标题

在儿童演讲中进行实时错误发音检测

Towards Real-time Mispronunciation Detection in Kids' Speech

论文作者

Plantinga, Peter, Fosler-Lussier, Eric

论文摘要

由于引入深度学习，现代的错误发音检测和诊断系统在准确性方面取得了显着提高。但是，尚未对这些系统进行实时运行的能力进行评估，这是提供快速反馈的应用程序的重要因素。特别是，最先进的是使用双向反复网络，在该网络中可能更合适。教师学习是一种自然的方法，可用于改善单向模型，但是当使用CTC目标时，这受到产出与证据的不良对齐的限制。我们通过尝试两个损失条款来改善模型的一致性来解决这一限制。一个损失是一个“对齐损失”术语，仅在功能不类似于沉默时才鼓励输出。另一个损失术语使用单向模型作为教师模型来对齐双向模型。我们提出的模型使用这些统一的双向模型作为教师模型。 CSLU儿童语料库的实验表明，这些变化会降低产出的潜伏期，并提高检测率，并在这些目标之间进行权衡。

Modern mispronunciation detection and diagnosis systems have seen significant gains in accuracy due to the introduction of deep learning. However, these systems have not been evaluated for the ability to be run in real-time, an important factor in applications that provide rapid feedback. In particular, the state-of-the-art uses bi-directional recurrent networks, where a uni-directional network may be more appropriate. Teacher-student learning is a natural approach to use to improve a uni-directional model, but when using a CTC objective, this is limited by poor alignment of outputs to evidence. We address this limitation by trying two loss terms for improving the alignments of our models. One loss is an "alignment loss" term that encourages outputs only when features do not resemble silence. The other loss term uses a uni-directional model as teacher model to align the bi-directional model. Our proposed model uses these aligned bi-directional models as teacher models. Experiments on the CSLU kids' corpus show that these changes decrease the latency of the outputs, and improve the detection rates, with a trade-off between these goals.

下载PDF全文

下载文献需遵守相关版权规定

论文标题