论文标题
流媒体上的端到端模型超过服务器端传统型号质量和延迟
A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency
论文作者
论文摘要
到目前为止,端到端(E2E)模型尚未显示出优于最先进的传统模型,即质量,即单词错误率(WER)和延迟,即,在用户停止讲话后,该假设的时间是最终确定的。在本文中,我们开发了第一频道的复发性神经网络传感器(RNN-T)模型和第二频繁的聆听,参加,Spell,Spell(LAS)委员,它超过了质量和延迟的传统模型。在质量方面,我们在各种领域中结合了大量的话语,以增加声音多样性和模型所看到的词汇。我们还用强调的英语演讲来训练,以使模型对不同的发音更强大。此外,鉴于培训数据的数量增加,我们探索了各种学习率计划。在延迟方面,我们使用RNN-T模型发出的句子终止决定来探索麦克风,并引入各种优化以提高LAS逆转的速度。总体而言,我们发现与传统模型相比,RNN-T+LAS提供了更好和延迟权衡。例如,对于相同的延迟,RNN-T+LAS在WER中获得了8%的相对改善,而模型尺寸较小400倍以上。
Thus far, end-to-end (E2E) models have not been shown to outperform state-of-the-art conventional models with respect to both quality, i.e., word error rate (WER), and latency, i.e., the time the hypothesis is finalized after the user stops speaking. In this paper, we develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer that surpasses a conventional model in both quality and latency. On the quality side, we incorporate a large number of utterances across varied domains to increase acoustic diversity and the vocabulary seen by the model. We also train with accented English speech to make the model more robust to different pronunciations. In addition, given the increased amount of training data, we explore a varied learning rate schedule. On the latency front, we explore using the end-of-sentence decision emitted by the RNN-T model to close the microphone, and also introduce various optimizations to improve the speed of LAS rescoring. Overall, we find that RNN-T+LAS offers a better WER and latency tradeoff compared to a conventional model. For example, for the same latency, RNN-T+LAS obtains a 8% relative improvement in WER, while being more than 400-times smaller in model size.