演讲网：弱监督，端到端的语音识别在工业规模上

论文标题

演讲网：弱监督，端到端的语音识别在工业规模上

SpeechNet: Weakly Supervised, End-to-End Speech Recognition at Industrial Scale

论文作者

Tang, Raphael, Kumar, Karun, Yang, Gefei, Pandey, Akshat, Mao, Yajie, Belyaev, Vladislav, Emmadi, Madhuri, Murray, Craig, Ture, Ferhan, Lin, Jimmy

论文摘要

端到端的自动语音识别系统代表了艺术的状态，但它们依靠数千个小时的手动注释语音进行培训，以及用于推理的重量级计算。当然，这会阻碍商业化，因为大多数公司都缺乏广泛的人类和计算资源。在本文中，我们探索培训并部署在标签 - 筛分（计算限制设置）中的ASR系统。为了减少人工劳动，我们使用第三方ASR系统作为弱监督来源，并补充了来自隐式用户反馈的标签功能。为了加速推断，我们建议在不同输入长度的CUDA图中汇总生产时间查询，其分布最能与流量相匹配。与我们的第三方ASR相比，我们的单词率率相对提高了8％，加速度为600％。我们的系统称为SpeechNet，目前每天在我们的语音智能电视上每天提供1200万个查询。据我们所知，这是学术文献中首次进行了大规模的基于WAV2VEC的部署。

End-to-end automatic speech recognition systems represent the state of the art, but they rely on thousands of hours of manually annotated speech for training, as well as heavyweight computation for inference. Of course, this impedes commercialization since most companies lack vast human and computational resources. In this paper, we explore training and deploying an ASR system in the label-scarce, compute-limited setting. To reduce human labor, we use a third-party ASR system as a weak supervision source, supplemented with labeling functions derived from implicit user feedback. To accelerate inference, we propose to route production-time queries across a pool of CUDA graphs of varying input lengths, the distribution of which best matches the traffic's. Compared to our third-party ASR, we achieve a relative improvement in word-error rate of 8% and a speedup of 600%. Our system, called SpeechNet, currently serves 12 million queries per day on our voice-enabled smart television. To our knowledge, this is the first time a large-scale, Wav2vec-based deployment has been described in the academic literature.

下载PDF全文

下载文献需遵守相关版权规定

论文标题