基于构象异构体的波形域神经声音回声癌症，优化了ASR准确性

论文标题

基于构象异构体的波形域神经声音回声癌症，优化了ASR准确性

A Conformer-based Waveform-domain Neural Acoustic Echo Canceller Optimized for ASR Accuracy

论文作者

Panchapagesan, Sankaran, Narayanan, Arun, Shabestary, Turaj Zakizadeh, Shao, Shuai, Howard, Nathan, Park, Alex, Walker, James, Gruenstein, Alexander

论文摘要

声学回声取消（AEC）对于准确地识别与正在播放音频的智能扬声器所说的查询至关重要。先前的工作表明，在使用预先训练的ASR模型编码器进行优化时，使用辅助损失进行优化时，在Log-Mel光谱特征（表示为“ LogMel”）上运行的神经AEC模型可以极大地提高自动语音识别（ASR）精度。在本文中，我们开发了一个以“ TASNET”架构启发的基于构象异构体的波形神经AEC模型。该模型通过在大型语音数据集中共同优化负尺度不变的SNR（SISNR）和ASR损失来训练该模型。在现实的重新记录测试集中，我们发现线性自适应AEC和波形 - 域神经AEC非常有效，仅与线性AEC相比，单词错误率（WER）降低了56-59％。在此测试集中，在易于适度的条件下，160万个参数波形神经AEC也可以在更大的650万参数LogMel-Domain神经AEC模型上改善20-29％。通过在较小的帧上操作，波形神经模型可以在较小尺寸的情况下表现更好，并且非常适合在内存有限的应用中。

Acoustic Echo Cancellation (AEC) is essential for accurate recognition of queries spoken to a smart speaker that is playing out audio. Previous work has shown that a neural AEC model operating on log-mel spectral features (denoted "logmel" hereafter) can greatly improve Automatic Speech Recognition (ASR) accuracy when optimized with an auxiliary loss utilizing a pre-trained ASR model encoder. In this paper, we develop a conformer-based waveform-domain neural AEC model inspired by the "TasNet" architecture. The model is trained by jointly optimizing Negative Scale-Invariant SNR (SISNR) and ASR losses on a large speech dataset. On a realistic rerecorded test set, we find that cascading a linear adaptive AEC and a waveform-domain neural AEC is very effective, giving 56-59% word error rate (WER) reduction over the linear AEC alone. On this test set, the 1.6M parameter waveform-domain neural AEC also improves over a larger 6.5M parameter logmel-domain neural AEC model by 20-29% in easy to moderate conditions. By operating on smaller frames, the waveform neural model is able to perform better at smaller sizes and is better suited for applications where memory is limited.

下载PDF全文

下载文献需遵守相关版权规定

论文标题