流媒体的流媒体，用于语音转换的演讲转换

论文标题

流媒体的流媒体，用于语音转换的演讲转换

Streaming Parrotron for on-device speech-to-speech conversion

论文作者

Rybakov, Oleg, Biadsy, Fadi, Zhang, Xia, Jiang, Liyang, Meadowlark, Phoenix, Agrawal, Shivani

论文摘要

我们提出了一个完全的在设备流语音2语音转换模型，该模型将给定的输入语音直接归一化为合成的输出语音。在移动设备上部署这种模型在内存足迹和计算要求方面提出了重大挑战。我们提出了一种基于流媒体的方法来产生可接受的延迟，与最先进的非流传输方法相比，语音转换质量的损失最小。我们的方法包括在说话者讲话时实时首次流式传输编码器。然后，一旦扬声器停止讲话，我们就会在流媒体声码编码器的侧面以流媒体模式运行频谱解码器，以生成输出语音。为了实现可接受的延迟质量的权衡，我们提出了一种新颖的混合方法，以实现编码器中的外观，该方法结合了铅功能堆栈和外观自我注意力。我们表明，我们的流媒体方法比Pixel4 CPU上的实时时间快2倍。

We present a fully on-device streaming Speech2Speech conversion model that normalizes a given input speech directly to synthesized output speech. Deploying such a model on mobile devices pose significant challenges in terms of memory footprint and computation requirements. We present a streaming-based approach to produce an acceptable delay, with minimal loss in speech conversion quality, when compared to a reference state of the art non-streaming approach. Our method consists of first streaming the encoder in real time while the speaker is speaking. Then, as soon as the speaker stops speaking, we run the spectrogram decoder in streaming mode along the side of a streaming vocoder to generate output speech. To achieve an acceptable delay-quality trade-off, we propose a novel hybrid approach for look-ahead in the encoder which combines a look-ahead feature stacker with a look-ahead self-attention. We show that our streaming approach is almost 2x faster than real time on the Pixel4 CPU.

下载PDF全文

下载文献需遵守相关版权规定

论文标题