Lite Audio-Visual语音增强

论文标题

Lite Audio-Visual语音增强

Lite Audio-Visual Speech Enhancement

论文作者

Chuang, Shang-Yi, Tsao, Yu, Lo, Chen-Chou, Wang, Hsin-Min

论文摘要

先前的研究证实了将视觉信息纳入语音增强（SE）系统的有效性。尽管有改善的降级性能，但在实施视听SE（AVSE）系统时可能会遇到两个问题：（1）额外的处理成本会纳入视觉输入，并且（2）使用脸部或唇部图像可能会导致隐私问题。在这项研究中，我们提出了一个Lite Avse（LAVSE）系统来解决这些问题。该系统包括两种视觉数据压缩技术，并从培训模型中删除视觉特征提取网络，从而获得更好的在线计算效率。我们的实验结果表明，所提出的LAVSE系统可以比具有类似数量的模型参数的仅音频SE系统提供的性能明显更好。此外，实验结果证实了两种技术在视觉数据压缩中的有效性。

Previous studies have confirmed the effectiveness of incorporating visual information into speech enhancement (SE) systems. Despite improved denoising performance, two problems may be encountered when implementing an audio-visual SE (AVSE) system: (1) additional processing costs are incurred to incorporate visual input and (2) the use of face or lip images may cause privacy problems. In this study, we propose a Lite AVSE (LAVSE) system to address these problems. The system includes two visual data compression techniques and removes the visual feature extraction network from the training model, yielding better online computation efficiency. Our experimental results indicate that the proposed LAVSE system can provide notably better performance than an audio-only SE system with a similar number of model parameters. In addition, the experimental results confirm the effectiveness of the two techniques for visual data compression.

下载PDF全文

下载文献需遵守相关版权规定

论文标题