使用深卡尔曼滤波器生成模型的视听语音增强

论文标题

使用深卡尔曼滤波器生成模型的视听语音增强

Audio-visual speech enhancement with a deep Kalman filter generative model

论文作者

Golmakani, Ali, Sadeghi, Mostafa, Serizel, Romain

论文摘要

基于变异自动编码器（VAE）的深层可变生成模型显示出有希望的视听语音增强性能（AVSE）。潜在的想法是学习干净的语音数据的基于Vaeb的先验分布，然后将其与统计噪声模型相结合，以从目标扬声器的嘈杂录音和视频（唇部图像）中恢复语音信号。为AVSE开发的现有生成模型没有考虑到语音数据的顺序性质，这使它们无法完全融合视觉数据的能力。在本文中，我们提出了一个视听深的Kalman滤波器（AV-DKF）生成模型，该模型假设了潜在变量的一阶Markov链模型，并有效地融合了视听数据。此外，我们开发了一种有效的推理方法来在测试时估计语音信号。我们进行了一组实验，以比较用于语音增强的生成模型的不同变体。结果证明了AV-DKF模型的优越性与仅音频版本以及基于视听的非音频和基于视听的模型相比。

Deep latent variable generative models based on variational autoencoder (VAE) have shown promising performance for audiovisual speech enhancement (AVSE). The underlying idea is to learn a VAEbased audiovisual prior distribution for clean speech data, and then combine it with a statistical noise model to recover a speech signal from a noisy audio recording and video (lip images) of the target speaker. Existing generative models developed for AVSE do not take into account the sequential nature of speech data, which prevents them from fully incorporating the power of visual data. In this paper, we present an audiovisual deep Kalman filter (AV-DKF) generative model which assumes a first-order Markov chain model for the latent variables and effectively fuses audiovisual data. Moreover, we develop an efficient inference methodology to estimate speech signals at test time. We conduct a set of experiments to compare different variants of generative models for speech enhancement. The results demonstrate the superiority of the AV-DKF model compared with both its audio-only version and the non-sequential audio-only and audiovisual VAE-based models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题