RESPVAD：通过视频提取的呼吸模式检测语音活动

论文标题

RESPVAD：通过视频提取的呼吸模式检测语音活动

RespVAD: Voice Activity Detection via Video-Extracted Respiration Patterns

论文作者

Mondal, Arnab Kumar, P, Prathosh A.

论文摘要

语音活动检测（VAD）是指在音频和视频等数字信号中识别人类语音区域的任务。尽管VAD是许多语音处理系统中必不可少的第一步，但在录音过程中的环境噪声水平较高时，它会带来挑战。为了在这种情况下提高VAD的性能，已经提出了利用从扬声器视频记录的嘴/唇部区域中提取的视觉信息提取的几种方法。即使这些优势比只有音频的方法具有优势，但它们取决于忠实地提取嘴唇/口区域。由这些动机，基于呼吸构成语音生产的主要能源的事实，这是一个新的VAD范式。具体而言，开发了使用从扬声器视频中提取的呼吸模式的独立于音频的VAD技术。首先，使用基于光流的方法从扬声器的腹部胸膜区域的视频中提取呼吸模式。随后，使用神经序列到序列预测模型从呼吸模式信号中检测到语音活动。通过对在实际声学环境中记录的具有挑战性的数据集进行实验证明了该方法的功效，并将其与基于音频和视觉提示的四种先前方法进行了比较。

Voice Activity Detection (VAD) refers to the task of identification of regions of human speech in digital signals such as audio and video. While VAD is a necessary first step in many speech processing systems, it poses challenges when there are high levels of ambient noise during the audio recording. To improve the performance of VAD in such conditions, several methods utilizing the visual information extracted from the region surrounding the mouth/lip region of the speakers' video recording have been proposed. Even though these provide advantages over audio-only methods, they depend on faithful extraction of lip/mouth regions. Motivated by these, a new paradigm for VAD based on the fact that respiration forms the primary source of energy for speech production is proposed. Specifically, an audio-independent VAD technique using the respiration pattern extracted from the speakers' video is developed. The Respiration Pattern is first extracted from the video focusing on the abdominal-thoracic region of a speaker using an optical flow based method. Subsequently, voice activity is detected from the respiration pattern signal using neural sequence-to-sequence prediction models. The efficacy of the proposed method is demonstrated through experiments on a challenging dataset recorded in real acoustic environments and compared with four previous methods based on audio and visual cues.

下载PDF全文

下载文献需遵守相关版权规定

论文标题