Varietysound：通过无监督信息解开，可控制的视频到Sound Generation

论文标题

Varietysound：通过无监督信息解开，可控制的视频到Sound Generation

VarietySound: Timbre-Controllable Video to Sound Generation via Unsupervised Information Disentanglement

论文作者

Cui, Chenye, Ren, Yi, Liu, Jinglin, Huang, Rongjie, Zhao, Zhou

论文摘要

在视频输入的情况下，视频到Sound Generation旨在产生逼真的自然声音。但是，以前的视频到听觉的生成方法只能生成随机或平均音色，而无需对产生的声音音色的任何控件或专业化，这导致人们有时无法在这些方法下获得所需的音色。在本文中，我们构成了用特定音色产生声音的任务，给定视频输入和参考音频示例。为了解决此任务，我们将每个目标音频分为三个组成部分：时间信息，声学信息和背景信息。我们首先使用三个编码器来编码这些组件：1）一个时间编码器来编码时间信息，因为输入视频共享与原始音频相同的时间信息，因此用视频框架馈送时间帧； 2）一个声学编码器编码Timbre信息，该信息将原始音频作为输入，并通过时间损失操作丢弃其时间信息； 3）背景编码器编码残差或背景声音，该声音使用原始音频的背景部分作为输入。为了使生成的结果获得更好的质量和时间对齐，我们还采用了MEL歧视者和对抗性训练的时间歧视者。我们在VAS数据集上的实验结果表明，我们的方法可以生成具有良好同步的高质量音频样本，并与视频中的事件进行良好的同步，并且与参考音频相似。

Video to sound generation aims to generate realistic and natural sound given a video input. However, previous video-to-sound generation methods can only generate a random or average timbre without any controls or specializations of the generated sound timbre, leading to the problem that people cannot obtain the desired timbre under these methods sometimes. In this paper, we pose the task of generating sound with a specific timbre given a video input and a reference audio sample. To solve this task, we disentangle each target sound audio into three components: temporal information, acoustic information, and background information. We first use three encoders to encode these components respectively: 1) a temporal encoder to encode temporal information, which is fed with video frames since the input video shares the same temporal information as the original audio; 2) an acoustic encoder to encode timbre information, which takes the original audio as input and discards its temporal information by a temporal-corrupting operation; and 3) a background encoder to encode the residual or background sound, which uses the background part of the original audio as input. To make the generated result achieve better quality and temporal alignment, we also adopt a mel discriminator and a temporal discriminator for the adversarial training. Our experimental results on the VAS dataset demonstrate that our method can generate high-quality audio samples with good synchronization with events in video and high timbre similarity with the reference audio.

下载PDF全文

下载文献需遵守相关版权规定

论文标题