直接语音到图像翻译

论文标题

直接语音到图像翻译

Direct Speech-to-image Translation

论文作者

Li, Jiguo, Zhang, Xinfeng, Jia, Chuanmin, Xu, Jizheng, Zhang, Li, Wang, Yue, Ma, Siwei, Gao, Wen

论文摘要

没有文本的直接语音到图像翻译是一个有趣且有用的主题，因为人类计算机互动，艺术创作，计算机辅助设计中的潜在应用。等等。更不用说许多语言没有写作形式。但是，据我们所知，它并未经过充分研究，如何将语音信号直接转换为图像以及它们的翻译程度。在本文中，我们试图将语音信号转换为没有转录阶段的图像信号。具体而言，语音编码器旨在将输入语音信号表示为嵌入功能，并使用教师学习的训练有素的图像编码器对其进行培训，以在新课程上获得更好的概括能力。随后，堆叠的生成对抗网络用于合成以嵌入功能为条件的高质量图像。合成和真实数据的实验结果表明，我们提出的方法有效地将原始语音信号转化为没有中间文本表示的图像。消融研究提供了有关我们方法的更多见解。

Direct speech-to-image translation without text is an interesting and useful topic due to the potential applications in human-computer interaction, art creation, computer-aided design. etc. Not to mention that many languages have no writing form. However, as far as we know, it has not been well-studied how to translate the speech signals into images directly and how well they can be translated. In this paper, we attempt to translate the speech signals into the image signals without the transcription stage. Specifically, a speech encoder is designed to represent the input speech signals as an embedding feature, and it is trained with a pretrained image encoder using teacher-student learning to obtain better generalization ability on new classes. Subsequently, a stacked generative adversarial network is used to synthesize high-quality images conditioned on the embedding feature. Experimental results on both synthesized and real data show that our proposed method is effective to translate the raw speech signals into images without the middle text representation. Ablation study gives more insights about our method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题