通过协调的采样和培训体现的自我监督学习

论文标题

通过协调的采样和培训体现的自我监督学习

Embodied Self-supervised Learning by Coordinated Sampling and Training

论文作者

Sun, Yifan, Wu, Xihong

论文摘要

自我监督的学习可以显着改善下游任务的性能，但是，学会表征的维度通常缺乏明确的物理含义。在这项工作中，我们提出了一种新颖的自我监督方法，通过采用相应的身体前进过程来解决反问题，以便学习的表示形式可以具有明确的物理意义。所提出的方法以逐个分析的方式工作，通过迭代采样和培训来学习推论网络。在给定的数据中，在采样步骤中，推理网络用于近似棘手的后部，我们从中采样输入参数并将其馈送到物理过程中，以在观测空间中生成数据。在训练步骤中，通过采样配对数据优化了相同的网络。我们通过解决声学反转问题来从言语中推断出发音信息，从而证明了所提出的方法的可行性。给定关节合成器，可以通过随机初始化从头开始训练推理模型。我们的实验表明，所提出的方法可以稳定融合，并且网络学会控制关节合成器像人一样说话。我们还证明，训练有素的模型可以很好地推广到看不见的说话者甚至新语言，并且可以通过自我适应来进一步提高性能。

Self-supervised learning can significantly improve the performance of downstream tasks, however, the dimensions of learned representations normally lack explicit physical meanings. In this work, we propose a novel self-supervised approach to solve inverse problems by employing the corresponding physical forward process so that the learned representations can have explicit physical meanings. The proposed approach works in an analysis-by-synthesis manner to learn an inference network by iteratively sampling and training. At the sampling step, given observed data, the inference network is used to approximate the intractable posterior, from which we sample input parameters and feed them to a physical process to generate data in the observational space; At the training step, the same network is optimized with the sampled paired data. We prove the feasibility of the proposed method by tackling the acoustic-to-articulatory inversion problem to infer articulatory information from speech. Given an articulatory synthesizer, an inference model can be trained completely from scratch with random initialization. Our experiments demonstrate that the proposed method can converge steadily and the network learns to control the articulatory synthesizer to speak like a human. We also demonstrate that trained models can generalize well to unseen speakers or even new languages, and performance can be further improved through self-adaptation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题