说话者相似性的自动评估

论文标题

说话者相似性的自动评估

Automatic Evaluation of Speaker Similarity

论文作者

Kamil, Deja, Ariadna, Sanchez, Julian, Roth, Marius, Cotescu

论文摘要

我们引入了一种新的自动评估方法，用于说话者相似性评估，这与人类感知得分一致。现代的神经文本到语音模型需要大量的清洁训练数据，这就是为什么许多解决方案从单个扬声器模型转换为在许多不同扬声器的示例中训练的解决方案的原因。多扬声器模型带来了新的可能性，例如更快的新声音创建，但也是一个新问题 - 扬声器泄漏，其中综合示例的扬声器身份可能与目标扬声器的示例不符。当前，发现此问题的唯一方法是通过昂贵的感知评估。在这项工作中，我们提出了一种自动方法来评估说话者相似性。为此，我们扩展了有关说话者验证系统的最新工作，并评估不同的指标和说话者嵌入模型如何以隐藏的参考和锚（Mushra）分数反映多个刺激。我们的实验表明，我们可以训练一个模型来预测说话者的相似性，穆斯拉（Mushra）的分数来自扬声器的嵌入式，其精度为0.96，并且在话语水平上最高0.78 Pearson分数。

We introduce a new automatic evaluation method for speaker similarity assessment, that is consistent with human perceptual scores. Modern neural text-to-speech models require a vast amount of clean training data, which is why many solutions switch from single speaker models to solutions trained on examples from many different speakers. Multi-speaker models bring new possibilities, such as a faster creation of new voices, but also a new problem - speaker leakage, where the speaker identity of a synthesized example might not match those of the target speaker. Currently, the only way to discover this issue is through costly perceptual evaluations. In this work, we propose an automatic method for assessment of speaker similarity. For that purpose, we extend the recent work on speaker verification systems and evaluate how different metrics and speaker embeddings models reflect Multiple Stimuli with Hidden Reference and Anchor (MUSHRA) scores. Our experiments show that we can train a model to predict speaker similarity MUSHRA scores from speaker embeddings with 0.96 accuracy and significant correlation up to 0.78 Pearson score at the utterance level.

下载PDF全文

下载文献需遵守相关版权规定

论文标题