基于语料库的单元选择文本到语音的统计上下文依赖性单元校正

论文标题

基于语料库的单元选择文本到语音的统计上下文依赖性单元校正

Statistical Context-Dependent Units Boundary Correction for Corpus-based Unit-Selection Text-to-Speech

论文作者

Zito, Claudio, Tesser, Fabio, Nicolao, Mauro, Cosi, Piero

论文摘要

在这项研究中，我们提出了一种用于说话者适应的创新技术，以通过应用于单位选择文本到语音（TTS）系统来提高分割的准确性。与传统的扬声器适应技术不同，该技术试图使用面对说话者特征的声学模型来提高分割的准确性，我们的目的是仅使用语言分析技术推断的上下文依赖性特征。简而言之，我们使用的是直观的想法，即上下文依赖性信息与相关的声波形密切相关。我们提出了一个统计模型，该模型可以预测校正值，以减少基于最新的隐藏马尔可夫模型（HMM）语音分割产生的系统错误。我们的方法包括两个阶段：（1）识别与上下文相关的语音单位类别（例如，将元音识别为单音节词的核的类别）；（2）构建一个回归模型，该模型将单个扬声器语料库分割为每个班级时ASR所产生的平均误差值。通过比较单位的校正边界和最新的HHM分割与参考对齐方式来评估该方法的成功，这应该是最佳解决方案。总之，我们的工作提供了对依赖说话者的特征敏感的模型，对有缺陷和嘈杂的信息敏感的模型，以及一种非常简单的实现，可以用作更昂贵的说话者适应系统的替代方案，或者是众多的手动校正课程。

In this study, we present an innovative technique for speaker adaptation in order to improve the accuracy of segmentation with application to unit-selection Text-To-Speech (TTS) systems. Unlike conventional techniques for speaker adaptation, which attempt to improve the accuracy of the segmentation using acoustic models that are more robust in the face of the speaker's characteristics, we aim to use only context dependent characteristics extrapolated with linguistic analysis techniques. In simple terms, we use the intuitive idea that context dependent information is tightly correlated with the related acoustic waveform. We propose a statistical model, which predicts correcting values to reduce the systematic error produced by a state-of-the-art Hidden Markov Model (HMM) based speech segmentation. Our approach consists of two phases: (1) identifying context-dependent phonetic unit classes (for instance, the class which identifies vowels as being the nucleus of monosyllabic words); and (2) building a regression model that associates the mean error value made by the ASR during the segmentation of a single speaker corpus to each class. The success of the approach is evaluated by comparing the corrected boundaries of units and the state-of-the-art HHM segmentation against a reference alignment, which is supposed to be the optimal solution. In conclusion, our work supplies a first analysis of a model sensitive to speaker-dependent characteristics, robust to defective and noisy information, and a very simple implementation which could be utilized as an alternative to either more expensive speaker-adaptation systems or of numerous manual correction sessions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题