论文标题
应用语法$ \ unicode {x2013} $
Applying Syntax$\unicode{x2013}$Prosody Mapping Hypothesis and Prosodic Well-Formedness Constraints to Neural Sequence-to-Sequence Speech Synthesis
论文作者
论文摘要
端到端文本到语音综合(TTS)直接从文本或音素字符串中生成语音,它提高了语音合成的质量,而不是传统的TTS。但是,大多数先前的研究都是基于主观自然性评估的,并且没有客观地检查它们是否可以重现语音现象的音调模式,例如下降,节奏的增强和最初降低日语中句法结构的初步降低。这些现象可以通过语音约束和语法$ \ Unicode {x2013} $韵律映射假设(SPMH)来语言解释,该假设(SPMH)假设从语法结构到语音层次结构的投影。尽管某些心理语言学实验验证了SPMH的有效性,但研究它是否可以在TTS中实施至关重要。为了综合涉及句法或语音限制的语言现象,我们提出了一个基于SPMH和韵律良好的限制的语音符号的模型。实验结果表明,该方法与语言学实验中报道的初始降低和节奏增强现象合成了与语言学实验中报道的相似的音高模式。提出的模型有效地合成了在训练数据中未明确包含的测试数据中的语音现象。
End-to-end text-to-speech synthesis (TTS), which generates speech sounds directly from strings of texts or phonemes, has improved the quality of speech synthesis over the conventional TTS. However, most previous studies have been evaluated based on subjective naturalness and have not objectively examined whether they can reproduce pitch patterns of phonological phenomena such as downstep, rhythmic boost, and initial lowering that reflect syntactic structures in Japanese. These phenomena can be linguistically explained by phonological constraints and the syntax$\unicode{x2013}$prosody mapping hypothesis (SPMH), which assumes projections from syntactic structures to phonological hierarchy. Although some experiments in psycholinguistics have verified the validity of the SPMH, it is crucial to investigate whether it can be implemented in TTS. To synthesize linguistic phenomena involving syntactic or phonological constraints, we propose a model using phonological symbols based on the SPMH and prosodic well-formedness constraints. Experimental results showed that the proposed method synthesized similar pitch patterns to those reported in linguistics experiments for the phenomena of initial lowering and rhythmic boost. The proposed model efficiently synthesizes phonological phenomena in the test data that were not explicitly included in the training data.