论文标题

NATIQ:阿拉伯语的端到端文本到语音系统

NatiQ: An End-to-end Text-to-Speech System for Arabic

论文作者

Abdelali, Ahmed, Durrani, Nadir, Demiroglu, Cenk, Dalvi, Fahim, Mubarak, Hamdy, Darwish, Kareem

论文摘要

Natiq是阿拉伯语的端到端文本到语音系统。我们的语音合成器使用Encoder-Decoder架构引起了人们的注意。我们同时使用了基于TACOTRON的模型(Tacotron-1和Tacotron-2)和更快的变压器模型来从字符中生成MEL光谱图。我们将tacotron1与Wavernn Vocoder串联,tacotron2与Wavelow Vocoder和ESPNET变压器与平行Wavegan Vocoder串联,以从频谱图合成波形。我们使用了两个声音的内部语音数据:1)中立的男性“ Hamza” - 叙述一般内容和新闻,以及2)表现力的女性“ Amina” - 叙述孩子的故事书来训练我们的模型。我们最佳系统的平均平均意见分数(MOS)分别为Amina和Hamza的平均意见分数为4.21和4.40。使用单词和字符错误率(WER和CER)对系统的客观评估以及实时因子测量的响应时间有利于端到端体系结构ESPNET。 NATIQ演示可在线上https://tts.qcri.org提供

NatiQ is end-to-end text-to-speech system for Arabic. Our speech synthesizer uses an encoder-decoder architecture with attention. We used both tacotron-based models (tacotron-1 and tacotron-2) and the faster transformer model for generating mel-spectrograms from characters. We concatenated Tacotron1 with the WaveRNN vocoder, Tacotron2 with the WaveGlow vocoder and ESPnet transformer with the parallel wavegan vocoder to synthesize waveforms from the spectrograms. We used in-house speech data for two voices: 1) neutral male "Hamza"- narrating general content and news, and 2) expressive female "Amina"- narrating children story books to train our models. Our best systems achieve an average Mean Opinion Score (MOS) of 4.21 and 4.40 for Amina and Hamza respectively. The objective evaluation of the systems using word and character error rate (WER and CER) as well as the response time measured by real-time factor favored the end-to-end architecture ESPnet. NatiQ demo is available on-line at https://tts.qcri.org

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源