论文标题
多演讲者和宽带模拟对话作为端到端神经腹泻的训练数据
Multi-Speaker and Wide-Band Simulated Conversations as Training Data for End-to-End Neural Diarization
论文作者
论文摘要
端到端诊断提出了标准级联腹化系统的有吸引力替代品,因为单个系统可以立即处理任务的所有方面。已经提出了许多端到端模型的口味,但是所有这些模型都需要(到目前为止)大量注释的数据进行培训。折衷解决方案在于生成合成数据,而最近提出的模拟对话(SC)表现出了与原始模拟混合物(SM)的显着改善。在这项工作中,我们每次对话都创建具有多个扬声器的SC,并表明它们允许的性能要比SM更好,还可以减少对微调阶段的依赖。我们还使用宽频公共音频来源创建SC,并就几个评估集进行了分析。与本出版物一起,我们发布了用于生成公共集合培训的数据和模型的食谱,以及实现每个对话和辅助语音活动检测损失的实现。
End-to-end diarization presents an attractive alternative to standard cascaded diarization systems because a single system can handle all aspects of the task at once. Many flavors of end-to-end models have been proposed but all of them require (so far non-existing) large amounts of annotated data for training. The compromise solution consists in generating synthetic data and the recently proposed simulated conversations (SC) have shown remarkable improvements over the original simulated mixtures (SM). In this work, we create SC with multiple speakers per conversation and show that they allow for substantially better performance than SM, also reducing the dependence on a fine-tuning stage. We also create SC with wide-band public audio sources and present an analysis on several evaluation sets. Together with this publication, we release the recipes for generating such data and models trained on public sets as well as the implementation to efficiently handle multiple speakers per conversation and an auxiliary voice activity detection loss.