使用基于扬声器距离的损失功能的增强的Conv-TASNET模型用于语音分离

论文标题

使用基于扬声器距离的损失功能的增强的Conv-TASNET模型用于语音分离

An enhanced Conv-TasNet model for speech separation using a speaker distance-based loss function

论文作者

Arango-Sánchez, Jose A., Arias-Londoño, Julián D.

论文摘要

这项工作通过预先训练的深度学习模型解决了西班牙语中语音分离的问题。与许多语音处理任务一样，其他与英语不同的语言中的大数据库很少。因此，这项工作将使用Conv-TASNET模型作为基准探索不同的培训策略。对于最佳训练策略，达到了9.9 dB的标准信号失真比（SI-SDR）度量值。然后，在实验上，我们确定了说话者的相似性与模型的性能之间的反比关系，因此提出了改进的Convtasnet架构。增强的Conv-TASNET模型使用预先训练的语音嵌入来在成本函数中添加扬声器之间的余弦相似性项，从而产生10.6 dB的SI-SDR。最后，有关实时部署的最终实验表明，由于需要处理仅出现一个说话者的小型语音段，因此在说话者的频道同步中存在一些缺点。

This work addresses the problem of speech separation in the Spanish Language using pre-trained deep learning models. As with many speech processing tasks, large databases in other languages different from English are scarce. Therefore this work explores different training strategies using the Conv-TasNet model as a benchmark. A scale-invariant signal distortion ratio (SI-SDR) metric value of 9.9 dB was achieved for the best training strategy. Then, experimentally, we identified an inverse relationship between the speakers' similarity and the model's performance, so an improved ConvTasNet architecture was proposed. The enhanced Conv-TasNet model uses pre-trained speech embeddings to add a between-speakers cosine similarity term in the cost function, yielding an SI-SDR of 10.6 dB. Lastly, final experiments regarding real-time deployment show some drawbacks in the speakers' channel synchronization due to the need to process small speech segments where only one of the speakers appears.

下载PDF全文

下载文献需遵守相关版权规定

论文标题