通过全天候分离，计数和诊断系统来应对真正的嘈杂的回响会议

论文标题

通过全天候分离，计数和诊断系统来应对真正的嘈杂的回响会议

Tackling real noisy reverberant meetings with all-neural source separation, counting, and diarization system

论文作者

Kinoshita, Keisuke, Delcroix, Marc, Araki, Shoko, Nakatani, Tomohiro

论文摘要

自动会议分析是一项必需的基本技术，例如智能设备跟随并响应我们的对话。为了实现最佳的自动会议分析，我们先前提出了一种全面的方法，该方法以最佳方式共同解决了源分离，扬声器诊断和来源计数问题（从某种意义上说，所有3个任务都可以通过错误的反向传播共同优化）。结果表明，该方法可以很好地处理类似模拟的清洁（无噪声和无声）对话框数据，并且与几种常规方法相比，该方法的性能非常好。但是，目前尚不清楚是否会成功地将这种全神经方法成功概括为更复杂的真实会议数据，其中包含更加自发的扬声器，严重的噪音和混响，以及在这种情况下与先进的系统相比，它的性能如何。在本文中，我们首先考虑改善全神经方法的鲁棒性所需的实际问题，然后实验表明，即使在实际的会议场景中，全神经方法也可以执行有效的语音增强，并同时超过最先进的系统。

Automatic meeting analysis is an essential fundamental technology required to let, e.g. smart devices follow and respond to our conversations. To achieve an optimal automatic meeting analysis, we previously proposed an all-neural approach that jointly solves source separation, speaker diarization and source counting problems in an optimal way (in a sense that all the 3 tasks can be jointly optimized through error back-propagation). It was shown that the method could well handle simulated clean (noiseless and anechoic) dialog-like data, and achieved very good performance in comparison with several conventional methods. However, it was not clear whether such all-neural approach would be successfully generalized to more complicated real meeting data containing more spontaneously-speaking speakers, severe noise and reverberation, and how it performs in comparison with the state-of-the-art systems in such scenarios. In this paper, we first consider practical issues required for improving the robustness of the all-neural approach, and then experimentally show that, even in real meeting scenarios, the all-neural approach can perform effective speech enhancement, and simultaneously outperform state-of-the-art systems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题