连续的语音分离：数据集和分析

论文标题

连续的语音分离：数据集和分析

Continuous speech separation: dataset and analysis

论文作者

Chen, Zhuo, Yoshioka, Takuya, Lu, Liang, Zhou, Tianyan, Meng, Zhong, Luo, Yi, Wu, Jian, Xiao, Xiong, Li, Jinyu

论文摘要

本文介绍了用于评估连续语音分离算法的数据集和协议。大多数先前关于语音分离的研究都使用了人为混合语音话语的预分段信号，这些信号主要是\ emph {Fllual}重叠，并根据信噪比或类似性能指标对算法进行评估。但是，在自然对话中，语音信号是连续的，包含重叠和无重叠的组件。此外，基于信号的指标与自动语音识别（ASR）精度的相关性非常弱。我们认为，这不仅很难评估测试算法的实际相关性，而且还阻碍了研究人员无法开发可以容易应用于实际场景的系统。在本文中，我们将连续的语音分离（CSS）定义为从\ textit {Chanduel}音频流中生成一组非叠层的语音信号的任务，其中包含多种\ emph {emph {eytph {部分}的话语。一个新的真实录制的数据集（称为Libricss）是通过串联语音发音来模拟对话并用远场麦克风捕获音频重播来衍生出来的。还通过使用训练有素的多条件声学模型来建立基于Kaldi的ASR评估方案。通过使用此数据集，研究了最近提出的与说话者无关的CSS算法的几个方面。数据集和评估脚本可在此方向促进研究。

This paper describes a dataset and protocols for evaluating continuous speech separation algorithms. Most prior studies on speech separation use pre-segmented signals of artificially mixed speech utterances which are mostly \emph{fully} overlapped, and the algorithms are evaluated based on signal-to-distortion ratio or similar performance metrics. However, in natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components. In addition, the signal-based metrics have very weak correlations with automatic speech recognition (ASR) accuracy. We think that not only does this make it hard to assess the practical relevance of the tested algorithms, it also hinders researchers from developing systems that can be readily applied to real scenarios. In this paper, we define continuous speech separation (CSS) as a task of generating a set of non-overlapped speech signals from a \textit{continuous} audio stream that contains multiple utterances that are \emph{partially} overlapped by a varying degree. A new real recorded dataset, called LibriCSS, is derived from LibriSpeech by concatenating the corpus utterances to simulate a conversation and capturing the audio replays with far-field microphones. A Kaldi-based ASR evaluation protocol is also established by using a well-trained multi-conditional acoustic model. By using this dataset, several aspects of a recently proposed speaker-independent CSS algorithm are investigated. The dataset and evaluation scripts are available to facilitate the research in this direction.

下载PDF全文

下载文献需遵守相关版权规定

论文标题