论文标题
将光谱和自我监管的功能结合在一起,以进行低资源语音识别和翻译
Combining Spectral and Self-Supervised Features for Low Resource Speech Recognition and Translation
论文作者
论文摘要
自我监督学习(SSL)模型已成功应用于各种基于深度学习的语音任务,尤其是那些数据有限的语音任务。但是,SSL表示的质量在很大程度上取决于SSL训练域和目标数据域之间的相关性。相反,光谱特征(SF)提取器(如木纤维银行)是手工制作的非可学习组件,并且对于域移动可能更健壮。本工作研究了以下假设:将非可行性的SF提取器结合到SSL模型是用于低资源语音任务的有效方法。我们提出了一个可学习且可解释的框架,以结合SF和SSL表示。在三个低资源数据集上,所提出的框架在自动语音识别(ASR)和语音翻译(ST)任务上的基线和SSL模型都显着优于基线和SSL模型。我们还设计了基于专家的组合模型的混合物。最后一个模型表明,如果SSL训练集与目标语言数据之间的域不匹配,SSL模型对常规SF提取器的相对贡献非常小。
Self-Supervised Learning (SSL) models have been successfully applied in various deep learning-based speech tasks, particularly those with a limited amount of data. However, the quality of SSL representations depends highly on the relatedness between the SSL training domain(s) and the target data domain. On the contrary, spectral feature (SF) extractors such as log Mel-filterbanks are hand-crafted non-learnable components, and could be more robust to domain shifts. The present work examines the assumption that combining non-learnable SF extractors to SSL models is an effective approach to low resource speech tasks. We propose a learnable and interpretable framework to combine SF and SSL representations. The proposed framework outperforms significantly both baseline and SSL models on Automatic Speech Recognition (ASR) and Speech Translation (ST) tasks on three low resource datasets. We additionally design a mixture of experts based combination model. This last model reveals that the relative contribution of SSL models over conventional SF extractors is very small in case of domain mismatch between SSL training set and the target language data.