研究病理言语识别的自我监督预审框架

论文标题

研究病理言语识别的自我监督预审框架

Investigating Self-supervised Pretraining Frameworks for Pathological Speech Recognition

论文作者

Violeta, Lester Phillip, Huang, Wen-Chin, Toda, Tomoki

论文摘要

我们研究了用于自动语音识别（ASR）的病理语音数据集（ASR）的自我监管预处理框架的性能。现代的端到端模型需要数千个小时的数据才能进行训练，但是仅公开可用的病理语音数据集。解决此问题的一个事实解决方案是首先在大量健康的语音数据集上预处理模型，然后在病理语音数据集中进行微调。一个称为自我监督学习（SSL）的新训练框架仅使用语音数据来训练网络，从而在培训数据需求方面提供了更大的灵活性，并允许在训练中使用更多的语音数据。我们使用两种类型的病理语音，即日本的电脑膜脑和英语尺寸障碍，研究了SSL框架，例如WAV2VEC 2.0和WAVLM模型，并将其性能与不同的监督训练预处理进行比较。我们的结果表明，尽管SSL在资源最少的健康语音方面表现出了成功，但我们认为病理性言论并不是这种情况。最佳监督设置的表现优于最佳SSL设置，在电力语音中，字符错误率为13.9％，质心语音中的单词错误率为16.8％。

We investigate the performance of self-supervised pretraining frameworks on pathological speech datasets used for automatic speech recognition (ASR). Modern end-to-end models require thousands of hours of data to train well, but only a small number of pathological speech datasets are publicly available. A proven solution to this problem is by first pretraining the model on a huge number of healthy speech datasets and then fine-tuning it on the pathological speech datasets. One new pretraining framework called self-supervised learning (SSL) trains a network using only speech data, providing more flexibility in training data requirements and allowing more speech data to be used in pretraining. We investigate SSL frameworks such as the wav2vec 2.0 and WavLM models using different setups and compare their performance with different supervised pretraining setups, using two types of pathological speech, namely, Japanese electrolaryngeal and English dysarthric. Our results show that although SSL has shown success with minimally resourced healthy speech, we do not find this to be the case with pathological speech. The best supervised setup outperforms the best SSL setup by 13.9% character error rate in electrolaryngeal speech and 16.8% word error rate in dysarthric speech.

下载PDF全文

下载文献需遵守相关版权规定

论文标题