论文标题
更多的讲话或更多演讲者?
More Speaking or More Speakers?
论文作者
论文摘要
自我训练(ST)和自我监督学习(SSL)方法已显示出自动语音识别(ASR)的强烈改善。尽管有这些进步,但据我们所知,这些方法中使用的标记和未标记数据集的组成如何影响结果。在这项工作中,我们旨在分析培训数据中的说话者数量在最近的SSL算法(WAV2VEC 2.0)和最近的ST算法(Slimipl)中的影响。我们通过改变说话者的数量,同时保持固定小时数,反之亦然,对标记和未标记数据进行系统分析。我们的发现表明,SSL需要大量未标记的数据才能产生高精度结果,而ST则需要在标签数据中足够数量的扬声器,尤其是在低期权设置中。通过这种方式,这两种方法改善了不同数据组成制度的监督学习。
Self-training (ST) and self-supervised learning (SSL) methods have demonstrated strong improvements in automatic speech recognition (ASR). In spite of these advances, to the best of our knowledge, there is no analysis of how the composition of the labelled and unlabelled datasets used in these methods affects the results. In this work we aim to analyse the effect of number of speakers in the training data on a recent SSL algorithm (wav2vec 2.0), and a recent ST algorithm (slimIPL). We perform a systematic analysis on both labeled and unlabeled data by varying the number of speakers while keeping the number of hours fixed and vice versa. Our findings suggest that SSL requires a large amount of unlabeled data to produce high accuracy results, while ST requires a sufficient number of speakers in the labelled data, especially in the low-regime setting. In this manner these two approaches improve supervised learning in different regimes of data composition.