论文标题
关于选择音频训练模型的音频字幕的调查
An investigation on selecting audio pre-trained models for audio captioning
论文作者
论文摘要
音频字幕是一项根据内容生成音频描述的任务。由于高复杂性,预训练的模型被广泛用于音频字幕。除非重新训练全面的系统,否则很难确定预训练的模型对音频字幕系统的贡献。为了防止耗时和消耗能量的再培训过程,有必要在音频字幕中为预训练的模型提出绩效的倾向。在本文中,研究了一系列预训练的模型,以进行提取的音频功能与音频字幕的性能之间的相关性。基于实验结果提出了几个预测因子。结果表明,提取的音频特征的峰度和偏度可能是由于音频和音频特征偏斜性之间的高相关性以及音频特征的高度相关性以及音频附件系统的性能而导致的预训练音频的性能的指标。
Audio captioning is a task that generates description of audio based on content. Pre-trained models are widely used in audio captioning due to high complexity. Unless a comprehensive system is re-trained, it is hard to determine how well pre-trained models contribute to audio captioning system. To prevent the time consuming and energy consuming process of retraining, it is necessary to propose a preditor of performance for the pre-trained model in audio captioning. In this paper, a series of pre-trained models are investigated for the correlation between extracted audio features and the performance of audio captioning. A couple of predictor is proposed based on the experiment results.The result demonstrates that the kurtosis and skewness of audio features extracted may act as an indicator of the performance of audio captioning systems for pre-trained audio due to the high correlation between kurtosis and skewness of audio features and the performance of audio captioning systems.