论文标题
基于单词频率的音频和后处理的影响音频字幕
Effects of Word-frequency based Pre- and Post- Processings for Audio Captioning
论文作者
论文摘要
我们用于任务6(自动音频字幕)的系统检测和分类声学场景和事件(Dcase)2020 Challenge结合了三个元素,即数据调整,多任务学习和后处理,用于听音。该系统获得了最高的评估分数,但是在哪些最完全促进其性能的单个要素中,尚未阐明。在这里,为了评估他们的贡献,我们首先对我们的系统进行了元素的消融研究,以估算每个元素在多大程度上有效。然后,我们通过详细的模块进行研究,以进一步阐明关键处理模块以提高准确性。结果显示了数据增强和后处理,在我们的系统中得分显着提高。特别是,混合数据增强和光束搜索在后处理中分别提高了0.8和1.6点。
The system we used for Task 6 (Automated Audio Captioning)of the Detection and Classification of Acoustic Scenes and Events(DCASE) 2020 Challenge combines three elements, namely, dataaugmentation, multi-task learning, and post-processing, for audiocaptioning. The system received the highest evaluation scores, butwhich of the individual elements most fully contributed to its perfor-mance has not yet been clarified. Here, to asses their contributions,we first conducted an element-wise ablation study on our systemto estimate to what extent each element is effective. We then con-ducted a detailed module-wise ablation study to further clarify thekey processing modules for improving accuracy. The results showthat data augmentation and post-processing significantly improvethe score in our system. In particular, mix-up data augmentationand beam search in post-processing improve SPIDEr by 0.8 and 1.6points, respectively.