基于单词频率的音频和后处理的影响音频字幕

论文标题

基于单词频率的音频和后处理的影响音频字幕

Effects of Word-frequency based Pre- and Post- Processings for Audio Captioning

论文作者

Takeuchi, Daiki, Koizumi, Yuma, Ohishi, Yasunori, Harada, Noboru, Kashino, Kunio

论文摘要

我们用于任务6（自动音频字幕）的系统检测和分类声学场景和事件（Dcase）2020 Challenge结合了三个元素，即数据调整，多任务学习和后处理，用于听音。该系统获得了最高的评估分数，但是在哪些最完全促进其性能的单个要素中，尚未阐明。在这里，为了评估他们的贡献，我们首先对我们的系统进行了元素的消融研究，以估算每个元素在多大程度上有效。然后，我们通过详细的模块进行研究，以进一步阐明关键处理模块以提高准确性。结果显示了数据增强和后处理，在我们的系统中得分显着提高。特别是，混合数据增强和光束搜索在后处理中分别提高了0.8和1.6点。

The system we used for Task 6 (Automated Audio Captioning)of the Detection and Classification of Acoustic Scenes and Events(DCASE) 2020 Challenge combines three elements, namely, dataaugmentation, multi-task learning, and post-processing, for audiocaptioning. The system received the highest evaluation scores, butwhich of the individual elements most fully contributed to its perfor-mance has not yet been clarified. Here, to asses their contributions,we first conducted an element-wise ablation study on our systemto estimate to what extent each element is effective. We then con-ducted a detailed module-wise ablation study to further clarify thekey processing modules for improving accuracy. The results showthat data augmentation and post-processing significantly improvethe score in our system. In particular, mix-up data augmentationand beam search in post-processing improve SPIDEr by 0.8 and 1.6points, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题