使用活动耦合的笛卡尔DOA矢量和RD3NET进行声音事件的定位和检测

论文标题

使用活动耦合的笛卡尔DOA矢量和RD3NET进行声音事件的定位和检测

Sound Event Localization and Detection Using Activity-Coupled Cartesian DOA Vector and RD3net

论文作者

Shimada, Kazuki, Takahashi, Naoya, Takahashi, Shusuke, Mitsufuji, Yuki

论文摘要

我们提交给DCASE2020任务的系统〜3：声音事件本地化和检测（SELD）在本报告中描述。我们考虑两个系统：一个单级系统，该系统可以同时解决声音事件定位〜（SEL）和声音事件检测〜（SED），以及一个两阶段的系统，该系统首先处理SED和SEL任务，然后将这些结果单独处理和SEL任务。作为单阶段系统，我们提出了一个统一的培训框架，该培训框架使用活动耦合的笛卡尔矢量〜（ACCDOA）表示作为SED和SEL任务的单个目标。为了有效地估计声音事件的位置和活动，我们进一步提出了RD3NET，该net结合了复发和卷积层，并具有密集的跳过连接和扩张。为了概括模型，我们应用了三种数据增强技术：均衡的混合物数据增强〜（EMDA），一阶Ambisonic〜（FOA）singals的旋转以及Specaugment的多通道扩展。我们的系统证明了基线系统的显着改善。

Our systems submitted to the DCASE2020 task~3: Sound Event Localization and Detection (SELD) are described in this report. We consider two systems: a single-stage system that solve sound event localization~(SEL) and sound event detection~(SED) simultaneously, and a two-stage system that first handles the SED and SEL tasks individually and later combines those results. As the single-stage system, we propose a unified training framework that uses an activity-coupled Cartesian DOA vector~(ACCDOA) representation as a single target for both the SED and SEL tasks. To efficiently estimate sound event locations and activities, we further propose RD3Net, which incorporates recurrent and convolution layers with dense skip connections and dilation. To generalize the models, we apply three data augmentation techniques: equalized mixture data augmentation~(EMDA), rotation of first-order Ambisonic~(FOA) singals, and multichannel extension of SpecAugment. Our systems demonstrate a significant improvement over the baseline system.

下载PDF全文

下载文献需遵守相关版权规定

论文标题