论文标题
高效的编码器架构,具有自上而下的语音分离的关注
An efficient encoder-decoder architecture with top-down attention for speech separation
论文作者
论文摘要
深度神经网络在语音分离任务中表现出了良好的前景。但是,在保持低模型复杂性的同时获得良好的结果在实际应用中仍然具有挑战性。在本文中,我们通过模仿大脑的自上而下的注意力(称为TDANET),提供了一种由生物启发的编码器架构,并以降低模型的复杂性而无需牺牲性能。 Tdanet中自上而下的注意力是通过全球注意力(GA)模块和级联的局部注意力(LA)层提取的。 GA模块将多尺度的声学特征作为提取全局注意力信号的输入,然后通过直接自上而下的连接来调节不同尺度的特征。 LA层使用相邻层的特征作为输入来提取局部注意信号,该信号用于以自上而下的方式调节横向输入。在三个基准数据集上,TDANET始终达到竞争性的分离性能,以提高效率的先前最新方法(SOTA)方法。具体而言,TDANET的多重蓄能操作(MAC)仅是Sepformer的5 \%,以前的SOTA模型之一,CPU推理时间仅为Sepformer的10 \%。此外,TDANET的大尺寸版本在三个数据集上获得了SOTA结果,MAC仍然只有10 \%的Sepformer,而CPU推理时间仅为Sepformer的24 \%。
Deep neural networks have shown excellent prospects in speech separation tasks. However, obtaining good results while keeping a low model complexity remains challenging in real-world applications. In this paper, we provide a bio-inspired efficient encoder-decoder architecture by mimicking the brain's top-down attention, called TDANet, with decreased model complexity without sacrificing performance. The top-down attention in TDANet is extracted by the global attention (GA) module and the cascaded local attention (LA) layers. The GA module takes multi-scale acoustic features as input to extract global attention signal, which then modulates features of different scales by direct top-down connections. The LA layers use features of adjacent layers as input to extract the local attention signal, which is used to modulate the lateral input in a top-down manner. On three benchmark datasets, TDANet consistently achieved competitive separation performance to previous state-of-the-art (SOTA) methods with higher efficiency. Specifically, TDANet's multiply-accumulate operations (MACs) are only 5\% of Sepformer, one of the previous SOTA models, and CPU inference time is only 10\% of Sepformer. In addition, a large-size version of TDANet obtained SOTA results on three datasets, with MACs still only 10\% of Sepformer and the CPU inference time only 24\% of Sepformer.