McSae：掩盖的交叉自我牵键编码，以嵌入扬声器

论文标题

McSae：掩盖的交叉自我牵键编码，以嵌入扬声器

MCSAE: Masked Cross Self-Attentive Encoding for Speaker Embedding

论文作者

Seo, Soonshin, Kim, Ji-Hwan

论文摘要

通常，已经将自我发挥的机制应用于嵌入编码的说话者。先前的研究重点是在高级层（例如最后一个合并层）中训练自我注意力。但是，在嵌入编码的扬声器中降低了低级特征的效果。因此，我们建议使用Resnet提出掩盖的跨性别自动编码（MCSAE）。它重点介绍了高级和低级层的特征。基于多层聚集，每个残留层的输出特征用于MCSAE。在MCSAE中，跨自我发场模块训练了每个输入特征的相互依赖性。随机掩盖正规化模块也用于防止过度拟合问题。因此，MCSAE增强了代表扬声器信息的框架的重量。然后，将输出功能连接并编码到扬声器嵌入。因此，通过使用MCSAE来编码更有用的扬声器嵌入。实验结果显示，使用Voxceleb1评估数据集的错误率为2.63％，最小检测成本函数为0.1453。与以前的自我牵手编码和最新编码方法相比，这些表现得到了改善。

In general, a self-attention mechanism has been applied for speaker embedding encoding. Previous studies focused on training the self-attention in a high-level layer, such as the last pooling layer. However, the effect of low-level features was reduced in the speaker embedding encoding. Therefore, we propose masked cross self-attentive encoding (MCSAE) using ResNet. It focuses on the features of both high-level and lowlevel layers. Based on multi-layer aggregation, the output features of each residual layer are used for the MCSAE. In the MCSAE, cross self-attention module is trained the interdependence of each input features. A random masking regularization module also applied to preventing overfitting problem. As such, the MCSAE enhances the weight of frames representing the speaker information. Then, the output features are concatenated and encoded to the speaker embedding. Therefore, a more informative speaker embedding is encoded by using the MCSAE. The experimental results showed an equal error rate of 2.63% and a minimum detection cost function of 0.1453 using the VoxCeleb1 evaluation dataset. These were improved performances compared with the previous self-attentive encoding and state-of-the-art encoding methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题