论文标题

部分可观测时空混沌系统的无模型预测

TMS: A Temporal Multi-scale Backbone Design for Speaker Embedding

论文作者

Zhang, Ruiteng, Wei, Jianguo, Lu, Xugang, Lu, Wenhuan, Jin, Di, Xu, Junhai, Zhang, Lin, Ji, Yantao, Dang, Jianwu

论文摘要

扬声器嵌入是一个重要的前端模块,用于探索需要说话者信息的许多语音应用程序的判别扬声器功能。当前用于扬声器嵌入的SOTA骨干网络旨在汇总带有扬声器代表的多支球网络体系结构的话语中的多尺度功能。但是,由于模型参数的迅速增加和计算复杂性,因此天真地添加许多多尺度特征的分支无法有效地提高性能。因此,在最新的最新网络体系结构中,只能为扬声器嵌入设计与有限数量的时间尺度相对应的几个分支。为了解决这个问题,在本文中,我们提出了一个有效的时间多尺度(TMS)模型,在该模型中,多尺度分支可以在扬声器嵌入网络中有效设计,而无需增加计算成本。新模型基于常规的TDNN,在该tdnn中,网络体系结构巧妙地分为两个建模运算符:一个频道模型运算符和一个时间的多分支建模运算符。在时间多分支运算符中添加时间多尺度仅需增加参数数量的一点,从而节省了更多的计算预算,以添加更多具有较大时间尺度的分支。此外,在推理阶段,我们进一步开发了一种系统性的重新参数化方法,将基于TMS的模型转换为基于单路径的拓扑,以提高推理速度。我们研究了新的TMS方法的自动扬声器验证(ASV)的性能。结果表明,基于TMS的模型在SOTA ASV模型上的性能显着提高,同时推理速度更快。

Speaker embedding is an important front-end module to explore discriminative speaker features for many speech applications where speaker information is needed. Current SOTA backbone networks for speaker embedding are designed to aggregate multi-scale features from an utterance with multi-branch network architectures for speaker representation. However, naively adding many branches of multi-scale features with the simple fully convolutional operation could not efficiently improve the performance due to the rapid increase of model parameters and computational complexity. Therefore, in the most current state-of-the-art network architectures, only a few branches corresponding to a limited number of temporal scales could be designed for speaker embeddings. To address this problem, in this paper, we propose an effective temporal multi-scale (TMS) model where multi-scale branches could be efficiently designed in a speaker embedding network almost without increasing computational costs. The new model is based on the conventional TDNN, where the network architecture is smartly separated into two modeling operators: a channel-modeling operator and a temporal multi-branch modeling operator. Adding temporal multi-scale in the temporal multi-branch operator needs only a little bit increase of the number of parameters, and thus save more computational budget for adding more branches with large temporal scales. Moreover, in the inference stage, we further developed a systemic re-parameterization method to convert the TMS-based model into a single-path-based topology in order to increase inference speed. We investigated the performance of the new TMS method for automatic speaker verification (ASV) on in-domain and out-of-domain conditions. Results show that the TMS-based model obtained a significant increase in the performance over the SOTA ASV models, meanwhile, had a faster inference speed.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源