与多尺度时间卷积的小脚印关键字斑点

论文标题

与多尺度时间卷积的小脚印关键字斑点

Small-Footprint Keyword Spotting with Multi-Scale Temporal Convolution

论文作者

Li, Ximin, Wei, Xiaodong, Qin, Xiaowei

论文摘要

关键字斑点（KWS）在智能设备终端和服务机器人的人体计算机交互中起着至关重要的作用。在KWS任务中实现小足迹和高准确性之间的权衡仍然很具有挑战性。在本文中，我们探讨了多尺度时间建模的应用到小英尺打印机关键字点斑点任务。我们提出了一个多分支时间卷积模块（MTCONV），该模块是由具有不同内核大小的多个时间卷积滤波器组成的CNN块，该过滤器富含时间特征空间。此外，利用时间和深度卷积，为KWS系统设计了时间有效的神经网络（TENET）。基于目的模型，我们用MTCONV替换标准的时间卷积层，可以训练以提高性能。在推理阶段，MTCONV可以等效地转换为基础卷积体系结构，因此与基本模型相比，不增加额外的参数和计算成本。 Google语音命令数据集上的结果表明，我们经过MTCONV训练的模型之一仅使用100K参数执行96.8％的精度。

Keyword Spotting (KWS) plays a vital role in human-computer interaction for smart on-device terminals and service robots. It remains challenging to achieve the trade-off between small footprint and high accuracy for KWS task. In this paper, we explore the application of multi-scale temporal modeling to the small-footprint keyword spotting task. We propose a multi-branch temporal convolution module (MTConv), a CNN block consisting of multiple temporal convolution filters with different kernel sizes, which enriches temporal feature space. Besides, taking advantage of temporal and depthwise convolution, a temporal efficient neural network (TENet) is designed for KWS system. Based on the purposed model, we replace standard temporal convolution layers with MTConvs that can be trained for better performance. While at the inference stage, the MTConv can be equivalently converted to the base convolution architecture, so that no extra parameters and computational costs are added compared to the base model. The results on Google Speech Command Dataset show that one of our models trained with MTConv performs the accuracy of 96.8% with only 100K parameters.

下载PDF全文

下载文献需遵守相关版权规定

论文标题