使用SINC-CONVOLUTIONS在原始音频数据中轻巧的端到端语音识别

论文标题

使用SINC-CONVOLUTIONS在原始音频数据中轻巧的端到端语音识别

Lightweight End-to-End Speech Recognition from Raw Audio Data Using Sinc-Convolutions

论文作者

Kürzinger, Ludwig, Lindae, Nicolas, Klewitz, Palle, Rigoll, Gerhard

论文摘要

许多端到端的自动语音识别（ASR）系统仍然依赖于手工制作以模仿人类听力的预处理频域功能。我们的工作是由综合学习功能提取的最新进展激发的。为此，我们提出了轻巧的SINC-CONVOLTICTY（LSC），将SINC-CONVOLTIONS与深度卷积整合为端到端ASR Systems的低参数机器可爱的功能提取。我们将LSC集成到混合CTC/注意体系结构中进行评估。最终的端到端模型显示出平滑的收敛行为，通过在时间域中应用规格进一步改善。我们还讨论了滤波器级的改进，例如将对数压缩作为激活函数。我们的模型在TEDLIUM V2测试数据集上达到了10.7％的单词错误率，使用Log-Mel FilterBank特征超过相应的体系结构，绝对1.9％，但仅具有其模型大小的21％。

Many end-to-end Automatic Speech Recognition (ASR) systems still rely on pre-processed frequency-domain features that are handcrafted to emulate the human hearing. Our work is motivated by recent advances in integrated learnable feature extraction. For this, we propose Lightweight Sinc-Convolutions (LSC) that integrate Sinc-convolutions with depthwise convolutions as a low-parameter machine-learnable feature extraction for end-to-end ASR systems. We integrated LSC into the hybrid CTC/attention architecture for evaluation. The resulting end-to-end model shows smooth convergence behaviour that is further improved by applying SpecAugment in time-domain. We also discuss filter-level improvements, such as using log-compression as activation function. Our model achieves a word error rate of 10.7% on the TEDlium v2 test dataset, surpassing the corresponding architecture with log-mel filterbank features by an absolute 1.9%, but only has 21% of its model size.

下载PDF全文

下载文献需遵守相关版权规定

论文标题