论文标题
使用SINC-CONVOLUTIONS在原始音频数据中轻巧的端到端语音识别
Lightweight End-to-End Speech Recognition from Raw Audio Data Using Sinc-Convolutions
论文作者
论文摘要
许多端到端的自动语音识别(ASR)系统仍然依赖于手工制作以模仿人类听力的预处理频域功能。我们的工作是由综合学习功能提取的最新进展激发的。为此,我们提出了轻巧的SINC-CONVOLTICTY(LSC),将SINC-CONVOLTIONS与深度卷积整合为端到端ASR Systems的低参数机器可爱的功能提取。 我们将LSC集成到混合CTC/注意体系结构中进行评估。最终的端到端模型显示出平滑的收敛行为,通过在时间域中应用规格进一步改善。我们还讨论了滤波器级的改进,例如将对数压缩作为激活函数。我们的模型在TEDLIUM V2测试数据集上达到了10.7%的单词错误率,使用Log-Mel FilterBank特征超过相应的体系结构,绝对1.9%,但仅具有其模型大小的21%。
Many end-to-end Automatic Speech Recognition (ASR) systems still rely on pre-processed frequency-domain features that are handcrafted to emulate the human hearing. Our work is motivated by recent advances in integrated learnable feature extraction. For this, we propose Lightweight Sinc-Convolutions (LSC) that integrate Sinc-convolutions with depthwise convolutions as a low-parameter machine-learnable feature extraction for end-to-end ASR systems. We integrated LSC into the hybrid CTC/attention architecture for evaluation. The resulting end-to-end model shows smooth convergence behaviour that is further improved by applying SpecAugment in time-domain. We also discuss filter-level improvements, such as using log-compression as activation function. Our model achieves a word error rate of 10.7% on the TEDlium v2 test dataset, surpassing the corresponding architecture with log-mel filterbank features by an absolute 1.9%, but only has 21% of its model size.