音频输入表示对基于神经网络的音乐转录的影响

论文标题

音频输入表示对基于神经网络的音乐转录的影响

The impact of Audio input representations on neural network based music transcription

论文作者

Cheuk, Kin Wai, Agres, Kat, Herremans, Dorien

论文摘要

本文彻底分析了不同输入表示对多形多个仪器音乐转录的影响。我们使用自己的基于GPU的频谱提取工具Nnaudio来研究使用线性频谱图，对数频谱谱图，MEL Spectrogram和Constant-Q Transform（CQT）的影响。我们的结果表明，可以通过选择适当的输入表示（具有STFT窗口长度为4,096和2,048的频率bins in pempseptr组中的$ 8.33 $％的转录精度和$ 9.39 $％的误差降低误差），而无需更改神经网络设计（单层完全连接）。我们的实验还表明，MEL频谱图是一种紧凑的表示，我们可以将频率箱的数量减少到仅512，同时仍然保持相对较高的音乐转录精度。

This paper thoroughly analyses the effect of different input representations on polyphonic multi-instrument music transcription. We use our own GPU based spectrogram extraction tool, nnAudio, to investigate the influence of using a linear-frequency spectrogram, log-frequency spectrogram, Mel spectrogram, and constant-Q transform (CQT). Our results show that a $8.33$% increase in transcription accuracy and a $9.39$% reduction in error can be obtained by choosing the appropriate input representation (log-frequency spectrogram with STFT window length 4,096 and 2,048 frequency bins in the spectrogram) without changing the neural network design (single layer fully connected). Our experiments also show that Mel spectrogram is a compact representation for which we can reduce the number of frequency bins to only 512 while still keeping a relatively high music transcription accuracy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题