论文标题
多诉估算的深度学习架构:朝着可靠的评估
Deep-Learning Architectures for Multi-Pitch Estimation: Towards Reliable Evaluation
论文作者
论文摘要
从音乐录音中提取音调信息是音乐信号处理中的一个具有挑战性但重要的问题。框架的转录或多诉估计旨在检测多音音乐录音中音调的同时活动,并且由于深入学习技术以及各种建议的网络架构,最近有了重大改进。在本文中,我们根据CNN,U-NET结构和自我发项组成部分实现不同的体系结构。我们对这些体系结构进行了几次修改,包括跳过连接的自我发干模块,替代自我注意力的经常性层以及同时预测多态程度的多任务策略。我们比较了这些体系结构的不同尺寸的变体,以进行多诉估算,以使用MusicNet和Schubert Winterreise数据集进行了钢琴曲线场景之外的西方古典音乐。我们的实验表明,大多数体系结构都会产生竞争成果,并且较大的模型变体似乎是有益的。但是,我们发现这些结果在很大程度上取决于随机效应和训练测试拆分的特殊选择,这质疑仅在较小的改进的情况下,对特定架构的优越性主张。因此,我们研究了数据集拆分在工作周期的几个运动(交叉评估)的情况下的影响,并提出了对音乐网的最佳实践分裂策略,该策略削弱了单个测试轨道的影响并抑制过度适合特定作品和记录条件。对混合数据集的最终评估表明,对一个特定数据集的改进不一定会推广到其他方案,从而强调需要进一步高质量的多态数据集,以便衡量音乐转录任务的进度。
Extracting pitch information from music recordings is a challenging but important problem in music signal processing. Frame-wise transcription or multi-pitch estimation aims for detecting the simultaneous activity of pitches in polyphonic music recordings and has recently seen major improvements thanks to deep-learning techniques, with a variety of proposed network architectures. In this paper, we realize different architectures based on CNNs, the U-net structure, and self-attention components. We propose several modifications to these architectures including self-attention modules for skip connections, recurrent layers to replace the self-attention, and a multi-task strategy with simultaneous prediction of the degree of polyphony. We compare variants of these architectures in different sizes for multi-pitch estimation, focusing on Western classical music beyond the piano-solo scenario using the MusicNet and Schubert Winterreise datasets. Our experiments indicate that most architectures yield competitive results and that larger model variants seem to be beneficial. However, we find that these results substantially depend on randomization effects and the particular choice of the training-test split, which questions the claim of superiority for particular architectures given only small improvements. We therefore investigate the influence of dataset splits in the presence of several movements of a work cycle (cross-version evaluation) and propose a best-practice splitting strategy for MusicNet, which weakens the influence of individual test tracks and suppresses overfitting to specific works and recording conditions. A final evaluation on a mixed dataset suggests that improvements on one specific dataset do not necessarily generalize to other scenarios, thus emphasizing the need for further high-quality multi-pitch datasets in order to reliably measure progress in music transcription tasks.