论文标题
数据加载管道在训练深神经网络中的重要性
Importance of Data Loading Pipeline in Training Deep Neural Networks
论文作者
论文摘要
训练大规模的深神经网络是一项漫长而耗时的操作,通常需要许多GPU加速。在大型模型中,加载数据所花费的时间需要大部分模型培训时间。由于GPU服务器通常价格昂贵,因此可以节省训练时间的技巧是有价值的。尤其是在需要详尽数据扩展操作的现实应用程序上观察到训练。数据增强技术包括:填充,旋转,添加噪声,下抽样,向上采样等。这些额外的操作增加了建立有效的数据加载管道的需求,并探索现有的工具以加快训练时间。我们重点介绍了为此任务设计的两个主要工具,即加速数据读数的二进制数据格式,以及nvidia dali,以加速数据增强。我们的研究表明,如果使用此类专用工具,则对20%至40%的订单有所改善。
Training large-scale deep neural networks is a long, time-consuming operation, often requiring many GPUs to accelerate. In large models, the time spent loading data takes a significant portion of model training time. As GPU servers are typically expensive, tricks that can save training time are valuable.Slow training is observed especially on real-world applications where exhaustive data augmentation operations are required. Data augmentation techniques include: padding, rotation, adding noise, down sampling, up sampling, etc. These additional operations increase the need to build an efficient data loading pipeline, and to explore existing tools to speed up training time. We focus on the comparison of two main tools designed for this task, namely binary data format to accelerate data reading, and NVIDIA DALI to accelerate data augmentation. Our study shows improvement on the order of 20% to 40% if such dedicated tools are used.