论文标题

关于为深神经网络设计处理器阵列的困难

On the Difficulty of Designing Processor Arrays for Deep Neural Networks

论文作者

Stehle, Kevin, Schindler, Günther, Fröning, Holger

论文摘要

收缩阵列是一个有前途的计算概念,尤其是在人工神经网络处理中发现的CMOS技术趋势和线性代数操作。这种深度学习方法在广泛的应用中的最新成功导致了各种模型,尽管概念上的概念性相似,但概念上的概念性与较大的设计空间相似,这表明操作中的多样性巨大:操作数的尺寸有很大的变化,因为它取决于该设计原理,例如诸如接收场域的大小,功能,差异和分组的功能,并且具有差异,并且具有功能,并且功能差异,并且功能差异,并且功能差异,并且具有差异性,并且具有差异性,并且具有差异性。最后,最近的网络以前通过各种连通性(例如Resnet或densenet中的)进行了纯净的前馈模型。选择最佳收缩阵列配置的问题无法分析解决,因此,需要方法和工具,以促进有关总周期,利用率,数据移动量的快速准确推理。在这项工作中,我们介绍了Camuy,这是一种用于线性代数操作的重量平台收缩期阵列的轻巧模型,该模型允许快速探索不同配置,例如收缩阵列尺寸和输入/输出/输出刻度。 Camuy AIDS加速器设计人员在寻找特定网络体系结构的最佳配置或在各种网络架构中的稳健性能。它通过自定义操作员简单地集成到现有的机器学习工具堆栈(例如TensorFlow)中。我们对流行的DNN模型进行了分析,以说明它如何估算所需的周期,数据移动成本以及收缩阵列利用率,并展示网络体系结构设计中的进展如何影响基于收缩阵列的加速器推断的效率。

Systolic arrays are a promising computing concept which is in particular inline with CMOS technology trends and linear algebra operations found in the processing of artificial neural networks. The recent success of such deep learning methods in a wide set of applications has led to a variety of models, which albeit conceptual similar as based on convolutions and fully-connected layers, in detail show a huge diversity in operations due to a large design space: An operand's dimension varies substantially since it depends on design principles such as receptive field size, number of features, striding, dilating and grouping of features. Last, recent networks extent previously plain feedforward models by various connectivity, such as in ResNet or DenseNet. The problem of choosing an optimal systolic array configuration cannot be solved analytically, thus instead methods and tools are required that facilitate a fast and accurate reasoning about optimality in terms of total cycles, utilization, and amount of data movements. In this work we introduce Camuy, a lightweight model of a weight-stationary systolic array for linear algebra operations that allows quick explorations of different configurations, such as systolic array dimensions and input/output bitwidths. Camuy aids accelerator designers in either finding optimal configurations for a particular network architecture or for robust performance across a variety of network architectures. It offers simple integration into existing machine learning tool stacks (e.g TensorFlow) through custom operators. We present an analysis of popular DNN models to illustrate how it can estimate required cycles, data movement costs, as well as systolic array utilization, and show how the progress in network architecture design impacts the efficiency of inference on accelerators based on systolic arrays.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源