Tensordash：利用稀疏性加速深度神经网络培训和推理

论文标题

Tensordash：利用稀疏性加速深度神经网络培训和推理

TensorDash: Exploiting Sparsity to Accelerate Deep Neural Network Training and Inference

论文作者

Mahmoud, Mostafa, Edo, Isak, Zadeh, Ali Hadi, Awad, Omar Mohamed, Pekhimenko, Gennady, Albericio, Jorge, Moshovos, Andreas

论文摘要

TensorDash是一种硬件级技术，用于使数据并行MAC单元在其输入操作数流中利用稀疏性。当用来组合硬件加速器进行深度学习时，Tensordash可以加速训练过程，同时提高能源效率。 Tensordash将每个乘数输入的8输入多路复用器和面积效率高效率的硬件调度程序组成的低成本输入操作数互连组合。尽管互连允许每个操作数非常有限的运动集，但调度程序在神经网络的激活，权重或梯度中存在时可以有效提取稀疏性。在涵盖各种应用程序的各种型号上，Tensordash将培训过程加速$ 1.95 {\ times} $，而$ 1.89 \ times $ $ $越来越高，$ 1.6 \ $ 1.6 \ times $ $ $ $ $ $ $ $更高的能源效率在芯片上和芯片内存访问中。尽管Tensordash可以与任何数据类型一起使用，但我们使用单精度的浮点单元和BFLOAT16进行了演示。

TensorDash is a hardware level technique for enabling data-parallel MAC units to take advantage of sparsity in their input operand streams. When used to compose a hardware accelerator for deep learning, TensorDash can speedup the training process while also increasing energy efficiency. TensorDash combines a low-cost, sparse input operand interconnect comprising an 8-input multiplexer per multiplier input, with an area-efficient hardware scheduler. While the interconnect allows a very limited set of movements per operand, the scheduler can effectively extract sparsity when it is present in the activations, weights or gradients of neural networks. Over a wide set of models covering various applications, TensorDash accelerates the training process by $1.95{\times}$ while being $1.89\times$ more energy-efficient, $1.6\times$ more energy efficient when taking on-chip and off-chip memory accesses into account. While TensorDash works with any datatype, we demonstrate it with both single-precision floating-point units and bfloat16.

下载PDF全文

下载文献需遵守相关版权规定

论文标题