具有签名位式架构的节能密集的DNN加速度

论文标题

具有签名位式架构的节能密集的DNN加速度

Energy-efficient Dense DNN Acceleration with Signed Bit-slice Architecture

论文作者

Im, Dongseok, Park, Gwangtae, Li, Zhiyong, Ryu, Junha, Yoo, Hoi-Jun

论文摘要

随着在移动系统芯片（SOC）上执行的深神经网络（DNN）的数量增加，移动SOC在其有限的硬件资源和电力预算中的实时DNN加速度受到了损害。尽管以前的移动神经加工单元（NPU）利用了稀疏性的低位计算和剥削，但它无法加速高精度和致密DNN。本文提出了节能签名的位板架构，该结构通过利用大量的签名位分线的零值来加速高精度和密集的DNN。拟议的签名位片段表示（SBR）更改签名$ 1111_ {2} $ lit-slice to $ 0000_ {2} $，通过从其位板的低阶借用$ 1 $值。结果，即使在密集的DNN中，它也会生成大量的零位。此外，它可以平衡2 2的补体数据的正值和负值，从而允许基于位的输出推测预先计算位置的高阶，并跳过剩余的位低点位较低的位。签名的位式体系结构压缩并跳过了零输入签名的位单点，而零跳过单元还通过掩盖了推测的输入为零来支持输出跳过。此外，异质网络芯片（NOC）有益于数据可重复性和降低传输带宽的利用。本文介绍了专门的指令集体系结构（ISA）和一个层次结构指令解码器，以控制签名的位式体系结构。最后，签名的位式体系结构的表现优于以前的位式加速器，位融合，超过$ \ times3.65 $更高的面积效率，$ \ times 3.88 $更高的能量效率，$ \ times5.35 $更高的吞吐量。

As the number of deep neural networks (DNNs) to be executed on a mobile system-on-chip (SoC) increases, the mobile SoC suffers from the real-time DNN acceleration within its limited hardware resources and power budget. Although the previous mobile neural processing units (NPUs) take advantage of low-bit computing and exploitation of the sparsity, it is incapable of accelerating high-precision and dense DNNs. This paper proposes energy-efficient signed bit-slice architecture which accelerates both high-precision and dense DNNs by exploiting a large number of zero values of signed bit-slices. Proposed signed bit-slice representation (SBR) changes signed $1111_{2}$ bit-slice to $0000_{2}$ by borrowing a $1$ value from its lower order of bit-slice. As a result, it generates a large number of zero bit-slices even in dense DNNs. Moreover, it balances the positive and negative values of 2's complement data, allowing bit-slice based output speculation which pre-computes high order of bit-slices and skips the remaining dense low order of bit-slices. The signed bit-slice architecture compresses and skips the zero input signed bit-slices, and the zero skipping unit also supports the output skipping by masking the speculated inputs as zero. Additionally, the heterogeneous network-on-chip (NoC) benefits the exploitation of data reusability and reduction of transmission bandwidth. The paper introduces a specialized instruction set architecture (ISA) and a hierarchical instruction decoder for the control of the signed bit-slice architecture. Finally, the signed bit-slice architecture outperforms the previous bit-slice accelerator, Bit-fusion, over $\times3.65$ higher area-efficiency, $\times3.88$ higher energy-efficiency, and $\times5.35$ higher throughput.

下载PDF全文

下载文献需遵守相关版权规定

论文标题