论文标题

具有签名位式架构的节能密集的DNN加速度

Energy-efficient Dense DNN Acceleration with Signed Bit-slice Architecture

论文作者

Im, Dongseok, Park, Gwangtae, Li, Zhiyong, Ryu, Junha, Yoo, Hoi-Jun

论文摘要

随着在移动系统芯片(SOC)上执行的深神经网络(DNN)的数量增加,移动SOC在其有限的硬件资源和电力预算中的实时DNN加速度受到了损害。尽管以前的移动神经加工单元(NPU)利用了稀疏性的低位计算和剥削,但它无法加速高精度和致密DNN。本文提出了节能签名的位板架构,该结构通过利用大量的签名位分线的零值来加速高精度和密集的DNN。拟议的签名位片段表示(SBR)更改签名$ 1111_ {2} $ lit-slice to $ 0000_ {2} $,通过从其位板的低阶借用$ 1 $值。结果,即使在密集的DNN中,它也会生成大量的零位。此外,它可以平衡2 2的补体数据的正值和负值,从而允许基于位的输出推测预先计算位置的高阶,并跳过剩余的位低点位较低的位。签名的位式体系结构压缩并跳过了零输入签名的位单点,而零跳过单元还通过掩盖了推测的输入为零来支持输出跳过。此外,异质网络芯片(NOC)有益于数据可重复性和降低传输带宽的利用。本文介绍了专门的指令集体系结构(ISA)和一个层次结构指令解码器,以控制签名的位式体系结构。最后,签名的位式体系结构的表现优于以前的位式加速器,位融合,超过$ \ times3.65 $更高的面积效率,$ \ times 3.88 $更高的能量效率,$ \ times5.35 $更高的吞吐量。

As the number of deep neural networks (DNNs) to be executed on a mobile system-on-chip (SoC) increases, the mobile SoC suffers from the real-time DNN acceleration within its limited hardware resources and power budget. Although the previous mobile neural processing units (NPUs) take advantage of low-bit computing and exploitation of the sparsity, it is incapable of accelerating high-precision and dense DNNs. This paper proposes energy-efficient signed bit-slice architecture which accelerates both high-precision and dense DNNs by exploiting a large number of zero values of signed bit-slices. Proposed signed bit-slice representation (SBR) changes signed $1111_{2}$ bit-slice to $0000_{2}$ by borrowing a $1$ value from its lower order of bit-slice. As a result, it generates a large number of zero bit-slices even in dense DNNs. Moreover, it balances the positive and negative values of 2's complement data, allowing bit-slice based output speculation which pre-computes high order of bit-slices and skips the remaining dense low order of bit-slices. The signed bit-slice architecture compresses and skips the zero input signed bit-slices, and the zero skipping unit also supports the output skipping by masking the speculated inputs as zero. Additionally, the heterogeneous network-on-chip (NoC) benefits the exploitation of data reusability and reduction of transmission bandwidth. The paper introduces a specialized instruction set architecture (ISA) and a hierarchical instruction decoder for the control of the signed bit-slice architecture. Finally, the signed bit-slice architecture outperforms the previous bit-slice accelerator, Bit-fusion, over $\times3.65$ higher area-efficiency, $\times3.88$ higher energy-efficiency, and $\times5.35$ higher throughput.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源