在FPGA设计中，基于CNN的对象检测器中混合精度的混合数据流的特定层特异性优化

论文标题

在FPGA设计中，基于CNN的对象检测器中混合精度的混合数据流的特定层特异性优化

Layer-specific Optimization for Mixed Data Flow with Mixed Precision in FPGA Design for CNN-based Object Detectors

论文作者

Nguyen, Duy Thanh, Kim, Hyun, Lee, Hyuk-Jae

论文摘要

卷积神经网络（CNN）需要密集的计算和频繁的存储器访问，这导致了较低的处理速度和较大的功率耗散。尽管CNN中不同层的特性通常完全不同，但以前的硬件设计已采用了常见的优化方案。本文提出了一种特定于图层的设计，该设计采用了针对不同层进行优化的不同组织。所提出的设计采用了两个特定层的优化：特定于层的混合数据流和层特异性混合精度。混合数据流旨在最大程度地减少芯片访问，同时要求FPGA设备的片上存储器（BRAM）资源最小。混合精度量化是为了实现无损精度和侵略性模型压缩，从而进一步降低了片外访问。贝叶斯优化方法用于选择每一层的最佳稀疏度，从而在准确性和压缩之间取得了最佳的权衡。这种混合方案允许整个网络模型存储在FPGA的BRAM中，以积极地减少芯片访问，从而实现了显着的性能增强。与在完整的网络中相比，模型大小减少了22.66-28.93倍，而在VOC，COCO和ImageNet数据集上的精度降低可忽略不计。此外，混合数据流和混合精度的组合在吞吐量，片外访问和片上内存的要求方面显着优于先前的作品。

Convolutional neural networks (CNNs) require both intensive computation and frequent memory access, which lead to a low processing speed and large power dissipation. Although the characteristics of the different layers in a CNN are frequently quite different, previous hardware designs have employed common optimization schemes for them. This paper proposes a layer-specific design that employs different organizations that are optimized for the different layers. The proposed design employs two layer-specific optimizations: layer-specific mixed data flow and layer-specific mixed precision. The mixed data flow aims to minimize the off-chip access while demanding a minimal on-chip memory (BRAM) resource of an FPGA device. The mixed precision quantization is to achieve both a lossless accuracy and an aggressive model compression, thereby further reducing the off-chip access. A Bayesian optimization approach is used to select the best sparsity for each layer, achieving the best trade-off between the accuracy and compression. This mixing scheme allows the entire network model to be stored in BRAMs of the FPGA to aggressively reduce the off-chip access, and thereby achieves a significant performance enhancement. The model size is reduced by 22.66-28.93 times compared to that in a full-precision network with a negligible degradation of accuracy on VOC, COCO, and ImageNet datasets. Furthermore, the combination of mixed dataflow and mixed precision significantly outperforms the previous works in terms of both throughput, off-chip access, and on-chip memory requirement.

下载PDF全文

下载文献需遵守相关版权规定

论文标题