论文标题
在FPGA设计中,基于CNN的对象检测器中混合精度的混合数据流的特定层特异性优化
Layer-specific Optimization for Mixed Data Flow with Mixed Precision in FPGA Design for CNN-based Object Detectors
论文作者
论文摘要
卷积神经网络(CNN)需要密集的计算和频繁的存储器访问,这导致了较低的处理速度和较大的功率耗散。尽管CNN中不同层的特性通常完全不同,但以前的硬件设计已采用了常见的优化方案。本文提出了一种特定于图层的设计,该设计采用了针对不同层进行优化的不同组织。所提出的设计采用了两个特定层的优化:特定于层的混合数据流和层特异性混合精度。混合数据流旨在最大程度地减少芯片访问,同时要求FPGA设备的片上存储器(BRAM)资源最小。混合精度量化是为了实现无损精度和侵略性模型压缩,从而进一步降低了片外访问。贝叶斯优化方法用于选择每一层的最佳稀疏度,从而在准确性和压缩之间取得了最佳的权衡。这种混合方案允许整个网络模型存储在FPGA的BRAM中,以积极地减少芯片访问,从而实现了显着的性能增强。与在完整的网络中相比,模型大小减少了22.66-28.93倍,而在VOC,COCO和ImageNet数据集上的精度降低可忽略不计。此外,混合数据流和混合精度的组合在吞吐量,片外访问和片上内存的要求方面显着优于先前的作品。
Convolutional neural networks (CNNs) require both intensive computation and frequent memory access, which lead to a low processing speed and large power dissipation. Although the characteristics of the different layers in a CNN are frequently quite different, previous hardware designs have employed common optimization schemes for them. This paper proposes a layer-specific design that employs different organizations that are optimized for the different layers. The proposed design employs two layer-specific optimizations: layer-specific mixed data flow and layer-specific mixed precision. The mixed data flow aims to minimize the off-chip access while demanding a minimal on-chip memory (BRAM) resource of an FPGA device. The mixed precision quantization is to achieve both a lossless accuracy and an aggressive model compression, thereby further reducing the off-chip access. A Bayesian optimization approach is used to select the best sparsity for each layer, achieving the best trade-off between the accuracy and compression. This mixing scheme allows the entire network model to be stored in BRAMs of the FPGA to aggressively reduce the off-chip access, and thereby achieves a significant performance enhancement. The model size is reduced by 22.66-28.93 times compared to that in a full-precision network with a negligible degradation of accuracy on VOC, COCO, and ImageNet datasets. Furthermore, the combination of mixed dataflow and mixed precision significantly outperforms the previous works in terms of both throughput, off-chip access, and on-chip memory requirement.