论文标题

MLPERF微小基准的开源FPGA-ML代码

Open-source FPGA-ML codesign for the MLPerf Tiny Benchmark

论文作者

Borras, Hendrik, Di Guglielmo, Giuseppe, Duarte, Javier, Ghielmetti, Nicolò, Hawks, Ben, Hauck, Scott, Hsu, Shih-Chieh, Kastner, Ryan, Liang, Jason, Meza, Andres, Muhizi, Jules, Nguyen, Tai, Roy, Rushil, Tran, Nhan, Umuroglu, Yaman, Weng, Olivia, Yokuda, Aidan, Blott, Michaela

论文摘要

我们介绍了我们在现场可编程栅极阵列(FPGA)平台上MLPERF微小的推理基准的开发经验和最新结果。我们使用开源HLS4ML和Finn工作流程,旨在使FPGA中优化神经网络的AI硬件代码民主化。我们介绍了关键字发现,异常检测和图像分类基准任务的设计和实现过程。最终的硬件实现是为速度和效率量身定制的,可配置的,可配置的空间数据流体系结构,并引入了新的通用优化和作为此工作的一部分开发的常见工作流程。完整的工作流程从量化感知培训到FPGA实施。该解决方案部署在片上的系统(Pynq-Z2)和纯FPGA(ARTY A7-100T)平台上。由此产生的提交的潜伏期低至20 $μ$ S,每次推理的能耗低至30 $μ$ J。我们演示了异质硬件平台上新兴的ML基准如何催化协作,开发新技术和更可访问的工具。

We present our development experience and recent results for the MLPerf Tiny Inference Benchmark on field-programmable gate array (FPGA) platforms. We use the open-source hls4ml and FINN workflows, which aim to democratize AI-hardware codesign of optimized neural networks on FPGAs. We present the design and implementation process for the keyword spotting, anomaly detection, and image classification benchmark tasks. The resulting hardware implementations are quantized, configurable, spatial dataflow architectures tailored for speed and efficiency and introduce new generic optimizations and common workflows developed as a part of this work. The full workflow is presented from quantization-aware training to FPGA implementation. The solutions are deployed on system-on-chip (Pynq-Z2) and pure FPGA (Arty A7-100T) platforms. The resulting submissions achieve latencies as low as 20 $μ$s and energy consumption as low as 30 $μ$J per inference. We demonstrate how emerging ML benchmarks on heterogeneous hardware platforms can catalyze collaboration and the development of new techniques and more accessible tools.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源