伊卡洛斯：神经辐射场的专业体系结构渲染

论文标题

伊卡洛斯：神经辐射场的专业体系结构渲染

ICARUS: A Specialized Architecture for Neural Radiance Fields Rendering

论文作者

Rao, Chaolin, Yu, Huangjie, Wan, Haochuan, Zhou, Jindong, Zheng, Yueyang, Ma, Yu, Chen, Anpei, Wu, Minye, Yuan, Binzhe, Zhou, Pingqiang, Lou, Xin, Yu, Jingyi

论文摘要

在渲染应用程序中，神经辐射场（NERF）的实际部署面临着几个挑战，最关键的是，即使是高端图形处理单元（GPU），最关键的渲染速度也很低。在本文中，我们提出了iCarus，这是一种专门的加速器架构，适合NERF渲染。与使用通用计算和NERF的内存体系结构的GPU不同，伊卡洛斯使用专用元素核（PLCORE）执行完整的NERF管道，该管道由位置编码单元（PEU）组成，该位置编码单元（PEU），多层perceptron（MLP）发动机（MLP）发动机和音量呈呈渲染单元（VRU）。一个plcore占据位置\＆Directions，并呈现相应的像素颜色，而没有任何中间数据偏离芯片以进行临时存储和交换，这可能是时间和功耗。为了实现NERF的最昂贵组件，即MLP，我们将完全连接的操作转换为近似可重构的多个恒定乘法（MCMS），在这些乘法中，共享常见的子表达在不同的乘积上共享以提高计算效率。我们使用Synopsys HAPS-80 S104（用于大型集成电路和系统设计的基于字段的可编程门阵列（FPGA）原型制度系统构建原型ICARUS。我们使用40nm LP CMOS技术评估了PLCORE的功率性能区域（PPA）。一个单一的Plcore在400 MHz工作，占16.5 $ mm^2 $，消耗282.8 MW，转化为0.105 UJ/样品。将结果与GPU和张量处理单元（TPU）实现的结果进行了比较。

The practical deployment of Neural Radiance Fields (NeRF) in rendering applications faces several challenges, with the most critical one being low rendering speed on even high-end graphic processing units (GPUs). In this paper, we present ICARUS, a specialized accelerator architecture tailored for NeRF rendering. Unlike GPUs using general purpose computing and memory architectures for NeRF, ICARUS executes the complete NeRF pipeline using dedicated plenoptic cores (PLCore) consisting of a positional encoding unit (PEU), a multi-layer perceptron (MLP) engine, and a volume rendering unit (VRU). A PLCore takes in positions \& directions and renders the corresponding pixel colors without any intermediate data going off-chip for temporary storage and exchange, which can be time and power consuming. To implement the most expensive component of NeRF, i.e., the MLP, we transform the fully connected operations to approximated reconfigurable multiple constant multiplications (MCMs), where common subexpressions are shared across different multiplications to improve the computation efficiency. We build a prototype ICARUS using Synopsys HAPS-80 S104, a field programmable gate array (FPGA)-based prototyping system for large-scale integrated circuits and systems design. We evaluate the power-performance-area (PPA) of a PLCore using 40nm LP CMOS technology. Working at 400 MHz, a single PLCore occupies 16.5 $mm^2$ and consumes 282.8 mW, translating to 0.105 uJ/sample. The results are compared with those of GPU and tensor processing unit (TPU) implementations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题