Loopstack：轻巧的张量代数编译器堆栈

论文标题

Loopstack：轻巧的张量代数编译器堆栈

LoopStack: a Lightweight Tensor Algebra Compiler Stack

论文作者

Wasti, Bram, Cambronero, José Pablo, Steiner, Benoit, Leather, Hugh, Zlateski, Aleksandar

论文摘要

我们提供LoopStack，这是用于张量操作的域特异性编译器堆栈，由前端，Looptool和有效的优化代码生成器Loopnest组成。此堆栈使我们能够编译整个神经网络并生成针对AVX2，AVX512，NEON和NEONFP16指令集的代码，同时结合了其他机器学习编译器后端经常缺少的优化。我们评估了完整神经网络和常用网络块的集合以及单个操作员的堆栈，并证明Loopstack生成了与此匹配的机器代码，并经常超过两种情况下最先进的机器学习框架的性能。我们还表明，对于大量的时间表，loopnest的汇编的数量级比LLVM快，同时导致相等或改善的运行时间性能。此外，Loopstack的内存足迹很小 - 二进制尺寸为245kb，有效的代码不到30k行，非常适合在移动设备和嵌入式设备上使用。

We present LoopStack, a domain specific compiler stack for tensor operations, composed of a frontend, LoopTool, and an efficient optimizing code generator, LoopNest. This stack enables us to compile entire neural networks and generate code targeting the AVX2, AVX512, NEON, and NEONfp16 instruction sets while incorporating optimizations often missing from other machine learning compiler backends. We evaluate our stack on a collection of full neural networks and commonly used network blocks as well as individual operators, and show that LoopStack generates machine code that matches and frequently exceeds the performance of in state-of-the-art machine learning frameworks in both cases. We also show that for a large collection of schedules LoopNest's compilation is orders of magnitude faster than LLVM, while resulting in equal or improved run time performance. Additionally, LoopStack has a very small memory footprint - a binary size of 245KB, and under 30K lines of effective code makes it ideal for use on mobile and embedded devices.

下载PDF全文

下载文献需遵守相关版权规定

论文标题