张量处理单元的学习性能模型

论文标题

张量处理单元的学习性能模型

A Learned Performance Model for Tensor Processing Units

论文作者

Kaufman, Samuel J., Phothilimthana, Phitchaya Mangpo, Zhou, Yanqi, Mendis, Charith, Roy, Sudip, Sabne, Amit, Burrows, Mike

论文摘要

准确的硬件性能模型对于有效的代码生成至关重要。编译器可以将它们用于做出启发式决策，超级茶商作为最小化目标，或者由自动调整器来为特定程序找到最佳配置。但是，由于当代处理器很复杂，因此很难开发它们，并且最近深度学习加速器的扩散增加了发展负担。我们演示了一种从张量计算图程序（TPU）实例的张量计算图程序中学习性能模型的方法。我们表明，我们所学的模型优于在两个任务（瓷砖大小选择和操作员融合）上的大量分析性能模型 - 它有助于自动持有人在访问TPU有限或昂贵的设置中发现更快的程序。

Accurate hardware performance models are critical to efficient code generation. They can be used by compilers to make heuristic decisions, by superoptimizers as a minimization objective, or by autotuners to find an optimal configuration for a specific program. However, they are difficult to develop because contemporary processors are complex, and the recent proliferation of deep learning accelerators has increased the development burden. We demonstrate a method of learning performance models from a corpus of tensor computation graph programs for Tensor Processing Unit (TPU) instances. We show that our learned model outperforms a heavily-optimized analytical performance model on two tasks -- tile-size selection and operator fusion -- and that it helps an autotuner discover faster programs in a setting where access to TPUs is limited or expensive.

下载PDF全文

下载文献需遵守相关版权规定

论文标题