论文标题
表征深神经网络中子诱导的断层模型
Characterizing a Neutron-Induced Fault Model for Deep Neural Networks
论文作者
论文摘要
在图形处理单元(GPU)上执行的深神经网络(DNN)的可靠性评估是一个具有挑战性的问题,因为硬件体系结构非常复杂,软件框架由许多抽象层组成。虽然软件级故障注入是评估复杂应用程序可靠性的常见且快速的方法,但由于它对硬件资源的访问有限,并且所采用的故障模型可能太幼稚(即单位和双位翻转),因此可能会产生不切实际的结果。相反,用中子光束注射物理断层可提供现实的错误率,但缺乏断层传播可见性。本文提出了DNN故障模型的表征,该模型在软件级别结合了中子束实验和故障注射。我们将运行常规矩阵乘法(GEMM)和DNN的GPU暴露于梁中子,以测量其错误率。在DNN上,我们观察到关键错误的百分比可能高达61%,并表明ECC在减少关键错误方面无效。然后,我们使用RTL模拟得出的故障模型进行了互补的软件级故障注入。我们的结果表明,通过注射复杂的断层模型,Yolov3的误导率已被验证,它非常接近通过光束实验测量的速率,该速率比仅使用单位倒换的故障注射测量的速率高8.66倍。
The reliability evaluation of Deep Neural Networks (DNNs) executed on Graphic Processing Units (GPUs) is a challenging problem since the hardware architecture is highly complex and the software frameworks are composed of many layers of abstraction. While software-level fault injection is a common and fast way to evaluate the reliability of complex applications, it may produce unrealistic results since it has limited access to the hardware resources and the adopted fault models may be too naive (i.e., single and double bit flip). Contrarily, physical fault injection with neutron beam provides realistic error rates but lacks fault propagation visibility. This paper proposes a characterization of the DNN fault model combining both neutron beam experiments and fault injection at software level. We exposed GPUs running General Matrix Multiplication (GEMM) and DNNs to beam neutrons to measure their error rate. On DNNs, we observe that the percentage of critical errors can be up to 61%, and show that ECC is ineffective in reducing critical errors. We then performed a complementary software-level fault injection, using fault models derived from RTL simulations. Our results show that by injecting complex fault models, the YOLOv3 misdetection rate is validated to be very close to the rate measured with beam experiments, which is 8.66x higher than the one measured with fault injection using only single-bit flips.