AI/ML应用程序的芯片上可训练和可扩展ANN的内存实施

论文标题

AI/ML应用程序的芯片上可训练和可扩展ANN的内存实施

In-memory Implementation of On-chip Trainable and Scalable ANN for AI/ML Applications

论文作者

Kumar, Abhash, Singh, Jawar, Beeraka, Sai Manohar, Gupta, Bharat

论文摘要

基于传统的von Neumann架构处理器在能量和吞吐量方面效率低下，因为它们涉及单独的处理和内存单元，也称为〜\ textit {Memory Wall}。当需要进行大规模的并行性和频繁的数据移动时，将进一步加剧记忆墙问题，以实时实现人工神经网络（ANN），这需要许多智能应用程序。解决记忆墙问题的最有前途的方法之一是在内存核心本身内部进行计算，以增强内存带宽和能源效率的广泛计算。本文介绍了一个内存计算体系结构，用于ANN启用人工智能（AI）和机器学习（ML）应用程序。所提出的体系结构利用基于标准的六晶体管（6T）静态随机访问存储器（SRAM）核心的深度内存架构来实现多层感知器。我们的新型片上训练和推理内存内结构可以通过同时访问每个预校周期的SRAM阵列的多行，并消除数据的频繁访问，从而降低了能量成本并增强了吞吐量。所提出的体系结构实现了反向传播，这是网络培训期间使用新提出的不同构件（例如重量更新，模拟乘法，错误计算，对数字转换的签名模拟以及其他必要的信号控制单元）的基石。与较早的分类器相比，在IRIS数据集上对所提出的体系结构进行了训练和测试，该数据集在IRIS数据集上进行了$ \ times $ $ $ $ $的能源效率。

Traditional von Neumann architecture based processors become inefficient in terms of energy and throughput as they involve separate processing and memory units, also known as~\textit{memory wall}. The memory wall problem is further exacerbated when massive parallelism and frequent data movement are required between processing and memory units for real-time implementation of artificial neural network (ANN) that enables many intelligent applications. One of the most promising approach to address the memory wall problem is to carry out computations inside the memory core itself that enhances the memory bandwidth and energy efficiency for extensive computations. This paper presents an in-memory computing architecture for ANN enabling artificial intelligence (AI) and machine learning (ML) applications. The proposed architecture utilizes deep in-memory architecture based on standard six transistor (6T) static random access memory (SRAM) core for the implementation of a multi-layered perceptron. Our novel on-chip training and inference in-memory architecture reduces energy cost and enhances throughput by simultaneously accessing the multiple rows of SRAM array per precharge cycle and eliminating the frequent access of data. The proposed architecture realizes backpropagation which is the keystone during the network training using newly proposed different building blocks such as weight updation, analog multiplication, error calculation, signed analog to digital conversion, and other necessary signal control units. The proposed architecture was trained and tested on the IRIS dataset which exhibits $\approx46\times$ more energy efficient per MAC (multiply and accumulate) operation compared to earlier classifiers.

下载PDF全文

下载文献需遵守相关版权规定

论文标题