论文标题
微型控制器上的机器学习的对角记忆优化
Diagonal Memory Optimisation for Machine Learning on Micro-controllers
论文作者
论文摘要
随着机器学习扩展到越来越多的应用领域,微控制器和低功率CPU越来越多地用于使用机器学习模型。将部署到这些有限的硬件目标上的能力是使机器学习模型可以在各种新域中使用。优化这些目标的推理过程带来了桌面CPU或GPU实现的不同挑战,在这些目标上,这些目标上可用的少量RAM设置了可以执行的模型大小的限制。对十一个机器学习模型的存储使用模式的分析。使用Valgrind调试工具的修改版本观察到内存负载和存储模式,以确定记忆区域保持计算所需的值随着推理的进展所必需的值。这些分析通过重叠单个张量操作的输入和输出缓冲区来确定机会优化这些模型的内存使用。提出了三种方法,可以计算张量操作的输入和输出缓冲区的安全重叠。从计算昂贵的方法具有在编译层操作上操作的能力到需要访问该图层原始源代码的多功能分析解决方案。描述并证明对角度内存优化技术可在应用于11个常见模型时可实现高达34.5%的存储器节省。在使用对角存储器优化的情况下,只有可以部署某些模型的位置,可以确定微控制器目标。
As machine learning spreads into more and more application areas, micro controllers and low power CPUs are increasingly being used to perform inference with machine learning models. The capability to deploy onto these limited hardware targets is enabling machine learning models to be used across a diverse range of new domains. Optimising the inference process on these targets poses different challenges from either desktop CPU or GPU implementations, where the small amounts of RAM available on these targets sets limits on size of models which can be executed. Analysis of the memory use patterns of eleven machine learning models was performed. Memory load and store patterns were observed using a modified version of the Valgrind debugging tool, identifying memory areas holding values necessary for the calculation as inference progressed. These analyses identified opportunities optimise the memory use of these models by overlapping the input and output buffers of individual tensor operations. Three methods are presented which can calculate the safe overlap of input and output buffers for tensor operations. Ranging from a computationally expensive approach with the ability to operate on compiled layer operations, to a versatile analytical solution which requires access to the original source code of the layer. The diagonal memory optimisation technique is described and shown to achieve memory savings of up to 34.5% when applied to eleven common models. Micro-controller targets are identified where it is only possible to deploy some models if diagonal memory optimisation is used.