迈向透明AI：关于解释深神经网络内部结构的调查

论文标题

迈向透明AI：关于解释深神经网络内部结构的调查

Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks

论文作者

Räuker, Tilman, Ho, Anson, Casper, Stephen, Hadfield-Menell, Dylan

论文摘要

机器学习的最后一个十年的规模和能力急剧增加。深度神经网络（DNN）越来越多地部署在现实世界中。但是，它们很难分析，引起人们对使用它们的担忧，而无需严格了解它们的功能。解释它们的有效工具对于通过帮助发现问题，解决错误并改善基本理解来构建更具值得信赖的AI至关重要。特别是，“内部”可解释性技术专注于解释DNN的内部组成部分，非常适合发展机械理解，指导手动修改和反向工程解决方案。最近的许多工作集中在DNN的可解释性上，到目前为止，快速进步使方法变得更加困难。在这项调查中，我们审查了300多个工作，重点是内部的可解释性工具。我们介绍了一种分类法，该分类法对方法进行了分类，它们可以通过他们有助于解释（权重，神经元，子网或潜在表示）的网络部分进行分类，以及它们是在（固有的）还是在（事后）培训期间实施。据我们所知，我们也是第一个调查可解释性研究与对抗性鲁棒性，持续学习，模块化，网络压缩和研究人类视觉系统的连接之间的许多联系的人。我们讨论了关键挑战，并认为可解释性研究中的现状在很大程度上没有生产力。最后，我们强调了未来工作的重要性，该工作强调诊断，调试，对手和基准测试，以使可解释性工具对实际应用中的工程师更有用。

The last decade of machine learning has seen drastic increases in scale and capabilities. Deep neural networks (DNNs) are increasingly being deployed in the real world. However, they are difficult to analyze, raising concerns about using them without a rigorous understanding of how they function. Effective tools for interpreting them will be important for building more trustworthy AI by helping to identify problems, fix bugs, and improve basic understanding. In particular, "inner" interpretability techniques, which focus on explaining the internal components of DNNs, are well-suited for developing a mechanistic understanding, guiding manual modifications, and reverse engineering solutions. Much recent work has focused on DNN interpretability, and rapid progress has thus far made a thorough systematization of methods difficult. In this survey, we review over 300 works with a focus on inner interpretability tools. We introduce a taxonomy that classifies methods by what part of the network they help to explain (weights, neurons, subnetworks, or latent representations) and whether they are implemented during (intrinsic) or after (post hoc) training. To our knowledge, we are also the first to survey a number of connections between interpretability research and work in adversarial robustness, continual learning, modularity, network compression, and studying the human visual system. We discuss key challenges and argue that the status quo in interpretability research is largely unproductive. Finally, we highlight the importance of future work that emphasizes diagnostics, debugging, adversaries, and benchmarking in order to make interpretability tools more useful to engineers in practical applications.

下载PDF全文

下载文献需遵守相关版权规定

论文标题