在小规模变压器中实现和理解系统推理中的分布概括

论文标题

在小规模变压器中实现和理解系统推理中的分布概括

Achieving and Understanding Out-of-Distribution Generalization in Systematic Reasoning in Small-Scale Transformers

论文作者

Nam, Andrew J., Abdool, Mustafa, Maxfield, Trevor, McClelland, James L.

论文摘要

分布概括（OODG）是神经网络的长期挑战。在定义明确的变量和规则的任务中，这一挑战很明显，在这些任务中，明确使用规则可以独立于变量的特定值解决问题，但是网络往往与培训数据中采样的值范围息息相关。基于变压器的大型语言模型已经突破了神经网络如何解决以前看不见的问题的界限，但是它们的复杂性和对培训数据中相关内容的缺乏，使他们如何实现这种稳健性。为了了解基于变压器的系统如何推广的一步，我们探索了在小型变压器中探索了由已知分布的示例训练的小型变压器。使用基于拼图的推理任务，我们表明，如果训练集包含从简单组件任务的整个分布中采样的示例，则可以在复杂的问题上发生OODG。成功的概括取决于当使用绝对位置编码时仔细管理位置对准，但是我们发现，抑制对绝对位置的敏感性克服了这一限制。总而言之，我们的结果是朝着理解和促进变形金刚系统概括的一小步。

Out-of-distribution generalization (OODG) is a longstanding challenge for neural networks. This challenge is quite apparent in tasks with well-defined variables and rules, where explicit use of the rules could solve problems independently of the particular values of the variables, but networks tend to be tied to the range of values sampled in their training data. Large transformer-based language models have pushed the boundaries on how well neural networks can solve previously unseen problems, but their complexity and lack of clarity about the relevant content in their training data obfuscates how they achieve such robustness. As a step toward understanding how transformer-based systems generalize, we explore the question of OODG in small scale transformers trained with examples from a known distribution. Using a reasoning task based on the puzzle Sudoku, we show that OODG can occur on a complex problem if the training set includes examples sampled from the whole distribution of simpler component tasks. Successful generalization depends on carefully managing positional alignment when absolute position encoding is used, but we find that suppressing sensitivity to absolute positions overcomes this limitation. Taken together our results represent a small step toward understanding and promoting systematic generalization in transformers.

下载PDF全文

下载文献需遵守相关版权规定

论文标题