大型语言模型是零击的推理器

论文标题

大型语言模型是零击的推理器

Large Language Models are Zero-Shot Reasoners

论文作者

Kojima, Takeshi, Gu, Shixiang Shane, Reid, Machel, Matsuo, Yutaka, Iwasawa, Yusuke

论文摘要

预处理的大语言模型（LLM）广泛用于自然语言处理（NLP）的许多子场中，通常被称为具有特定任务示例的优秀少数学习者。值得注意的是，一系列思想链（COT）提示，这是一种通过分步回答示例引发复杂的多步推理的技术，在算术和符号推理中实现了最新的表演，难以遵循LLMS标准缩放定律的困难System-2任务。尽管这些成功通常归因于LLM的几次学习能力，但我们表明，LLM是通过在每个答案之前简单地添加“逐步思考”而成为不错的零射击推理者。实验结果表明，使用相同的单个及时模板，我们的零射击量明显优于零射击的LLM表演，这些表现在不同的基准推理任务上，包括算术（Multiarith，GSM8K，Aqua-Rat，Svamp，svamp），符号推理，符号推理，没有任何logip flip，其他理解的任务 - 例如，例如通过大型指令型模型（Text-Davinci-002）将Multiarith的准确性从17.7％提高到78.7％，而GSM8K的准确性从10.4％提高到40.7％，以及另一个现成的大型型号（540B参数Palm Palm Palm Palm），以及类似的改进幅度。在非常多样化的推理任务中，这个单一提示的多功能性暗示了LLM的尚未开发和研究的基本零拍功能，这表明可以通过简单提示来提取高级，多任务多任务的广泛认知能力。我们希望我们的工作不仅可以成为具有挑战性的推理基准的最小零击基线，而且还强调了仔细探索和分析LLM中隐藏在LLM中的巨大零拍知识的重要性。

Pretrained large language models (LLMs) are widely used in many sub-fields of natural language processing (NLP) and generally known as excellent few-shot learners with task-specific exemplars. Notably, chain of thought (CoT) prompting, a recent technique for eliciting complex multi-step reasoning through step-by-step answer examples, achieved the state-of-the-art performances in arithmetics and symbolic reasoning, difficult system-2 tasks that do not follow the standard scaling laws for LLMs. While these successes are often attributed to LLMs' ability for few-shot learning, we show that LLMs are decent zero-shot reasoners by simply adding "Let's think step by step" before each answer. Experimental results demonstrate that our Zero-shot-CoT, using the same single prompt template, significantly outperforms zero-shot LLM performances on diverse benchmark reasoning tasks including arithmetics (MultiArith, GSM8K, AQUA-RAT, SVAMP), symbolic reasoning (Last Letter, Coin Flip), and other logical reasoning tasks (Date Understanding, Tracking Shuffled Objects), without any hand-crafted few-shot examples, e.g. increasing the accuracy on MultiArith from 17.7% to 78.7% and GSM8K from 10.4% to 40.7% with large InstructGPT model (text-davinci-002), as well as similar magnitudes of improvements with another off-the-shelf large model, 540B parameter PaLM. The versatility of this single prompt across very diverse reasoning tasks hints at untapped and understudied fundamental zero-shot capabilities of LLMs, suggesting high-level, multi-task broad cognitive capabilities may be extracted by simple prompting. We hope our work not only serves as the minimal strongest zero-shot baseline for the challenging reasoning benchmarks, but also highlights the importance of carefully exploring and analyzing the enormous zero-shot knowledge hidden inside LLMs before crafting finetuning datasets or few-shot exemplars.

下载PDF全文

下载文献需遵守相关版权规定

论文标题