Paper Summary: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Paper Summary: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Last updated:

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

chain-of-thought-article-front-page Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Source

WHAT

Chain-of-Thought (CoT) is a technique whereby the output begins with extra reasoning steps before giving the final answer.

chain-of-thought-prompting-example Chain-of-Thought prompting example. Source

WHY

Because even large LLMs struggle with tasks that require multiple steps of reasoning and/or symbolic logic. And adding more parameters doesn't seem to help much.

HOW

Using few-shot prompting, one adds examples of how the model's output should include reasoning steps, before the final answer.

CLAIMS/QUOTES

  • Types of tasks amenable to CoT: "... chain-of-thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks"

  • Interpretability: CoT may add some level of interpretability to the way the LLM is thinking as it produces an output. Authors note this needs more study.

  • CoT can be added to any model without needing retraining or fine-tuning: You can add CoT capabilities to any pre-trained LLM by just adding CoT examples and then using few-shot prompting.

  • Emergent behavior: CoT only works in large models: "... chain-of-thought prompting does not positively impact performance for small models, and only yields performance gains when used with models of ∼100B parameters."

    • In smaller models, the chains produced were "fluent but illogical", actually making results worse than with normal prompting.
  • Performance vs Special-purpose models: Vanilla LLMs with CoT in-context learning outperform LLMs that have been fine-tuned to excel in specific domains (e.g. GPT-3 fine-tuned on math).

  • CoT helps more, the more complex a task is. The performance gains of CoT are larger for tasks that require multi-step reasoning, such as logic and common-sense tasks.

NOTES

  • The CoT needs not be shown to a user using the model. It can just be hidden.

  • The experiments used just 8 examples of CoT in the context, for few-shot prompting.

  • A related zero-shot approach is to simply append "let's think step-by-step" to the prompt, without any few-shot exemplars. This was proposed by Kojima et al. (2022) in "Large Language Models are Zero-Shot Reasoners", and is a separate technique not covered in this paper.

MY 2¢

  • It's important to realize that CoT is an inference-type technique. It does not change the training-time setup of a model at all!

    • "No language models were finetuned in the process of writing this paper."
  • This wasn't discussed in the paper, but there definitely are latency tradeoffs when adding CoT to a model.


REFERENCES

Dialogue & Discussion