Color scheme

Distilling multi-step "System 2" reasoning into AI language models fails at Chain of Thought

Midjourney prompted by THE DECODER

Language models can reason better if they write down intermediate steps. A new study shows how such "System 2 Reasoning" can, at least partially, be trained into language models.

In recent years, AI methods like Chain-of-Thought Prompting or Branch-Solve-Merge have demonstrated that large language models achieve better results when they are made to generate their answers in multiple steps.

This two-step process can be seen as an expression of Daniel Kahneman's "System 2 Reasoning", where information is processed slowly and consciously. Its counterpart is "System 1", a fast, unconscious, and automated way of thinking.

Researchers from Meta AI have now developed a method to "distill" the computationally intensive "System 2 Reasoning" of AI models into the parameters of a language model. The results show that the resulting "System 1" model achieves similarly good performance in some cases as the original two-stage process - with significantly lower computational effort.

The process works as follows: First, a "System 2" method is applied to a large amount of example data. Then the answers are filtered, e.g., by keeping only consistent results. Finally, this data is used to train the language model through fine-tuning. Essentially, the team generates synthetic training data via System-2 prompts for fine-tuning LLMs to skip steps and answer directly.

Chain-of-thought remains out of reach

The researchers applied the method to four different "System 2" approaches and five task types. They found that distillation works in many, but not all cases.

For methods such as System 2 Attention to avoid biases or Rephrase and Respond to improve responses, it was shown that the resulting "System 1" models delivered similar results as the "System 2" variants, but with significantly fewer generated tokens.

However, the distillation failed for complex mathematical reasoning using Chain-of-Thought Prompting. The researchers suspect this is because some tasks are simply too complex for "System 1" thinking - especially since the models, even with CoT, repeatedly fail at logical tasks.

Nevertheless, the researchers see their method as a promising approach for developing powerful AI systems, which can then focus on the really challenging problems using other methods like CoT.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Recommendation

AI research

Distilling multi-step "System 2" reasoning into AI language models fails at Chain of Thought

Chain-of-thought remains out of reach

TransNAR: Neural Algorithmic Reasoners bring robust computation to transformers

Rule-Based Rewards: OpenAI provides insight into the GPT-4 safety stack

Google Deepmind develops open-source AI to tackle biases in evaluating language models

Nvidia researchers develop DoRA, a smarter way to fine-tune AI models for specific tasks

Rule-Based Rewards: OpenAI provides insight into the GPT-4 safety stack

Meta takes on OpenAI's GPT-4o with Llama 3 405B, its largest open-source LLM to date

AI models might need to scale down to scale up again

Distilling multi-step "System 2" reasoning into AI language models fails at Chain of Thought

Chain-of-thought remains out of reach

Share

Bank details