TL;DR: The Chain of Thought (CoT) reasoning model is reshaping AI by enhancing its ability to break down complex problems into manageable steps, blending intuitive and logical reasoning. DeepSeek's unique approach with Outcome Reward training offers a distinct advantage over traditional models by prioritizing creative divergence and user intent speculation.
はじめに
The world of AI reasoning models is rapidly evolving, and with the release of GPT O3-mini, we are witnessing a proliferation of these technologies. Having explored various models like O1, DS, and Gemini 2 Flash, I've observed unique features in DeepSeek’s Chain of Thought (CoT) that set it apart. Unlike GPT, which excels at executing commands with precision, DeepSeek breaks problems into detailed steps and anticipates user intent, often using speculative language like "maybe."
Key Differences in CoT Models
- GPT O1: Known for executing complex, high-level prompts with precision but lacks creativity.
- DeepSeek (DS): Exhibits superior divergent thinking, creativity, and expressiveness with simple prompts, though it may struggle with complex instructions.
Understanding the Differences
These differences are rooted in the distinct training methodologies: DeepSeek employs Outcome Reward training, whereas most other Large Language Models (LLMs) use Process Reward training. Let's explore these reward systems further.
Chapter 1: Chain of Thought Training—Building the Framework
AI's capability for deep thinking originates from "patient problem decomposition" combined with "intuitive answer targeting." By encouraging AI to break down problems as humans do, intuitive guesses transform into logical reasoning.
Think of CoT training like starting a puzzle by identifying edge pieces. It provides AI with a "reasoning map," guiding it to identify the problem, break down steps, and connect the logic rather than jumping to conclusions.
Example: Reducing Urban Traffic Congestion
- Without CoT: Simply suggests building more subways.
- * CoTと:
- Analyzes primary causes, e.g., too many private cars.
- Offers demand-side solutions like public transportation.
- Proposes supply-side solutions like optimizing traffic lights.
- Suggests long-term planning such as work-residence balance policies.
第2章 プロセス報酬:正しいステップごとに小さな報酬を与える
Process reward immerses AI in human thinking processes, focusing on the steps that lead to a reasonable outcome. This approach is akin to GPS navigation, which recalculates routes upon wrong turns rather than just announcing wrong routes upon destination arrival.
- Core Techniques Include:
- Step Scoring: Independently evaluates each reasoning step.
- Logical Coherence: Ensures the logic chain remains unbroken.
Chapter 3: Outcome Reward—Focusing on Final Success
Outcome reward trains AI to reach the correct answer in a human-understandable way by providing the model with a question and result, allowing it to decide the intermediate process.
- 人間味のあるデザイン:
- Prefers analogies over formulas.
- Adapts scenarios to the audience, using different methods for engineers versus young students.
Chapter 4: Reward Fusion—Balancing Process and Outcome
Ideal AI thinking requires balancing "rational decomposition" with "emotional expression." Process reward acts as the conductor, guiding each step, while outcome reward is the audience's applause, influencing the emotional tone.
Example: Explaining Why Leaves Fall to a Child
- Pure Process AI: Offers detailed, technical explanations.
- Pure Outcome AI: Provides simple, imaginative answers.
- * バランスの取れたAI:
- Delivers scientific explanations alongside engaging narratives.
Process reward ensures credibility, while outcome reward adds empathy, creating a balanced AI that turns cold code into warm, relatable interactions.
As AI learns to dynamically balance these approaches, it transforms into a more human-like assistant, capable of both rigorous analysis and empathetic communication.