Chain of Thought: DeepSeek's Unique Approach to Reasoning Models

This article discusses the Chain of Thought (CoT) reasoning model, focusing on DeepSeek's unique approach compared to other models like GPT.


Introduction

The release of GPT O3-mini, the fourth reasoning model on the market, signals the growing prevalence of this technology. Having experimented with various reasoning models (O1, DS, Gemini 2 Flash), I've observed significant differences in DeepSeek's CoT compared to others. For instance, DeepSeek breaks down problems into more detailed steps and frequently uses speculation ("maybe...") to anticipate user intent. In contrast, GPT's CoT focuses on executing user commands.

Key Differences

* GPT O1: Excels at executing complex, high-level prompts but exhibits less creativity.

* DeepSeek (DS): Demonstrates better divergent thinking, creativity, and expressiveness with simple prompts. However, it can lose control with lengthy instructions, suggesting a decline in effectiveness with increased instruction complexity.

Underlying Reasons

These differences stem from DeepSeek's use of Outcome Reward training, unlike other Large Language Models (LLMs) that utilize Process Reward training. The following chapters will delve into these two reward mechanisms.


Chapter 1: Chain of Thought Training: Building the Framework Before Adding Details


AI's deep thinking ability arises from the combined training of "patient problem decomposition" and "intuitive answer targeting." Forcing AI to dissect problems like humans transforms "intuitive leaps" into "logical ladders."


Similar to starting a puzzle by finding the edge pieces, CoT training provides AI with a "reasoning map." It guides the AI to follow the path of "identifying the problem → breaking down steps → connecting logic," instead of directly guessing the complete picture.


Example:

Question: How to reduce urban traffic congestion?

* Without CoT: Build more subways. (Correct result, but lacks a reusable thinking framework)

* With CoT:

   * Analyze the main cause: Too many private cars.

   * Demand-side solutions: Encourage public transportation/ride-sharing.

   * Supply-side solutions: Optimize traffic light algorithms.

   * Long-term planning: Work-residence balance policies.

     (Traceable process, adjustable strategies)


Chapter 2: Process Reward: Small Rewards for Each Correct Step


Process reward teaches AI numerous human thinking processes, enabling it to learn how humans think and use reasonable steps to execute tasks. It focuses not only on the correctness of the answer but also on whether the AI's CoT demonstrates reasonable deduction.

Like GPS navigation, process reward reminds you of "recalculating route" at each wrong turn instead of declaring "wrong route" at the destination.

Core Techniques:

* Step Scoring: Independently evaluates each step in the reasoning process (e.g., whether the intermediate formula in a math problem is reasonable).

* Logical Coherence Detection: Ensures the "because A, therefore B" chain is unbroken (e.g., avoids jumps like "cold weather → therefore eat more watermelon").

* Analogy: A teacher gives points to students who raise their hands to speak in each class.


Chapter 3: Outcome Reward: Focusing Solely on Final Success or Failure


Outcome reward provides the model with a question and result, training AI to independently decide the intermediate thinking process until it achieves the result.

The goal is to make AI understand that the correct answer must be expressed in a "human-understandable" way.


Humanized Design:

* Learning Preference: Humans prefer "using analogies to explain quantum mechanics" instead of piling up formulas.

* Scenario Adaptation: Provide engineers with code + principles, while using stories + illustrations for elementary school students.

* Analogy: The exam only counts the final grade, regardless of daily homework.


Chapter 4: Reward Fusion: Process and Outcome are Equally Important


Ideal AI thinking lies in the coexistence of "rational decomposition" and "emotional expression."  Like a symphony orchestra, process reward is the conductor ensuring each musician plays according to the score, while outcome reward is the audience's applause deciding whether to adjust the melody's passion.

Example:

* Question: How to explain "why leaves fall" to a child?

* Pure Process AI: Explains the abscission layer cells, abscisic acid hormone... step by step (rigorous but boring).

* Pure Outcome AI: "The big tree is going to sleep in winter!" (lively but lacks knowledge).

* Balanced AI:

   * Scientific level: Reduced light in autumn → leaves stop making nutrients → abscission cells separate (process reward supervision).

   * Expression level: The big tree is like changing sleeping clothes, taking off the old leaves, and waiting to wear new clothes in spring! (outcome reward optimization).

Process reward and outcome reward are like the two chains of DNA:

* The process ensures the credibility of thinking (no fabrication).

* The result endows the expression with empathy (no saying correct nonsense).


When AI learns to dynamically balance between the two, the cold code becomes warm.

Chain of Thought: DeepSeek's Unique Approach to Reasoning Models
James Huang February 2, 2025
Share this post
Tags
Standing on Giants: How China "Caught Up" (and What It Really Means)
The Human Factor and the Limits of Copying