Chain of Thought: DeepSeek's Unique Approach to Reasoning Models

TL;DR: The Chain of Thought (CoT) reasoning model is reshaping AI by enhancing its ability to break down complex problems into manageable steps, blending intuitive and logical reasoning. DeepSeek's unique approach with Outcome Reward training offers a distinct advantage over traditional models by prioritizing creative divergence and user intent speculation.

Introduction

The world of AI reasoning models is rapidly evolving, and with the release of GPT O3-mini, we are witnessing a proliferation of these technologies. Having explored various models like O1, DS, and Gemini 2 Flash, I've observed unique features in DeepSeek’s Chain of Thought (CoT) that set it apart. Unlike GPT, which excels at executing commands with precision, DeepSeek breaks problems into detailed steps and anticipates user intent, often using speculative language like "maybe."

Key Differences in CoT Models

GPT O1: Known for executing complex, high-level prompts with precision but lacks creativity.
DeepSeek (DS): Exhibits superior divergent thinking, creativity, and expressiveness with simple prompts, though it may struggle with complex instructions.

Understanding the Differences

These differences are rooted in the distinct training methodologies: DeepSeek employs Outcome Reward training, whereas most other Large Language Models (LLMs) use Process Reward training. Let's explore these reward systems further.

Chapter 1: Chain of Thought Training—Building the Framework

AI's capability for deep thinking originates from "patient problem decomposition" combined with "intuitive answer targeting." By encouraging AI to break down problems as humans do, intuitive guesses transform into logical reasoning.

Think of CoT training like starting a puzzle by identifying edge pieces. It provides AI with a "reasoning map," guiding it to identify the problem, break down steps, and connect the logic rather than jumping to conclusions.

Example: Reducing Urban Traffic Congestion

Without CoT: Simply suggests building more subways.
With CoT:
Analyzes primary causes, e.g., too many private cars.
Offers demand-side solutions like public transportation.
Proposes supply-side solutions like optimizing traffic lights.
Suggests long-term planning such as work-residence balance policies.

Chapter 2: Process Reward—Small Rewards for Each Step

Process reward immerses AI in human thinking processes, focusing on the steps that lead to a reasonable outcome. This approach is akin to GPS navigation, which recalculates routes upon wrong turns rather than just announcing wrong routes upon destination arrival.

Core Techniques Include:
Step Scoring: Independently evaluates each reasoning step.
Logical Coherence: Ensures the logic chain remains unbroken.

Chapter 3: Outcome Reward—Focusing on Final Success

Outcome reward trains AI to reach the correct answer in a human-understandable way by providing the model with a question and result, allowing it to decide the intermediate process.

Humanized Design:
Prefers analogies over formulas.
Adapts scenarios to the audience, using different methods for engineers versus young students.

Chapter 4: Reward Fusion—Balancing Process and Outcome

Ideal AI thinking requires balancing "rational decomposition" with "emotional expression." Process reward acts as the conductor, guiding each step, while outcome reward is the audience's applause, influencing the emotional tone.

Example: Explaining Why Leaves Fall to a Child

Pure Process AI: Offers detailed, technical explanations.
Pure Outcome AI: Provides simple, imaginative answers.
Balanced AI:
Delivers scientific explanations alongside engaging narratives.

Process reward ensures credibility, while outcome reward adds empathy, creating a balanced AI that turns cold code into warm, relatable interactions.

As AI learns to dynamically balance these approaches, it transforms into a more human-like assistant, capable of both rigorous analysis and empathetic communication.

in Insights

# AI AI Architecture AI arms race Ai Art Ai Automation Ai Benefits Ai Chip Architecture Ai Collaboration Ai Copilot Ai Copywriter Thoughts

James Huang February 2, 2025

Share this post

Our blogs

Standing on Giants: How China "Caught Up" (and What It Really Means)

The Human Factor and the Limits of Copying

Follow us

Follow us

Chain of Thought: DeepSeek's Unique Approach to Reasoning Models

Introduction

Key Differences in CoT Models

Understanding the Differences

Chapter 1: Chain of Thought Training—Building the Framework

Example: Reducing Urban Traffic Congestion

Chapter 2: Process Reward—Small Rewards for Each Step

Chapter 3: Outcome Reward—Focusing on Final Success

Chapter 4: Reward Fusion—Balancing Process and Outcome

Example: Explaining Why Leaves Fall to a Child

Share this post

Tags

Our blogs

Mercury Technology Solutions

improve & OPtimise business operations

elevate marketing effectiveness

boost overall efficiency (Artifical Intelligent)

Follow us

Chain of Thought: DeepSeek's Unique Approach to Reasoning Models

Introduction

Key Differences in CoT Models

Understanding the Differences

Chapter 1: Chain of Thought Training—Building the Framework

Example: Reducing Urban Traffic Congestion

Chapter 2: Process Reward—Small Rewards for Each Step

Chapter 3: Outcome Reward—Focusing on Final Success

Chapter 4: Reward Fusion—Balancing Process and Outcome

Example: Explaining Why Leaves Fall to a Child

Share this post

Tags

Our blogs