TL;DR: The scaling laws of AI are hitting diminishing returns, ushering in an era where architectural innovation, not just brute-force compute, will define progress. DeepSeek's recent DeepSeek-OCR, with its "visual compression" of context, represents a groundbreaking shift. By converting long textual conversations into "photographic" memory fragments, DeepSeek is tackling AI's critical long-context problem, enabling theoretically infinite conversations while optimizing compute. This innovation highlights a fundamental divergence in global AI strategy: where Western tech often "stacks resources," Chinese firms are excelling at "engineering optimization" – a difference that could reshape the competitive landscape and democratize advanced AI capabilities.
I am James, CEO of Mercury Technology Solutions.
The trajectory of AI development, particularly between the East and the West, continues to reveal two fundamentally different approaches to technological progress. While much of the recent conversation has revolved around the perceived plateau of AI scaling laws—especially after GPT-5 didn't deliver the same "magic leap" as its predecessors—the real breakthroughs are now occurring in the intricate dance of optimization.
Yesterday, DeepSeek unveiled DeepSeek-OCR, and I believe this innovation opens a crucial new frontier for AI optimization. It's a testament to thinking differently about the very nature of AI memory.
The Elephant in the Room: AI's Contextual Amnesia
Anyone who has spent significant time conversing with an LLM has experienced it: the longer the conversation, the "dumber" the AI becomes. Responses drift, coherence fades, and eventually, the AI forgets earlier details entirely. Our knee-jerk reaction is often to simply restart a new conversation, which, to our relief, immediately restores the AI's "freshness" and quality.
This isn't a bug; it's a fundamental challenge: AI struggles with excessively long contexts. Imagine trying to meticulously remember every word of an entire book, while simultaneously processing new information. Your brain would quickly "crash." LLMs face a similar computational avalanche when processing long context windows; the compute required grows exponentially, leading to memory overloads and unacceptably slow response times. While theoretically possible, practically, the latency makes such a system unusable.
DeepSeek's team, however, proposed a radical solution: "photographing" old conversations.
Visual Compression: A Human-like Approach to AI Memory
Initially, the idea sounds counter-intuitive. Converting text into images, then asking an AI to "read" those images to reconstruct the conversation? Wouldn't that lead to massive information loss and increased storage requirements?
DeepSeek's results are, frankly, astonishing. They discovered that a page of 1,000 words could be reconstructed with over 97% accuracy using only about 100 "visual tokens." This is like compressing a 100,000-word conversation into 10,000 "photo fragments," allowing the AI to recall the gist of your discussion by looking at these fragments. Even pushing the compression ratio to 20x (50 visual tokens for 1,000 words) still retained approximately 60% accuracy. Think about recalling details from a month-old conversation – 60% retention is impressive for a human, let alone an AI.
(A crucial caveat: these tests were primarily in OCR scenarios – text reconstruction from images. The effectiveness in complex multi-turn dialogue, code discussions, or intricate reasoning still requires full validation, as the paper itself acknowledges these are preliminary results.)
However, from an engineering standpoint, the performance is remarkable. A single A100 GPU can process 200,000 pages daily, scaling to 33 million pages with 20 nodes. For use cases involving massive document processing, such as preparing large models for training or building enterprise knowledge bases, this efficiency gain is transformative.
DeepSeek has even open-sourced the code and model weights, lowering the barrier to entry. While the model isn't fine-tuned for conversational use and requires specific prompt formats, the underlying optimization is undeniable.
Smart Architecture: Adaptive Compression and the "Human Forgetting" Hypothesis
DeepSeek-OCR isn't a rigid, one-size-fits-all solution. Its architecture is flexible, offering multiple modes like a camera's various shooting settings. A simple slide might only need 64 visual tokens at 512x512 resolution (Tiny mode), while a complex newspaper layout can be handled with around 800 tokens using a multi-view "Gundam" mode.
This flexibility is key. It's akin to how humans process information – simple notes are stored differently than complex academic papers. DeepSeek-OCR intelligently adjusts compression based on content complexity, conserving resources where possible and applying more power when needed. The underlying principle is profound: the limit of compression depends on complexity, mirroring how human memory operates.
This brings us to the paper's most insightful concept: "Letting AI forget like a human."
Consider your own memory. You can repeat a recent sentence verbatim. An hour-old conversation's gist is clear. Yesterday's events are key fragments. Last week's discussion is hazy. Last month's is largely forgotten.
DeepSeek proposes a similar mechanism for AI: recent interactions are kept as raw text. One-hour-old content becomes a high-resolution "photo" (800 tokens). This morning's dialogue degrades to standard definition (256 tokens). Yesterday's becomes low-res (100 tokens), and older memories are either heavily compressed or discarded.
This design resembles the fading nature of human memory, and it opens up the possibility of AI handling theoretically infinite conversations, as older memories automatically "fade" to make room for new ones.
Of course, challenges remain. How do we determine which information is "important" and deserves high-resolution retention? What happens if a user, 50 turns into a conversation, suddenly references a detail from turn 5 that has been heavily compressed? This might require "memory importance scoring" or user-assigned importance tags.
The Global AI Divide: Engineering Optimization vs. Resource Stacking
This research vividly illustrates a defining characteristic of Chinese AI companies: an extreme focus on cost optimization and engineering efficiency.
DeepSeek's previous V3 model achieved GPT-4 level performance with a fraction of the compute (2.788M H800 GPU hours, estimated $5.57M training cost), astonishing the industry. This OCR model continues that trend, relentlessly seeking to achieve the best results with the fewest tokens.
In contrast to the "stack resources until it works" approach often seen in some Western AI development, Chinese teams excel at deep optimization under resource constraints. This could be a direct result of GPU export restrictions, fostering forced innovation, combined with a strong engineering culture of efficiency. While OpenAI can burn vast sums training larger models, DeepSeek must find ways to achieve comparable results with less.
This divergence is actively reshaping the global AI competitive landscape. While some Western companies are still competing on who has the largest model or the highest training costs, Chinese firms are exploring how to achieve 90% of the effect with 10% of the cost. In the long run, this engineering optimization capability could prove to be a more formidable competitive advantage than sheer resource deployment, especially for large-scale commercial applications where cost control is paramount.
Looking Ahead: The Promise of R2 and Beyond
If DeepSeek integrates these types of innovative techniques into their next-generation inference model, R2, it could lead to substantial shifts. R1 already demonstrated Chinese teams' ability to achieve near-Western parity in inference, but its long-context handling remained limited by traditional architectures. If R2 integrates visual compression, MoE optimization, and other as-yet-unannounced techniques, it could dramatically reduce the computational cost of long contexts while maintaining powerful reasoning.
This isn't just a performance bump; it's an expansion of use cases. Imagine an AI that remembers dozens of conversation turns, processes extremely long documents, and maintains an acceptable inference cost. This would be transformative for applications requiring extended interaction, such as education, medical consultation, or legal analysis. And if the cost is low enough, these capabilities could move from being "exclusive to large corporations" to being "accessible to small and medium developers."
DeepSeek's technological roadmap consistently points towards "more efficient, more practical" solutions, rather than simply chasing benchmark numbers. V3, OCR, and likely R2, all follow this path. While these are based on current information and speculation, the direction is clear and technically supported.
Human memory doesn't function like a traditional computer, logging every detail. We remember impressions, key information, and emotional connections, not verbatim transcripts. We forget details but retain the important. We re-encode memories, storing them more efficiently. DeepSeek-OCR offers a viable pathway for AI to mimic this: when handling long contexts, a visual representation might be far more efficient than pure text.
Whether this idea holds up in broader contexts remains to be seen. But it undeniably proves one thing: under resource constraints, by deeply contemplating the nature of the problem, cleverly designing the architecture, and meticulously optimizing every component, it is still possible to build highly competitive systems. This, perhaps, is a microcosm of China's AI development – a victory not of resource stacking, but of engineering optimization.
The next time you find your AI "forgetting" your previous conversation, perhaps a future AI will respond: "I haven't forgotten; I've simply photographed our conversation and stored it deep within my memory. If you need it, I can always retrieve it for you."
At that moment, the dialogue between AI and humanity might become far more natural, and enduring.
Mercury Technology Solutions. Accelerate Digitality.