DeepSeek V4 Sets Open-Weight Bar With 1M-Token Context

DeepSeek V4 Sets Open-Weight Bar With 1M-Token Context

Million-token documents stopped being edge cases and started becoming baselines when DeepSeek made 1M tokens the default context in V4 on April 24, pairing that leap with open weights and turnkey APIs. The release arrived in two preview models—V4-Pro and V4-Flash—that reframed long-context not as a marketing checkbox but as an architectural commitment. Both models used a Mixture-of-Experts design and a revised attention scheme that fused token-level compression with DeepSeek Sparse Attention, slashing compute and memory to make million-token inputs practical at production scale. V4-Pro disclosed 1.6 trillion total parameters with 49 billion active per token, aimed at frontier reasoning; V4-Flash offered 284 billion total with 13 billion active, tuned for fast, low-cost inference. Adjustable “Thinking” and “Non-Thinking” modes, plus a reasoning_effort control, turned deliberation into a controllable resource, letting teams dial depth to suit retrieval, coding, or agentic planning without ripping up their pipelines.

Architecture, Benchmarks, and Developer Flow

Building on this foundation, DeepSeek positioned V4-Pro as competitive with Claude Opus 4.6, GPT-5.4 xHigh, and Gemini 3.1 Pro across knowledge, reasoning, and agentic tasks. Claims included an open-source lead in agentic coding and a 3206 Codeforces rating, with only Gemini edging it on broad world knowledge. Flash tracked close to Pro on structured reasoning and matched it on simpler agent workflows at a fraction of the price, underscoring a performance-per-dollar thesis rather than chasing raw scale. Compatibility smoothed adoption: V4 spoke OpenAI Chat Completions and Anthropic-style interfaces, slotted into agent stacks like Claude Code, OpenClaw, and OpenCode, and required minimal glue code to replace legacy endpoints. DeepSeek also announced that deepseek-chat and deepseek-reasoner would be retired on July 24, with aliases already pointing to Flash to ease cutovers. For teams juggling latency budgets, the reasoning_effort knob enabled per-call tradeoffs, transforming “thinking time” into an operational variable.

Open Weights, Long-Context Norms, and What to Do Next

This approach naturally led to broader shifts: long-context moved from a premium add-on to a baseline capability, driven by attention-level gains instead of brute-force scaling, and open weights remained central as V4 landed on Hugging Face under an open license. The lab staked a claim as the open-weight frontier standard-bearer while domestic rivals—Alibaba, Xiaomi’s MiMo, and Moonshot’s Kimi—pressed their own upgrades, and a China-native compute path advanced through Huawei Ascend optimizations. Practical next steps were clear: start with retrieval-heavy workloads where million-token windows compressed RAG stacks, then ratchet up reasoning_effort only for agent turns that earned it. Pilots benefited from diffing Pro and Flash on the same traces to set cost floors, routing complex tool use to Pro and steady automation to Flash. Migration plans were best anchored to the July 24 endpoint sunset. Teams pursuing regulated or on-prem deployments were advised to validate V4’s open weights against internal inference, exploiting Ascend builds where supply chains required it.

Subscribe to our weekly news digest.

Join now and become a part of our fast-growing community.

Invalid Email Address
Thanks for Subscribing!
We'll be sending you our best soon!
Something went wrong, please try again later