DeepSeek Releases DSpark, a Speculative Decoding Framework That Accelerates DeepSeek-V4 Per-User Generation 60–85% Over MTP-1

DeepSeek released DSpark, a speculative decoding framework, with open-source checkpoints and training code. It is a serving optimization, not a new model. The checkpoints DeepSeek-V4-Pro-DSpark and DeepSeek-V4-Flash-DSpark reuse the existing V4 weights, with a draft module attached.

The DeepSeek research team also open-sourced DeepSpec, an MIT-licensed codebase for training and evaluating speculative decoding drafters. The work targets one problem: faster large-model inference in busy production serving.

AI Spending Takes Center Stage in Big Tech Earnings | Bloomberg Tech 7/23/2026

IBM CEO: ‘Technology Spend Will Keep Growing’

TL;DR

DSpark pairs a parallel draft backbone with a tiny sequential head to cut suffix decay.
A confidence head and load-aware scheduler verify more tokens when GPUs are idle, fewer when busy.
Offline, accepted length rises 26–31% over Eagle3 and 16–18% over DFlash.
In production on DeepSeek-V4, per-user generation runs 60–85% faster than the MTP-1 baseline.
Output stays lossless, and the checkpoints plus DeepSpec training code are open-source.

What is DSpark?

Speculative decoding splits generation into two roles. A small draft model proposes a block of tokens. The full target model then verifies that block in one forward pass.

Rejection sampling accepts the longest valid prefix and appends one bonus token. Because the rule preserves the target distribution exactly, there is no quality loss. DSpark keeps this guarantee. It changes how tokens are drafted and how many get verified.

The Latency Math it Optimizes

Per-token latency follows one equation from the paper: L = (T_draft + T_verify) / τ. Here τ is the number of tokens accepted per cycle. Speedup comes from three levers only.

You can draft faster, lowering T_draft. You can draft better, raising τ. Or you can verify smarter, reducing wasted T_verify. DSpark pulls all three levers at once.

How It Works: Semi-Autoregressive Generation

Earlier drafters force a trade-off. Autoregressive drafters like Eagle3 condition each token on prior ones. That gives strong acceptance, but drafting cost grows with block size.

Parallel drafters like DFlash produce the whole block in one pass. Drafting stays cheap, but each position ignores its neighbors. The result is ‘multi-modal collision’ and rapid acceptance decay along the suffix.

DSpark splits drafting into two stages. A heavy parallel backbone, DFlash in their setup, produces base logits for every position. Then a lightweight sequential head adds a prefix-dependent bias before sampling each token.

The default sequential head is a Markov head. It only looks at the immediately preceding token. A low-rank factorization (rank 256) keeps it cheap, even with large vocabularies.

Once position one samples ‘of’, the head boosts ‘course’ and suppresses ‘problem’. An optional RNN head tracks the full block prefix. It adds only marginal gains, so the Markov head ships as the default.

The payoff shows up position by position. DSpark inherits the parallel backbone’s high first-token accuracy. The sequential head then holds acceptance steady deep into the block.

Training freezes the target model and reuses its embedding and output head. A total-variation loss is the key term. Minimizing that distance directly maximizes the draft’s acceptance rate.

How It Works: Confidence-Scheduled Verification

More draft tokens do not always mean more speed. Verifying tokens that will be rejected wastes batch capacity under heavy load. DSpark adds two parts to fix this.

A confidence head outputs a score for each draft position. The score estimates the chance that token survives verification, given accepted predecessors. It is supervised by the analytical per-step acceptance rate.

Raw neural confidence is usually overconfident. So the research team applies Sequential Temperature Scaling, a post-hoc calibration step. It cuts expected calibration error from 3–8% down to about 1%.

A hardware-aware prefix scheduler then sets the verification length per request. It uses a profiled throughput curve, SPS(B), measured once at startup. When GPUs are idle, it verifies more tokens. When GPUs are busy, it verifies fewer.

The scheduler uses an early-stopping rule to stay lossless. The appendix section gives a counterexample showing why a naive global search would leak information.

Metrics

Offline tests cover math, code, and daily chat. Targets include Qwen3-4B, 8B, 14B, and Gemma4-12B. DSpark beats both baselines on accepted length across every domain.

Against Eagle3, macro-average accepted length rises 30.9%, 26.7%, and 30.0% on the three Qwen3 sizes. Against DFlash, gains are 16.3%, 18.4%, and 18.3%. A 2-layer DSpark even beats a 5-layer DFlash.

The sequential head adds little cost. Scaling draft length from 4 to 16 adds only 0.2–1.3% per-round latency. In return, accepted length improves by up to 30%.

Production results come from DeepSeek-V4-Flash and V4-Pro under live traffic. The baseline is MTP-1, the prior single-token setup. At matched throughput, per-user speed rises 60–85% on Flash and 57–78% on Pro. The shipped configuration is DSpark-5, a five-token draft block with the Markov head.

Drafter	Drafting style	Block cost	Suffix acceptance	Verification length
Eagle3	Autoregressive	Grows with block size	High, stable	Fixed
DFlash	Parallel	Near-constant	Decays fast	Fixed (full block)
MTP-1	Single-token (MTP)	Low	—	Static 2 tokens
DSpark	Parallel + sequential head	Near-constant	High, stable	Dynamic, load-aware

Use Cases With Examples

Structured workloads gain the most from longer verification. In code generation, acceptance is naturally high. The scheduler can verify long prefixes with little waste, so coding agents stream output faster.

Open-ended chat behaves differently. A confidence-threshold sweep raised chat acceptance from 45.7% to 95.7%. The confidence head flags uncertain suffix tokens so they can be pruned.

Math reasoning sits between the two. Its acceptance rose from 76.9% to 92.5% in the same sweep. Long step-by-step traces benefit from steady deep-block acceptance.

High-concurrency serving is the headline case. At moderate load, the scheduler runs roughly 4–6 verified tokens per request. As concurrency rises, it trims that budget to protect throughput.

Try It

DeepSpec runs in three stages: data preparation, training, then evaluation. A config selects the algorithm and target model. Evaluation benchmarks a trained draft checkpoint across nine datasets.

# Install dependencies
python -m pip install -r requirements.txt

# Train a DSpark draft against a Qwen3-4B target.
# The algorithm and target are chosen by the config, e.g.
# config/dspark/dspark_qwen3_4b.py
bash scripts/train/train.sh

# Evaluate the trained draft across the 9 benchmark datasets.
# Set in the eval config:
#   target_name_or_path = Qwen/Qwen3-4B
#   draft_name_or_path  = ~/checkpoints/deepspec/dspark_block8_qwen3_4b/step_latest
bash scripts/eval/eval.sh

The default configs assume one node with 8 GPUs. Reduce CUDA_VISIBLE_DEVICES for fewer. Note the target cache can be large, near 38 TB for the Qwen3-4B setting.

For the production checkpoints, the draft module attaches to the existing V4 weights. The Hugging Face cards include a minimal inference example in the inference folder. No retraining of the target model is required.

The interactive demo below shows the mechanism. Pick a drafter, a domain, and a GPU-load level. Watch the draft block, the confidence scores, and the scheduler’s verification budget change in real time. The numbers are illustrative, modeled on the paper’s reported behavior.

Check out the Paper, GitHub and Model weight on HF. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Credit: Source link