[paper] DWDP

Based on DWDP: Distributed Weight Data Parallelism for High-Performance LLM Inference on NVL72 arXiv:2604.01621 and TensorRT-LLM commit e92ee4 (PR #12136).

Motivation

Multi-GPU MoE inference uses DEP (Data + Expert Parallelism). Every layer requires barrier-like synchronization. With real-world workloads this is harmful because:

Attention: varying ISLs
MoE: hot experts cause some ranks to process more tokens

Both amplify into global wait time at every sync point. Measured on DeepSeek-R1/GB200: ~12% overhead at just CV=20% ISL variation.

Core solution

Instead of sending tokens to where weights are (EP's AlltoAll), DWDP pulls weights to where tokens are via async P2P (cudaMemcpyAsync). Under DWDP, the weights communication of 2 parts is:

Attention: fully replicated (unchanged DP)
MoE: partitioned across ranks; each rank permanently holds its local experts, fetches missing experts on demand

Thus there is no explicit collective sync as each rank progresses independently. While executing MoE at layer $l$ , asynchronously prefetch experts needed for layer $l+1$ with double buffering pipeline.

🌱

Q: Why not use NCCL Allgather?

synchronization semantics force all ranks to participate simultaneously;
NCCL uses SMs, interfering with compute. cudaMemcpyAsync uses the copy engine only - no SM interference, no cross-rank synchronization.

Expert splitting and fetching

NVIDIA/TensorRT-LLM/tensorrt_llm/llmapi/llm_args.pyLines 2625 to 2646 in 7ae13fb

Loading…

In the rest of this post, we use $L$ and $P$ to denote num_experts_per_worker and num_prefetch_experts. They satisfy the following constraint:

E = L + (N - 1) \times P,

where $E$ is the total number of experts and $N$ is the number of DWDP workers.

DWDP assigns experts in range $[r \times P, r \times P + L]$ to rank $r$ , so there will be $L - P$ experts stored in adjacent ranks. During runtime, the rank $r$ fetches $E - L$ experts from peers:

experts $[r' \times P, (r' + 1) \times P]$ from rank $r' < r$
experts $[(r' - 1) \times P + L, r' \times P + L]$ from rank $r' > r$

DWDP prefetches all $P$ experts from each peer before knowing which experts the router will pick in the layer-wise manner. Routing decisions are unavailable at prefetch time (the fetch is pipelined ahead of the next layer's compute). Some fetched experts may go unused in any given batch.

Implementation optimizations

GroupedGEMM kernel for non-contiguous expert weights. The naive baseline merged all local + prefetched weights into a single contiguous buffer via a pre-GEMM D2D copy. DWDP extends the cuteDSL GroupedGEMM kernel to accept a list of weight tensors from different memory locations to eliminate the D2D memory copies.
Round-robin sliced prefetch (N/A in the open source TRTLLM). Multiple destination ranks simultaneously pull from the same source rank, resulting in many-to-one copy-engine contention. For DWDP with $N$ ranks, number of competing pulls $X$ from one source follows the Binomial distribution:

X \sim \mathrm{Binomial}(N - 2, 1 / (N - 1)).

The key architectural observation is that the copy engine is pipelined and, for sufficiently small requests, it can keep small slices in flight. Let the slice size be $s$ and consider the regime where $s$ is small enough that the copy engine can keep two slices in flight at the same time.

DWDP proposes fix (paper only - NOT in open source TRTLLM): split each transfer into 1 MB slices and schedule round-robin across peers:

for offset in range(0, M, slice_size):        # iterate slices first
    for peer in round_robin(remote_peers):    # then round-robin peers
        copy_plan.append((dst, src, chunk))

DWDP interleaves DMA requests across sources. The copy engine's two-slice pipeline provides robustness against $X \le 1$ contention (the most common case).

Current ToT TRTLLM only copies the remote experts on every single remote tensor.

NVIDIA/TensorRT-LLM/tensorrt_llm/_torch/pyexecutor/dwdp.pyLines 530 to 585 in b0290d0

Loading…

Evaluation

Experiments are conducted in NVL72, baseline is set to disaggregated prefill/decode DEP. Perf is reported as TPS/GPU at particular TPS/user constraint.

Context only

Drawn based on Table 3 from the paper.

(a) As the sequence length grows, computation accounts for a larger fraction of the context-phase latency, so the relative gain from reducing synchronized communication becomes smaller.

(d) DWDP3, however, shows worse TTFT speedup, likely because the smaller context-side deployment provides lower aggregate throughput and therefore higher queueing delay before the first token.

End-to-end comparison

Comparing pareto points with similar TPS/user, we find that DWDP typically uses fewer context GPUs than the baseline. This suggests that the gain primarily comes from reduced context GPU demand.

At high TPS/user, the system is already heavily generation-bottlenecked, so the context stage cannot accumulate enough tokens to amortize DWDP's prefetch overhead.

What the paper actually proves: DWDP lets you serve similar TPS/user with fewer context GPUs, at the cost of significantly worse TTFT. The per-GPU efficiency improvement (1.09-1.11 $\times$ ) is cleanly proven in Section 5.2 (context-only). The end-to-end headline (8.8% TPS/GPU) mixes this with GPU-count reduction and should not be read as a throughput improvement.

Discussions

Prefill only - not decode

DWDP is designed for the context (prefill) phase only.

Why decode is incompatible? DWDP requires $T_\mathrm{compute} > T_\mathrm{prefetch}$ . During decode, each forward pass processes only 1 new token per request (memory-bandwidth bound). The weight matrices to prefetch are the same size regardless. The compute window is orders of magnitude too small to amortize the P2P weight fetch. Paper's data: at MNT=16K (already reduced compute budget), TPS/GPU speedup drops to 1.01 $\times$ . Decode compute is even smaller.

Chunked prefill

Chunked prefill can balance the sequence length across requests by scheduling them at a finer granularity. However, this causes a smaller compute window, thus making it harder to hide prefetch. Table 3b at MNT=16K: gain drops to 1.01 $\times$ . The compute window determines everything; chunking directly shrinks it.

Continuous batching (overlap scheduler) - hard incompatibility

NVIDIA/TensorRT-LLM/tensorrt_llm/_torch/pyexecutor/py_executor.pyLines 558 to 561 in 08bc122

Loading…

DWDP requires disable_overlap_scheduler=True. The overlap scheduler's model of overlapping sampling and scheduling between iterations conflicts with DWDP's static double-buffer layer-wise prefetch cadence.

Hardware requirements: why NVL72?

DWDP requires $T_\mathrm{compute} > T_\mathrm{prefetch}$ . $T_\mathrm{prefetch}$ depends on bandwidth. From Table 1 at ISL=8K/MNT=32768:

Per-layer P2P copy: ~7 us
Per-layer compute window: ~10-11 us
Ratio: ~1.4 $\times$ - barely viable on GB200/NVL72

Scale bandwidth down:

System	Bandwidth	Per-layer prefetch	Compute window	Ratio
NVLink 5 (NVL72)	1800 GB/s	~7 us	~10-11 us	1.4 $\times$ (viable)
NVLink 4 (H100 node)	600 GB/s	~21 us	~10-11 us	0.5 $\times$ (infeasible)
InfiniBand NDR (cross-node)	50 GB/s	~250 us	~10-11 us	0.04 $\times$ (impossible)

NVL72 uniquely provides:

High bandwidth: 3 $\times$ more than NVLink 4
All-pairs at scale: 72 GPUs all at NVLink speed - enables large DWDP groups without cross-node penalty

TRTLLM configuration knobs for TPS/GPU vs. TPS/user

Key levers for balancing hardware efficiency against per-user latency:

Knob	Description	Favor TPS/GPU	Favor TPS/user
`max_num_tokens`	Upper bound on total tokens processed per forward pass across all batched requests.	higher	lower
`max_batch_size`	Maximum number of concurrent requests the engine batches together in one iteration.	higher	lower
`capacity_scheduler_policy`	KV-cache admission policy: `MAX_UTILIZATION` oversubscribes and may evict requests; `GUARANTEED_NO_EVICT` reserves full capacity per admitted request.	`MAX_UTILIZATION`	`GUARANTEED_NO_EVICT`
`enable_chunked_prefill`	Splits long prefill into chunks so prefill can interleave with decode in the same iteration.	off	on
`free_gpu_memory_fraction`	Fraction of remaining GPU memory allocated to the KV-cache pool after weights.	higher	lower
Context GPU count (disagg)	Number of GPUs assigned to the context (prefill) stage in disaggregated prefill/decode serving.	fewer	more
`speculative_config`	Enables speculative decoding (a draft proposes tokens, target verifies) to cut per-token latency at the cost of extra compute.	off (high load)	on (low load)

In disaggregated serving, context GPU count is the dominant trade-off lever. DWDP's contribution is to shift the Pareto frontier: sustaining similar TPS/user with fewer context GPUs.