Based on DWDP: Distributed Weight Data Parallelism for High-Performance LLM Inference on NVL72 arXiv:2604.01621 and TensorRT-LLM commit e92ee4 (PR #12136).
Motivation
Multi-GPU MoE inference uses DEP (Data + Expert Parallelism). Every layer requires barrier-like synchronization. With real-world workloads this is harmful because:
- Attention: varying ISLs
- MoE: hot experts cause some ranks to process more tokens
Both amplify into global wait time at every sync point. Measured on DeepSeek-R1/GB200: ~12% overhead at just CV=20% ISL variation.
Core solution
Instead of sending tokens to where weights are (EP's AlltoAll), DWDP pulls weights to where tokens are via async P2P (cudaMemcpyAsync).
Under DWDP, the weights communication of 2 parts is:
- Attention: fully replicated (unchanged DP)
- MoE: partitioned across ranks; each rank permanently holds its local experts, fetches missing experts on demand
Thus there is no explicit collective sync as each rank progresses independently. While executing MoE at layer , asynchronously prefetch experts needed for layer with double buffering pipeline.
Q: Why not use NCCL Allgather?
- synchronization semantics force all ranks to participate simultaneously;
- NCCL uses SMs, interfering with compute.
cudaMemcpyAsyncuses the copy engine only - no SM interference, no cross-rank synchronization.
Expert splitting and fetching
In the rest of this post, we use and to denote num_experts_per_worker and num_prefetch_experts.
They satisfy the following constraint:
where is the total number of experts and is the number of DWDP workers.
DWDP assigns experts in range to rank , so there will be experts stored in adjacent ranks. During runtime, the rank fetches experts from peers:
- experts from rank
- experts from rank
DWDP prefetches all experts from each peer before knowing which experts the router will pick in the layer-wise manner. Routing decisions are unavailable at prefetch time (the fetch is pipelined ahead of the next layer's compute). Some fetched experts may go unused in any given batch.
Implementation optimizations
-
GroupedGEMM kernel for non-contiguous expert weights. The naive baseline merged all local + prefetched weights into a single contiguous buffer via a pre-GEMM D2D copy. DWDP extends the cuteDSL GroupedGEMM kernel to accept a list of weight tensors from different memory locations to eliminate the D2D memory copies.
-
Round-robin sliced prefetch (N/A in the open source TRTLLM). Multiple destination ranks simultaneously pull from the same source rank, resulting in many-to-one copy-engine contention. For DWDP with ranks, number of competing pulls from one source follows the Binomial distribution:
The key architectural observation is that the copy engine is pipelined and, for sufficiently small requests, it can keep small slices in flight. Let the slice size be and consider the regime where is small enough that the copy engine can keep two slices in flight at the same time.
DWDP proposes fix (paper only - NOT in open source TRTLLM): split each transfer into 1 MB slices and schedule round-robin across peers:
for offset in range(0, M, slice_size): # iterate slices first
for peer in round_robin(remote_peers): # then round-robin peers
copy_plan.append((dst, src, chunk))
DWDP interleaves DMA requests across sources. The copy engine's two-slice pipeline provides robustness against contention (the most common case).
Current ToT TRTLLM only copies the remote experts on every single remote tensor.
Evaluation
Experiments are conducted in NVL72, baseline is set to disaggregated prefill/decode DEP. Perf is reported as TPS/GPU at particular TPS/user constraint.
Context only
Drawn based on Table 3 from the paper.
(a) As the sequence length grows, computation accounts for a larger fraction of the context-phase latency, so the relative gain from reducing synchronized communication becomes smaller.
(d) DWDP3, however, shows worse TTFT speedup, likely because the smaller context-side deployment provides lower aggregate throughput and therefore higher queueing delay before the first token.
End-to-end comparison
Comparing pareto points with similar TPS/user, we find that DWDP typically uses fewer context GPUs than the baseline. This suggests that the gain primarily comes from reduced context GPU demand.
At high TPS/user, the system is already heavily generation-bottlenecked, so the context stage cannot accumulate enough tokens to amortize DWDP's prefetch overhead.
What the paper actually proves: DWDP lets you serve similar TPS/user with fewer context GPUs, at the cost of significantly worse TTFT. The per-GPU efficiency improvement (1.09-1.11) is cleanly proven in Section 5.2 (context-only). The end-to-end headline (8.8% TPS/GPU) mixes this with GPU-count reduction and should not be read as a throughput improvement.
Discussions
Prefill only - not decode
DWDP is designed for the context (prefill) phase only.
Why decode is incompatible? DWDP requires . During decode, each forward pass processes only 1 new token per request (memory-bandwidth bound). The weight matrices to prefetch are the same size regardless. The compute window is orders of magnitude too small to amortize the P2P weight fetch. Paper's data: at MNT=16K (already reduced compute budget), TPS/GPU speedup drops to 1.01. Decode compute is even smaller.
Chunked prefill
Chunked prefill can balance the sequence length across requests by scheduling them at a finer granularity. However, this causes a smaller compute window, thus making it harder to hide prefetch. Table 3b at MNT=16K: gain drops to 1.01. The compute window determines everything; chunking directly shrinks it.
Continuous batching (overlap scheduler) - hard incompatibility
DWDP requires disable_overlap_scheduler=True. The overlap scheduler's model of overlapping sampling and scheduling between iterations conflicts with DWDP's static double-buffer layer-wise prefetch cadence.
Hardware requirements: why NVL72?
DWDP requires . depends on bandwidth. From Table 1 at ISL=8K/MNT=32768:
- Per-layer P2P copy: ~7 us
- Per-layer compute window: ~10-11 us
- Ratio: ~1.4 - barely viable on GB200/NVL72
Scale bandwidth down:
| System | Bandwidth | Per-layer prefetch | Compute window | Ratio |
|---|---|---|---|---|
| NVLink 5 (NVL72) | 1800 GB/s | ~7 us | ~10-11 us | 1.4 (viable) |
| NVLink 4 (H100 node) | 600 GB/s | ~21 us | ~10-11 us | 0.5 (infeasible) |
| InfiniBand NDR (cross-node) | 50 GB/s | ~250 us | ~10-11 us | 0.04 (impossible) |
NVL72 uniquely provides:
- High bandwidth: 3 more than NVLink 4
- All-pairs at scale: 72 GPUs all at NVLink speed - enables large DWDP groups without cross-node penalty
TRTLLM configuration knobs for TPS/GPU vs. TPS/user
Key levers for balancing hardware efficiency against per-user latency:
| Knob | Description | Favor TPS/GPU | Favor TPS/user |
|---|---|---|---|
max_num_tokens | Upper bound on total tokens processed per forward pass across all batched requests. | higher | lower |
max_batch_size | Maximum number of concurrent requests the engine batches together in one iteration. | higher | lower |
capacity_scheduler_policy | KV-cache admission policy: MAX_UTILIZATION oversubscribes and may evict requests; GUARANTEED_NO_EVICT reserves full capacity per admitted request. | MAX_UTILIZATION | GUARANTEED_NO_EVICT |
enable_chunked_prefill | Splits long prefill into chunks so prefill can interleave with decode in the same iteration. | off | on |
free_gpu_memory_fraction | Fraction of remaining GPU memory allocated to the KV-cache pool after weights. | higher | lower |
| Context GPU count (disagg) | Number of GPUs assigned to the context (prefill) stage in disaggregated prefill/decode serving. | fewer | more |
speculative_config | Enables speculative decoding (a draft proposes tokens, target verifies) to cut per-token latency at the cost of extra compute. | off (high load) | on (low load) |
In disaggregated serving, context GPU count is the dominant trade-off lever. DWDP's contribution is to shift the Pareto frontier: sustaining similar TPS/user with fewer context GPUs.