[paper] DWDP

April 19, 2026

Based on DWDP: Distributed Weight Data Parallelism for High-Performance LLM Inference on NVL72 arXiv:2604.01621 and TensorRT-LLM commit e92ee4 (PR #12136).

Motivation

Multi-GPU MoE inference uses DEP (Data + Expert Parallelism). Every layer requires barrier-like synchronization. With real-world workloads this is harmful because:

  • Attention: varying ISLs
  • MoE: hot experts cause some ranks to process more tokens

Both amplify into global wait time at every sync point. Measured on DeepSeek-R1/GB200: ~12% overhead at just CV=20% ISL variation.

Core solution

Instead of sending tokens to where weights are (EP's AlltoAll), DWDP pulls weights to where tokens are via async P2P (cudaMemcpyAsync). Under DWDP, the weights communication of 2 parts is:

  • Attention: fully replicated (unchanged DP)
  • MoE: partitioned across ranks; each rank permanently holds its local experts, fetches missing experts on demand

Thus there is no explicit collective sync as each rank progresses independently. While executing MoE at layer ll, asynchronously prefetch experts needed for layer l+1l+1 with double buffering pipeline.

🌱

Q: Why not use NCCL Allgather?

  • synchronization semantics force all ranks to participate simultaneously;
  • NCCL uses SMs, interfering with compute. cudaMemcpyAsync uses the copy engine only - no SM interference, no cross-rank synchronization.

Expert splitting and fetching

In the rest of this post, we use LL and PP to denote num_experts_per_worker and num_prefetch_experts. They satisfy the following constraint:

E=L+(Nβˆ’1)Γ—P,E = L + (N - 1) \times P,

where EE is the total number of experts and NN is the number of DWDP workers.

DWDP assigns experts in range [rΓ—P,rΓ—P+L][r \times P, r \times P + L] to rank rr, so there will be Lβˆ’PL - P experts stored in adjacent ranks. During runtime, the rank rr fetches Eβˆ’LE - L experts from peers:

  • experts [rβ€²Γ—P,(rβ€²+1)Γ—P][r' \times P, (r' + 1) \times P] from rank rβ€²<rr' < r
  • experts [(rβ€²βˆ’1)Γ—P+L,rβ€²Γ—P+L][(r' - 1) \times P + L, r' \times P + L] from rank rβ€²>rr' > r

DWDP prefetches all PP experts from each peer before knowing which experts the router will pick in the layer-wise manner. Routing decisions are unavailable at prefetch time (the fetch is pipelined ahead of the next layer's compute). Some fetched experts may go unused in any given batch.

Implementation optimizations

  1. GroupedGEMM kernel for non-contiguous expert weights. The naive baseline merged all local + prefetched weights into a single contiguous buffer via a pre-GEMM D2D copy. DWDP extends the cuteDSL GroupedGEMM kernel to accept a list of weight tensors from different memory locations to eliminate the D2D memory copies.

  2. Round-robin sliced prefetch (N/A in the open source TRTLLM). Multiple destination ranks simultaneously pull from the same source rank, resulting in many-to-one copy-engine contention. For DWDP with NN ranks, number of competing pulls XX from one source follows the Binomial distribution:

X∼Binomial(Nβˆ’2,1/(Nβˆ’1)).X \sim \mathrm{Binomial}(N - 2, 1 / (N - 1)).

The key architectural observation is that the copy engine is pipelined and, for sufficiently small requests, it can keep small slices in flight. Let the slice size be ss and consider the regime where ss is small enough that the copy engine can keep two slices in flight at the same time.

DWDP proposes fix (paper only - NOT in open source TRTLLM): split each transfer into 1 MB slices and schedule round-robin across peers:

for offset in range(0, M, slice_size):        # iterate slices first
    for peer in round_robin(remote_peers):    # then round-robin peers
        copy_plan.append((dst, src, chunk))

DWDP interleaves DMA requests across sources. The copy engine's two-slice pipeline provides robustness against X≀1X \le 1 contention (the most common case).

Current ToT TRTLLM only copies the remote experts on every single remote tensor.

Evaluation

Experiments are conducted in NVL72, baseline is set to disaggregated prefill/decode DEP. Perf is reported as TPS/GPU at particular TPS/user constraint.

Context only

Drawn based on Table 3 from the paper.

Drawn based on Table 3 from the paper.

(a) As the sequence length grows, computation accounts for a larger fraction of the context-phase latency, so the relative gain from reducing synchronized communication becomes smaller.

(d) DWDP3, however, shows worse TTFT speedup, likely because the smaller context-side deployment provides lower aggregate throughput and therefore higher queueing delay before the first token.

End-to-end comparison

Comparing pareto points with similar TPS/user, we find that DWDP typically uses fewer context GPUs than the baseline. This suggests that the gain primarily comes from reduced context GPU demand.

At high TPS/user, the system is already heavily generation-bottlenecked, so the context stage cannot accumulate enough tokens to amortize DWDP's prefetch overhead.

What the paper actually proves: DWDP lets you serve similar TPS/user with fewer context GPUs, at the cost of significantly worse TTFT. The per-GPU efficiency improvement (1.09-1.11Γ—\times) is cleanly proven in Section 5.2 (context-only). The end-to-end headline (8.8% TPS/GPU) mixes this with GPU-count reduction and should not be read as a throughput improvement.

Discussions

Prefill only - not decode

DWDP is designed for the context (prefill) phase only.

Why decode is incompatible? DWDP requires Tcompute>TprefetchT_\mathrm{compute} > T_\mathrm{prefetch}. During decode, each forward pass processes only 1 new token per request (memory-bandwidth bound). The weight matrices to prefetch are the same size regardless. The compute window is orders of magnitude too small to amortize the P2P weight fetch. Paper's data: at MNT=16K (already reduced compute budget), TPS/GPU speedup drops to 1.01Γ—\times. Decode compute is even smaller.

Chunked prefill

Chunked prefill can balance the sequence length across requests by scheduling them at a finer granularity. However, this causes a smaller compute window, thus making it harder to hide prefetch. Table 3b at MNT=16K: gain drops to 1.01Γ—\times. The compute window determines everything; chunking directly shrinks it.

Continuous batching (overlap scheduler) - hard incompatibility

DWDP requires disable_overlap_scheduler=True. The overlap scheduler's model of overlapping sampling and scheduling between iterations conflicts with DWDP's static double-buffer layer-wise prefetch cadence.

Hardware requirements: why NVL72?

DWDP requires Tcompute>TprefetchT_\mathrm{compute} > T_\mathrm{prefetch}. TprefetchT_\mathrm{prefetch} depends on bandwidth. From Table 1 at ISL=8K/MNT=32768:

  • Per-layer P2P copy: ~7 us
  • Per-layer compute window: ~10-11 us
  • Ratio: ~1.4Γ—\times - barely viable on GB200/NVL72

Scale bandwidth down:

SystemBandwidthPer-layer prefetchCompute windowRatio
NVLink 5 (NVL72)1800 GB/s~7 us~10-11 us1.4Γ—\times (viable)
NVLink 4 (H100 node)600 GB/s~21 us~10-11 us0.5Γ—\times (infeasible)
InfiniBand NDR (cross-node)50 GB/s~250 us~10-11 us0.04Γ—\times (impossible)

NVL72 uniquely provides:

  1. High bandwidth: 3Γ—\times more than NVLink 4
  2. All-pairs at scale: 72 GPUs all at NVLink speed - enables large DWDP groups without cross-node penalty

TRTLLM configuration knobs for TPS/GPU vs. TPS/user

Key levers for balancing hardware efficiency against per-user latency:

KnobDescriptionFavor TPS/GPUFavor TPS/user
max_num_tokensUpper bound on total tokens processed per forward pass across all batched requests.higherlower
max_batch_sizeMaximum number of concurrent requests the engine batches together in one iteration.higherlower
capacity_scheduler_policyKV-cache admission policy: MAX_UTILIZATION oversubscribes and may evict requests; GUARANTEED_NO_EVICT reserves full capacity per admitted request.MAX_UTILIZATIONGUARANTEED_NO_EVICT
enable_chunked_prefillSplits long prefill into chunks so prefill can interleave with decode in the same iteration.offon
free_gpu_memory_fractionFraction of remaining GPU memory allocated to the KV-cache pool after weights.higherlower
Context GPU count (disagg)Number of GPUs assigned to the context (prefill) stage in disaggregated prefill/decode serving.fewermore
speculative_configEnables speculative decoding (a draft proposes tokens, target verifies) to cut per-token latency at the cost of extra compute.off (high load)on (low load)

In disaggregated serving, context GPU count is the dominant trade-off lever. DWDP's contribution is to shift the Pareto frontier: sustaining similar TPS/user with fewer context GPUs.