Based on Manifold-Constrained Hyper-Connections arXiv:2512.24880 and the Megatron-LM integration tracked in issue #2919, PyTorch-path PR #2943, and cuTile kernel-fusion PR #3828. And the latest DeepSeek-V4 released @today (2026/04/24) also harnesses this hyper-connection structure, as expected.
In this post, we're focusing more on the computation flow and dedicated optimizations instead of the theoretical analysis.
Parameterization and manifold projection
Hyper-Connections (HC) widen the residual from a single stream of width to parallel streams with learnable mixing:
with and . mHC parameterizes these three mappings so that each is projected onto a well-behaved manifold, keeping the composite stable across depth. Given the input hidden matrix at layer , the computation proceeds in two phases.
Phase 1: initial mapping computation
The flattened input preserves full cross-stream context. After an RMSNorm, the dynamic and static mappings are computed as:
where and are learnable linear projections, and reshapes the output from back to . In practice the three projections are packed into a single linear. , and are all learnable parameters.
Phase 2: manifold projection
Each raw mapping is then pushed onto its target manifold:
where is the sigmoid and denotes Sinkhorn-Knopp. The non-negativity enforced on and prevents signal cancellation from positive-negative coefficient composition, acting as a supplementary manifold projection alongside the Birkhoff-polytope constraint on .
Implementation details of Sinkhorn-Knopp
The Sinkhorn-Knopp operator itself enforces double stochasticity through iterative normalization. Starting from the positive matrix , it alternates row and column normalization:
where and divide each row or column by its sum. The sequence converges to a doubly-stochastic as ; the implementation uses , with row-max subtraction before the exponential for numerical stability.
Notably, there is an implementation trick: activation checkpointing/recomputation. We only record the input of the Sinkhorn-Knopp operator and reproduce the iteration during backward, as the intermediate activations for autograd ops are unnecessarily stored between forward and backward otherwise.
In backward, it manually enables the autograd recording and
In the DeepSeek-V4 release, they use TileLang to implement the Sinkhorn-Knopp iteration together with the pre/post/res weights split:
Forward compute workflow
Each transformer layer applies the mHC pattern twice: once around self-attention, once around the MLP. The per-sublayer flow, with tokens, batch , expansion , and hidden size , is:
The block entry replicates the single input stream times, and the block exit averages the streams back.
Compute and memory overhead
The extra work that mHC adds per sublayer falls into three buckets; everything else (sigmoids, the 20 Sinkhorn-Knopp iterations) is and negligible at scale.
| op | FLOPs |
|---|---|
| compute mappings (linear ) | |
| aggregate (-to-1) + expand (1-to-) | |
| mix (batched ) |
For typical (in the DS-V4 tech report), per sublayer the total is FLOPs. Against for attention QKVO and for a MLP at , the ratio is roughly and respectively.
Yet the paper reports 6.7% end-to-end training overhead — well above what the FLOP count predicts. The gap is dominated by activation memory: the residual stream widens from to , so every op along it — mix, bias-dropout-add, layernorm input, the residual add itself — runs on an larger tensor.
Parameter overhead is small: each HC module adds weights from the packed projection plus biases and three scalar gates. For that is 688k parameters per sublayer, on the order of 0.1% of the sublayer's own GEMM weights which is negligible.
Megatron integration
PRs #2943 and #3828 target on dev branch. There is another mHC PR #3430 being merged into main.
mHC integration
PR #2943 is the PyTorch-path landing, correctness-first and without custom kernels.
It adds a HyperConnectionModule (one per sublayer, i.e. two per transformer layer) and a HyperConnectionTransformerLayer subclass that wraps the attention and MLP sublayers with aggregate on the input side and the residual-mix on the output side.
Parallelism
- TP: supported. The projection is small and stays replicated; only the sublayer weights are sharded as before.
- SP: HC parameters are marked
sequence_parallel=Trueso their gradients participate in the standard SP allreduce. - PP: validated at PP=4 on Qwen3-30B-A3B. DualPipe is extended so the first stream copy is cached locally per stage and recompute windows do not cross stage boundaries.
- CP: unaffected, since mHC is per-token.
Block-level recomputation
The wide residual costs times the activation memory, so the non-fused path checkpoints aggregate and groups layers into recompute blocks.
Given total layers and block size , the peak-activation memory is , minimized at
Only the first-layer input of each block is persisted; the rest is transient inside the recompute bubble.
Kernel fusion
PR #3828 is the GB200-targeted follow-up: four cuTile kernels that collapse the per-layer mHC forward and backward into four launches, enabled with --use-fused-mhc.
| kernel | fuses | why it matters | speedup |
|---|---|---|---|
fused_proj_rms | RMSNorm and the projection | every layer computes its mappings here; the row-wise sum-of-squares is reused from the matmul accumulator | 1.40x |
fused_sinkhorn | exponential and the 20 alternating row/column normalizations | hides the iteration count behind a single launch; fp32 math with row-max subtraction and tighter | 6.89x |
fused_h_aggregate | weight broadcast, multiply, and -stream reduction | memory-bound, but avoids materializing the broadcast | 1.13x |
fused_h_post_bda | residual mixing , post-expansion , and bias | hottest op on the wide residual; keeps the mix in registers | 3.24x |
Rough speedups measured in the PR. End-to-end, the paper reports only 6.7% training-time overhead versus baseline dense residuals at 27B with , which is what makes the manifold constraint affordable in production.
Results
From the paper (dense 27B, matched tokens):
| benchmark | baseline | HC | mHC |
|---|---|---|---|
| MMLU (acc.) | 59.0 | 63.0 | 63.4 |
| BBH (EM) | 43.8 | 48.9 | 51.0 |
| DROP (F1) | 47.0 | 51.6 | 53.9 |
| GSM8K (EM) | 46.7 | 53.2 | 53.8 |
Summary: mHC equals standard hyper-connections plus a Birkhoff-polytope projection on via Sinkhorn-Knopp. The Megatron integration consists of two PRs: a correctness-first PyTorch path with block-level recompute (#2943), followed by four cuTile-fused kernels (#3828) that amortize the -times residual width at roughly 6% to 7% training-time cost.