[paper] mHC: Manifold-Constrained Hyper-Connections

April 20, 2026, last update: April 24, 2026

Based on Manifold-Constrained Hyper-Connections arXiv:2512.24880 and the Megatron-LM integration tracked in issue #2919, PyTorch-path PR #2943, and cuTile kernel-fusion PR #3828. And the latest DeepSeek-V4 released @today (2026/04/24) also harnesses this hyper-connection structure, as expected.

In this post, we're focusing more on the computation flow and dedicated optimizations instead of the theoretical analysis.

Parameterization and manifold projection

Hyper-Connections (HC) widen the residual from a single stream of width CC to nn parallel streams with learnable mixing:

xl+1  =  Hlresxl  +  (Hlpost)F(Hlprexl,  Wl)\mathbf{x}_{l+1} \;=\; \mathbf{H}_l^{\mathrm{res}}\, \mathbf{x}_l \;+\; (\mathbf{H}_l^{\mathrm{post}})^\top\, F(\mathbf{H}_l^{\mathrm{pre}}\, \mathbf{x}_l,\; \mathbf{W}_l)

with HlresRn×n\mathbf{H}_l^{\mathrm{res}} \in \mathbb{R}^{n \times n} and Hlpre,HlpostR1×n\mathbf{H}_l^{\mathrm{pre}}, \mathbf{H}_l^{\mathrm{post}} \in \mathbb{R}^{1 \times n}. mHC parameterizes these three mappings so that each is projected onto a well-behaved manifold, keeping the composite stable across depth. Given the input hidden matrix xlRn×C\mathbf{x}_l \in \mathbb{R}^{n \times C} at layer ll, the computation proceeds in two phases.

Phase 1: initial mapping computation

The flattened input xl=vec(xl)R1×nC\vec{\mathbf{x}}_l = \mathrm{vec}(\mathbf{x}_l) \in \mathbb{R}^{1 \times nC} preserves full cross-stream context. After an RMSNorm, the dynamic and static mappings are computed as:

xl=RMSNorm(xl)H~lpre=αlpre(xlϕlpre)+blpreH~lpost=αlpost(xlϕlpost)+blpostH~lres=αlresmat(xlϕlres)+blres\begin{aligned} \vec{\mathbf{x}}_l' &= \mathrm{RMSNorm}(\vec{\mathbf{x}}_l) \\ \tilde{\mathbf{H}}_l^{\mathrm{pre}} &= \alpha_l^{\mathrm{pre}} \cdot \big(\vec{\mathbf{x}}_l'\, \boldsymbol{\phi}_l^{\mathrm{pre}}\big) + \mathbf{b}_l^{\mathrm{pre}} \\ \tilde{\mathbf{H}}_l^{\mathrm{post}} &= \alpha_l^{\mathrm{post}} \cdot \big(\vec{\mathbf{x}}_l'\, \boldsymbol{\phi}_l^{\mathrm{post}}\big) + \mathbf{b}_l^{\mathrm{post}} \\ \tilde{\mathbf{H}}_l^{\mathrm{res}} &= \alpha_l^{\mathrm{res}} \cdot \mathrm{mat}\big(\vec{\mathbf{x}}_l'\, \boldsymbol{\phi}_l^{\mathrm{res}}\big) + \mathbf{b}_l^{\mathrm{res}} \end{aligned}

where ϕlpre,ϕlpostRnC×n\boldsymbol{\phi}_l^{\mathrm{pre}}, \boldsymbol{\phi}_l^{\mathrm{post}} \in \mathbb{R}^{nC \times n} and ϕlresRnC×n2\boldsymbol{\phi}_l^{\mathrm{res}} \in \mathbb{R}^{nC \times n^2} are learnable linear projections, and mat()\mathrm{mat}(\cdot) reshapes the output from R1×n2\mathbb{R}^{1 \times n^2} back to Rn×n\mathbb{R}^{n \times n}. In practice the three projections are packed into a single nCn2+2nnC \to n^2 + 2n linear. ϕ\boldsymbol{\phi}, b\mathbf{b} and α\alpha are all learnable parameters.

Phase 2: manifold projection

Each raw mapping is then pushed onto its target manifold:

Hlpre=σ(H~lpre),Hlpost=2σ(H~lpost),Hlres=SK(H~lres)\mathbf{H}_l^{\mathrm{pre}} = \sigma(\tilde{\mathbf{H}}_l^{\mathrm{pre}}), \quad \mathbf{H}_l^{\mathrm{post}} = 2\sigma(\tilde{\mathbf{H}}_l^{\mathrm{post}}), \quad \mathbf{H}_l^{\mathrm{res}} = \mathrm{SK}(\tilde{\mathbf{H}}_l^{\mathrm{res}})

where σ()\sigma(\cdot) is the sigmoid and SK()\mathrm{SK}(\cdot) denotes Sinkhorn-Knopp. The non-negativity enforced on Hlpre\mathbf{H}_l^{\mathrm{pre}} and Hlpost\mathbf{H}_l^{\mathrm{post}} prevents signal cancellation from positive-negative coefficient composition, acting as a supplementary manifold projection alongside the Birkhoff-polytope constraint on Hlres\mathbf{H}_l^{\mathrm{res}}.

Implementation details of Sinkhorn-Knopp

The Sinkhorn-Knopp operator itself enforces double stochasticity through iterative normalization. Starting from the positive matrix M(0)=exp(H~lres)\mathbf{M}^{(0)} = \exp(\tilde{\mathbf{H}}_l^{\mathrm{res}}), it alternates row and column normalization:

M(t)=Tr(Tc(M(t1)))\mathbf{M}^{(t)} = \mathcal{T}_r\big(\mathcal{T}_c(\mathbf{M}^{(t-1)})\big)

where Tr\mathcal{T}_r and Tc\mathcal{T}_c divide each row or column by its sum. The sequence converges to a doubly-stochastic Hlres=M(tmax)\mathbf{H}_l^{\mathrm{res}} = \mathbf{M}^{(t_{\max})} as tmaxt_{\max} \to \infty; the implementation uses tmax=20t_{\max} = 20, with row-max subtraction before the exponential for numerical stability.

Notably, there is an implementation trick: activation checkpointing/recomputation. We only record the input of the Sinkhorn-Knopp operator and reproduce the iteration during backward, as the intermediate activations for autograd ops are unnecessarily stored between forward and backward otherwise.

In backward, it manually enables the autograd recording and

In the DeepSeek-V4 release, they use TileLang to implement the Sinkhorn-Knopp iteration together with the pre/post/res weights split:

Forward compute workflow

Each transformer layer applies the mHC pattern twice: once around self-attention, once around the MLP. The per-sublayer flow, with ss tokens, batch bb, expansion nn, and hidden size CC, is:

The block entry replicates the single input stream nn times, and the block exit averages the nn streams back.

Compute and memory overhead

The extra work that mHC adds per sublayer falls into three buckets; everything else (sigmoids, the 20 Sinkhorn-Knopp iterations) is O(n2)O(n^2) and negligible at scale.

opFLOPs
compute mappings (linear nCn2+2nnC \to n^2 + 2n)2sbnC(n2+2n)2\, sb\, nC\, (n^2 + 2n)
aggregate (nn-to-1) + expand (1-to-nn)3sbnC\approx 3\, sb\, nC
mix Hresxl\mathbf{H}^{\mathrm{res}}\, \mathbf{x}_l (batched n×nn×Cn{\times}n \cdot n{\times}C)2sbn2C2\, sb\, n^2 C

For typical n=4n = 4 (in the DS-V4 tech report), per sublayer the total is 236sbC\approx 236\, sbC FLOPs. Against 8sbC28\, sb C^2 for attention QKVO and 16sbC216\, sb C^2 for a 4×4\times MLP at C=7168C=7168, the ratio is roughly 0.4%0.4\% and 0.2%0.2\% respectively.

Yet the paper reports 6.7% end-to-end training overhead — well above what the FLOP count predicts. The gap is dominated by activation memory: the residual stream widens from sbCsbC to sbnCsb\,nC, so every op along it — mix, bias-dropout-add, layernorm input, the residual add itself — runs on an n×n\times larger tensor.

Parameter overhead is small: each HC module adds nC(n2+2n)nC (n^2 + 2n) weights from the packed ϕ\boldsymbol{\phi} projection plus biases and three scalar gates. For n=4,C=7168n = 4, C = 7168 that is \sim688k parameters per sublayer, on the order of 0.1% of the sublayer's own GEMM weights which is negligible.

Megatron integration

PRs #2943 and #3828 target on dev branch. There is another mHC PR #3430 being merged into main.

mHC integration

PR #2943 is the PyTorch-path landing, correctness-first and without custom kernels. It adds a HyperConnectionModule (one per sublayer, i.e. two per transformer layer) and a HyperConnectionTransformerLayer subclass that wraps the attention and MLP sublayers with aggregate on the input side and the residual-mix on the output side.

Parallelism

  • TP: supported. The nCn2+2nnC \to n^2 + 2n projection is small and stays replicated; only the sublayer weights are sharded as before.
  • SP: HC parameters are marked sequence_parallel=True so their gradients participate in the standard SP allreduce.
  • PP: validated at PP=4 on Qwen3-30B-A3B. DualPipe is extended so the first stream copy xl0\mathbf{x}_{l_0} is cached locally per stage and recompute windows do not cross stage boundaries.
  • CP: unaffected, since mHC is per-token.

Block-level recomputation

The wide residual costs nn times the activation memory, so the non-fused path checkpoints aggregate and groups layers into recompute blocks. Given LL total layers and block size LrL_r, the peak-activation memory is nCL/Lr+(n+2)CLrnC \cdot \lceil L/L_r \rceil + (n+2)C \cdot L_r, minimized at

LrnLn+2.L_r^{*} \approx \sqrt{\frac{nL}{n+2}}.

Only the first-layer input of each block is persisted; the rest is transient inside the recompute bubble.

Kernel fusion

PR #3828 is the GB200-targeted follow-up: four cuTile kernels that collapse the per-layer mHC forward and backward into four launches, enabled with --use-fused-mhc.

kernelfuseswhy it mattersspeedup
fused_proj_rmsRMSNorm and the nCn2+2nnC \to n^2 + 2n projectionevery layer computes its mappings here; the row-wise sum-of-squares is reused from the matmul accumulator1.40x
fused_sinkhornexponential and the 20 alternating row/column normalizationshides the iteration count behind a single launch; fp32 math with row-max subtraction and tighter ϵ=108\epsilon{=}10^{-8}6.89x
fused_h_aggregateweight broadcast, multiply, and nn-stream reductionmemory-bound, but avoids materializing the [s,b,n,C][s,b,n,C] broadcast1.13x
fused_h_post_bdaresidual mixing Hresr\mathbf{H}^{\mathrm{res}}\,\mathbf{r}, post-expansion hpost(x+b)\mathbf{h}_{\mathrm{post}} \odot (\mathbf{x} + \mathbf{b}), and biashottest op on the wide residual; keeps the [n,n]@[n,C][n,n]@[n,C] mix in registers3.24x

Rough 35%\sim 35\% speedups measured in the PR. End-to-end, the paper reports only 6.7% training-time overhead versus baseline dense residuals at 27B with n=4n=4, which is what makes the manifold constraint affordable in production.

Results

From the paper (dense 27B, matched tokens):

benchmarkbaselineHCmHC
MMLU (acc.)59.063.063.4
BBH (EM)43.848.951.0
DROP (F1)47.051.653.9
GSM8K (EM)46.753.253.8
🌱

Summary: mHC equals standard hyper-connections plus a Birkhoff-polytope projection on Hres\mathbf{H}^{\mathrm{res}} via Sinkhorn-Knopp. The Megatron integration consists of two PRs: a correctness-first PyTorch path with block-level recompute (#2943), followed by four cuTile-fused kernels (#3828) that amortize the nn-times residual width at roughly 6% to 7% training-time cost.