Gated Delta Network

Gated Delta Network (GDN) ¹ is a linear attention (LA) variant that harnesses both gating for memory control and delta update rule for precise memory modifications. It was used in the incredible Qwen 3 Next: for every 3 GDN layers, there will be 1 full attention layer (linear:full ratio 3:1).

GDN workflow

In this post, I'm going to walk through the algorithm design and the corresponding hardware parallel optimizations in Megatron.

Variants of linear attention

Linear attention maintains a matrix-valued state $\mathbf{S}_t \in \mathbb{R}^{d_v \times d_k}$ that acts as a key-value associative memory. Each step absorbs the current key and value, then emits an output via $\mathbf{o}_t = \mathbf{S}_t \mathbf{q}_t$ . Let's first talk about how different variants differ in how $\mathbf{S}_t$ is updated.

Mamba2 ² introduces a data-dependent scalar gate $\alpha_t \in (0, 1)$ that decays the entire state before each write:

$\mathbf{S}_t = \textcolor{blue}{\alpha_t}\, \mathbf{S}_{t-1} + \mathbf{v}_t \mathbf{k}_t^\top$

This forgets bulk context cheaply, but it cannot remove a single key-value pair without also decaying every other association at the same rate.

Unrolling the recurrence shows that $\mathbf{S}_t$ is a weighted sum of past key-value outer products, with each historical contribution multiplied by the product of all gates emitted since:

$\mathbf{S}_t = \sum_{i=1}^t \Bigl(\prod_{j=i+1}^t \alpha_j\Bigr)\, \mathbf{v}_i \mathbf{k}_i^\top$

Define the cumulative decay $\gamma_t = \prod_{i=1}^t \alpha_i$ . Then $\prod_{j=i+1}^t \alpha_j = \gamma_t / \gamma_i$ , so the state and output collapse to

$\mathbf{S}_t = \sum_{i=1}^t \frac{\gamma_t}{\gamma_i}\, \mathbf{v}_i \mathbf{k}_i^\top, \qquad \mathbf{o}_t = \sum_{i=1}^t \frac{\gamma_t}{\gamma_i}\, (\mathbf{k}_i^\top \mathbf{q}_t)\, \mathbf{v}_i$

Stacking the per-token vectors into row matrices $\mathbf{Q}, \mathbf{K}, \mathbf{V} \in \mathbb{R}^{L \times \cdot}$ gives the parallel matrix form

$\mathbf{O} = \bigl((\mathbf{Q} \mathbf{K}^\top) \odot \boldsymbol{\Gamma}\bigr)\, \mathbf{V}$

where $\boldsymbol{\Gamma} \in \mathbb{R}^{L \times L}$ is the decay-aware causal mask with $\boldsymbol{\Gamma}_{ij} = \gamma_i / \gamma_j$ for $i \geq j$ and $0$ otherwise. Compared with the standard causal mask $\mathbf{M}$ of vanilla linear attention, $\boldsymbol{\Gamma}$ just replaces each $1$ entry with the appropriate ratio of cumulative gates, so token $i$ sees token $j \leq i$ scaled by $\gamma_i / \gamma_j$ — exactly the surviving fraction of $\alpha$ products between them. This is the form Mamba2 trains in: a single fused matmul per layer, no token-by-token iteration, and the same $\boldsymbol{\Gamma}$ structure is what GDN later reuses inside each chunk.

DeltaNet ³ instead applies a generalized Householder term to overwrite one slot at a time, with writing strength $\beta_t \in (0, 1)$ :

$\mathbf{S}_t = \mathbf{S}_{t-1}(\mathbf{I} - \textcolor{blue}{\beta_t}\, \mathbf{k}_t \mathbf{k}_t^\top) + \textcolor{blue}{\beta_t}\, \mathbf{v}_t \mathbf{k}_t^\top$

The $(\mathbf{I} - \beta_t \mathbf{k}_t \mathbf{k}_t^\top)$ factor subtracts the value currently associated with $\mathbf{k}_t$ before the new $\mathbf{v}_t$ is written, giving precise edits but no bulk-clear mechanism.

The gated delta rule combines both into a single transition:

$\mathbf{S}_t = \mathbf{S}_{t-1}\bigl(\textcolor{blue}{\alpha_t}(\mathbf{I} - \textcolor{blue}{\beta_t} \mathbf{k}_t \mathbf{k}_t^\top)\bigr) + \textcolor{blue}{\beta_t} \mathbf{v}_t \mathbf{k}_t^\top, \qquad \mathbf{o}_t = \mathbf{S}_t \mathbf{q}_t$

Two corners recover the parents:

$\alpha_t \to 0$ wipes the state regardless of $\beta_t$ (Mamba2-style hard reset).
$\alpha_t \to 1$ with $\beta_t \to 1$ falls back to the pure delta rule.

Chunkwise parallel training

Running the recurrence token-by-token is memory-bound and leaves tensor cores idle. GDN trains chunkwise: split the sequence into chunks of size $C$ , propagate $\mathbf{S}_{[t]}$ between chunks, and express the work inside a chunk as dense matmuls.

Mathematical details: how to convert GDN into a parallel form

Partially expanding the GDN recurrence over $r$ steps inside chunk $[t]$ splits the running state into a gated transition product $\mathbf{F}_{[t]}^r$ and a gated accumulated-write sum $\mathbf{G}_{[t]}^r$ :

\mathbf{S}_{[t]}^r = \mathbf{S}_{[t]}\, \underbrace{\prod_{i=1}^r \textcolor{blue}{\alpha_{[t]}^i}\bigl(\mathbf{I} - \beta_{[t]}^i \mathbf{k}_{[t]}^i \mathbf{k}_{[t]}^{i\top}\bigr)}_{=:\, \mathbf{F}_{[t]}^r}\; +\; \underbrace{\sum_{i=1}^r \beta_{[t]}^i \mathbf{v}_{[t]}^i \mathbf{k}_{[t]}^{i\top} \prod_{j=i+1}^r \textcolor{blue}{\alpha_{[t]}^j}\bigl(\mathbf{I} - \beta_{[t]}^j \mathbf{k}_{[t]}^j \mathbf{k}_{[t]}^{j\top}\bigr)}_{=:\, \mathbf{G}_{[t]}^r} \tag{1}

Each $\alpha_{[t]}^i$ is a scalar, so it factors out of the matrix products. Pulling the cumulative gate $\gamma_{[t]}^r = \prod_i \alpha_{[t]}^i$ out of the first term gives $\mathbf{F}_{[t]}^r = \gamma_{[t]}^r\, \mathbf{P}_{[t]}^r$ , where $\mathbf{P}_{[t]}^r$ is the $\beta$ -only Householder product:

\mathbf{P}_{[t]}^r := \prod_{i=1}^r \bigl(\mathbf{I} - \beta_{[t]}^i \mathbf{k}_{[t]}^i \mathbf{k}_{[t]}^{i\top}\bigr), \qquad \mathbf{H}_{[t]}^r := \sum_{i=1}^r \beta_{[t]}^i \mathbf{v}_{[t]}^i \mathbf{k}_{[t]}^{i\top} \prod_{j=i+1}^r \bigl(\mathbf{I} - \beta_{[t]}^j \mathbf{k}_{[t]}^j \mathbf{k}_{[t]}^{j\top}\bigr)

For $\mathbf{G}_{[t]}^r$ the inner $\alpha_{[t]}^j$ 's collapse to ratios $\gamma_{[t]}^r / \gamma_{[t]}^i$ that scale the $i$ -th historical write — exactly the entries of the decay-aware mask $\boldsymbol{\Gamma}_{[t]}$ that re-enter at the matrix level below. We first derive the WY representation of the $\beta$ -only building blocks $\mathbf{P}_{[t]}^r$ and $\mathbf{H}_{[t]}^r$ — these are DeltaNet's results, then re-insert $\gamma$ factors to recover $\mathbf{F}_{[t]}^r$ and $\mathbf{G}_{[t]}^r$ . Let says, we want to prove that:

\mathbf{P}_{[t]}^r = \mathbf{I} - \sum_{i=1}^r \mathbf{w}_{[t]}^i \mathbf{k}_{[t]}^{i\top}, \qquad \mathbf{w}_{[t]}^r = \beta_{[t]}^r\Bigl(\mathbf{k}_{[t]}^r - \sum_{i=1}^{r-1} \mathbf{w}_{[t]}^i (\mathbf{k}_{[t]}^{i\top}\mathbf{k}_{[t]}^r)\Bigr) \tag{2}

\mathbf{H}_{[t]}^r = \sum_{i=1}^r \mathbf{u}_{[t]}^i \mathbf{k}_{[t]}^{i\top}, \qquad \mathbf{u}_{[t]}^r = \beta_{[t]}^r\Bigl(\mathbf{v}_{[t]}^r - \sum_{i=1}^{r-1} \mathbf{u}_{[t]}^i (\mathbf{k}_{[t]}^{i\top}\mathbf{k}_{[t]}^r)\Bigr) \tag{3}

The rest of this section derives (2) and (3) from (1), then turns the per-step recursion into a single $C \times C$ triangular solve. Drop the $[t]$ subscript for readability.

Folding $\mathbf{P}^r$ into the W form

Base case ( $r=1$ ). $\mathbf{P}^1 = \mathbf{I} - \beta^1 \mathbf{k}^1 \mathbf{k}^{1\top}$ already matches the WY form with $\mathbf{w}^1 = \beta^1 \mathbf{k}^1$ , consistent with the $r=1$ instance of (2) (the inner sum is empty).

Inductive step. Assume $\mathbf{P}^{r-1} = \mathbf{I} - \sum_{i=1}^{r-1} \mathbf{w}^i \mathbf{k}^{i\top}$ . Following the left-to-right product convention of (1):

$\mathbf{P}^r = \mathbf{P}^{r-1}\bigl(\mathbf{I} - \beta^r \mathbf{k}^r \mathbf{k}^{r\top}\bigr) = \mathbf{P}^{r-1} - \beta^r \bigl(\mathbf{P}^{r-1} \mathbf{k}^r\bigr)\, \mathbf{k}^{r\top}$

The only nontrivial piece is the vector $\mathbf{P}^{r-1} \mathbf{k}^r \in \mathbb{R}^{d_k}$ :

$\mathbf{P}^{r-1} \mathbf{k}^r = \mathbf{k}^r - \sum_{i=1}^{r-1} \mathbf{w}^i \bigl(\mathbf{k}^{i\top} \mathbf{k}^r\bigr)$

which is exactly the bracketed expression inside the definition of $\mathbf{w}^r$ in (2). Defining

$\mathbf{w}^r := \beta^r \Bigl(\mathbf{k}^r - \sum_{i=1}^{r-1} \mathbf{w}^i (\mathbf{k}^{i\top} \mathbf{k}^r)\Bigr)$

reduces the previous line to $\mathbf{P}^r = \mathbf{P}^{r-1} - \mathbf{w}^r \mathbf{k}^{r\top}$ , and substituting the inductive form for $\mathbf{P}^{r-1}$ yields

$\mathbf{P}^r = \mathbf{I} - \sum_{i=1}^{r} \mathbf{w}^i \mathbf{k}^{i\top}$

closing the induction.

Folding $\mathbf{H}^r$ into the U form

The accumulated-write term $\mathbf{H}^r$ admits its own one-step recurrence. Peel off the $i = r$ summand and factor the remaining product:

$\mathbf{H}^r = \underbrace{\sum_{i=1}^{r-1} \beta^i \mathbf{v}^i \mathbf{k}^{i\top} \prod_{j=i+1}^{r-1}\bigl(\mathbf{I} - \beta^j \mathbf{k}^j \mathbf{k}^{j\top}\bigr)}_{= \mathbf{H}^{r-1}}\, \bigl(\mathbf{I} - \beta^r \mathbf{k}^r \mathbf{k}^{r\top}\bigr) + \beta^r \mathbf{v}^r \mathbf{k}^{r\top}$

Expanding the right factor and rearranging:

$\mathbf{H}^r = \mathbf{H}^{r-1} + \beta^r \bigl(\mathbf{v}^r - \mathbf{H}^{r-1} \mathbf{k}^r\bigr)\, \mathbf{k}^{r\top}$

Inductive claim. $\mathbf{H}^r = \sum_{i=1}^r \mathbf{u}^i \mathbf{k}^{i\top}$ with the recursion from (3). For $r=1$ , $\mathbf{u}^1 = \beta^1 \mathbf{v}^1$ (empty inner sum), so $\mathbf{u}^1 \mathbf{k}^{1\top} = \beta^1 \mathbf{v}^1 \mathbf{k}^{1\top} = \mathbf{H}^1$ .

Assuming the claim at $r-1$ , we have:

$\mathbf{H}^{r-1} \mathbf{k}^r = \sum_{i=1}^{r-1} \mathbf{u}^i \bigl(\mathbf{k}^{i\top} \mathbf{k}^r\bigr)$

so the bracketed term in the one-step recurrence becomes

$\beta^r\bigl(\mathbf{v}^r - \mathbf{H}^{r-1} \mathbf{k}^r\bigr) = \beta^r\Bigl(\mathbf{v}^r - \sum_{i=1}^{r-1} \mathbf{u}^i (\mathbf{k}^{i\top} \mathbf{k}^r)\Bigr) =: \mathbf{u}^r$

which matches the definition of $\mathbf{u}^r$ in (3). Substituting back gives $\mathbf{H}^r = \mathbf{H}^{r-1} + \mathbf{u}^r \mathbf{k}^{r\top} = \sum_{i=1}^r \mathbf{u}^i \mathbf{k}^{i\top}$ , closing the induction.

Intuition. $\mathbf{u}^r$ is the $r$ -th write after the previously stored writes are projected away along the new key direction. Compared with the W recursion, the only change is the source vector: $\mathbf{k}^r$ for $\mathbf{w}^r$ (because we're tracking the transition matrix), $\mathbf{v}^r$ for $\mathbf{u}^r$ (because we're tracking the accumulated content).

From sequential recursion to a triangular solve: $\mathbf{W} = \mathbf{T}\mathbf{K}$

The recursion in (2) looks inherently sequential — $\mathbf{w}^r$ depends on every earlier $\mathbf{w}^i$ . Stacking the per-step vectors row-wise into $\mathbf{W}, \mathbf{K} \in \mathbb{R}^{C \times d_k}$ (so $\mathbf{W}_{r,:} = \mathbf{w}^{r\top}$ , $\mathbf{K}_{r,:} = \mathbf{k}^{r\top}$ ) reveals that it is actually a single triangular linear system in disguise.

Step 1 — transpose into row form. Transposing the $\mathbf{w}^r$ recursion in (2):

$\mathbf{w}^{r\top} = \beta^r \mathbf{k}^{r\top} - \beta^r \sum_{i=1}^{r-1} \bigl(\mathbf{k}^{i\top}\mathbf{k}^r\bigr)\, \mathbf{w}^{i\top}$

so row $r$ of $\mathbf{W}$ satisfies

\mathbf{W}_{r,:} = \beta^r \mathbf{K}_{r,:} - \beta^r \sum_{i=1}^{r-1} \bigl(\mathbf{k}^{i\top}\mathbf{k}^r\bigr)\, \mathbf{W}_{i,:} \tag{4}

Step 2 — identify coefficients with entries of $\mathrm{diag}(\beta)\, \mathbf{K}\mathbf{K}^\top$ . Define $\mathbf{A} := \mathrm{diag}(\beta)\, \mathbf{K}\mathbf{K}^\top$ . For any $r, i$ :

$\mathbf{A}_{r,i} = \beta^r\, (\mathbf{K}\mathbf{K}^\top)_{r,i} = \beta^r\, \mathbf{k}^{r\top}\mathbf{k}^i = \beta^r\, \mathbf{k}^{i\top}\mathbf{k}^r$

(the last equality uses that $\mathbf{k}^{i\top}\mathbf{k}^r$ is a scalar). So the coefficient $\beta^r (\mathbf{k}^{i\top}\mathbf{k}^r)$ in (4) is exactly $\mathbf{A}_{r,i}$ .

Step 3 — restrict to the strictly lower-triangular part. The sum in (4) runs only over $i = 1, \ldots, r-1$ , never including $i \geq r$ . Equivalently, only the strictly lower-triangular entries of $\mathbf{A}$ matter. Let $\mathbf{L} := \mathrm{strictLower}(\mathbf{A})$ , with $\mathbf{L}_{r,i} = \mathbf{A}_{r,i}$ for $i < r$ and $0$ otherwise. Then

$\sum_{i=1}^{r-1} \mathbf{A}_{r,i}\, \mathbf{W}_{i,:} = \sum_{i=1}^{C} \mathbf{L}_{r,i}\, \mathbf{W}_{i,:} = (\mathbf{L}\mathbf{W})_{r,:}$

and (4) becomes

$\mathbf{W}_{r,:} + (\mathbf{L}\mathbf{W})_{r,:} = \beta^r \mathbf{K}_{r,:} = (\mathrm{diag}(\beta)\, \mathbf{K})_{r,:}$

Step 4 — stack rows. Across all $r = 1, \ldots, C$ :

$(\mathbf{I} + \mathbf{L})\, \mathbf{W} = \mathrm{diag}(\beta)\, \mathbf{K}$

Step 5 — invert. $\mathbf{I} + \mathbf{L}$ is lower triangular with unit diagonal, hence invertible, giving the closed form

$\mathbf{W} = (\mathbf{I} + \mathbf{L})^{-1}\, \mathrm{diag}(\beta)\, \mathbf{K} = \underbrace{\bigl[\mathbf{I} + \mathrm{strictLower}\bigl(\mathrm{diag}(\beta)\, \mathbf{K}\mathbf{K}^\top\bigr)\bigr]^{-1} \mathrm{diag}(\beta)}_{=\,\mathbf{T}}\, \mathbf{K}$

That is $\mathbf{W} = \mathbf{T}\,\mathbf{K}$ .

The exact same argument applied to the $\mathbf{u}^r$ recursion in (3) replaces $\mathbf{K}$ on the right-hand side with $\mathbf{V}$ (because $\mathbf{u}^r$ 's "source" vector is $\mathbf{v}^r$ rather than $\mathbf{k}^r$ ), giving $\mathbf{U} = \mathbf{T}\,\mathbf{V}$ with the same $\mathbf{T}$ .

Cost view. The sequential reading computes $\mathbf{w}^1, \ldots, \mathbf{w}^C$ one at a time with $C$ data-dependent steps. The matrix reading $(\mathbf{I} + \mathbf{L})\mathbf{W} = \mathrm{diag}(\beta)\,\mathbf{K}$ is a single $C \times C$ lower-triangular system with $d_k$ right-hand sides (one per column of $\mathbf{K}$ ), solved by a batched forward substitution — tensor-core-friendly, and the same $\mathbf{T}$ is reused for both $\mathbf{W}$ and $\mathbf{U}$ .

Unified matrix view

Stacking the per-step vectors row-wise, (2) and (3) compactly read

$\mathbf{P}_{[t]}^r = \mathbf{I} - \mathbf{W}_{[t]}^\top \mathbf{K}_{[t]}, \qquad \mathbf{H}_{[t]}^r = \mathbf{U}_{[t]}^\top \mathbf{K}_{[t]}$

with the UT transform

$\mathbf{T}_{[t]} = \bigl[\mathbf{I} + \mathrm{strictLower}\bigl(\mathrm{diag}(\beta_{[t]})\, \mathbf{K}_{[t]} \mathbf{K}_{[t]}^\top\bigr)\bigr]^{-1} \mathrm{diag}(\beta_{[t]})$

$\mathbf{W}_{[t]} = \mathbf{T}_{[t]}\, \mathbf{K}_{[t]}, \qquad \mathbf{U}_{[t]} = \mathbf{T}_{[t]}\, \mathbf{V}_{[t]}$

So a product of $r$ rank-1 perturbations plus an accumulated rank- $r$ write — together carrying $O(r(d_k + d_v))$ degrees of freedom — are expressed by a single batched $C \times C$ triangular inverse applied to $\mathbf{K}_{[t]}$ and $\mathbf{V}_{[t]}$ , replacing $r$ sequential Householder-and-write applications with tensor-core-friendly matmuls.

Re-inserting the $\alpha$ gates

So far we have the WY representation of the $\beta$ -only $\mathbf{P}_{[t]}^r$ and $\mathbf{H}_{[t]}^r$ . Putting the $\alpha$ 's back recovers $\mathbf{F}_{[t]}^r$ and $\mathbf{G}_{[t]}^r$ from (1). For $\mathbf{F}$ this is a global scalar: $\mathbf{F}_{[t]}^r = \gamma_{[t]}^r\, \mathbf{P}_{[t]}^r$ , so the row-stacked form is $\mathbf{F}_{[t]}^r = \gamma_{[t]}^r(\mathbf{I} - \mathbf{W}_{[t]}^\top \mathbf{K}_{[t]})$ with the same $\mathbf{W}_{[t]}$ .

For $\mathbf{G}$ the $\alpha$ 's distribute as ratios: the $i$ -th historical write enters with multiplier $\gamma_{[t]}^r / \gamma_{[t]}^i$ , which is exactly the $(r, i)$ entry of the decay-aware causal mask $\boldsymbol{\Gamma}_{[t]}$ . Tracing through the W-form argument, every $\beta^r (\mathbf{k}^{i\top}\mathbf{k}^r)$ coefficient picks up an extra $\gamma_{[t]}^r/\gamma_{[t]}^i$ factor, which means $\mathbf{K}_{[t]} \mathbf{K}_{[t]}^\top$ inside $\mathbf{T}$ is replaced by $\boldsymbol{\Gamma}_{[t]} \odot \mathbf{K}_{[t]} \mathbf{K}_{[t]}^\top$ . The resulting gated UT transform gives the row-stacked accumulated-write matrix directly:

$\widetilde{\mathbf{U}}_{[t]} = \Bigl[\mathbf{I} + \mathrm{strictLower}\bigl(\mathrm{diag}(\beta_{[t]})\,(\boldsymbol{\Gamma}_{[t]} \odot \mathbf{K}_{[t]} \mathbf{K}_{[t]}^\top)\bigr)\Bigr]^{-1} \mathrm{diag}(\beta_{[t]})\, \mathbf{V}_{[t]}$

With $\overleftarrow{\cdot}$ denoting decay of each vector to the first position of the chunk and $\overrightarrow{\cdot}$ decay to the last position, the cross-chunk recurrence and per-chunk output become

$\mathbf{S}_{[t+1]} = \overrightarrow{\mathbf{S}_{[t]}} + \bigl(\widetilde{\mathbf{U}_{[t]}} - \overleftarrow{\mathbf{W}_{[t]}}\, \mathbf{S}_{[t]}^\top\bigr)^\top \overrightarrow{\mathbf{K}_{[t]}}$

$\mathbf{O}_{[t]} = \overleftarrow{\mathbf{Q}_{[t]}}\, \mathbf{S}_{[t]}^\top + \bigl(\mathbf{Q}_{[t]} \mathbf{K}_{[t]}^\top \odot \mathbf{M}\bigr)\bigl(\widetilde{\mathbf{U}_{[t]}} - \overleftarrow{\mathbf{W}_{[t]}}\, \mathbf{S}_{[t]}^\top\bigr)$

where $\mathbf{M}$ is the standard causal mask and $\mathbf{W}_{[t]}$ comes from the same UT transform applied to $\mathbf{K}_{[t]}$ . Every intra-chunk operation is now matmul-shaped, so the algorithm preserves the gated delta semantics exactly while running tensor-core-bound. And the inter-chunk computation is still sequantial.

Block compute flow

The block diagram realizes one step of this recurrence. Let $\mathbf{x}_t$ be the block input.

Four linear projections fan $\mathbf{x}_t$ out into $(\mathbf{q}, \mathbf{k})$ , $\mathbf{v}$ , $(\alpha, \beta)$ , and a residual gate $\mathbf{g}$ .
The $\mathbf{q}, \mathbf{k}$ path applies a short causal Conv, then SiLU, then L2 normalization on each head. The L2 step pins $\|\mathbf{k}_t\| = 1$ so that $(\mathbf{I} - \beta_t \mathbf{k}_t \mathbf{k}_t^\top)$ stays well-conditioned after many delta updates.
The $\mathbf{v}$ path uses the same Conv then SiLU stack without L2, since values are the content being written rather than the lookup direction.
The $\alpha, \beta$ path is a plain linear projection; $\alpha_t$ uses Mamba2's parameterization and $\beta_t$ a sigmoid so both stay in $(0, 1)$ .
$(\mathbf{q}_t, \mathbf{k}_t, \mathbf{v}_t, \alpha_t, \beta_t)$ feed the gated delta rule and yield $\mathbf{o}_t = \mathbf{S}_t \mathbf{q}_t$ .
$\mathbf{o}_t$ is RMS-normalized, gated elementwise by SiLU( $\mathbf{g}$ ) (the residual branch on the right of the diagram), then projected back to the model dimension by the top Linear.

Cost summary vs full SDPA

The chunkwise algorithm replaces SDPA's $O(L^2)$ sequence-axis cost with a fixed-size recurrent state plus per-chunk local work. Let $L$ be the sequence length, $d$ the per-head dimension (taking $d_k = d_v = d$ for brevity), and $C$ the chunk size ( $C = 64$ in fla).

Quantity (per head, per layer)	Full SDPA	GDN chunkwise
Training compute	$O(L^2\, d)$	$O(L\, C\, d + L\, d^2)$
Training memory (activations)	$O(L\, d)$ with FlashAttention	$O(L\, d) + O(d^2)$ state
Inference compute per generated token	$O(L\, d)$ (scan KV cache)	$O(d^2)$ (matvec with state)
Inference memory per generated token	$O(L\, d)$ , KV cache grows	$O(d^2)$ , state is fixed

Training. SDPA scales quadratically with $L$ along the sequence axis. GDN turns the quadratic term into a chunk-local $O(L\, C\, d)$ contribution (linear in $L$ once $C$ is fixed) plus a cross-chunk $O(L\, d^2)$ term from sweeping the $d \times d$ state through the $L/C$ chunks. For $L \gg \max(C, d)$ the saving in attention-core flops is roughly a factor $L / C$ .

Inference. SDPA must attend to every prior key, so per-token compute and KV-cache memory both grow linearly with $L$ . GDN compresses everything into the $d_k \times d_v$ matrix $\mathbf{S}_t$ : each new token costs one $O(d^2)$ matvec, and the memory footprint never grows with context length.

Gated Delta Networks: Improving Mamba2 with Delta Rule, https://arxiv.org/abs/2412.06464v3 ↩
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality, https://arxiv.org/abs/2405.21060 ↩
Parallelizing Linear Transformers with the Delta Rule over Sequence Length, https://arxiv.org/abs/2406.06484 ↩

Gated Delta Network

Variants of linear attention

Chunkwise parallel training

Folding Pr\mathbf{P}^rPr into the W form

Folding Hr\mathbf{H}^rHr into the U form

From sequential recursion to a triangular solve: W=TK\mathbf{W} = \mathbf{T}\mathbf{K}W=TK