Autoregressive language models based on the decoder-only Transformer architecture generate tokens sequentially, conditioning each prediction on all previously generated tokens. During inference, the key-value pairs of prior tokens must be retained to ensure coherence across the generation sequence. In the standard Multi-Head Attention (MHA) formulation, the size of this KV cache grows linearly with both the sequence length and the number of attention heads, creating a significant memory bottleneck that limits the maximum context length achievable on commodity hardware. Multi-Head Latent Attention (MLA), introduced in DeepSeek-V2, addresses this bottleneck through low-rank joint projection of the key and value representations, achieving KV cache sizes comparable to Grouped-Query Attention (GQA) while preserving — and in some cases exceeding — the modeling capacity of full MHA.

This article presents a detailed analysis of the MLA mechanism, including its low-rank compression strategy, the decoupled Rotary Position Embedding (RoPE) design, and the weight-absorption trick that enables efficient inference. We also provide a rigorous comparison of the computational and memory costs of MLA relative to MHA, MQA, and GQA.

Notation

SymbolDescription
dhd_hDimension of embedding per attention head
dcd_cKV compression dimension in MLA
dcd_c'Query compression dimension in MLA
dhRd_h^RPer-head dimension of the decoupled RoPE queries and keys in MLA
nhn_hNumber of attention heads
ngn_gNumber of KV groups in GQA
llNumber of Transformer layers
htRdh_t \in \mathbb{R}^{d}Attention input for the tt-th token at a given layer
utRdu_t \in \mathbb{R}^{d}Output hidden state for the tt-th token at a given layer

Background: KV Cache in Autoregressive Transformers

In a decoder-only Transformer, each autoregressive step processes the current token embedding hth_t through a series of projections to produce query, key, and value vectors for every attention head:

[qt,1;  qt,2;  ;  qt,nh]=WQht[q_{t,1};\; q_{t,2};\; \ldots;\; q_{t,n_h}] = W_Q \, h_t[kt,1;  kt,2;  ;  kt,nh]=WKht[k_{t,1};\; k_{t,2};\; \ldots;\; k_{t,n_h}] = W_K \, h_t[vt,1;  vt,2;  ;  vt,nh]=WVht[v_{t,1};\; v_{t,2};\; \ldots;\; v_{t,n_h}] = W_V \, h_t

where WQ,WK,WVRdhnh×dW_Q, W_K, W_V \in \mathbb{R}^{d_h n_h \times d} are the projection weight matrices and qt,i,kt,i,vt,iRdhq_{t,i}, k_{t,i}, v_{t,i} \in \mathbb{R}^{d_h} denote the query, key, and value for head ii of token tt.

Within each head, the attention output is computed as a weighted sum over all prior values:

ot,i=j=1tsoftmaxj ⁣(qt,ikj,idh)vj,io_{t,i} = \sum_{j=1}^{t} \mathrm{softmax}_j\!\left(\frac{q_{t,i} \, k_{j,i}^\top}{\sqrt{d_h}}\right) v_{j,i}

The final layer output is obtained by concatenating the per-head outputs and projecting:

ut=WO[ot,1;  ot,2;  ;  ot,nh]u_t = W_O \, [o_{t,1};\; o_{t,2};\; \ldots;\; o_{t,n_h}]

To avoid redundant recomputation, the key-value pairs (kj,i,vj,i)(k_{j,i}, v_{j,i}) are cached after their first computation. The per-token KV cache size in MHA is therefore 2nhdh2 \, n_h \, d_h per layer, yielding a total cache of 2nhdhl2 \, n_h \, d_h \, l elements per token. For a model with nh=64n_h = 64, dh=128d_h = 128, and l=60l = 60, this amounts to nearly one million float16 parameters per token — a substantial memory footprint that grows without bound as the sequence length increases.

Prior Approaches: MQA and GQA

Multi-Query Attention (MQA) [Shazeer, 2019] collapses all heads to share a single pair of key and value projections, reducing the per-token KV cache to 2dhl2 \, d_h \, l. Although effective in reducing memory, the aggressive sharing of KV representations across heads degrades model quality, as each head loses its distinctive key-value subspace.

Grouped-Query Attention (GQA) [Ainslie et al., 2023] interpolates between MHA and MQA by partitioning the nhn_h query heads into ngn_g groups, each sharing one key-value head. The per-token KV cache becomes 2ngdhl2 \, n_g \, d_h \, l, where ngn_g controls the trade-off between cache efficiency and model capacity. When ng=nhn_g = n_h, GQA reduces to MHA; when ng=1n_g = 1, it reduces to MQA.

Comparison of KV cache sizes across MHA, GQA, MQA, and MLA

Complexity Comparison

It is instructive to compare the per-token KV cache and the per-layer FLOPs for each attention variant. For a sequence of length TT with hidden dimension d=nhdhd = n_h d_h:

MethodKV Cache per Token (per layer)Key Projection FLOPsValue Projection FLOPs
MHA2nhdh2 n_h d_hnhdhdn_h d_h \cdot dnhdhdn_h d_h \cdot d
GQA2ngdh2 n_g d_hngdhdn_g d_h \cdot dngdhdn_g d_h \cdot d
MQA2dh2 d_hdhdd_h \cdot ddhdd_h \cdot d
MLAdc+dhRd_c + d_h^Rdcdd_c \cdot ddcdd_c \cdot d

In MHA, the KV projection matrices WK,WVRnhdh×dW_K, W_V \in \mathbb{R}^{n_h d_h \times d} require 2nhdhd2 n_h d_h d FLOPs per token, and the cache must store 2nhdh2 n_h d_h elements per layer. GQA reduces the projection cost to 2ngdhd2 n_g d_h d by sharing KV within each group, but the cache reduction is proportional to ng/nhn_g / n_h. MQA achieves the smallest projection cost of 2dhd2 d_h d but at the cost of collapsing all head-specific information. MLA, by contrast, uses a single compressed vector of dimension dcnhdhd_c \ll n_h d_h from which both key and value are up-projected. The key and value projection costs are each dcdd_c \cdot d, and the cache stores only dc+dhRd_c + d_h^R elements (the compressed vector plus the decoupled RoPE key). Critically, the cache size is independent of nhn_h, which allows MLA to increase the number of heads freely without any cache penalty — a property that none of the other methods possess.

Low-Rank Joint Key-Value Compression

The central innovation of MLA is the joint low-rank compression of the key and value representations. Rather than projecting hth_t independently into key and value spaces of dimension nhdhn_h d_h, MLA first projects it into a low-dimensional latent space and then up-projects separately to recover the key and value:

ctKV=WDKVhtc_t^{KV} = W_{DKV} \, h_tktC=WUKctKVk_t^C = W_{UK} \, c_t^{KV}vtC=WUVctKVv_t^C = W_{UV} \, c_t^{KV}

Here ctKVRdcc_t^{KV} \in \mathbb{R}^{d_c} is the compressed latent vector, which is the only representation stored in the KV cache during inference. The down-projection matrix is WDKVRdc×dW_{DKV} \in \mathbb{R}^{d_c \times d}, and the up-projection matrices are WUKRnhdh×dcW_{UK} \in \mathbb{R}^{n_h d_h \times d_c} and WUVRnhdh×dcW_{UV} \in \mathbb{R}^{n_h d_h \times d_c}. The compression dimension dcd_c is chosen to be much smaller than nhdhn_h d_h; in DeepSeek-V2, dc=4dhd_c = 4 d_h, which for dh=128d_h = 128 yields dc=512d_c = 512, compared to nhdh=8192n_h d_h = 8192 for nh=64n_h = 64.

The low-rank structure imposes a bottleneck: the key and value vectors for all heads are constrained to lie in a dcd_c-dimensional subspace. This is a form of information compression, and the key question is whether dcd_c dimensions suffice to preserve the expressive power of the full-rank key and value. Empirically, DeepSeek-V2 demonstrates that with dc=4dhd_c = 4 d_h, MLA not only matches but slightly exceeds the performance of MHA — a result we revisit in the discussion of experimental results.

To further reduce the memory footprint during training (where the query activations must also be stored for backpropagation), MLA applies the same low-rank compression strategy to the query:

ctQ=WDQhtc_t^Q = W_{DQ} \, h_tqtC=WUQctQq_t^C = W_{UQ} \, c_t^Q

where ctQRdcc_t^Q \in \mathbb{R}^{d_c'} is the compressed query latent, WDQRdc×dW_{DQ} \in \mathbb{R}^{d_c' \times d}, and WUQRnhdh×dcW_{UQ} \in \mathbb{R}^{n_h d_h \times d_c'}. Note that the query compression affects only training memory, not inference cache size, since queries are not cached.

Decoupled Rotary Position Embedding

Rotary Position Embedding (RoPE) [Su et al., 2024] encodes positional information by applying a rotation matrix to the query and key vectors. For a vector xR2mx \in \mathbb{R}^{2m}, the rotation at position tt is defined as:

RoPE(x,t)=R(t)x,R(t)=diag ⁣(R1t,R2t,,Rmt)\mathrm{RoPE}(x, t) = R(t) \, x, \quad R(t) = \mathrm{diag}\!\left(R_1^{t}, R_2^{t}, \ldots, R_m^{t}\right)

where each RitR2×2R_i^{t} \in \mathbb{R}^{2 \times 2} is a 2D rotation by angle tθit \theta_i:

Rit=(cos(tθi)sin(tθi)sin(tθi)cos(tθi))R_i^{t} = \begin{pmatrix} \cos(t\theta_i) & -\sin(t\theta_i) \\ \sin(t\theta_i) & \cos(t\theta_i) \end{pmatrix}

The key property of RoPE is that the attention score between a query at position tt and a key at position ss depends only on the relative position tst - s:

R(t)q,R(s)k=qR(t)R(s)k=qR(st)k\langle R(t) \, q, \, R(s) \, k \rangle = q^\top R(t)^\top R(s) \, k = q^\top R(s - t) \, k

since R(t)R(s)=R(st)R(t)^\top R(s) = R(s - t) by the group structure of rotation matrices.

The Incompatibility Problem

In MLA, the key is reconstructed from the compressed latent: ktC=WUKctKVk_t^C = W_{UK} \, c_t^{KV}. If one were to apply RoPE naively by rotating the reconstructed key, the result would be R(t)WUKctKVR(t) \, W_{UK} \, c_t^{KV}. Since ctKVc_t^{KV} is the cached quantity, this rotation must be applied at every inference step for all cached positions — defeating the purpose of caching. Moreover, the rotation matrix R(t)R(t) cannot be absorbed into WUKW_{UK} because R(t)R(t) is position-dependent and varies across tokens; there is no fixed matrix that can be pre-multiplied into WUKW_{UK} to account for all positions simultaneously.

Equivalently, consider attempting to absorb R(t)R(t) into WUKW_{UK}. We would like to write:

R(t)WUKctKV=W~UK(t)ctKVR(t) \, W_{UK} \, c_t^{KV} = \tilde{W}_{UK}(t) \, c_t^{KV}

but W~UK(t)\tilde{W}_{UK}(t) depends on the position tt, meaning it cannot be pre-computed and must be materialized for each token — incurring an O(dcnhdh)O(d_c \cdot n_h d_h) cost per token, which is exactly the cost we sought to avoid.

The Decoupled Solution

DeepSeek-V2 resolves this by separating the positional and content components of the key. The content information is carried by the compressed latent ctKVc_t^{KV} (up-projected without RoPE), while the positional information is carried by a separate, small set of RoPE-encoded key vectors that are shared across all heads.

Query side. The compressed query latent ctQc_t^Q is first up-projected and then RoPE is applied:

[qt,1C;  ;  qt,nhC]=WUQctQ[q_{t,1}^C;\; \ldots;\; q_{t,n_h}^C] = W_{UQ} \, c_t^Q[qt,1R;  ;  qt,nhR]=RoPE(WQRctQ)[q_{t,1}^R;\; \ldots;\; q_{t,n_h}^R] = \mathrm{RoPE}(W_{QR} \, c_t^Q)

where WQRRnhdhR×dcW_{QR} \in \mathbb{R}^{n_h d_h^R \times d_c'} produces a separate set of RoPE queries of dimension dhRd_h^R per head. The final query for head ii is the concatenation of the content and positional parts:

qt,i=[qt,iC;  qt,iR]Rdh+dhRq_{t,i} = [q_{t,i}^C;\; q_{t,i}^R] \in \mathbb{R}^{d_h + d_h^R}

Key side. The content key is up-projected from the compressed latent without RoPE:

[kt,1C;  ;  kt,nhC]=WUKctKV[k_{t,1}^C;\; \ldots;\; k_{t,n_h}^C] = W_{UK} \, c_t^{KV}

and the positional key is produced from the raw hidden state with RoPE:

ktR=RoPE(WKRht)RdhRk_t^R = \mathrm{RoPE}(W_{KR} \, h_t) \in \mathbb{R}^{d_h^R}

where WKRRdhR×dW_{KR} \in \mathbb{R}^{d_h^R \times d} and the resulting ktRk_t^R is shared across all heads. The final key for head ii is:

kt,i=[kt,iC;  ktR]Rdh+dhRk_{t,i} = [k_{t,i}^C;\; k_t^R] \in \mathbb{R}^{d_h + d_h^R}

This decoupling ensures that the compressed latent ctKVc_t^{KV} is stored as-is in the cache (without any position-dependent transformation), while the position information is captured by the small additional vector ktRk_t^R of dimension dhRd_h^R.

Attention Score Computation

With the decoupled query and key, the attention score for head ii decomposes into a content component and a positional component:

qt,ikj,i=(qt,iC)kj,iC+(qt,iR)kjRq_{t,i}^\top k_{j,i} = (q_{t,i}^C)^\top k_{j,i}^C + (q_{t,i}^R)^\top k_j^R

The attention output is then:

ot,i=j=1tsoftmaxj ⁣((qt,iC)kj,iC+(qt,iR)kjRdh+dhR)vj,iCo_{t,i} = \sum_{j=1}^{t} \mathrm{softmax}_j\!\left(\frac{(q_{t,i}^C)^\top k_{j,i}^C + (q_{t,i}^R)^\top k_j^R}{\sqrt{d_h + d_h^R}}\right) v_{j,i}^C

The per-token cache now stores cjKVRdcc_j^{KV} \in \mathbb{R}^{d_c} and kjRRdhRk_j^R \in \mathbb{R}^{d_h^R}, for a total of dc+dhRd_c + d_h^R elements per layer — independent of nhn_h.

Architecture of MLA with decoupled RoPE

Inference-Time Weight Absorption

A crucial advantage of MLA is that the up-projection matrices WUKW_{UK} and WUVW_{UV} can be absorbed into other weight matrices at inference time, effectively eliminating the cost of explicitly reconstructing the full key and value vectors.

Key Absorption: WUKW_{UK} into WQW_Q

Consider the content term of the attention score:

(qt,iC)kj,iC=(WUQ,ictQ)(WUK,icjKV)=(ctQ)WUQ,iWUK,icjKV(q_{t,i}^C)^\top k_{j,i}^C = (W_{UQ,i} \, c_t^Q)^\top (W_{UK,i} \, c_j^{KV}) = (c_t^Q)^\top W_{UQ,i}^\top W_{UK,i} \, c_j^{KV}

where WUQ,i,WUK,iRdh×dcW_{UQ,i}, W_{UK,i} \in \mathbb{R}^{d_h \times d_c'} and Rdh×dc\mathbb{R}^{d_h \times d_c} respectively are the slices of the up-projection matrices corresponding to head ii. The product WUQ,iWUK,iRdc×dcW_{UQ,i}^\top W_{UK,i} \in \mathbb{R}^{d_c' \times d_c} can be pre-computed once and reused for all tokens, since it depends only on the model weights, not on the input. Let us define:

W^QK,i=WUQ,iWUK,i\hat{W}_{QK,i} = W_{UQ,i}^\top W_{UK,i}

Then the content attention score simplifies to:

(qt,iC)kj,iC=(ctQ)W^QK,icjKV(q_{t,i}^C)^\top k_{j,i}^C = (c_t^Q)^\top \hat{W}_{QK,i} \, c_j^{KV}

This means the attention score is computed directly from the compressed representations ctQc_t^Q and cjKVc_j^{KV}, without ever materializing the full dhd_h-dimensional key or query vectors. The pre-computed matrix W^QK,i\hat{W}_{QK,i} has size dc×dcd_c' \times d_c, which is small compared to the original key projection WUK,iRdh×dcW_{UK,i} \in \mathbb{R}^{d_h \times d_c}.

Value Absorption: WUVW_{UV} into WOW_O

After computing the attention weights, the weighted sum of values is:

ot,i=j=1tαt,j,ivj,iC=j=1tαt,j,iWUV,icjKV=WUV,ij=1tαt,j,icjKVo_{t,i} = \sum_{j=1}^{t} \alpha_{t,j,i} \, v_{j,i}^C = \sum_{j=1}^{t} \alpha_{t,j,i} \, W_{UV,i} \, c_j^{KV} = W_{UV,i} \sum_{j=1}^{t} \alpha_{t,j,i} \, c_j^{KV}

where αt,j,i\alpha_{t,j,i} are the attention weights for head ii and WUV,iRdh×dcW_{UV,i} \in \mathbb{R}^{d_h \times d_c} is the value up-projection for head ii. Since the weighted sum jαt,j,icjKV\sum_j \alpha_{t,j,i} \, c_j^{KV} is a vector in Rdc\mathbb{R}^{d_c}, the computation proceeds as:

  1. Compute the weighted sum in the compressed space: c^tKV,i=jαt,j,icjKVRdc\hat{c}_t^{KV,i} = \sum_j \alpha_{t,j,i} \, c_j^{KV} \in \mathbb{R}^{d_c}.
  2. Up-project: ot,i=WUV,ic^tKV,io_{t,i} = W_{UV,i} \, \hat{c}_t^{KV,i}.

Now, the final output projection concatenates all heads and applies WOW_O:

ut=WO[ot,1;  ;  ot,nh]=WO[WUV,1c^tKV,1;  ;  WUV,nhc^tKV,nh]u_t = W_O \, [o_{t,1};\; \ldots;\; o_{t,n_h}] = W_O \, [W_{UV,1} \, \hat{c}_t^{KV,1};\; \ldots;\; W_{UV,n_h} \, \hat{c}_t^{KV,n_h}]

This can be rewritten by merging WOW_O and WUVW_{UV} into a single matrix. Let WO=[WO,1,,WO,nh]W_O = [W_{O,1}, \ldots, W_{O,n_h}] where WO,iRd×dhW_{O,i} \in \mathbb{R}^{d \times d_h}. Then:

ut=i=1nhWO,iWUV,ic^tKV,i=i=1nhW^OV,ic^tKV,iu_t = \sum_{i=1}^{n_h} W_{O,i} \, W_{UV,i} \, \hat{c}_t^{KV,i} = \sum_{i=1}^{n_h} \hat{W}_{OV,i} \, \hat{c}_t^{KV,i}

where W^OV,i=WO,iWUV,iRd×dc\hat{W}_{OV,i} = W_{O,i} \, W_{UV,i} \in \mathbb{R}^{d \times d_c} can be pre-computed. The net effect is that the value up-projection and output projection are fused into a single d×dcd \times d_c matrix multiplication per head, and the full dhd_h-dimensional value vectors never need to be materialized.

Summary of Absorption Benefits

QuantityWithout AbsorptionWith Absorption
Cached per token2nhdh2 n_h d_h (full KV)dc+dhRd_c + d_h^R (compressed)
Key score FLOPs per query–key pairdhd_h (dot product in dhd_h)dcdc/nhd_c' \cdot d_c / n_h (per head, via W^QK\hat{W}_{QK})
Value aggregationdhd_h per headdcd_c per head (sum in compressed space)
Output projectiondnhdhd \cdot n_h d_hddcd \cdot d_c per head (via W^OV\hat{W}_{OV})

The absorption trick transforms MLA from a method that merely compresses the cache into one that also reduces the computational cost of the attention operation itself.

Experimental Results

The following table summarizes the per-token KV cache comparison across methods, using the DeepSeek-V2 configuration where dc=4dhd_c = 4 d_h and dhR=dh/2d_h^R = d_h / 2:

MethodKV Cache per Token (per layer)Relative Size
MHA2nhdhl2 n_h d_h \, l1×1\times (baseline)
GQA2ngdhl2 n_g d_h \, lng/nhn_g / n_h
MQA2dhl2 d_h \, l1/nh1 / n_h
MLA(dc+dhR)l=92dhl(d_c + d_h^R) \, l = \tfrac{9}{2} d_h \, l94nh\tfrac{9}{4 n_h}

For nh=64n_h = 64 and dh=128d_h = 128, MHA requires 1638416384 elements per layer per token, while MLA requires only 576576 — a compression ratio of approximately 28.4×28.4\times.

MLA vs MHA benchmark comparison

The benchmark results demonstrate that MLA not only matches but slightly exceeds MHA across a range of evaluation metrics. This may appear counterintuitive for a compression-based method, but the explanation lies in the architectural flexibility afforded by the independence of the KV cache from the head count. In standard MHA, increasing the number of heads linearly increases the KV cache, creating a hard constraint on model width. In MLA, the KV cache size depends only on dcd_c and dhRd_h^R, so the number of heads can be increased freely to enhance model capacity without any cache penalty. DeepSeek-V2 exploits this by using approximately 3×3\times the typical head count, distributing the same total hidden dimension across more heads with smaller per-head dimensions. The resulting model has finer-grained attention patterns while maintaining an efficient cache footprint.

References

  1. DeepSeek-AI. “DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model.” arXiv preprint arXiv:2405.04434, 2024.

  2. Shazeer, N. “Fast Transformer Decoding: One Write-Head is All You Need.” arXiv preprint arXiv:1911.02150, 2019.

  3. Ainslie, J., Lemercier, P., et al. “GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.” Proceedings of EMNLP, 2023.

  4. Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., and Liu, Y. “RoFormer: Enhanced Transformer with Rotary Position Embedding.” Neurocomputing, 2024.