Attention

中文版本: 阅读中文版 This article focuses on engineering perspective. The mathematical derivation of MLA (from RoPE to latent projection, partial RoPE compatibility proof, weight absorption algebra) is in /posts/mathematics/position-encoding/mla-from-rope/. This article does not repeat the math — it discusses only the engineering numbers and design trade-offs that matter for DeepSeek V2/V3 deployment. Figure from DeepSeek-V2: A Strong, Economical, and Efficient MoE Language Model 1. Why DeepSeek Chose MLA: Engineering Motivation DeepSeek V2 / V3 [1] adopt MLA as the replacement for standard MHA — the root motivation is deployment economics. In LLM serving, the KV cache memory footprint directly determines how many concurrent requests fit on one card. For DeepSeek V2’s size (nh=128n_h = 128, dh=128d_h = 128, l=60l = 60), standard MHA caches 2nhdh=32,7682 n_h d_h = 32{,}768 elements per token per layer; across 60 layers, ~2M elements/token, ~4 MB/token in bf16. Under 32 K context, a single sequence consumes ~128 GB — one H100 80 GB cannot even fit it. ...

Attention

Qwen3.5 vs Qwen3: A Deep Architectural Comparison

Multi-Head Latent Attention: DeepSeek V2/V3 Engineering View