Qwen3.5 vs Qwen3: A Deep Architectural Comparison

中文版本: 阅读中文版 Figure from Qwen3.5-Omni Technical Report Based on Qwen3.5 official technical documentation and code structure analysis. 交互式架构对比 下面是 Qwen3-VL 与 Qwen3.5 的交互式架构可视化,支持 Tab 切换、拖拽平移、滚轮缩放,点击节点查看详细信息。 操作提示:点击顶部 Tab 切换 Qwen3-VL / Qwen3.5 / Compare 视图;滚轮缩放;拖拽平移;点击节点查看参数详情。 ...

March 7, 2026 · 5 分钟 · LexHsu

Multi-Head Latent Attention: DeepSeek V2/V3 Engineering View

中文版本: 阅读中文版 This article focuses on engineering perspective. The mathematical derivation of MLA (from RoPE to latent projection, partial RoPE compatibility proof, weight absorption algebra) is in /posts/mathematics/position-encoding/mla-from-rope/. This article does not repeat the math — it discusses only the engineering numbers and design trade-offs that matter for DeepSeek V2/V3 deployment. Figure from DeepSeek-V2: A Strong, Economical, and Efficient MoE Language Model 1. Why DeepSeek Chose MLA: Engineering Motivation DeepSeek V2 / V3 [1] adopt MLA as the replacement for standard MHA — the root motivation is deployment economics. In LLM serving, the KV cache memory footprint directly determines how many concurrent requests fit on one card. For DeepSeek V2’s size (nh=128n_h = 128, dh=128d_h = 128, l=60l = 60), standard MHA caches 2nhdh=32,7682 n_h d_h = 32{,}768 elements per token per layer; across 60 layers, ~2M elements/token, ~4 MB/token in bf16. Under 32 K context, a single sequence consumes ~128 GB — one H100 80 GB cannot even fit it. ...

September 13, 2025 · 3 分钟 · LexHsu
访客 704 人次 · 访问 1065 次