MoE | Xu'Blog

Based on Qwen3.5 official technical documentation and code structure analysis. 交互式架构对比下面是 Qwen3-VL 与 Qwen3.5 的交互式架构可视化，支持 Tab 切换、拖拽平移、滚轮缩放，点击节点查看详细信息。操作提示：点击顶部 Tab 切换 Qwen3-VL / Qwen3.5 / Compare 视图；滚轮缩放；拖拽平移；点击节点查看参数详情。 1. 注意力机制：根本性重构这是最大的代际差异。Qwen3 用标准 Transformer 注意力，Qwen3.5 引入了混合注意力（Hybrid Attention）。维度 Qwen3 Qwen3.5 注意力类型标准 Softmax 注意力混合注意力：Gated DeltaNet (线性) + Full Attention 层间比例全部是 Full Attention 3:1 — 每 3 层线性注意力 + 1 层完整注意力复杂度 O(L²·d) O(L·d²)，近线性 KV Cache 存储全部历史 KV 对，随序列线性增长 75% 的层用固定大小循环状态 S_t，不缓存 KV 长文本衰减有线性层有衰减，但每隔 4 层 Full Attention 做"上下文刷新" 序列并行支持不支持（注意力实现不兼容） 1.1 Gated DeltaNet 状态更新公式 1 S_t = β_t ⊙ S_{t-1} + Δ_t ⊗ (K_t ⊗ V_t) β_t = 门控参数（控制记忆保留/遗忘） Δ_t = 增量更新参数（精确修改特定位置，不是全量覆写）状态空间固定 O(1)，不随序列长度增长 1.2 层分布示例（24 层模型） 1 2 3 4 5 6 7 8 9 Layer 0: linear_attention Layer 1: linear_attention Layer 2: linear_attention Layer 3: full_attention ← 上下文刷新 Layer 4: linear_attention Layer 5: linear_attention Layer 6: linear_attention Layer 7: full_attention ← 上下文刷新 ... 重复（full_attention_interval=4）配置参数： ...