LLM | Xu'Blog

Qwen3.5 vs Qwen3: A Deep Architectural Comparison

Based on Qwen3.5 official technical documentation and code structure analysis. 交互式架构对比下面是 Qwen3-VL 与 Qwen3.5 的交互式架构可视化，支持 Tab 切换、拖拽平移、滚轮缩放，点击节点查看详细信息。操作提示：点击顶部 Tab 切换 Qwen3-VL / Qwen3.5 / Compare 视图；滚轮缩放；拖拽平移；点击节点查看参数详情。 1. 注意力机制：根本性重构这是最大的代际差异。Qwen3 用标准 Transformer 注意力，Qwen3.5 引入了混合注意力（Hybrid Attention）。维度 Qwen3 Qwen3.5 注意力类型标准 Softmax 注意力混合注意力：Gated DeltaNet (线性) + Full Attention 层间比例全部是 Full Attention 3:1 — 每 3 层线性注意力 + 1 层完整注意力复杂度 O(L²·d) O(L·d²)，近线性 KV Cache 存储全部历史 KV 对，随序列线性增长 75% 的层用固定大小循环状态 S_t，不缓存 KV 长文本衰减有线性层有衰减，但每隔 4 层 Full Attention 做"上下文刷新" 序列并行支持不支持（注意力实现不兼容） 1.1 Gated DeltaNet 状态更新公式 1 S_t = β_t ⊙ S_{t-1} + Δ_t ⊗ (K_t ⊗ V_t) β_t = 门控参数（控制记忆保留/遗忘） Δ_t = 增量更新参数（精确修改特定位置，不是全量覆写）状态空间固定 O(1)，不随序列长度增长 1.2 层分布示例（24 层模型） 1 2 3 4 5 6 7 8 9 Layer 0: linear_attention Layer 1: linear_attention Layer 2: linear_attention Layer 3: full_attention ← 上下文刷新 Layer 4: linear_attention Layer 5: linear_attention Layer 6: linear_attention Layer 7: full_attention ← 上下文刷新 ... 重复（full_attention_interval=4）配置参数： ...

CORAL: Autonomous Multi-Agent Evolution for Open-Ended Discovery

Introduction Open-ended discovery—the search for novel, high-quality solutions in domains where the solution space lacks clear structure and evaluation may be expensive or sparse—remains one of the hardest challenges in automated scientific reasoning. Unlike constrained optimization, where gradients or convexity guide the search, open-ended problems demand sustained exploration, accumulation of partial insights, and the ability to redirect effort when progress stalls. Mathematical conjecture proving, systems-level code optimization, and combinatorial design all fall squarely in this category. ...

Multi-Head Latent Attention: Efficient KV Cache Compression in DeepSeek-V2

Autoregressive language models based on the decoder-only Transformer architecture generate tokens sequentially, conditioning each prediction on all previously generated tokens. During inference, the key-value pairs of prior tokens must be retained to ensure coherence across the generation sequence. In the standard Multi-Head Attention (MHA) formulation, the size of this KV cache grows linearly with both the sequence length and the number of attention heads, creating a significant memory bottleneck that limits the maximum context length achievable on commodity hardware. Multi-Head Latent Attention (MLA), introduced in DeepSeek-V2, addresses this bottleneck through low-rank joint projection of the key and value representations, achieving KV cache sizes comparable to Grouped-Query Attention (GQA) while preserving — and in some cases exceeding — the modeling capacity of full MHA. ...