Foundation Models

Qwen3.5 vs Qwen3: A Deep Architectural Comparison

中文版本: 阅读中文版 Figure from Qwen3.5-Omni Technical Report Based on Qwen3.5 official technical documentation and code structure analysis. 交互式架构对比下面是 Qwen3-VL 与 Qwen3.5 的交互式架构可视化，支持 Tab 切换、拖拽平移、滚轮缩放，点击节点查看详细信息。操作提示：点击顶部 Tab 切换 Qwen3-VL / Qwen3.5 / Compare 视图；滚轮缩放；拖拽平移；点击节点查看参数详情。 ...

CORAL: Autonomous Multi-Agent Evolution for Open-Ended Discovery

Introduction Figure from CORAL: Autonomous Multi-Agent Evolution for Open-Ended Discovery Open-ended discovery—the search for novel, high-quality solutions in domains where the solution space lacks clear structure and evaluation may be expensive or sparse—remains one of the hardest challenges in automated scientific reasoning. Unlike constrained optimization, where gradients or convexity guide the search, open-ended problems demand sustained exploration, accumulation of partial insights, and the ability to redirect effort when progress stalls. Mathematical conjecture proving, systems-level code optimization, and combinatorial design all fall squarely in this category. ...

InSpatio-World: Real-Time 4D World Simulation via Spatiotemporal Autoregressive Modeling

Figure from InSpatio-World: Real-Time 4D World Simulation via Spatiotemporal Autoregressive Modeling The ability to simulate a 4D world — one that evolves in time and can be viewed from arbitrary perspectives — is a foundational capability for autonomous driving, robotics, and embodied AI. Existing video generation models produce visually compelling sequences but lack spatial consistency when the camera moves. 3D reconstruction methods achieve geometric fidelity but struggle with dynamic scenes and real-time performance. InSpatio-World bridges this gap through a spatiotemporal autoregressive (STAR) architecture that combines the strengths of both paradigms. ...

Multi-Head Latent Attention: DeepSeek V2/V3 Engineering View

中文版本: 阅读中文版 This article focuses on engineering perspective. The mathematical derivation of MLA (from RoPE to latent projection, partial RoPE compatibility proof, weight absorption algebra) is in /posts/mathematics/position-encoding/mla-from-rope/. This article does not repeat the math — it discusses only the engineering numbers and design trade-offs that matter for DeepSeek V2/V3 deployment. Figure from DeepSeek-V2: A Strong, Economical, and Efficient MoE Language Model 1. Why DeepSeek Chose MLA: Engineering Motivation DeepSeek V2 / V3 [1] adopt MLA as the replacement for standard MHA — the root motivation is deployment economics. In LLM serving, the KV cache memory footprint directly determines how many concurrent requests fit on one card. For DeepSeek V2’s size (nh=128n_h = 128, dh=128d_h = 128, l=60l = 60), standard MHA caches 2nhdh=32,7682 n_h d_h = 32{,}768 elements per token per layer; across 60 layers, ~2M elements/token, ~4 MB/token in bf16. Under 32 K context, a single sequence consumes ~128 GB — one H100 80 GB cannot even fit it. ...