[{"content":" 中文版本: 阅读中文版\nFigure from Qwen3.5-Omni Technical Report\nBased on Qwen3.5 official technical documentation and code structure analysis.\n交互式架构对比 下面是 Qwen3-VL 与 Qwen3.5 的交互式架构可视化，支持 Tab 切换、拖拽平移、滚轮缩放，点击节点查看详细信息。\n操作提示：点击顶部 Tab 切换 Qwen3-VL / Qwen3.5 / Compare 视图；滚轮缩放；拖拽平移；点击节点查看参数详情。\n1. 注意力机制：根本性重构 这是最大的代际差异。Qwen3 用标准 Transformer 注意力，Qwen3.5 引入了混合注意力（Hybrid Attention）。\n维度 Qwen3 Qwen3.5 注意力类型 标准 Softmax 注意力 混合注意力：Gated DeltaNet (线性) + Full Attention 层间比例 全部是 Full Attention 3:1 — 每 3 层线性注意力 + 1 层完整注意力 复杂度 O(L²·d) O(L·d²)，近线性 KV Cache 存储全部历史 KV 对，随序列线性增长 75% 的层用固定大小循环状态 S_t，不缓存 KV 长文本衰减 有 线性层有衰减，但每隔 4 层 Full Attention 做\u0026quot;上下文刷新\u0026quot; 序列并行 支持 不支持（注意力实现不兼容） 1.1 Gated DeltaNet 状态更新公式 1 S_t = β_t ⊙ S_{t-1} + Δ_t ⊗ (K_t ⊗ V_t) β_t = 门控参数（控制记忆保留/遗忘） Δ_t = 增量更新参数（精确修改特定位置，不是全量覆写） 状态空间固定 O(1)，不随序列长度增长 1.2 层分布示例（24 层模型） 1 2 3 4 5 6 7 8 9 Layer 0: linear_attention Layer 1: linear_attention Layer 2: linear_attention Layer 3: full_attention ← 上下文刷新 Layer 4: linear_attention Layer 5: linear_attention Layer 6: linear_attention Layer 7: full_attention ← 上下文刷新 ... 重复（full_attention_interval=4） 配置参数：\n1 2 3 4 5 6 7 8 { \u0026#34;num_hidden_layers\u0026#34;: 24, \u0026#34;layer_types\u0026#34;: [ \u0026#34;linear_attention\u0026#34;, \u0026#34;linear_attention\u0026#34;, \u0026#34;linear_attention\u0026#34;, \u0026#34;full_attention\u0026#34; ], \u0026#34;full_attention_interval\u0026#34;: 4 } 1.3 KV Cache 内存对比 序列长度 纯 Full Attention 纯线性注意力 混合 (3:1) 32K 8 GB 256 MB ~2.7 GB 128K 128 GB 1 GB ~34 GB 262K 512 GB 2 GB ~130 GB 混合方案在 128K 序列长度下，KV Cache 内存减少 73%。\n1.4 计算量对比（24 层, 32K 序列） 策略 Full Attn 层数 Linear Attn 层数 相对计算量 纯 Full Attention 24 0 100% 纯线性注意力 0 24 ~25% 混合 (interval=4) 6 18 ~44% 混合方案节省 56% 计算量，同时保持模型质量。\n2. 视觉编码器：从 DeepStack 多层注入到联合训练 维度 Qwen3-VL Qwen3.5 Vision Encoder 架构 SigLIP2, 24层, patch_size=16, merge_size=2 完全相同 DeepStack deepstack_visual_indexes: [5, 11, 17] 三层注入 deepstack_visual_indexes: [] 关闭 融合架构 Late Fusion（ViT + tokenizer 独立编码后拼接） 仍是 Late Fusion 训练策略 ViT 预训练 → LLM 预训练 → 对齐微调 + DeepStack 补丁 从预训练第一步就多模态联合训练 关键变化 DeepStack 被移除：Qwen3-VL 的 DeepStack 从 ViT 的第 5、11、17 层提取多尺度特征，通过 3 个 Merger 以残差加法注入 LLM 前 3 层——这是对\u0026quot;LLM 预训练不看视觉 token\u0026quot;的工程补丁。Qwen3.5 将其完全移除（deepstack_visual_indexes = []）。\nEarly Training, not Early Fusion：Qwen3.5 的 Vision Encoder 架构参数与 Qwen3-VL 完全相同，视觉 token 化管线也完全保留。其本质是训练策略的改变——从预训练阶段就将视觉和语言数据联合输入，用 joint loss 监督；它并未达到学术意义上的 Early Fusion（模态在底层共享表示空间）。当模型从第一步就同时处理两种模态时，LLM 的每一层注意力自然学会跨模态路由，DeepStack 补丁不再必要。\n3. 线性注意力层的独特参数（Qwen3.5 新增） 这些 SSM 组件是 Gated DeltaNet 的核心，Qwen3 完全没有：\n参数 作用 conv1d.weight 1D 卷积（kernel size=4），捕获局部依赖，补偿线性注意力的弱局部建模 A_log 状态转移矩阵（log 存储，加载时取 -exp(A_log) 保证数值稳定） dt_proj (weight + bias) 时间步门控投影，生成动态门控参数（Gated DeltaNet 自适应记忆更新的核心） D_proj 残差/跳跃连接，增强梯度回传，提高训练稳定性 线性注意力专用配置参数：\n参数 说明 典型值 linear_conv_kernel_dim 1D 卷积核大小 4 linear_key_head_dim Key 向量头维度 128 linear_value_head_dim Value 向量头维度 128 linear_num_key_heads Key 头数（决定记忆容量上限） 16 linear_num_value_heads Value 头数（决定输出维度） 16 4. MoE 架构升级 维度 Qwen3-MoE Qwen3.5-MoE 稀疏度 基础 MoE 高稀疏 MoE，激活比 \u0026lt; 5% 路由策略 — Top-8 路由，64 个专家 + 共享专家 与注意力结合 独立 MoE + 混合注意力深度结合，FFN 用 MoE，注意力用混合机制 显存效率 标准 显存占用降低 60% MoE 版本对比：\n模型 总参数 激活参数 激活比 Qwen3.5-35B-A3B 35B 3B ~8.6% Qwen3.5-122B-A10B 122B 10B ~8.2% Qwen3.5-397B-A17B 397B 17B ~4.3% 高稀疏 MoE + 混合注意力的组合，使得超大模型（397B）仅用 17B 激活参数就能高效推理，显存和计算成本大幅降低。\n5. 位置编码变化 维度 Qwen3 Qwen3.5 RoPE 应用比例 标准比例 partial_rotary_factor: 0.25，只对 25% 的注意力头维度应用 RoPE 最大上下文 256K 1M tokens M-RoPE 需要区分图像/视频 token 同样需要，但新增 mm_token_type_ids（image=1, video=2） RoPE 只应用于 25% 的注意力头维度，意味着 75% 的头不受位置编码约束。这与混合注意力架构配合——线性注意力层本身不需要位置编码，Full Attention 层也只需要部分头携带位置信息，就能在 1M tokens 的超长上下文中保持质量。\n6. Tool Calling 格式变化 Qwen3 Qwen3.5 格式 JSON：{\u0026quot;name\u0026quot;: \u0026quot;...\u0026quot;, \u0026quot;arguments\u0026quot;: {...}} XML：\u0026lt;function=name\u0026gt;\u0026lt;parameter=key\u0026gt;value\u0026lt;/parameter\u0026gt;\u0026lt;/function\u0026gt; 优势 结构化，易于程序解析 更接近自然语言，模型生成更流畅 7. 架构代际演进总结 flowchart LR subgraph Qwen3[\"Qwen3\"] A1[\"标准 Softmax Attention\"] A2[\"外挂视觉编码器\"] A3[\"DeepStack Merger\"] A4[\"基础 MoE\"] A5[\"标准 RoPE\"] A6[\"JSON Tool Calling\"] end subgraph Qwen35[\"Qwen3.5\"] B1[\"混合注意力(Gated DeltaNet + Full Attention)\"] B2[\"Joint multimodal training\"] B3[\"移除，更简洁的视觉架构\"] B4[\"高稀疏 MoE(激活比 \u003c 5%)\"] B5[\"Partial RoPE (25%)+ 1M 上下文\"] B6[\"XML Tool Calling\"] end A1 --\u003e B1 A2 --\u003e B2 A3 --\u003e B3 A4 --\u003e B4 A5 --\u003e B5 A6 --\u003e B6Qwen3.5 的核心设计哲学可以概括为：用结构创新换效率。混合注意力用 56% 的计算量维持质量，高稀疏 MoE 用 \u0026lt;5% 的激活比驱动大模型，Partial RoPE 支撑 1M 上下文——每一项都是在不牺牲（甚至提升）能力的前提下，大幅降低推理成本。\nReferences 本文部分 reference 的 arXiv ID 为 2026 年预占位编号，待论文正式公开后将更新链接。\nQwen Team, 2026. Qwen3.5-Omni Technical Report. arXiv:2604.15804 Yang, S. et al., 2024. Gated Delta Networks: Improving Mamba2 with Delta Rule. arXiv:2412.06464 Su, J. et al., 2021. RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864 Zhai, X. et al., 2023. Sigmoid Loss for Language Image Pre-Training (SigLIP). arXiv:2303.15343 Meng, L. et al., 2024. DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs. arXiv:2406.04334 Related Concepts Partial RoPE geometry — geometric foundation of position encoding decoupling used by Qwen3.5, see /posts/mathematics/position-encoding/rope-geometry/ MLA vs Hybrid Attention — Qwen3.5\u0026rsquo;s hybrid linear attention and DeepSeek-V2\u0026rsquo;s MLA pursue different KV cache reduction routes, see /posts/mathematics/position-encoding/mla-from-rope/ ","permalink":"https://xuquant.com/en/posts/foundation-models/qwen3-vs-qwen3-5-architecture/","summary":"\u003cblockquote\u003e\n\u003cp\u003e中文版本: \u003ca href=\"/posts/foundation-models/qwen3-vs-qwen3-5-architecture.zh/\"\u003e阅读中文版\u003c/a\u003e\u003c/p\u003e\n\u003c/blockquote\u003e\n\u003cp\u003e\u003cpicture\u003e\n  \u003csource srcset=\"/images/paper-figures/qwen3.5-arch.webp\" type=\"image/webp\"\u003e\n  \u003cimg src=\"/images/paper-figures/qwen3.5-arch.png\" alt=\"Qwen3.5-Omni Architecture\" loading=\"lazy\" decoding=\"async\"\u003e\n\u003c/picture\u003e\n\u003cem\u003eFigure from \u003ca href=\"https://arxiv.org/abs/2604.15804\"\u003eQwen3.5-Omni Technical Report\u003c/a\u003e\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003eBased on Qwen3.5 official technical documentation and code structure analysis.\u003c/p\u003e\n\u003ch2 id=\"交互式架构对比\"\u003e交互式架构对比\u003c/h2\u003e\n\u003cp\u003e下面是 Qwen3-VL 与 Qwen3.5 的交互式架构可视化，支持 Tab 切换、拖拽平移、滚轮缩放，点击节点查看详细信息。\u003c/p\u003e\n\u003ciframe src=\"/qwen-arch-compare.html\" width=\"100%\" height=\"680\" style=\"border:1px solid #1e2d4a; border-radius:8px; margin:16px 0;\" loading=\"lazy\"\u003e\u003c/iframe\u003e\n\u003cblockquote\u003e\n\u003cp\u003e\u003cstrong\u003e操作提示\u003c/strong\u003e：点击顶部 Tab 切换 Qwen3-VL / Qwen3.5 / Compare 视图；滚轮缩放；拖拽平移；点击节点查看参数详情。\u003c/p\u003e","title":"Qwen3.5 vs Qwen3: A Deep Architectural Comparison"},{"content":"Introduction Figure from CORAL: Autonomous Multi-Agent Evolution for Open-Ended Discovery\nOpen-ended discovery\u0026mdash;the search for novel, high-quality solutions in domains where the solution space lacks clear structure and evaluation may be expensive or sparse\u0026mdash;remains one of the hardest challenges in automated scientific reasoning. Unlike constrained optimization, where gradients or convexity guide the search, open-ended problems demand sustained exploration, accumulation of partial insights, and the ability to redirect effort when progress stalls. Mathematical conjecture proving, systems-level code optimization, and combinatorial design all fall squarely in this category.\nThe emergence of large language model (LLM)-driven evolutionary search has begun to change what is possible. FunSearch (Romera-Paredes et al., 2024) demonstrated that an LLM could mutate programs evolved over a population, discovering new results in combinatorics and combinatorial optimization. AlphaEvolve (Novak et al., 2025) extended this idea with MAP-Elites archiving and island-model parallelism, achieving notable advances in matrix multiplication and graph algorithms. Yet both systems share a fundamental limitation: the search itself is governed by fixed heuristics. The choice of which parent to mutate, how to construct the mutation prompt, when to evaluate, and what knowledge to carry forward are all determined by pre-written rules. The LLM functions as a proposal engine embedded in a rigid loop; it cannot decide to run a local test before submitting, nor can it pause to write down an insight for later reuse.\nThe core insight behind CORAL (Qu et al., 2026) is that delegating more search decisions to autonomous agents, rather than pre-defining them as fixed procedures, unlocks substantially better performance. Where FunSearch hard-codes a selection rule, a CORAL agent decides what to read based on its own reasoning. Where AlphaEvolve invokes the evaluator after every proposal, a CORAL agent may choose to validate locally first, iterate on a draft, and only call the external evaluator when confidence is high. Where traditional evolutionary search discards knowledge between runs, CORAL agents accumulate observations, strategies, and reusable tools in a shared persistent memory that persists across evaluations and agents.\nCORAL introduces three mechanisms that make this autonomy practical at scale: shared persistent memory provides a filesystem-based knowledge repository that all agents read from and write to; asynchronous multi-agent organization enables NN agents to explore in parallel without any direct message passing; and heartbeat-based interventions inject structured reflection, consolidation, and redirection prompts at configurable intervals, preventing agents from getting stuck in unproductive loops. Evaluated across eleven tasks spanning mathematical optimization and systems engineering, CORAL achieves the best final score on every task and establishes eight new state-of-the-art results. Its improvement rate exceeds fixed evolutionary baselines by 3\u0026ndash;10×\\times, and it typically converges in 5\u0026ndash;20 evaluations where baselines require 60\u0026ndash;100. On Anthropic\u0026rsquo;s kernel engineering benchmark, four co-evolving agents push the best known score from 1,363 to 1,103 cycles\u0026mdash;a 19% improvement\u0026mdash;without any web search.\nProblem Formulation An open-ended discovery task is defined as a pair (x,E)(x, E), where xx is a task description and EE is an evaluator function. For a candidate solution yy, the evaluator returns a score and optional feedback:\nE(x,y):=(s,f)E(x, y) := (s, f)Here ss is a scalar score (to be maximized or minimized depending on the task) and ff is auxiliary feedback, which may take the form of sub-score decompositions, textual criticism from an LLM-based judge, or execution traces. The feedback signal is richer than a single number, but it is not a gradient; it does not directly indicate how to improve yy.\nEach improvement step in CORAL follows a four-stage cycle:\nRETRIEVE: Construct a working context M^t\\hat{\\mathcal{M}}_t from shared persistent memory Mt\\mathcal{M}_t\u0026mdash;selecting relevant prior attempts, notes, and skills. PROPOSE: Generate a candidate solution yt+1y_{t+1} conditioned on the task xx and the retrieved context M^t\\hat{\\mathcal{M}}_t. EVALUATE: Obtain score and feedback (st+1,ft+1)=E(x,yt+1)(s_{t+1}, f_{t+1}) = E(x, y_{t+1}) from the external evaluator. UPDATE: Integrate the new information into shared persistent memory, producing Mt+1\\mathcal{M}_{t+1}. The critical question is who decides at each stage. The table below contrasts three paradigms:\nStage Fixed Evolutionary Search Autonomous Single-Agent Autonomous Multi-Agent Retrieve Fixed selection rule (e.g., top-kk by score) Agent decides what to read Each agent independently decides Propose Single LLM forward pass per candidate Agent may iterate, test locally, refine Multiple agents explore in parallel Evaluate External call after every proposal Agent decides when to call evaluator Shared evaluator, agents decide timing Update Fixed rule (e.g., replace worst in population) Agent decides what knowledge to write Asynchronous writes to shared memory Communication None None Indirect, via shared persistent memory In fixed evolutionary search, the LLM is a passive proposal engine. In autonomous single-agent evolution, it becomes an active optimizer that plans its own search trajectory. In autonomous multi-agent evolution, multiple such optimizers collaborate implicitly through a shared knowledge store, achieving both diversity and cumulative progress.\nCore Mechanisms 3.1 Shared Persistent Memory CORAL organizes shared knowledge as a filesystem with three root directories, each mapped into every agent\u0026rsquo;s workspace via symbolic links:\nattempts/ stores historical evaluations. Each attempt is a JSON record keyed by commit hash, containing the solution snapshot, score, status (improved / baseline / regressed / crashed / timeout), parent hash, timestamp, and evaluator feedback. Agents browse high-performing solutions, compare approaches, and trace lineage through the parent_hash field.\nnotes/ captures observations, learnings, and reflections in Markdown files with YAML frontmatter. Agents decide what to record and where to file it. Special subdirectories support collective knowledge: _synthesis/ holds cross-cutting summaries produced by consolidation heartbeats, _connections.md maps patterns across categories, and _open-questions.md tracks unresolved gaps and contradictions. On demanding tasks like kernel engineering, agents create directories such as \u0026ldquo;what NEVER worked\u0026rdquo; to catalog dead ends\u0026mdash;a practice that emerges organically rather than being prescribed.\nskills/ records reusable procedures, tools, and scripts. Each skill consists of a natural-language description (SKILL.md) paired with executable artifacts (functions and example scripts). The system provides a built-in skill_creator skill that guides agents through the create-test-refine workflow for producing new skills.\nThe filesystem-as-memory design has practical consequences. Agents access memory through CLI tools (coral notes, coral skills) or direct Bash file reads, both of which are natural operations for code agents. Concurrency safety comes from atomic writes using the temp-file-then-rename pattern, and because each attempt writes to a uniquely named file (keyed by commit hash), no explicit locking is needed. Git version control on the shared directory provides an audit trail: every Attempt record includes a shared_state_hash linking it to the exact memory snapshot at evaluation time.\n3.2 Asynchronous Multi-Agent Organization NN agents run asynchronously, each maintaining its own local context Ct(i)\\mathcal{C}_t^{(i)} and operating in an isolated workspace. Each agent has an independent git worktree on its own branch with its own Python virtual environment (.venv). The worktrees share the underlying repository object database, keeping disk usage modest, while branch isolation ensures that one agent\u0026rsquo;s experimental changes never corrupt another\u0026rsquo;s workspace.\nThe defining design choice is that agents do not communicate directly. Coordination happens entirely through shared persistent memory: when agent ii writes an attempt, note, or skill to M\\mathcal{M}, agent jj may retrieve it when constructing M^t(j)\\hat{\\mathcal{M}}_t^{(j)} in a subsequent step. This indirect coordination has three desirable properties. First, it increases exploration diversity: each agent follows its own reasoning about what to try next, producing search trajectories that overlap only partially (pairwise Jaccard similarity of 0.31\u0026ndash;0.43 on pressure-test tasks). Second, it enables shared accumulation: discoveries by one agent\u0026mdash;whether a high-scoring solution or a useful insight\u0026mdash;become immediately available to all others. Third, it eliminates the need for any message-passing protocol, avoiding the engineering complexity of queuing, ordering, and consensus that plagues explicit communication frameworks.\nA critical security measure is evaluator separation. The grader code resides in .coral/private/eval/, which is excluded from the agent\u0026rsquo;s worktree via gitignore rules and directory permissions. Agents cannot read the scoring function, preventing them from gaming the evaluator\u0026mdash;a real concern when agents are powerful enough to manipulate code.\n3.3 Heartbeat Mechanism Left to their own devices, autonomous agents can fall into unproductive patterns: obsessing over minor variations of the same approach, failing to recognize when a strategy is exhausted, or neglecting to consolidate scattered insights. CORAL\u0026rsquo;s heartbeat mechanism addresses this through structured interventions that inject targeted prompts at configurable intervals. Three heartbeat types are defined in the paper, with a fourth discovered in the codebase:\nReflect (every 1 evaluation, local scope). After each evaluation, the agent is prompted to anchor its reflection on the concrete result, check for unexpected findings, analyze causes, assess confidence, and plan the next experiment. This keeps the agent grounded in evidence rather than drifting into speculation.\nConsolidate (every 10 evaluations, global scope). At regular intervals, the system prompts the agent to synthesize knowledge across all agents\u0026rsquo; work, producing entries in notes/_synthesis/, updating notes/_connections.md with cross-category patterns, and refreshing notes/_open-questions.md with current gaps. Consolidation heartbeats ensure that partial discoveries do not languish in isolation.\nPivot (triggered after 5 consecutive evaluations without improvement, local scope). When an agent\u0026rsquo;s progress plateaus, the pivot heartbeat directs it to diagnose the ceiling, study what distinguishes top performers, select a new direction (a different algorithm family, problem formulation, or representation), restart from a strong foundation, and quickly submit a minimal implementation. Plateau detection tracks evals_since_improvement with a cooldown mechanism to avoid triggering on every marginal regression.\nLint Wiki (every 10 evaluations, global scope). This fourth heartbeat, present in the codebase but not discussed in the paper, spawns a librarian sub-agent that performs health checks on the knowledge base: detecting contradictions between notes, deduplicating overlapping entries, and restructuring the directory hierarchy. It reflects CORAL\u0026rsquo;s philosophy that knowledge quality requires active maintenance, not just passive accumulation.\nThe triggering mechanism is worth understanding in detail. The agent manager runs a monitoring loop that polls .coral/public/attempts/ every 5 seconds. For each new attempt, it updates per-agent tracking state: local evaluation count, best score, and consecutive evaluations without improvement. When a heartbeat condition is met, the manager sends a SIGINT to interrupt the agent\u0026rsquo;s current operation, composes a combined prompt (original task context plus heartbeat directive), and resumes the agent with this enriched context. Agents can also create heartbeat events themselves using coral heartbeat set, giving them a degree of self-regulation.\nImplementation Insights from Code The paper presents CORAL\u0026rsquo;s mechanisms at a conceptual level. Reading the source code reveals several design decisions that are essential to making autonomous multi-agent evolution work in practice but receive little or no discussion in the paper itself. This section documents those insights.\n4.1 The Filesystem as Message Bus CORAL\u0026rsquo;s most striking architectural choice is the absence of any centralized coordination service. All inter-agent communication flows through the filesystem: agents read and write files in the shared memory directory, and the manager monitors the filesystem to trigger heartbeats. There is no message queue, no RPC framework, no database. This design is both simple and robust. Atomic writes via the temp-file-then-rename pattern guarantee that no agent ever reads a partially written file. Git versioning on the shared directory means that every state change is auditable and revertible. The Attempt record\u0026rsquo;s shared_state_hash field creates a snapshot link between each evaluation and the memory state at that moment, enabling post-hoc analysis of exactly what information was available to each agent.\n4.2 Crash Recovery as a First-Class Concern Autonomous agents crash. They run out of context windows, encounter Python import errors, or produce output that the LLM cannot parse. CORAL treats crash recovery as a first-class concern rather than an afterthought. The exit classifier categorizes every agent termination into three types: clean (normal exit), no_result (ran but produced no evaluation), and session_error (crash or timeout). A crash circuit breaker monitors failure frequency: if three crashes occur within a short window, the system pauses the agent for five minutes before restarting, preventing rapid crash loops from wasting API credits. An important nuance is the evaluator queue exemption: if an agent is waiting for an evaluator response, the manager does not count this as a stall, avoiding false-positive kill signals during long-running evaluations.\nThe evaluator itself runs as an independent subprocess with a hard timeout (default 300 seconds, configurable per task). If the grader does not return within the limit, it is killed with SIGKILL\u0026mdash;no graceful shutdown, no chance to hang indefinitely. This hard boundary ensures that a buggy or adversarial solution cannot monopolize evaluation resources.\n4.3 The Agent-as-Optimizer Philosophy The CORAL.md template file, injected into every agent\u0026rsquo;s workspace as its primary instruction set, encodes a distinctive philosophy about how agents should approach search. Three directives stand out:\n\u0026ldquo;Eval early and often.\u0026rdquo; The template urges agents to submit solutions for external evaluation rather than over-optimizing locally. The reasoning is that the external evaluator provides the only reliable signal; local tests may be incomplete or misleading.\n\u0026ldquo;Bias toward speed.\u0026rdquo; A rough but evaluable solution is preferable to a perfect but untested one. This directive combats the tendency of LLM agents to refine indefinitely without ever checking whether their refinements actually improve the score.\n\u0026ldquo;Every eval should produce at least one note or skill update.\u0026rdquo; Knowledge accumulation is not optional. Even a failed evaluation should generate an insight\u0026mdash;what was tried, why it failed, what should be avoided next time. This rule ensures that the shared memory grows monotonically, benefiting all agents.\nGit operations are entirely managed by the framework. Agents never run git commit or git add directly; instead, they call coral eval -m \u0026quot;message\u0026quot;, which stages all changes, commits, evaluates, and records the attempt atomically. This prevents agents from accidentally corrupting the repository state and ensures that every evaluation corresponds to a clean commit.\n4.4 The Sub-Agent System CORAL deploys specialized sub-agents for tasks that benefit from focused expertise:\nDeep-researcher performs structured literature review. When the warm-start option is enabled, this sub-agent surveys relevant web resources before the main agent begins coding, providing an initial knowledge base that accelerates early progress.\nLibrarian conducts knowledge base health checks during lint-wiki heartbeats. It scans shared notes for contradictions (e.g., two notes claiming opposite conclusions about the same technique), identifies redundant entries covering the same ground, and restructures the directory hierarchy when the organization becomes unwieldy.\nSkill-creator is a meta-skill: it guides agents through the process of creating, testing, and refining new skills. When an agent discovers a reusable procedure\u0026mdash;say, a particular code transformation pattern that consistently reduces cycle count in kernel engineering\u0026mdash;it can invoke the skill-creator to package this procedure into a properly documented and tested skill that other agents can discover and apply.\nExperimental Analysis 5.1 Single-Agent Results CORAL was evaluated on eleven tasks: six mathematical optimization problems (circle packing, signal processing, Erdos minimum overlap, MMD-16-2, MMD-14-3, 3rd-order autocorrelation inequality) and five systems optimization problems (EPLB, PRISM, LLM-SQL, transaction scheduling, Cloudcast). All results are averaged over four independent trials with a budget of 3 hours wall-clock time or 100 iterations (whichever is longer).\nSingle-agent CORAL achieves the best final score on all eleven tasks against three baselines\u0026mdash;OpenEvolve, ShinkaEvolve, and EvoX (the strongest competitor with its meta-evolutionary search strategy)\u0026mdash;and establishes new state-of-the-art results on eight. The improvement rate, defined as the fraction of evaluations that produce a strictly better score, is 3\u0026ndash;10×\\times higher than baselines across tasks. Perhaps more striking is the evaluation efficiency: CORAL typically converges in 5\u0026ndash;20 evaluations where baselines require 60\u0026ndash;100. On circle packing, CORAL matches the SOTA in just 11 evaluations (OpenEvolve needs 100); on MMD-16-2, it reaches the known optimum in 6 evaluations (EvoX requires 18).\nThe efficiency gain is not accidental. Because autonomous agents can validate locally before calling the external evaluator, a significant fraction of submissions are already pre-screened. On kernel engineering, 57% of evaluations are preceded by a local test, and 47% of locally tested submissions produce an improvement. The agent is not guessing; it is making informed proposals.\n5.2 Multi-Agent Results The multi-agent setting reveals CORAL\u0026rsquo;s most impressive results. On Anthropic\u0026rsquo;s kernel engineering task, four co-evolving agents (Claude Code with Opus 4.6) achieve 1,103 cycles, compared to 1,350 for single-agent CORAL and 2,740 for OpenEvolve. The four agents collectively produce 596 evaluations with a 9% improvement rate. Cross-pollination is critical: 66% of new records originate from a cross-agent parent\u0026mdash;a solution proposed by one agent that another agent picks up and improves. On polyominoes packing, four agents reach 84.2% coverage (versus 80.2% for single-agent), and with web search enabled, CORAL attains 89.4%, surpassing the prior SOTA of 87%.\nThe multi-agent advantage is not limited to proprietary models. Using the fully open-source stack (OpenCode + MiniMax M2.5), four-agent CORAL consistently outperforms its single-agent counterpart across all mathematical and systems tasks, with gains ranging from 0.15% to 20.8%.\n5.3 Why Autonomous Evolution Works Three mechanisms explain the performance gap:\nLocal verification. Agents test solutions locally before submitting them for external evaluation. The local test rate varies by task: 57% on kernel engineering (where compilation and cycle counting can be done locally), 61% on transaction scheduling, but 0% on PRISM (where the evaluator generates random test cases that cannot be reproduced locally). Where local testing is feasible, it acts as a high-pass filter, catching compilation failures and obvious regressions before they consume evaluation budget.\nKnowledge accumulation. On demanding tasks, agents create 0.55\u0026ndash;0.68 knowledge artifacts per attempt, compared to 0.05 on standard tasks. This tenfold difference reflects a qualitative shift: on standard tasks, notes tend to be lightweight progress logs (\u0026ldquo;tried parameter X, got score Y\u0026rdquo;), while on hard tasks they capture reusable insights (\u0026ldquo;identified VALU architecture bottleneck at depth-0 XOR; switching to per-lane ALU saves 64 VALU at cost of 512 ALU\u0026rdquo;). Knowledge access correlates with improvement: on kernel engineering, 55% of evaluations that access prior knowledge produce an improvement, versus 9% overall.\nCross-agent information transfer. In the four-agent kernel engineering run, 36% of attempts use another agent\u0026rsquo;s commit as a parent. Cross-agent parents achieve a 17% improvement rate (versus 9% overall), and 66% of new records trace back to cross-agent lineages. The transfer modes differ by task: kernel engineering favors direct code reuse (agents copy and modify promising commits), while polyominoes packing favors knowledge transfer (87% of rounds reference another agent\u0026rsquo;s notes or skills). These complementary patterns emerge organically from the shared memory architecture.\n5.4 Ablation Studies Disabling knowledge accumulation (removing notes and skills) degrades performance on all tested tasks. The effect is largest on kernel engineering, where scores regress from 1,350 to 1,601 cycles\u0026mdash;an 18.6% setback. On polyominoes and transaction scheduling, the regression is 3.6% and 2.7% respectively. Knowledge is not a nice-to-have; it is load-bearing.\nSeparating co-evolution from independent execution isolates the value of shared memory. Four co-evolving agents achieve 1,103 cycles on kernel engineering, while the best of four independently run agents reaches only 1,180\u0026mdash;a 6.5% gap that cannot be explained by additional computation alone. On polyominoes, the gap is 4.2%. The shared memory enables a form of soft coordination that makes the whole greater than the sum of its parts.\nDiscussion 6.1 From Fixed to Autonomous Evolution Fixed evolutionary search treats the LLM as a sophisticated mutator embedded in a rigid loop. The search strategy\u0026mdash;what to mutate, how to select parents, when to evaluate\u0026mdash;is determined entirely by hand-coded heuristics. This approach works when the problem structure aligns with the heuristics, but it fundamentally limits the LLM\u0026rsquo;s capacity for planning and strategic decision-making. CORAL\u0026rsquo;s shift to autonomous evolution asks a different question: rather than \u0026ldquo;how should we orchestrate the LLM?\u0026rdquo;, it asks \u0026ldquo;what decisions should we let the LLM make for itself?\u0026rdquo;\nThe answer, it turns out, is \u0026ldquo;most of them.\u0026rdquo; Agents that can choose what to read, when to test, what to record, and when to pivot consistently outperform fixed procedures. The heartbeat mechanism is key to making this work: it provides soft guidance (reflection prompts, consolidation triggers, redirection cues) rather than hard constraints (fixed selection rules, mandatory evaluation after every proposal). The agent retains autonomy over its search trajectory while benefiting from periodic nudges that prevent common failure modes.\n6.2 Implicit Protocols in Multi-Agent Collaboration CORAL\u0026rsquo;s multi-agent organization deliberately avoids explicit communication. There is no message-passing protocol, no shared plan, no role assignment. Yet the agents develop what can be called implicit protocols: patterns of coordination that emerge from shared memory access. The Jaccard similarity of 0.31\u0026ndash;0.43 between agents\u0026rsquo; attempted strategies indicates that more than half of each agent\u0026rsquo;s search vocabulary is unique, providing genuine exploration diversity. At the same time, the 36% cross-agent parent rate on kernel engineering shows that agents are effectively building on each other\u0026rsquo;s discoveries. The result is a system that combines the breadth of independent exploration with the depth of shared accumulation.\nThis horizontal-parallel architecture contrasts with vertical-sequential frameworks like MetaGPT, where agents assume fixed roles (product manager, architect, engineer) and pass artifacts through a predefined pipeline. For open-ended problems, where the optimal division of labor is unknown in advance, horizontal parallelism with implicit coordination is more appropriate: it allows the discovery process itself to determine what each agent should work on.\n6.3 Limitations CORAL\u0026rsquo;s approach carries three notable limitations. First, it depends on frontier foundation models capable of handling complex coding agent workflows; single-agent runs cost approximately $30\u0026ndash;60 per 3-hour session, and four-agent runs are roughly four times that amount. Deploying on smaller, locally-hostable models remains an open challenge. Second, all agents are initialized identically, with the same task prompt and the same CORAL.md instructions. Injecting heterogeneous personalities, roles, or private information at initialization could further increase exploration diversity, but how to do this systematically is not yet understood. Third, the framework assumes the availability of a reliable evaluator. For problems where evaluation is itself expensive, incomplete, or ambiguous\u0026mdash;a common situation in real-world scientific discovery\u0026mdash;the evaluator may need to co-evolve with the solution, a direction that CORAL does not currently explore.\nReferences 本文部分 reference 的 arXiv ID 为 2026 年预占位编号，待论文正式公开后将更新链接。\n- Romera-Paredes, B. et al., 2024. Mathematical discoveries from program search with large language models. Nature, 625, pp.468\u0026ndash;475.\n- Novak, R. et al., 2025. AlphaEvolve: A coding agent for scientific and algorithmic discovery. Google DeepMind Technical Report.\n- Sharma, R., 2025. OpenEvolve: An open-source implementation of evolutionary search with LLMs. GitHub Repository.\n- Lange, R. et al., 2025. ShinkaEvolve: Accelerating evolutionary search with diversity-guided sampling. Preprint.\n- Liu, Y. et al., 2026. EvoX: Meta-evolutionary search for open-ended discovery. Preprint.\n- Qu, A., Zheng, H., Zhou, Z. et al., 2026. CORAL: Towards autonomous multi-agent evolution for open-ended discovery. arXiv:2604.01658.\nRelated Concepts RL fundamentals — the autonomous decision view of explore-exploit trade-off in RL terms: see https://xuquant.com/en/posts/autonomous-driving/basic_rl/ Multi-agent reasoning with visual primitives — DeepSeek\u0026rsquo;s \u0026ldquo;thinking with visual primitives\u0026rdquo; exhibits a parallel self-directed reasoning path: see /posts/foundation-models/deepseek-thinking-with-visual-primitives/ ","permalink":"https://xuquant.com/en/posts/foundation-models/coral-autonomous-multi-agent-evolution/","summary":"\u003ch2 id=\"introduction\"\u003eIntroduction\u003c/h2\u003e\n\u003cp\u003e\u003cpicture\u003e\n  \u003csource srcset=\"/images/paper-figures/coral-arch.webp\" type=\"image/webp\"\u003e\n  \u003cimg src=\"/images/paper-figures/coral-arch.png\" alt=\"CORAL Architecture\" loading=\"lazy\" decoding=\"async\"\u003e\n\u003c/picture\u003e\n\u003cem\u003eFigure from \u003ca href=\"https://arxiv.org/abs/2604.01658\"\u003eCORAL: Autonomous Multi-Agent Evolution for Open-Ended Discovery\u003c/a\u003e\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003eOpen-ended discovery\u0026mdash;the search for novel, high-quality solutions in domains where the solution space lacks clear structure and evaluation may be expensive or sparse\u0026mdash;remains one of the hardest challenges in automated scientific reasoning. Unlike constrained optimization, where gradients or convexity guide the search, open-ended problems demand sustained exploration, accumulation of partial insights, and the ability to redirect effort when progress stalls. Mathematical conjecture proving, systems-level code optimization, and combinatorial design all fall squarely in this category.\u003c/p\u003e","title":"CORAL: Autonomous Multi-Agent Evolution for Open-Ended Discovery"},{"content":" Figure from InSpatio-World: Real-Time 4D World Simulation via Spatiotemporal Autoregressive Modeling\nThe ability to simulate a 4D world \u0026mdash; one that evolves in time and can be viewed from arbitrary perspectives \u0026mdash; is a foundational capability for autonomous driving, robotics, and embodied AI. Existing video generation models produce visually compelling sequences but lack spatial consistency when the camera moves. 3D reconstruction methods achieve geometric fidelity but struggle with dynamic scenes and real-time performance. InSpatio-World bridges this gap through a spatiotemporal autoregressive (STAR) architecture that combines the strengths of both paradigms.\nThis article provides a detailed technical analysis based on the paper (arXiv:2604.07209) and the open-source implementation.\nInteractive Demo The following viewer shows the complete pipeline output for a circular orbit trajectory. Three videos play in sync: the original input, the geometric rendering condition, and the predicted novel view.\nControls: Play/pause all videos simultaneously. Drag the timeline to seek. Speed control: 0.5x\u0026ndash;2.0x. Keyboard: Space = play/pause, Arrow keys = frame step.\n1. The Core Problem: Why Not Just Generate Video? Video generation models (Sora, Wan, CogVideo) produce temporally coherent frames but have no notion of 3D geometry. When you ask them to \u0026ldquo;move the camera left,\u0026rdquo; they hallucinate plausible-looking motion that is not geometrically consistent with the underlying scene.\nVideo Generation + Photorealistic + Temporal coherence - No 3D consistency - Geometry hallucinated Examples: Sora, Wan2.1 CogVideoX 3D Reconstruction + Geometric fidelity + Multi-view consistent - Static scenes only - Not real-time Examples: NeRF, 3DGS InstantNGP InSpatio-World + Photorealistic + 3D consistent + Dynamic scenes + Real-time (24 FPS) STAR + JDMD 1.3B params InSpatio-World identifies three specific failure modes in existing autoregressive world simulators:\nSpatial Persistence Degradation: As the autoregressive rollout extends, the model \u0026ldquo;forgets\u0026rdquo; the original scene geometry. Objects drift, textures blur, and structural coherence decays. Synthetic-to-Real Gap: Training on rendered (synthetic) data provides precise camera control but produces artifacts. Training on real video produces realistic frames but lacks control signals. Neither alone is sufficient. Insufficient Control Precision: Existing trajectory-conditioned models fail to accurately follow user-specified camera paths, especially for large rotations. 2. Architecture: STAR (Spatiotemporal Autoregressive) The STAR architecture generates video in blocks of NfN_f frames (default: 3), each conditioned on three types of information:\nSTAR: Block-wise Causal Denoising Reference z_ref (global anchor) Source video latent History z_{\u0026lt;i} (temporal ctx) Previous block output Geometry [z_warp, mask] Explicit 3D constraint Causal DiT + KV Cache Denoise z_i | z_{\u0026lt;i}, z_ref, [z_warp, m] z_i (denoised) The denoising process for block ii is:\nz^i=Denoiseθ(zi,σ∣z\u0026lt;i,zrefi,[zwarpi,mi])\\hat{z}_i = \\text{Denoise}_\\theta(z_i, \\sigma \\mid z_{\u0026lt;i}, z_{\\text{ref}_i}, [z_{\\text{warp}_i}, m_i])2.1 Implicit ST-Cache: The Global Spatial Anchor The reference latent zrefz_{\\text{ref}} is extracted from the source video and injected into every block as a persistent spatial anchor. This solves spatial persistence degradation by ensuring the model always has access to the original scene appearance.\nIn the implementation, this works through a KV cache mechanism:\n1 2 3 4 5 6 7 # Concatenate reference + history as context frames context_frames = torch.cat([ref_block, last_pred_padded], dim=1) # Reference block is prepended to every denoising step denoised_pred, _ = denoise_block( noisy_current, context=context_frames, render_block=render_condition, ... ) A critical implementation detail: position encoding anchoring. The RoPE position indices for the reference block, history block, and current block are each anchored to fixed absolute positions, preventing the position encoding from drifting as the sequence length grows during autoregressive rollout.\n2.2 Explicit Spatial Constraint: Depth → Point Cloud → Render The explicit geometric pipeline operates in three stages:\nDepth estimation: Depth-Anything-3 (DA3) estimates per-frame depth maps and camera poses from the source video. Point cloud reconstruction: Each frame\u0026rsquo;s depth map is unprojected into a 3D point cloud (one PLY per frame). Trajectory-conditioned rendering: Given a user-specified camera trajectory, the point cloud is re-projected to the novel viewpoint, producing render_offline.mp4 and mask_offline.mp4. Source Video + Trajectory DA3 Depth + Pose estimation Point Cloud 3D unproject + Reproject Geometry Cond. render_video + mask_video The render video provides a coarse geometric guide for where objects should appear from the new viewpoint, while the mask indicates which pixels have valid geometry. The DiT learns to refine this coarse render into a photorealistic frame.\n2.3 Trajectory Specification Trajectories are defined as simple text files with three lines: pitch angles (degrees), yaw angles (degrees), and displacement scale factors. The sphere2pose function converts spherical coordinates to 4×4 camera-to-world matrices:\n1 2 3 4 # x_y_circle_cycle.txt 0 0 ... 30 30 ... 0 0 ... -30 -30 ... 0 0 0 0 ... 45 45 ... 90 90 ... 45 45 ... 0 0 1.0 1.0 ... 1.0 Keyframes are interpolated using scipy.interpolate.UnivariateSpline for smooth trajectories. The system adaptively adjusts frame count based on total angular change (0.3\u0026ndash;0.8 degrees per frame).\n3. JDMD: Solving the Synthetic-Real Gap Training on synthetic data (rendered point clouds) provides precise camera control but produces visual artifacts. Training on real video produces beautiful frames but lacks control signals. InSpatio-World\u0026rsquo;s solution: train on both simultaneously.\nV2V Branch (Synthetic) Input: source video + trajectory GT: re-rendered novel view Learns: precise motion control Loss: L_vis + lambda * L_ctrl Render artifacts OK as GT (geometry is correct) T2V Branch (Real) Input: text caption + video GT: real video frames Learns: visual fidelity Loss: L_vis (standard diffusion) No geometry needed (photorealism is correct) Shared DiT Weights The JDMD (Joint Distribution Matching Distillation) loss:\nLJDMD=Lvis+λctrl⋅Lctrl\\mathcal{L}_{\\text{JDMD}} = \\mathcal{L}_{\\text{vis}} + \\lambda_{\\text{ctrl}} \\cdot \\mathcal{L}_{\\text{ctrl}} Lvis\\mathcal{L}_{\\text{vis}}: Standard flow-matching loss on latent space, applied to both branches. Lctrl\\mathcal{L}_{\\text{ctrl}}: Control precision loss, computed only on the V2V branch, measuring how well the generated video follows the specified camera trajectory. This dual-branch training ensures the model inherits both geometric accuracy (from synthetic data) and visual realism (from real data).\n4. Inference Pipeline The complete inference pipeline has three steps:\nStep 1: Caption Generation Florence-2 generates a text description from the source video. This caption provides semantic context for the T2V component of the model.\nStep 2: Depth Estimation + Geometric Rendering DA3 estimates depth maps and camera poses. The depth maps are unprojected to point clouds and re-rendered from the target trajectory viewpoints, producing the geometry condition videos.\nStep 3: Autoregressive Inference The Causal DiT generates the novel-view video block by block, with each block conditioned on the reference latent, the history cache, and the geometric render.\n1 2 3 4 # Run the complete pipeline bash run_test_pipeline.sh \\ --input_dir ./test/example \\ --traj_txt_path ./traj/x_y_circle_cycle.txt Key inference options:\nFlag Purpose --relative_to_source Combine trajectory relative to initial view (for driving) --rotation_only Pan/tilt only, ignore translation --freeze_repeat N Freeze time, repeat frame N times --use_tae Tiny AutoEncoder for faster inference --compile_dit torch.compile acceleration 5. Performance Metric Value Model size 1.3B parameters FPS (H-series GPU) 24 FPS (RTX 4090) 10 WorldScore-Dynamic 68.72 (SOTA among real-time methods) Camera control precision 81.51 RE10K-Long FID 42.68 RE10K-Long FVD 100.55 The model achieves real-time performance while maintaining competitive quality against offline methods. The block-wise causal architecture enables streaming output \u0026mdash; the first few frames are available before the entire sequence is generated.\n6. Connection to Autonomous Driving InSpatio-World has a natural connection to autonomous driving planning. The project includes integration documentation for DrivoR, a Transformer-based E2E planner that achieves PDMS 93.7 on NAVSIM-v1.\nThe key insight: use InSpatio-World not as a planner, but as a future observation generator. Given a candidate trajectory from DrivoR, InSpatio-World can render what the ego vehicle would see if it followed that trajectory, enabling:\nFuture-consistency scoring: Add a feature to the DrivoR scorer that evaluates whether the predicted future observation is consistent with the planned trajectory. Counterfactual data augmentation: Generate training data for rare scenarios by rendering novel views along hypothetical trajectories that differ from the ground truth. Trajectory-conditioned world simulation: Combine DrivoR\u0026rsquo;s trajectory output with InSpatio-World\u0026rsquo;s rendering to create a closed-loop simulation environment. This points toward a broader trend: the convergence of world models and planning models in autonomous driving, where the world model provides the \u0026ldquo;what would happen\u0026rdquo; and the planner provides the \u0026ldquo;what should I do.\u0026rdquo;\n7. Limitations and Open Questions Long-range consistency: While the ST-Cache mitigates degradation, extremely long rollouts (hundreds of frames) still show gradual drift. 360-degree roaming: The current architecture handles moderate viewpoint changes well but struggles with full panoramic exploration. Dynamic objects: The explicit geometric pipeline (point cloud re-projection) treats objects as static; handling moving objects in the scene remains an open challenge. Sim-to-real gap for driving: Although JDMD helps, the gap between rendered and real driving scenes is larger than for general video, due to complex reflections, transparent surfaces, and fine textures. References 本文部分 reference 的 arXiv ID 为 2026 年预占位编号，待论文正式公开后将更新链接。\n- InSpatio-World: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling (arXiv:2604.07209)\n- Project Page\n- Wan2.1: Open and Advanced Large-Scale Video Generation Models\n- Depth-Anything-3: Monocular Depth Estimation\n- DrivoR: Driving on Registers for End-to-End Autonomous Driving\n- NAVSIM Benchmark\nRelated Concepts Ontology of 4D representation — the Newtonian vs Minkowski structural debate behind InSpatio-World\u0026rsquo;s \u0026ldquo;4D simulation\u0026rdquo;: see /posts/world-models/vision-2d-to-4d/ Alternative video world-model line — Wan2.2\u0026rsquo;s video generation vs InSpatio\u0026rsquo;s autoregressive 4D simulation in the world-modeling landscape: see /posts/world-models/wan2.2-video-world-model-boundary/ ","permalink":"https://xuquant.com/en/posts/foundation-models/inspatio-world-4d-simulator/","summary":"\u003cp\u003e\u003cpicture\u003e\n  \u003csource srcset=\"/images/paper-figures/inspatio-world-arch.webp\" type=\"image/webp\"\u003e\n  \u003cimg src=\"/images/paper-figures/inspatio-world-arch.png\" alt=\"InSpatio-World Architecture\" loading=\"lazy\" decoding=\"async\"\u003e\n\u003c/picture\u003e\n\u003cem\u003eFigure from \u003ca href=\"https://arxiv.org/abs/2604.07209\"\u003eInSpatio-World: Real-Time 4D World Simulation via Spatiotemporal Autoregressive Modeling\u003c/a\u003e\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003eThe ability to simulate a 4D world \u0026mdash; one that evolves in time and can be viewed from arbitrary perspectives \u0026mdash; is a foundational capability for autonomous driving, robotics, and embodied AI. Existing video generation models produce visually compelling sequences but lack spatial consistency when the camera moves. 3D reconstruction methods achieve geometric fidelity but struggle with dynamic scenes and real-time performance. InSpatio-World bridges this gap through a spatiotemporal autoregressive (STAR) architecture that combines the strengths of both paradigms.\u003c/p\u003e","title":"InSpatio-World: Real-Time 4D World Simulation via Spatiotemporal Autoregressive Modeling"},{"content":" 中文版本：阅读中文版\nIntroduction The integration of reinforcement learning into end-to-end autonomous driving systems has emerged as a promising direction for improving trajectory planning beyond what supervised learning alone can achieve. However, the direct application of standard RL algorithms to driving tasks faces core challenges: the sim-to-real gap in log-replay environments, the computational bottleneck of online simulation, and the difficulty of defining dense reward signals for continuous trajectory generation.\nThis article examines the RL pipeline for end-to-end autonomous driving through the lens of post-training alignment. We begin with the concept of metric caching, which decouples expensive environment evaluation from model training. We then analyze how Direct Preference Optimization (DPO) can be applied across different action representations\u0026mdash;discrete tokens, continuous regression, and diffusion models\u0026mdash;and discuss the basic distinction between offline and online RL in the driving context. Finally, we present three strategies for breaking the sampling ceiling that limits the performance of iterative self-improvement pipelines.\nMetric Cache: Decoupling Evaluation from Training A central engineering insight in modern driving RL pipelines is the separation of environment simulation from model training through precomputed metric caches. The metric cache is a serialized snapshot of ground-truth environmental data and scene context, designed specifically to accelerate the evaluation of predicted trajectories.\nThe cache contains several key components. The reference trajectory is generated by a rule-based planner (typically an Intelligent Driver Model) and serves as a baseline for comparison. The ego state records the initial position, velocity, and heading of the ego vehicle. The observation field stores interpolated ground-truth future trajectories of all surrounding agents at 10 Hz, enabling precise collision detection during evaluation. The centerline and route lane IDs encode the navigable path for computing progress and direction compliance. The drivable area map provides a polygonal representation of road boundaries for off-road detection.\nThe production pipeline proceeds in three stages. First, the raw scenario is loaded from the driving database, and the rule-based planner generates a reference trajectory. Second, ground-truth agent trajectories are interpolated and map features are extracted. Third, all components are serialized into a compressed cache file. At evaluation time, the model simply generates a predicted trajectory, and the scoring module loads the cache to perform collision detection against the stored observations, boundary checks against the drivable area map, and progress computation against the centerline\u0026mdash;all without accessing the original database.\nThis design has a profound implication for the training pipeline: it enables the Generate-Score-Train loop that underpins post-training RL. By precomputing all environment information, the system can rapidly evaluate thousands of candidate trajectories from a single scene, producing the preference pairs needed for DPO training.\nPost-Training Pipeline: DPO for Trajectory Planning Sampling and Preference Pair Construction The post-training pipeline begins with sampling. For each input context (multi-camera observations, navigation command, ego history), the model generates KK candidate trajectories (typically K=128K=128). Each trajectory is then evaluated by the scoring module, which produces a multi-dimensional score vector comprising: collision penalty, drivable area compliance, ego progress, time-to-collision, comfort, and a weighted total score.\nThe candidate trajectories are encoded as discrete action sequences through a Vector Quantization (VQ) module. Specifically, each trajectory is represented as a sequence of 8 discrete token IDs, corresponding to 4 seconds of prediction at 0.5-second intervals. The model records both the selected action tokens and their log probabilities under the current policy, which are stored for subsequent DPO training as the reference policy probabilities log⁡πref(a∣x)\\log \\pi_{\\text{ref}}(a|x).\nPreference pairs are constructed by selecting the highest-scoring trajectory as the winner and the lowest-scoring as the loser, based on the total weighted score. Crucially, the reference policy probabilities are recorded at sampling time, eliminating the need to maintain a separate frozen reference model during training.\nDPO Loss Formulation For discrete action spaces, the DPO loss follows the standard formulation. Let ywy_w denote the winner trajectory and yly_l the loser trajectory. The joint log-probability of a trajectory under an autoregressive model is the sum of per-step log-probabilities:\nlog⁡π(y∣x)=∑t=1Tlog⁡π(at∣a\u0026lt;t,x)\\log \\pi(y|x) = \\sum_{t=1}^{T} \\log \\pi(a_t | a_{\u0026lt;t}, x)The DPO loss is then:\nLDPO=−log⁡σ(β(log⁡πθ(yw∣x)πref(yw∣x)−log⁡πθ(yl∣x)πref(yl∣x)))L_{\\text{DPO}} = -\\log \\sigma\\left(\\beta \\left( \\log \\frac{\\pi_\\theta(y_w|x)}{\\pi_{\\text{ref}}(y_w|x)} - \\log \\frac{\\pi_\\theta(y_l|x)}{\\pi_{\\text{ref}}(y_l|x)} \\right)\\right)where β\\beta controls the deviation from the reference policy. During training, the log-probabilities are computed by gathering the model\u0026rsquo;s output logits at the positions corresponding to the actual action tokens, taking logarithms, and summing across time steps. The reference log-probabilities are read directly from the cached sampling data.\nTo monitor training progress, the implicit reward is computed as:\nr(x,y)=β(log⁡πθ(y∣x)−log⁡πref(y∣x))r(x, y) = \\beta \\left(\\log \\pi_\\theta(y|x) - \\log \\pi_{\\text{ref}}(y|x)\\right)The training objective is to increase the implicit reward for winners while decreasing it for losers, widening the margin between them.\nAction Space Comparison for DPO The choice of action representation determines how log⁡P(y∣x)\\log P(y|x) is computed for DPO, and each choice carries distinct trade-offs.\nDiscrete Token Space In the discrete setting, the model outputs a sequence of token IDs from a learned codebook (e.g., 8192 entries). The log-probability is computed via the standard softmax over logits:\nlog⁡P(y∣x)=∑t=1Tlog⁡exp⁡(zat)∑k=1Kexp⁡(zk)\\log P(y|x) = \\sum_{t=1}^{T} \\log \\frac{\\exp(z_{a_t})}{\\sum_{k=1}^{K} \\exp(z_k)}This representation is naturally multi-modal, provides exact probability values, and is robust to noise. It is also directly compatible with policy gradient RL methods. However, discretization introduces precision loss and faces the curse of dimensionality when the action space grows large. In the driving domain, this limitation is mitigated by the fact that the codebook can be trained to cover the relevant trajectory manifold effectively.\nContinuous Regression When the model directly regresses trajectory coordinates, the log-probability must be approximated under a distributional assumption. The most common approach assumes a Gaussian distribution with the model output as the mean and a fixed variance σ2\\sigma^2. Under this assumption:\nlog⁡P(y∣x)∝−12σ2∥y−μθ(x)∥2\\log P(y|x) \\propto -\\frac{1}{2\\sigma^2} \\|y - \\mu_\\theta(x)\\|^2That is, the negative mean squared error serves as a proxy for log-probability. The DPO loss then becomes a contrastive objective that pulls the model\u0026rsquo;s prediction closer to the winner trajectory while pushing it away from the loser:\nLDPO-Reg=−log⁡σ(β[−∥yw−μθ∥2+∥yw−μref∥2]−[−∥yl−μθ∥2+∥yl−μref∥2])L_{\\text{DPO-Reg}} = -\\log \\sigma\\left(\\beta \\left[-\\|y_w - \\mu_\\theta\\|^2 + \\|y_w - \\mu_{\\text{ref}}\\|^2\\right] - \\left[-\\|y_l - \\mu_\\theta\\|^2 + \\|y_l - \\mu_{\\text{ref}}\\|^2\\right]\\right)More sophisticated models (e.g., Trajectron++, MultiPath) output a Gaussian Mixture Model with parameters (πk,μk,Σk)(\\pi_k, \\mu_k, \\Sigma_k), where the probability density is:\nP(y∣x)=∑k=1Kπk⋅N(y∣μk,Σk)P(y|x) = \\sum_{k=1}^{K} \\pi_k \\cdot \\mathcal{N}(y | \\mu_k, \\Sigma_k)The log-probability of a sampled trajectory is computed via log-sum-exp over the mixture components. Continuous regression offers precise coordinate prediction and fast inference, but suffers from the averaging curse\u0026mdash;mode-averaged predictions tend toward the mean of multi-modal distributions, producing unrealistic trajectories at decision points.\nDiffusion Models Diffusion-based trajectory decoders generate continuous coordinates through an iterative denoising process. Computing log⁡P(y∣x)\\log P(y|x) for DPO requires a different approach: the denoising reconstruction error serves as a proxy for negative log-likelihood. Specifically:\nlog⁡Pθ(x)≈−Et,ϵ[∥ϵ−ϵθ(xt,t)∥2]\\log P_\\theta(x) \\approx -\\mathbb{E}_{t, \\epsilon}\\left[\\|\\epsilon - \\epsilon_\\theta(x_t, t)\\|^2\\right]The intuition is that if the model can accurately predict the noise added to a trajectory, then that trajectory is \u0026ldquo;likely\u0026rdquo; under the model\u0026rsquo;s distribution. For DPO, the loss compares the denoising errors for winner and loser trajectories:\nLDiffusion-DPO=−log⁡σ(β[∥ErrorLoser∥2−∥ErrorWinner∥2])L_{\\text{Diffusion-DPO}} = -\\log \\sigma\\left(\\beta\\left[\\|\\text{Error}_{\\text{Loser}}\\|^2 - \\|\\text{Error}_{\\text{Winner}}\\|^2\\right]\\right)The winner trajectory should be easier to denoise (lower error), while the loser should be harder. Diffusion models combine multi-modality with high precision and physical consistency, but are sensitive to hyperparameters and slower at inference time.\nSummary Model Type Output Log-Probability Proxy DPO Objective Strengths Weaknesses Discrete (VQ) Token IDs log⁡Softmax(logits)\\log \\text{Softmax}(\\text{logits}) Increase winner token logit Multi-modal, exact probability, RL-friendly Precision loss, curse of dimensionality Regression (x,y)(x, y) coordinates −MSE(pred,target)-\\text{MSE}(\\text{pred}, \\text{target}) Pull closer to winner coords Precise, fast inference Mode averaging, distributional assumption Diffusion (x,y)(x, y) coordinates −MSE(pred_noise,noise)-\\text{MSE}(\\text{pred\\_noise}, \\text{noise}) Make winner easier to denoise Multi-modal + precise, physically consistent Hyperparameter-sensitive, slow Offline RL vs. Online RL for Driving The Contextual Bandit Structure In most end-to-end driving systems operating on log-replay data, the RL problem has a different structure from the standard Markov Decision Process (MDP) assumed by algorithms like PPO or DQN. At time t=0t=0, the model observes the current scene and generates a complete future trajectory (e.g., 8 seconds). There is no sequential interaction: the model does not observe the outcome of the first second before deciding the second. The environment feedback arrives only after the entire trajectory is generated and evaluated.\nThis makes the problem a Contextual Bandit: the environment is the traffic scene, the action is the generated trajectory, and the reward is the evaluation score. The model commits to a single action (the full trajectory) and receives a single reward, with no intermediate state transitions.\nWhy Iterative Offline Beats Naive Online The current pipeline operates in an offline-to-online iterative mode. Scene data (the \u0026ldquo;prompt\u0026rdquo;) comes from real driving logs and is fixed. Experience data (the \u0026ldquo;samples\u0026rdquo;) is self-generated by the model through sampling. This self-generated experience is a crucial advantage over traditional offline RL, which only learns from human demonstrations. Self-generated samples allow the model to learn from its own failures\u0026mdash;a trajectory that appears smooth but causes a collision at second 3 is an excellent negative example.\nThe engineering advantages of the offline sampling mode over true online RL are significant:\nProperty Online RL Offline Sampling Computation CPU/IO-blocked: GPU waits for simulator CPU cluster samples to disk; GPU trains at 100% utilization Data Efficiency On-policy: samples discarded after one update Off-policy: samples reused across multiple epochs Stability Prone to collapse from poor batches Global view: cache can be cleaned before training Throughput Simulator runs at 10\u0026ndash;20 Hz Sampling fully parallelized across CPU cluster The Simulator Flaw A deeper problem with online RL in log-replay environments is the simulator flaw. In most driving benchmarks, other agents follow their recorded trajectories regardless of the ego vehicle\u0026rsquo;s actions. If the ego vehicle swerves into an adjacent car, that car does not react\u0026mdash;it is a \u0026ldquo;ghost car\u0026rdquo; replaying a recording. An online RL agent would quickly discover this and learn either overly conservative policies (never move when any car is nearby) or overly aggressive ones (exploit the fact that other cars never react). Neither strategy transfers to the real world.\nBreaking the Sampling Ceiling The core limitation of the Generate-Score-Train pipeline is captured by the inequality:\nTraining Ceiling=max⁡(Samples)\\text{Training Ceiling} = \\max(\\text{Samples})If the model is weak and all KK sampled trajectories are poor, DPO can only select the \u0026ldquo;least bad\u0026rdquo; trajectory as the winner. The model learns to distinguish bad from worse, but never sees what a genuinely good trajectory looks like. Three strategies can break through this ceiling.\nIterative Self-Improvement The most engineering-friendly approach requires no architectural changes, only a change in the training loop. Instead of a single round of sampling and training, the process is iterated:\nInitial model π0\\pi_0 samples to produce dataset D0D_0. Training on D0D_0 produces improved model π1\\pi_1. π1\\pi_1 samples again (now exploring regions of the state space that π0\\pi_0 could not reach) to produce D1D_1. Training on D1D_1 produces π2\\pi_2. Repeat for NN iterations. Each iteration shifts the sampling distribution toward better regions. The first round might discover \u0026ldquo;slow but safe\u0026rdquo; trajectories; the second round, building on a stronger policy, might explore \u0026ldquo;fast and safe\u0026rdquo; trajectories. This is essentially off-policy RL with iterative data collection, while maintaining the engineering simplicity of the offline pipeline.\nTest-Time Compute and Search Rather than improving the model, this approach improves the sampling process itself. Two strategies are available:\nGuided sampling exploits the structure of diffusion models by introducing a lightweight cost function during the reverse denoising process. This steers trajectory generation toward collision-free regions, raising the floor of sample quality without additional model training.\nTree search (e.g., Monte Carlo Tree Search) generates a large number of candidate trajectories (e.g., 1000) and uses a fast value model to pre-filter them down to a small set (e.g., 10) for expensive evaluation. This front-loads computational effort into the data generation phase, effectively performing \u0026ldquo;thinking\u0026rdquo; during sampling and distilling the results into the trained model.\nExpert Injection The fastest way to raise the ceiling is to introduce external expertise. During sampling, rule-based or optimization-based planners (e.g., lattice planners) generate trajectories that are mixed into the candidate pool. These expert trajectories become the winners in the preference pairs, forcing the model to learn: \u0026ldquo;this is how an expert planner handles this situation.\u0026rdquo; Over time, the model internalizes the expert\u0026rsquo;s decision-making patterns while retaining the neural network\u0026rsquo;s ability to generalize to scenarios where the rule-based planner fails.\nDiscussion The Generate-Score-Train paradigm has become the standard approach for aligning large models (whether LLMs, VLMs, or end-to-end driving systems) to desired behaviors. Its strength lies in engineering pragmatism: it decouples the expensive simulation step from the GPU-intensive training step, enables data reuse, and allows quality control before training. In this framework, sampling quality determines the performance ceiling, and the loss function merely determines how efficiently the model approaches that ceiling.\nThe three strategies for breaking the sampling ceiling are complementary rather than mutually exclusive. Iterative self-improvement provides a natural progression of model capability. Test-time search improves sample quality at the cost of additional computation during data generation. Expert injection provides an immediate boost by importing external knowledge. In practice, the most effective pipelines combine all three, using expert trajectories to bootstrap the first iteration, iterative self-improvement to progressively expand the frontier, and guided sampling or tree search to maximize the quality of each iteration\u0026rsquo;s samples.\nThe path from offline DPO to truly online RL in autonomous driving remains open. The simulator flaw\u0026mdash;the non-reactivity of log-replay agents\u0026mdash;is a basic obstacle that cannot be solved by algorithmic improvements alone. Addressing it requires either more realistic reactive simulators or hybrid approaches that combine log-replay evaluation with learned environment models. Until then, the iterative offline paradigm, with its engineering simplicity and demonstrated effectiveness, remains the pragmatic choice for production systems.\nReferences 1. Rafailov, R., Sharma, A., Mitchell, E., et al. \u0026ldquo;Direct Preference Optimization: Your Language Model is Secretly a Reward Model.\u0026rdquo; NeurIPS, 2023.\n2. Wallace, B., Dang, M., Rafailov, R., et al. \u0026ldquo;Diffusion Model Alignment Using Direct Preference Optimization.\u0026rdquo; arXiv:2311.12908, 2023.\n3. Shao, Z., Wang, P., Zhu, Q., et al. \u0026ldquo;DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.\u0026rdquo; arXiv:2402.03300, 2024.\n4. Chai, Y., et al. \u0026ldquo;UniAD: Planning-oriented Autonomous Driving.\u0026rdquo; CVPR, 2023.\n5. Daoud, A., et al. \u0026ldquo;DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving.\u0026rdquo; arXiv, 2024.\n6. Hu, Y., et al. \u0026ldquo;Planning-oriented Autonomous Driving via Interactive Multi-agent Modeling.\u0026rdquo; NeurIPS, 2023.\n7. Silver, D., Huang, A., Maddison, C.J., et al. \u0026ldquo;Mastering the Game of Go with Deep Neural Networks and Tree Search.\u0026rdquo; Nature, 2016.\n8. VAE-based discretization: van den Oord, A., Vinyals, O., and Kavukcuoglu, K. \u0026ldquo;Neural Discrete Representation Learning.\u0026rdquo; NeurIPS, 2017.\n9. Petrov, A., et al. \u0026ldquo;Trajectron++: Dynamically-Feasible Trajectory Forecasting with Heterogeneous Data.\u0026rdquo; ECCV, 2020.\n","permalink":"https://xuquant.com/en/posts/autonomous-driving/basic_rl/","summary":"\u003cblockquote\u003e\n\u003cp\u003e中文版本：\u003ca href=\"/posts/autonomous-driving/basic_RL.zh/\"\u003e阅读中文版\u003c/a\u003e\u003c/p\u003e\n\u003c/blockquote\u003e\n\u003ch2 id=\"introduction\"\u003eIntroduction\u003c/h2\u003e\n\u003cp\u003eThe integration of reinforcement learning into end-to-end autonomous driving systems has emerged as a promising direction for improving trajectory planning beyond what supervised learning alone can achieve. However, the direct application of standard RL algorithms to driving tasks faces core challenges: the sim-to-real gap in log-replay environments, the computational bottleneck of online simulation, and the difficulty of defining dense reward signals for continuous trajectory generation.\u003c/p\u003e","title":"Reinforcement Learning for End-to-End Autonomous Driving: From Offline DPO to Iterative Self-Improvement"},{"content":" 中文版本: 阅读中文版\nThis article focuses on engineering perspective. The mathematical derivation of MLA (from RoPE to latent projection, partial RoPE compatibility proof, weight absorption algebra) is in /posts/mathematics/position-encoding/mla-from-rope/. This article does not repeat the math — it discusses only the engineering numbers and design trade-offs that matter for DeepSeek V2/V3 deployment.\nFigure from DeepSeek-V2: A Strong, Economical, and Efficient MoE Language Model\n1. Why DeepSeek Chose MLA: Engineering Motivation DeepSeek V2 / V3 [1] adopt MLA as the replacement for standard MHA — the root motivation is deployment economics. In LLM serving, the KV cache memory footprint directly determines how many concurrent requests fit on one card. For DeepSeek V2\u0026rsquo;s size (nh=128n_h = 128, dh=128d_h = 128, l=60l = 60), standard MHA caches 2nhdh=32,7682 n_h d_h = 32{,}768 elements per token per layer; across 60 layers, ~2M elements/token, ~4 MB/token in bf16. Under 32 K context, a single sequence consumes ~128 GB — one H100 80 GB cannot even fit it.\nMLA compresses the per-token per-layer cache to dc+dhRd_c + d_h^R elements. DeepSeek V2 picks dc=4dh=512d_c = 4 d_h = 512 and dhR=dh/2=64d_h^R = d_h / 2 = 64, totaling 576 elements/token/layer — relative MHA compression of about 57× (note: the 28.4× figure often quoted in early literature corresponds to the nh=64n_h = 64 configuration). Under 32 K context the single-sequence KV cache drops to ~2.2 GB, which is the economic threshold for viable long-context serving.\nGQA and MQA can also compress KV cache, but they carry structural costs:\nMQA shares one K/V pair across all heads (ng=1n_g = 1), compressing MHA by a factor of nhn_h. The cost is significant expressiveness loss — all query heads see the same K/V, sacrificing per-head differentiated attention patterns. GQA is a compromise with ng=nh/rn_g = n_h / r, where every rr query heads share one K/V group. LLaMA-2 70B uses r=8r = 8, engineering-validated as low-loss, but the compression ratio is only 8× — an order of magnitude below MLA\u0026rsquo;s ~50×. MLA\u0026rsquo;s engineering advantage: KV cache size is decoupled from nhn_h. dcd_c is an independent design variable, allowed to be much smaller than nhdhn_h d_h without forcing multi-head K/V sharing. DeepSeek V2 retains 128 heads while pushing cache down to near-MQA levels — something GQA structurally cannot achieve.\n2. Deployment Numbers: DeepSeek V2 / V3 Measurements The table below summarizes deployment-side numbers from DeepSeek reports (V2 from [1], V3 from [2]).\nDimension DeepSeek 67B (MHA baseline) DeepSeek V2 (MLA) DeepSeek V3 (MLA) KV cache / token / layer (bf16) 32 KB 0.6 KB 0.6 KB Compression ratio vs MHA 1× ~57× ~57× Max generation throughput (H800 cluster) 3.5 K tok/s 50 K+ tok/s same order Single-sequence 32K context KV footprint ~120 GB ~2.2 GB ~2.2 GB Training activation savings — ~30% (query also compressed) ~30% The ~14× throughput improvement does not scale linearly with the ~57× cache compression because throughput also depends on attention compute, MoE routing, and network bandwidth — cache compression mainly unlocks the memory-bandwidth bottleneck under long context, with relatively smaller gain for short context.\nDeepSeek reports do not provide a direct MLA-vs-GQA comparison under identical conditions (V2\u0026rsquo;s baseline is 67B MHA). Estimating: if V2 used r=8r = 8 GQA, per-token-per-layer cache ~4 KB, still ~7× of MLA; under 32K context KV footprint ~15 GB — fits, but single-card concurrency is significantly lower than MLA.\n3. Latent Dimension Design Choice DeepSeek reports use dc=4dhd_c = 4 d_h, dc′=32dhd_c\u0026#x27; = \\frac{3}{2} d_h (query compression), dhR=dh/2d_h^R = d_h / 2. These numbers are not ablated in the paper — they look like heuristic engineering picks, chosen to match a total KV cache budget similar to GQA.\nThis is an open problem with MLA design: the optimal dcd_c depends on model size, context length, training data distribution, and several other factors, but no systematic ablation exists in the public literature. One direction worth tracking is the trade-off between dcd_c and head count nhn_h — MLA allows nhn_h to grow arbitrarily (no cache penalty), but as nhn_h grows does the per-head projection quality in the latent subspace degrade? DeepSeek V2 uses 128 heads (~2× of same-size LLaMA) — is it near the knee of this trade-off? No public data answers this.\n4. Engineering Critique: Heuristic Choices and Real-World Long-Context Gains MLA\u0026rsquo;s success on DeepSeek V2/V3 should not overshadow several under-justified design choices. First, dc=4dhd_c = 4 d_h lacks ablation support — the paper does not sweep dc∈{2,4,8,16}×dhd_c \\in \\{2, 4, 8, 16\\} \\times d_h, nor justify why query compression uses dc′=32dhd_c\u0026#x27; = \\frac{3}{2} d_h while KV compression uses 4dh4 d_h. These number combinations work in practice but are \u0026ldquo;tuned in\u0026rdquo; rather than \u0026ldquo;derived\u0026rdquo;. A truly convincing MLA paper should present the complete Pareto frontier of latent dim vs model quality vs inference speed vs training stability — that does not exist today.\nSecond, does the KV cache compression deliver theoretically estimated gains at long context (\u0026gt;32K)? Theoretically 57× compression should reduce attention\u0026rsquo;s memory-bandwidth consumption by 57×, but at inference time attention must still perform WUK,WUVW_{UK}, W_{UV} up-projections (or their weight-absorbed equivalents), and that cost grows linearly with context length. Under short context the cache is not the bottleneck; under long context the up-projection cost grows to dominate — MLA\u0026rsquo;s real speedup is context-length-dependent, but DeepSeek reports only \u0026ldquo;max throughput\u0026rdquo; aggregate numbers. An independent benchmark (e.g. the vLLM implementation in [6]) would be more convincing.\nThird, does MLA + partial RoPE introduce a representation bias? Decoupled RoPE only applies RoPE to dhR=dh/2=64d_h^R = d_h/2 = 64 dimensions, leaving the other dh=128d_h = 128 dimensions positionless. Attention\u0026rsquo;s positional sensitivity thus comes from only 1/3 of the dimensions. Under long context the ratio of positional to content signal is already scarce; partial RoPE further dilutes it — does this hurt long-context fine-grained reference? Does V3\u0026rsquo;s 128K context needle-in-a-haystack performance match expectation? DeepSeek reports aggregate perplexity and standard benchmarks, without breakdowns for position-sensitive tasks.\nRelated Concepts MLA Mathematical Derivation (canonical version) — full derivation from RoPE to latent projection, partial RoPE compatibility proof, weight absorption algebra: see /posts/mathematics/position-encoding/mla-from-rope/ Low-Rank Approximation Theory — MLA\u0026rsquo;s down-projection + up-projection is the engineering realization of SVD low-rank truncation: see /posts/mathematics/matrix/svd-low-rank/ KV Cache Inference Acceleration (cross-domain) — X-Cache in world model inference is the same idea in the vision domain: see /posts/world-models/xpeng-x-cache-world-model-inference-acceleration/ References [1] DeepSeek-AI. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. arXiv:2405.04434, 2024.\n[2] DeepSeek-AI. DeepSeek-V3 Technical Report. arXiv:2412.19437, 2024.\n[3] Shazeer, N. Fast Transformer Decoding: One Write-Head is All You Need (MQA). arXiv:1911.02150, 2019.\n[4] Ainslie, J., Lemercier, P., et al. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. Proceedings of EMNLP, 2023.\n[5] Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., and Liu, Y. RoFormer: Enhanced Transformer with Rotary Position Embedding. Neurocomputing, 2024.\n[6] Kwon, W., et al. Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM). SOSP 2023.\n","permalink":"https://xuquant.com/en/posts/foundation-models/deepseek_series1_mla/","summary":"\u003cblockquote\u003e\n\u003cp\u003e中文版本: \u003ca href=\"/posts/foundation-models/deepseek_series1_MLA.zh/\"\u003e阅读中文版\u003c/a\u003e\u003c/p\u003e\n\u003c/blockquote\u003e\n\u003cblockquote\u003e\n\u003cp\u003e\u003cstrong\u003eThis article focuses on engineering perspective.\u003c/strong\u003e The mathematical derivation of MLA (from RoPE to latent projection, partial RoPE compatibility proof, weight absorption algebra) is in /posts/mathematics/position-encoding/mla-from-rope/. This article does not repeat the math — it discusses only the engineering numbers and design trade-offs that matter for DeepSeek V2/V3 deployment.\u003c/p\u003e\n\u003c/blockquote\u003e\n\u003cp\u003e\u003cpicture\u003e\n  \u003csource srcset=\"/images/paper-figures/deepseek-v2-mla-arch.webp\" type=\"image/webp\"\u003e\n  \u003cimg src=\"/images/paper-figures/deepseek-v2-mla-arch.png\" alt=\"DeepSeek-V2 MLA Architecture\" loading=\"lazy\" decoding=\"async\"\u003e\n\u003c/picture\u003e\n\u003cem\u003eFigure from \u003ca href=\"https://arxiv.org/abs/2405.04434\"\u003eDeepSeek-V2: A Strong, Economical, and Efficient MoE Language Model\u003c/a\u003e\u003c/em\u003e\u003c/p\u003e\n\u003ch2 id=\"1-why-deepseek-chose-mla-engineering-motivation\"\u003e1. Why DeepSeek Chose MLA: Engineering Motivation\u003c/h2\u003e\n\u003cp\u003eDeepSeek V2 / V3 \u003ca href=\"/en/posts/foundation-models/deepseek_series1_mla/#ref1\"\u003e[1]\u003c/a\u003e adopt \u003ca href=\"/posts/mathematics/position-encoding/mla-from-rope/\"\u003eMLA\u003c/a\u003e as the replacement for standard MHA — the root motivation is deployment economics. In LLM serving, the KV cache memory footprint directly determines how many concurrent requests fit on one card. For DeepSeek V2\u0026rsquo;s size (\u003cspan class=\"katex\"\u003e\u003cmath xmlns=\"http://www.w3.org/1998/Math/MathML\"\u003e\u003csemantics\u003e\u003cmrow\u003e\u003cmsub\u003e\u003cmi\u003en\u003c/mi\u003e\u003cmi\u003eh\u003c/mi\u003e\u003c/msub\u003e\u003cmo\u003e=\u003c/mo\u003e\u003cmn\u003e128\u003c/mn\u003e\u003c/mrow\u003e\u003cannotation encoding=\"application/x-tex\"\u003en_h = 128\u003c/annotation\u003e\u003c/semantics\u003e\u003c/math\u003e\u003c/span\u003e, \u003cspan class=\"katex\"\u003e\u003cmath xmlns=\"http://www.w3.org/1998/Math/MathML\"\u003e\u003csemantics\u003e\u003cmrow\u003e\u003cmsub\u003e\u003cmi\u003ed\u003c/mi\u003e\u003cmi\u003eh\u003c/mi\u003e\u003c/msub\u003e\u003cmo\u003e=\u003c/mo\u003e\u003cmn\u003e128\u003c/mn\u003e\u003c/mrow\u003e\u003cannotation encoding=\"application/x-tex\"\u003ed_h = 128\u003c/annotation\u003e\u003c/semantics\u003e\u003c/math\u003e\u003c/span\u003e, \u003cspan class=\"katex\"\u003e\u003cmath xmlns=\"http://www.w3.org/1998/Math/MathML\"\u003e\u003csemantics\u003e\u003cmrow\u003e\u003cmi\u003el\u003c/mi\u003e\u003cmo\u003e=\u003c/mo\u003e\u003cmn\u003e60\u003c/mn\u003e\u003c/mrow\u003e\u003cannotation encoding=\"application/x-tex\"\u003el = 60\u003c/annotation\u003e\u003c/semantics\u003e\u003c/math\u003e\u003c/span\u003e), standard MHA caches \u003cspan class=\"katex\"\u003e\u003cmath xmlns=\"http://www.w3.org/1998/Math/MathML\"\u003e\u003csemantics\u003e\u003cmrow\u003e\u003cmn\u003e2\u003c/mn\u003e\u003cmsub\u003e\u003cmi\u003en\u003c/mi\u003e\u003cmi\u003eh\u003c/mi\u003e\u003c/msub\u003e\u003cmsub\u003e\u003cmi\u003ed\u003c/mi\u003e\u003cmi\u003eh\u003c/mi\u003e\u003c/msub\u003e\u003cmo\u003e=\u003c/mo\u003e\u003cmn\u003e32,768\u003c/mn\u003e\u003c/mrow\u003e\u003cannotation encoding=\"application/x-tex\"\u003e2 n_h d_h = 32{,}768\u003c/annotation\u003e\u003c/semantics\u003e\u003c/math\u003e\u003c/span\u003e elements per token per layer; across 60 layers, ~2M elements/token, ~4 MB/token in bf16. Under 32 K context, a single sequence consumes ~128 GB — one H100 80 GB cannot even fit it.\u003c/p\u003e","title":"Multi-Head Latent Attention: DeepSeek V2/V3 Engineering View"},{"content":" 中文版本：阅读中文版\nIntroduction Figure from Alpamayo-R1: Bridging Reasoning and Action Prediction for Autonomous Driving\nEnd-to-end autonomous driving has made significant progress in recent years, yet deploying Vision-Language-Action (VLA) models in real-world driving scenarios remains challenging. The basic difficulties are fourfold. First, multi-frame temporal understanding requires the model to extract decision-relevant changes from highly redundant consecutive observations, rather than merely processing static snapshots. Second, driving decisions must be causal: the model must model why a particular action is taken, not just learn statistical correlations between situations and actions. Third, predicted trajectories must satisfy kinematic and dynamic constraints while remaining multi-modal and efficient enough for real-time inference. Fourth, the reasoning process must be tightly aligned with action output\u0026mdash;reasoning should not be a post-hoc rationalization but must be verifiable by and constrained by the actual actions taken.\nThe Alpamayo system, built on the Cosmos-Reason VLM as its vision-language backbone, addresses these challenges through a carefully co-designed architecture spanning structure, data, training, and reinforcement learning. Rather than optimizing individual modules in isolation, the system treats reasoning-action alignment, ego-shortcut avoidance, and real-time multi-modal trajectory generation as joint design objectives. This article provides a technical overview of the Alpamayo approach, covering its system architecture, vision encoder design, trajectory decoding strategy, training pipeline, the Cause-of-Change (COC) dataset paradigm, and reinforcement learning fine-tuning.\nSystem Architecture The Alpamayo system takes as input multi-camera, multi-timestamp visual observations, user navigation commands, and historical ego motion (velocity, trajectory history). It produces three types of output: a reason trace that explains the key objects, causal relationships, and environmental changes underlying the decision; a meta action specifying a high-level semantic action such as stop, yield, follow, or lane change; and future trajectories that are kinematically feasible and executable.\nA critical design principle governs the role of ego information. Ego state is treated as a conditioning signal rather than a primary causal source for decision-making. This distinction is essential for avoiding the ego-shortcut problem, where the model learns to infer decisions from its own kinematic state (e.g., \u0026ldquo;I am stopped, therefore there must be a red light\u0026rdquo;) rather than from genuine environmental understanding. By structurally demoting ego information from a causal driver to a conditioning context, the system forces the model to ground its reasoning in external observations.\nVision Encoder Design The vision encoder must satisfy a stringent set of constraints: it must produce a compact token representation that preserves environmentally relevant semantic information while meeting the real-time requirements of a VLA driving system.\nTri-plane Compression for Surround-View Cameras For surround-view camera inputs, the system employs a Tri-plane compression strategy. Rather than naively concatenating tokens from multiple camera views\u0026mdash;which would lead to token explosion\u0026mdash;the encoder projects information from all views onto three orthogonal planes (XY, XZ, YZ). This tri-plane representation unifies multi-view information into a coherent 3D scene semantics while keeping the token count manageable. The approach draws on the observation that 3D structural information can be efficiently factorized into lower-dimensional projections without significant semantic loss, analogous to how tri-plane representations have been used in neural radiance fields.\nTemporal Compression Consecutive frames in a driving video stream contain large amounts of redundant information. The system addresses this by treating time as an additional dimension and performing joint spatiotemporal encoding. A cross-timestep joint encoding module combined with global attention-based compression (referred to as Flex) allows the model to distill temporally salient changes from the redundant background. This design ensures that the token budget is spent on information that actually changes and matters for decision-making, rather than on encoding the static environment repeatedly across time steps.\nLearnable Queries Structured feature representations (such as tri-planes) impose an inductive bias that can limit the model\u0026rsquo;s expressiveness. To address this, the system introduces learnable query tokens that allow the model to autonomously select and attend to the most relevant information. These queries operate on top of the structured representation, providing a flexible mechanism for extracting task-relevant features without being constrained by the fixed spatial structure of the tri-plane.\nInference-Time Token Pruning At inference time, the system applies post-training token pruning techniques to further reduce the computational cost. Tokens that contribute less to the final prediction are identified and removed, allowing the model to run faster without significant performance degradation.\nTrajectory Decoder Action Representation The system does not directly predict raw trajectory coordinates, which would be susceptible to sensor noise and difficult to constrain kinematically. Instead, it uses a control-level representation based on a bicycle model. This choice ensures that the predicted actions inherently satisfy dynamic constraints, facilitates multi-modal trajectory modeling, and improves both the stability and interpretability of the output trajectories.\nFidelity Fidelity refers to the preservation of information through the reason-to-action-to-control encoding and decoding pipeline. High fidelity means that the high-level decision intent (captured in the reason trace and meta action) is faithfully reflected in the low-level control commands. The system is designed to minimize information loss at each stage of this pipeline, ensuring that the executed trajectory is a true realization of the model\u0026rsquo;s reasoning.\nExpert Decoder: The \u0026ldquo;Big Brain, Small Brain\u0026rdquo; Architecture The decoding stage employs a dual-expert architecture. The VLA model (the \u0026ldquo;big brain\u0026rdquo;) handles perception, reasoning, and meta action generation, outputting key-value representations that encode the decision context. A separate Action Expert (the \u0026ldquo;small brain\u0026rdquo;) receives these KV representations and decodes them into high-precision, smooth continuous control commands via Flow Matching. This separation of concerns allows the VLA to focus on high-level cognitive tasks while the Action Expert specializes in fine-grained trajectory generation, analogous to how the cerebellum refines motor commands from cortical intent.\nTraining Strategy Discrete Action Tokens The choice of discrete action tokens serves three purposes. First, it makes the model amenable to reinforcement learning: discrete action spaces allow direct application of policy gradient methods (such as GRPO) for optimizing reasoning quality and consistency. Second, discrete tokens share the same representational space as language tokens, providing a natural foundation for reason-action alignment. Third, the combination of discrete representation for training stability and Flow Matching for inference-time precision and multi-modality yields a system that is both robust during training and expressive during deployment.\nTraining Decoupling The training procedure follows a decoupled strategy. The VLA model (perception and reasoning) is trained first. Once converged, its parameters are frozen and the KV representations are exported. The Action Expert is then trained separately on these frozen representations. This decoupling prevents the noise and gradient signal from the low-level control task from contaminating the high-level reasoning module, preserving the quality of the learned reason traces.\nCOC Dataset The Cause-of-Change (COC) dataset paradigm is central to the system\u0026rsquo;s approach to reasoning quality. Existing driving datasets contain reasoning annotations that are vague, post-hoc, and decoupled from the actual actions taken. A model trained on such data learns what to do but not why, producing reasoning traces that are retroactive justifications rather than genuine causal explanations.\nThe COC paradigm enforces an explicit causal structure. Each annotation must specify which environmental change and which key object caused the current decision and action. This requires imposing a strict causal template that anchors explanations in observable environmental factors, not merely generating longer reasoning traces.\nTo construct COC data at scale, the system combines high-quality manual annotations with an automated teacher-student pipeline. Manual annotations cover the design domain\u0026mdash;weather, lighting, and road conditions\u0026mdash;with explicit causal reasoning about critical objects. The automated pipeline uses large language models (such as Qwen) as teachers to generate ego behavior reasoning and action predictions, constrained by prompts that forbid ego-triggered explanations and require references to external objects and environmental changes.\nRL Fine-tuning Objective The reinforcement learning stage aims to provide explicit inference feedback by optimizing the model\u0026rsquo;s reasoning and action based on its own rollouts. The system uses Group Relative Policy Optimization (GRPO), which aligns the optimization objective with on-policy rollouts from the current model.\nReward Design The reward function comprises three components. The first is reasoning quality, evaluated by an expert LLM acting as a judge that penalizes hallucinations and causally empty explanations. The second is reason-action consistency, which verifies alignment between the reasoning trace and the executed trajectory by inversely solving the generated trajectory into a meta action and comparing it with the meta action stated in the reasoning trace. The third is trajectory quality, computed via rule-based metrics including collision, boundary violation, comfort, and efficiency.\nCost-Effective RL On-policy sampling is computationally expensive. To address this, the system constructs a dedicated post-training dataset and uses model logits and reward signals to estimate sample value. The key metric is KL divergence between the current policy and the reference policy: samples with higher divergence are more informative for training. This allows the system to prioritize high-value samples and reduce the total number of rollouts needed. The Cosmos-RL framework provides the infrastructure for this efficient RL pipeline.\nDiscussion The Alpamayo system\u0026rsquo;s core contribution lies not in any single module but in the joint optimization across architecture, data, training, and reinforcement learning. The structural demotion of ego information prevents shortcut learning. The COC dataset paradigm enforces genuine causal reasoning rather than post-hoc explanation. The decoupled training strategy preserves reasoning quality while enabling high-fidelity trajectory generation. The GRPO fine-tuning stage closes the loop by providing direct feedback on reasoning quality and reason-action consistency.\nSeveral open questions remain. The trade-off between token compression and information preservation in the vision encoder may become more acute as the system scales to longer temporal horizons. The COC annotation process, while effective, relies on large language models as teachers, raising questions about the ceiling of reasoning quality that can be achieved through distillation. The iterative nature of the RL fine-tuning pipeline, while cost-effective relative to fully online RL, still requires careful scheduling of sampling and training iterations. Finally, the generalization of the ego-shortcut avoidance strategy to more complex multi-agent interactions deserves further investigation.\nReferences 1. NVIDIA. \u0026ldquo;Cosmos-Reason: Reasoning and Action Alignment for Autonomous Driving.\u0026rdquo; Technical Report, 2025.\n2. NVIDIA. \u0026ldquo;Cosmos-RL: A Framework for Reinforcement Learning with Vision-Language Models.\u0026rdquo; 2025. Available at: https://nvidia-cosmos.github.io/cosmos-rl/\n3. Chan, E.R., Lin, C.Z., Chan, M.A., et al. \u0026ldquo;Efficient Geometry-aware 3D Generative Adversarial Networks.\u0026rdquo; CVPR, 2022. (Tri-plane representation)\n4. Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., et al. \u0026ldquo;Flow Matching for Generative Modeling.\u0026rdquo; ICLR, 2023.\n5. Shao, Z., Wang, P., Zhu, Q., et al. \u0026ldquo;DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.\u0026rdquo; arXiv:2402.03300, 2024. (GRPO)\n6. Rafailov, R., Sharma, A., Mitchell, E., et al. \u0026ldquo;Direct Preference Optimization: Your Language Model is Secretly a Reward Model.\u0026rdquo; NeurIPS, 2023. (DPO)\n","permalink":"https://xuquant.com/en/posts/autonomous-driving/nvidia_vla/","summary":"\u003cblockquote\u003e\n\u003cp\u003e中文版本：\u003ca href=\"/posts/autonomous-driving/Nvidia_VLA.zh/\"\u003e阅读中文版\u003c/a\u003e\u003c/p\u003e\n\u003c/blockquote\u003e\n\u003ch2 id=\"introduction\"\u003eIntroduction\u003c/h2\u003e\n\u003cp\u003e\u003cpicture\u003e\n  \u003csource srcset=\"/images/paper-figures/nvidia-vla-arch.webp\" type=\"image/webp\"\u003e\n  \u003cimg src=\"/images/paper-figures/nvidia-vla-arch.png\" alt=\"Alpamayo Architecture\" loading=\"lazy\" decoding=\"async\"\u003e\n\u003c/picture\u003e\n\u003cem\u003eFigure from \u003ca href=\"https://arxiv.org/abs/2511.00088\"\u003eAlpamayo-R1: Bridging Reasoning and Action Prediction for Autonomous Driving\u003c/a\u003e\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003eEnd-to-end autonomous driving has made significant progress in recent years, yet deploying Vision-Language-Action (VLA) models in real-world driving scenarios remains challenging. The basic difficulties are fourfold. First, multi-frame temporal understanding requires the model to extract decision-relevant changes from highly redundant consecutive observations, rather than merely processing static snapshots. Second, driving decisions must be causal: the model must model \u003cem\u003ewhy\u003c/em\u003e a particular action is taken, not just learn statistical correlations between situations and actions. Third, predicted trajectories must satisfy kinematic and dynamic constraints while remaining multi-modal and efficient enough for real-time inference. Fourth, the reasoning process must be tightly aligned with action output\u0026mdash;reasoning should not be a post-hoc rationalization but must be verifiable by and constrained by the actual actions taken.\u003c/p\u003e","title":"Alpamayo: Reasoning-Action Aligned VLA for Autonomous Driving"},{"content":" 中文版本：阅读中文版\n1. Why End-to-End Driving Needs Reinforcement Learning Figure from AlphaDrive: GRPO-based RL for Autonomous Driving\nSupervised learning\u0026mdash;whether through imitation learning or behavior cloning\u0026mdash;can only take an autonomous driving system so far. The core limitation is distributional: the training data is drawn from expert demonstrations, and any distributional shift between training and deployment leads to compounding errors. More critically, supervised objectives are misaligned with the true goal of driving. Minimizing the L2 distance to a ground-truth trajectory penalizes safe deviations as harshly as dangerous ones, and provides no mechanism for the model to discover better trajectories than those in the dataset.\nReinforcement learning offers a principled alternative. Rather than mimicking specific actions, RL optimizes a reward signal that directly measures driving quality\u0026mdash;collision avoidance, progress toward the destination, passenger comfort, rule compliance. The policy is free to discover novel strategies that achieve high reward, even if they differ from expert behavior. This is particularly valuable for handling long-tail scenarios where no demonstration exists.\nThe challenge, however, is that driving is not a standard MDP. In most end-to-end systems operating on log-replay data, the model generates a complete future trajectory at t=0t=0 and receives a single reward after evaluation\u0026mdash;a contextual bandit structure, not a sequential decision process. This structural difference propagates through every aspect of the RL pipeline: how advantages are estimated, how sampling works, and how the loss function is designed.\n2. From REINFORCE to PPO to GRPO: The Policy Gradient Lineage 2.1 The Policy Gradient Theorem Consider a parameterized policy πθ(a∣s)\\pi_\\theta(a|s). The objective is to maximize the expected return:\nJ(θ)=Eτ∼πθ[R(τ)]=Eτ∼πθ[∑t=0Tγtr(st,at)]J(\\theta) = \\mathbb{E}_{\\tau \\sim \\pi_\\theta}\\left[R(\\tau)\\right] = \\mathbb{E}_{\\tau \\sim \\pi_\\theta}\\left[\\sum_{t=0}^{T} \\gamma^t r(s_t, a_t)\\right]The policy gradient theorem [1] gives the gradient of this objective:\n∇θJ(θ)=Eτ∼πθ[∑t=0T∇θlog⁡πθ(at∣st)⋅R(τ)]\\nabla_\\theta J(\\theta) = \\mathbb{E}_{\\tau \\sim \\pi_\\theta}\\left[\\sum_{t=0}^{T} \\nabla_\\theta \\log \\pi_\\theta(a_t|s_t) \\cdot R(\\tau)\\right]This is the REINFORCE estimator [2]. Intuitively, it increases the log-probability of actions that led to high returns and decreases the log-probability of those that did not. The following diagram illustrates the computational flow:\nPolicy Gradient Computation Flow 1. Sample τ ~ π_θ(a|s) 2. Return R(τ) = Σ γ^t r_t 3. Log-prob ∇ log π_θ(a_t|s_t) 4. Update θ ← θ + α ∇J REINFORCE Problem: High Variance R(τ) is a Monte Carlo estimate — single trajectory return has enormous variance Baseline: R(τ) - b Advantage: A(s,a) Group Relative: A_i 2.2 Variance Reduction: From Returns to Advantages The raw REINFORCE gradient uses R(τ)R(\\tau) as a multiplier. This is problematic because R(τ)R(\\tau) has high variance\u0026mdash;a single trajectory return fluctuates wildly around the true expected return. The standard fix is to replace R(τ)R(\\tau) with the advantage function:\nAπ(st,at)=Qπ(st,at)−Vπ(st)A^\\pi(s_t, a_t) = Q^\\pi(s_t, a_t) - V^\\pi(s_t)The advantage measures how much better action ata_t is compared to the average action from state sts_t. This yields the advantage actor-critic gradient:\n∇θJ(θ)=E[∑t=0T∇θlog⁡πθ(at∣st)⋅Aπ(st,at)]\\nabla_\\theta J(\\theta) = \\mathbb{E}\\left[\\sum_{t=0}^{T} \\nabla_\\theta \\log \\pi_\\theta(a_t|s_t) \\cdot A^{\\pi}(s_t, a_t)\\right]In practice, the advantage is estimated using Generalized Advantage Estimation (GAE) [3], which interpolates between the high-variance Monte Carlo estimate and the high-bias TD(0) estimate via a parameter λ\\lambda:\nA^tGAE=∑l=0T−t(γλ)lδt+l\\hat{A}^{\\text{GAE}}_t = \\sum_{l=0}^{T-t} (\\gamma\\lambda)^l \\delta_{t+l}where δt=rt+γV(st+1)−V(st)\\delta_t = r_t + \\gamma V(s_{t+1}) - V(s_t) is the TD error. GAE requires a learned value function VψV_\\psi, which is typically a neural network trained concurrently with the policy.\n2.3 PPO: Clipped Surrogate Objective PPO constrains the policy ratio to address the instability of large policy updates:\nρt(θ)=πθ(at∣st)πθold(at∣st)\\rho_t(\\theta) = \\frac{\\pi_\\theta(a_t|s_t)}{\\pi_{\\theta_{\\text{old}}}(a_t|s_t)}The clipped surrogate objective is:\nLPPO-clip(θ)=Et[min⁡(ρt(θ)A^t, clip(ρt(θ),1−ε,1+ε)A^t)]L^{\\text{PPO-clip}}(\\theta) = \\mathbb{E}_t\\left[\\min\\left(\\rho_t(\\theta) \\hat{A}_t,\\ \\text{clip}(\\rho_t(\\theta), 1-\\varepsilon, 1+\\varepsilon) \\hat{A}_t\\right)\\right]The clip removes the incentive for moving the ratio outside [1−ε,1+ε][1-\\varepsilon, 1+\\varepsilon], while the outer min⁡\\min ensures the clipped version is a pessimistic lower bound. Combined with the advantage estimator from GAE, PPO provides stable, monotonic policy improvement in practice.\nHowever, PPO has a significant architectural cost: it requires a value network VψV_\\psi of comparable size to the policy network for computing GAE. In the LLM setting, this means training and maintaining a second model of equal parameter count, which doubles memory usage and complicates the training pipeline.\n2.4 GRPO: Eliminating the Value Network Group Relative Policy Optimization (GRPO), introduced in DeepSeekMath [5], removes the value network entirely. The key idea is simple but powerful: for a given input, sample a group of GG outputs from the old policy, score them all, and use the group statistics as the baseline.\nAdvantage Estimation: PPO (GAE) vs. GRPO (Group Relative) PPO: GAE Advantage Value Network V_ψ Learns state values TD Errors δ_t δ_t = r_t + γV-V GAE: Σ(γλ)^l δ_{t+l} Bias-variance tradeoff via λ Â_t = GAE_t Per-token advantage Requires separate V_ψ — 2x memory GRPO: Group Relative Advantage Sample G outputs o_1, ..., o_G ~ π_old Reward Model r_1, ..., r_G Group Normalize r̃_i = (r_i - mean) / std Â_i = r̃_i Per-output advantage No value network — 1x memory Given a question (or scene) qq, sample GG outputs {o1,o2,…,oG}\\{o_1, o_2, \\ldots, o_G\\} from πθold\\pi_{\\theta_{\\text{old}}}. Each output receives a reward rir_i from the reward model. The group-relative advantage is:\nr~i=ri−mean(r)std(r)+ε\\tilde{r}_i = \\frac{r_i - \\text{mean}(\\mathbf{r})}{\\text{std}(\\mathbf{r}) + \\varepsilon}A^i,t=r~i(all tokens in output i share the same advantage)\\hat{A}_{i,t} = \\tilde{r}_i \\quad \\text{(all tokens in output } i \\text{ share the same advantage)}The GRPO objective then applies the same clipped surrogate structure as PPO, but with these group-relative advantages:\nJGRPO(θ)=E[1G∑i=1G1∣oi∣∑t=1∣oi∣{min⁡[ρi,tA^i,t, clip(ρi,t,1−ε,1+ε)A^i,t]−βDKL[πθ∥πref]}]\\mathcal{J}_{\\text{GRPO}}(\\theta) = \\mathbb{E}\\left[\\frac{1}{G}\\sum_{i=1}^{G}\\frac{1}{|o_i|}\\sum_{t=1}^{|o_i|}\\left\\{\\min\\left[\\rho_{i,t}\\hat{A}_{i,t},\\ \\text{clip}(\\rho_{i,t}, 1-\\varepsilon, 1+\\varepsilon)\\hat{A}_{i,t}\\right] - \\beta \\mathbb{D}_{KL}[\\pi_\\theta \\| \\pi_{\\text{ref}}]\\right\\}\\right]where ρi,t=πθ(oi,t∣q,oi,\u0026lt;t)/πθold(oi,t∣q,oi,\u0026lt;t)\\rho_{i,t} = \\pi_\\theta(o_{i,t}|q, o_{i,\u0026lt;t}) / \\pi_{\\theta_{\\text{old}}}(o_{i,t}|q, o_{i,\u0026lt;t}).\nA critical design choice in GRPO is the placement of the KL penalty. In PPO, the KL term is added into the reward at each step, which means it affects the advantage computation. In GRPO, the KL penalty is placed directly in the loss function, decoupled from advantage estimation. This keeps the advantage computation clean and interpretable.\nDimension PPO GRPO Value function Requires learned VψV_\\psi None; group mean as baseline Advantage estimation GAE via TD errors Group-relative normalization KL penalty Embedded in per-step reward Directly in loss function Sampling Single output per input Group of GG outputs per input Memory overhead ~2x (policy + value networks) ~1x (policy only) Per-token advantage Yes (varies across positions) No (shared across output) 2.5 GRPO in Autonomous Driving: AlphaDrive The first application of GRPO to autonomous driving is AlphaDrive [6], which applies GRPO-based RL to Vision-Language Models for planning. AlphaDrive introduces four planning-oriented RL rewards tailored to driving scenarios and employs a two-stage training pipeline (SFT followed by RL). A notable finding is that RL training elicits emergent multi-modal planning capabilities\u0026mdash;the model learns to propose diverse viable trajectories without explicit multi-modal supervision. This is particularly significant because multi-modality in trajectory planning (e.g., deciding whether to pass on the left or right of an obstacle) is a core requirement for safe and efficient driving.\n3. The Unified Optimization Framework Across PPO, GRPO, and their variants, the optimization objective for policy methods can be expressed in a unified form:\nLtotal=λpolicy⋅Lpolicy+λreg⋅Lreg+λaux⋅Laux\\mathcal{L}_{\\text{total}} = \\lambda_{\\text{policy}} \\cdot \\mathcal{L}_{\\text{policy}} + \\lambda_{\\text{reg}} \\cdot \\mathcal{L}_{\\text{reg}} + \\lambda_{\\text{aux}} \\cdot \\mathcal{L}_{\\text{aux}}Each term serves a distinct purpose:\nLpolicy\\mathcal{L}_{\\text{policy}}: The core policy optimization loss that drives the policy toward higher-reward actions. This is the clipped surrogate objective (PPO-clip or GRPO-clip). Lreg\\mathcal{L}_{\\text{reg}}: Regularization that constrains the policy from deviating too far from a reference. This includes KL divergence DKL[πθ∥πref]\\mathbb{D}_{KL}[\\pi_\\theta \\| \\pi_{\\text{ref}}], entropy bonuses for exploration, and trust region constraints. Laux\\mathcal{L}_{\\text{aux}}: Auxiliary losses that preserve capabilities from pre-training or supervised fine-tuning. This includes imitation learning (behavior cloning) loss, reconstruction loss (for diffusion decoders), and value function loss. The mapping to concrete terms in a driving RL pipeline:\nAbstract Term Concrete Implementation Lpolicy\\mathcal{L}_{\\text{policy}} LGRPO-clip\\mathcal{L}_{\\text{GRPO-clip}} or LPPO-clip\\mathcal{L}_{\\text{PPO-clip}} Lreg\\mathcal{L}_{\\text{reg}} LKL+Lentropy\\mathcal{L}_{\\text{KL}} + \\mathcal{L}_{\\text{entropy}} Laux\\mathcal{L}_{\\text{aux}} LBC+Lvalue+∑mλaux,mLaux,m\\mathcal{L}_{\\text{BC}} + \\mathcal{L}_{\\text{value}} + \\sum_m \\lambda_{\\text{aux},m} \\mathcal{L}_{\\text{aux},m} The core policy loss expands to:\nLGRPO-clip=−E[1G∑i=1Gmin⁡(ρi(θ)A^i, clip(ρi(θ),1−ε,1+ε)A^i)]\\mathcal{L}_{\\text{GRPO-clip}} = -\\mathbb{E}\\left[\\frac{1}{G}\\sum_{i=1}^{G}\\min\\left(\\rho_i(\\theta)\\hat{A}_i,\\ \\text{clip}(\\rho_i(\\theta), 1-\\varepsilon, 1+\\varepsilon)\\hat{A}_i\\right)\\right]where ρi(θ)=πθ(ai∣s)/πθold(ai∣s)=exp⁡(log⁡πθ(ai∣s)−log⁡πθold(ai∣s))\\rho_i(\\theta) = \\pi_\\theta(a_i|s) / \\pi_{\\theta_{\\text{old}}}(a_i|s) = \\exp(\\log \\pi_\\theta(a_i|s) - \\log \\pi_{\\theta_{\\text{old}}}(a_i|s)).\n4. Sampling: LLM vs. Autonomous Driving The sampling step\u0026mdash;generating candidate outputs from the current policy\u0026mdash;is where the difference between LLM and driving RL becomes most pronounced. In both cases, the quality of the advantage estimate depends on the diversity and quality of the sampled group. But the constraints on what constitutes a valid sample differ at a basic level.\nSampling Space: LLM vs. Autonomous Driving LLM Sampling Temperature / top-k / top-p Any token sequence is \"valid\" No physical constraints Driving Sampling Diffusion noise / perturbation Must satisfy kinematics Physical constraints bound the space High-reward Low-reward / Invalid Medium-reward LLM sampling is unconstrained. Given a prompt, the model samples token sequences via temperature scaling, top-k filtering, or nucleus sampling (top-p). Any sequence of valid tokens is a syntactically legal output; the only question is whether it is semantically useful. The sampling space is the full vocabulary raised to the sequence length, and the diversity of samples is controlled by the temperature parameter.\nDriving sampling is subject to strict constraints. A sampled trajectory must satisfy:\nKinematic feasibility: The trajectory must respect vehicle dynamics\u0026mdash;maximum steering angle, acceleration limits, jerk constraints. A trajectory that requires instantaneous lateral displacement is physically impossible. Scene consistency: The trajectory must not pass through observed obstacles, violate traffic rules, or leave the drivable area. Temporal coherence: The trajectory must be smooth and continuous, without discontinuous jumps in position or heading. These constraints mean that naive perturbation of a trajectory (analogous to temperature sampling in LLMs) produces mostly invalid samples. A small perturbation might push the trajectory into an obstacle; a large perturbation might produce a physically impossible path. The sampling strategy must be carefully designed to produce meaningful diversity\u0026mdash;trajectories that differ in interesting ways (left pass vs. right pass, aggressive merge vs. conservative yield) while remaining physically feasible.\nThis is where diffusion-based trajectory decoders offer a natural advantage. The denoising process can be guided to satisfy constraints, and the noise schedule controls the exploration-exploitation tradeoff in a physically meaningful way.\n5. Loss Design: Multi-Objective Composition The full training objective in a production driving RL system typically combines multiple loss terms:\nLtotal=λpg⋅LGRPO-clip+λkl⋅LKL+λvf⋅Lvalue+λent⋅Lentropy+λbc⋅LBC+∑mλaux,m⋅Laux,m\\mathcal{L}_{\\text{total}} = \\lambda_{\\text{pg}} \\cdot \\mathcal{L}_{\\text{GRPO-clip}} + \\lambda_{\\text{kl}} \\cdot \\mathcal{L}_{\\text{KL}} + \\lambda_{\\text{vf}} \\cdot \\mathcal{L}_{\\text{value}} + \\lambda_{\\text{ent}} \\cdot \\mathcal{L}_{\\text{entropy}} + \\lambda_{\\text{bc}} \\cdot \\mathcal{L}_{\\text{BC}} + \\sum_m \\lambda_{\\text{aux},m} \\cdot \\mathcal{L}_{\\text{aux},m}Each term plays a specific role:\nPolicy gradient loss (LGRPO-clip\\mathcal{L}_{\\text{GRPO-clip}}): The primary driver of policy improvement. The clip mechanism prevents destructively large updates, while the group-relative advantage provides a variance-reduced gradient signal.\nKL divergence (LKL=DKL[πθ∥πref]\\mathcal{L}_{\\text{KL}} = \\mathbb{D}_{KL}[\\pi_\\theta \\| \\pi_{\\text{ref}}]): Constrains the policy from drifting too far from the reference (typically the SFT checkpoint). Without this, RL training can cause the model to \u0026ldquo;forget\u0026rdquo; its pre-trained capabilities\u0026mdash;a phenomenon known as reward hacking where the policy finds loopholes in the reward function that produce high scores but low-quality trajectories.\nEntropy bonus (Lentropy=−E[H(πθ)]\\mathcal{L}_{\\text{entropy}} = -\\mathbb{E}[\\mathcal{H}(\\pi_\\theta)]): Encourages exploration by preventing the policy from collapsing to a deterministic mode. In driving, this is essential for maintaining multi-modality: the model should continue to propose diverse plausible trajectories rather than converging to a single average solution.\nBehavior cloning loss (LBC\\mathcal{L}_{\\text{BC}}): An auxiliary imitation loss computed on expert demonstrations. This acts as a regularizer that prevents the policy from departing too far from safe, human-like driving behavior. It is particularly important in early RL training when the reward signal may be noisy or sparse.\nValue function loss (Lvalue\\mathcal{L}_{\\text{value}}): When a value network is used (as in PPO), this is the regression loss for training VψV_\\psi. In GRPO-based systems, this term is absent, but it may still appear in hybrid approaches that combine GRPO advantages with a learned baseline for additional variance reduction.\nOther auxiliary losses (Laux,m\\mathcal{L}_{\\text{aux},m}): Domain-specific terms such as reconstruction loss for diffusion decoders, collision prediction loss, or comfort regularization. These are typically small in magnitude but provide important inductive biases.\nThe coefficients {λ}\\{\\lambda\\} are critical hyperparameters. In practice, they are tuned through a combination of grid search and manual adjustment. A common pattern is to start with a high λbc\\lambda_{\\text{bc}} (strong imitation regularization) and gradually anneal it as the RL training stabilizes, allowing the policy gradient signal to dominate.\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 # Pseudocode: GRPO Training Loop for Driving for each iteration: # 1. Sample trajectories from old policy for each scene s in batch: trajectories = sample_G(π_θ_old, s, G=16) # G trajectories per scene # 2. Score trajectories with reward model for i in range(G): r_i = reward_model(trajectories[i], scene) # 3. Compute group-relative advantages r_mean = mean(r_1, ..., r_G) r_std = std(r_1, ..., r_G) + ε for i in range(G): Â_i = (r_i - r_mean) / r_std # 4. Compute clipped surrogate loss for i in range(G): ρ_i = π_θ(a_i|s) / π_θ_old(a_i|s) L_pg_i = -min(ρ_i * Â_i, clip(ρ_i, 1-ε, 1+ε) * Â_i) L_pg = mean(L_pg_1, ..., L_pg_G) # 5. Compute regularization losses L_kl = D_KL[π_θ || π_ref] L_ent = -H(π_θ) L_bc = imitation_loss(π_θ, expert_data) # 6. Total loss and update L_total = λ_pg * L_pg + λ_kl * L_kl + λ_ent * L_ent + λ_bc * L_bc θ = θ - α * ∇_θ L_total 6. Diffusion, Noise, and Exploration For diffusion-based trajectory decoders, the relationship between noise and exploration deserves special attention. In the standard diffusion process, a clean trajectory x0x_0 is corrupted by adding Gaussian noise over TT steps:\nxt=αˉtx0+1−αˉtϵ,ϵ∼N(0,I)x_t = \\sqrt{\\bar{\\alpha}_t} x_0 + \\sqrt{1 - \\bar{\\alpha}_t} \\epsilon, \\quad \\epsilon \\sim \\mathcal{N}(0, I)At inference time, the model denoises from xTx_T (pure noise) back to x0x_0 (a clean trajectory). The initial noise ϵ\\epsilon determines which trajectory is generated. In the RL context, this noise plays a role analogous to temperature in LLM sampling\u0026mdash;but with a crucial difference.\nNoise in diffusion-based driving is not merely \u0026ldquo;sampling randomness.\u0026rdquo; It directly determines:\nExploration range: The magnitude and structure of the noise control how far the generated trajectories can deviate from the mean. Larger noise leads to more diverse candidates. Candidate trajectory morphology: Different noise realizations produce qualitatively different trajectory shapes\u0026mdash;lane-change vs. lane-follow, aggressive vs. conservative, left vs. right. The noise does not just shift a trajectory; it can change its mode. Group distribution quality: For GRPO, the advantage estimation depends on the group of samples having meaningful reward variance. If the noise is too small, all trajectories are nearly identical, and the group-relative advantage is dominated by noise rather than signal. If the noise is too large, many trajectories become physically invalid, and the reward signal becomes uninformative. This creates a three-way tension in noise scheduling:\nEffective diversity: The noise must be large enough to produce trajectories with meaningfully different rewards, enabling the group-relative advantage to separate good from bad. Trajectory validity: The noise must be small enough (or the denoising process must be constrained enough) to keep trajectories within the kinematically feasible and scene-consistent region. Alignment with training objectives: The exploration direction should be consistent with what the reward function actually measures. Noise that produces diverse but reward-irrelevant variations (e.g., tiny lateral shifts that do not affect collision safety) wastes sampling budget. In practice, these tensions are addressed through a combination of constrained diffusion (guiding the denoising process with kinematic constraints), adaptive noise scheduling (adjusting noise levels based on scene complexity), and rejection sampling (discarding trajectories that violate hard constraints before computing rewards).\n7. Summary Component REINFORCE PPO GRPO Objective Maximize J(θ)=E[R(τ)]J(\\theta) = \\mathbb{E}[R(\\tau)] Clipped surrogate Clipped surrogate Advantage R(τ)R(\\tau) (raw return) GAE via learned VψV_\\psi Group-relative normalization Baseline None Learned value function Group mean reward Value network No Yes (same scale as policy) No Variance Very high Low (GAE + learned baseline) Moderate (group statistics) Update constraint None Clip ratio [1−ε,1+ε][1-\\varepsilon, 1+\\varepsilon] Clip ratio [1−ε,1+ε][1-\\varepsilon, 1+\\varepsilon] KL regularization None In reward (affects advantage) In loss (independent of advantage) Memory 1x ~2x ~1x Driving applicability Baseline only General purpose VLM planning, group-sampled scenarios The progression from REINFORCE to PPO to GRPO represents a trajectory of increasing practical efficiency: REINFORCE establishes the theoretical foundation, PPO introduces stable optimization through clipping and learned baselines, and GRPO removes the expensive value network by exploiting the group structure of the sampling process. For autonomous driving, GRPO is particularly attractive because the contextual bandit structure of trajectory planning naturally produces group-sampled outputs, and the absence of a value network simplifies the training pipeline for already complex end-to-end models.\nHowever, GRPO is not a universal replacement for PPO. In settings where per-token advantages matter (e.g., sequential decision-making with meaningful intermediate states), GAE provides a richer signal than the per-output advantage of GRPO. The choice between the two should be guided by the structure of the problem: contextual bandit with group sampling favors GRPO; sequential MDP with long horizons favors PPO.\nReferences 1. Sutton, R.S., McAllester, D.A., Singh, S.P., \u0026amp; Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. NeurIPS. Link\n2. Williams, R.J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4), 229-256.\n3. Schulman, J., Moritz, P., Levine, S., Jordan, M.I., \u0026amp; Abbeel, P. (2016). High-dimensional continuous control using generalized advantage estimation. ICLR.\n4. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., \u0026amp; Klimov, O. (2017). Proximal policy optimization algorithms. arXiv:1707.06347.\n5. Shao, Z., Wang, P., Zhu, Q., et al. (2024). DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv:2402.03300. GRPO is introduced in Section 4 of this paper.\n6. Jiang, B., Chen, S., Zhang, Q., Liu, W., \u0026amp; Wang, X. (2025). AlphaDrive: Unleashing the power of VLMs in autonomous driving via GRPO-based reasoning and planning. arXiv:2503.07608.\n","permalink":"https://xuquant.com/en/posts/autonomous-driving/rl-policy-optimization-e2e-driving/","summary":"\u003cblockquote\u003e\n\u003cp\u003e中文版本：\u003ca href=\"/posts/autonomous-driving/rl-policy-optimization-e2e-driving.zh/\"\u003e阅读中文版\u003c/a\u003e\u003c/p\u003e\n\u003c/blockquote\u003e\n\u003ch2 id=\"1-why-end-to-end-driving-needs-reinforcement-learning\"\u003e1. Why End-to-End Driving Needs Reinforcement Learning\u003c/h2\u003e\n\u003cp\u003e\u003cpicture\u003e\n  \u003csource srcset=\"/images/paper-figures/alphadrive-arch.webp\" type=\"image/webp\"\u003e\n  \u003cimg src=\"/images/paper-figures/alphadrive-arch.png\" alt=\"AlphaDrive Architecture\" loading=\"lazy\" decoding=\"async\"\u003e\n\u003c/picture\u003e\n\u003cem\u003eFigure from \u003ca href=\"https://arxiv.org/abs/2503.07608\"\u003eAlphaDrive: GRPO-based RL for Autonomous Driving\u003c/a\u003e\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003eSupervised learning\u0026mdash;whether through imitation learning or behavior cloning\u0026mdash;can only take an autonomous driving system so far. The core limitation is distributional: the training data is drawn from expert demonstrations, and any distributional shift between training and deployment leads to compounding errors. More critically, supervised objectives are misaligned with the true goal of driving. Minimizing the L2 distance to a ground-truth trajectory penalizes safe deviations as harshly as dangerous ones, and provides no mechanism for the model to discover \u003cem\u003ebetter\u003c/em\u003e trajectories than those in the dataset.\u003c/p\u003e","title":"Policy Optimization for End-to-End Autonomous Driving: From REINFORCE to GRPO"},{"content":" 中文版本：阅读中文版\nIntroduction The trajectory of autonomous driving architecture has undergone a paradigm shift: from the classical modular pipeline (perception →\\to prediction →\\to planning →\\to control) toward end-to-end systems that map sensory inputs directly to driving actions. This transition is not merely an engineering convenience\u0026mdash;it reflects a deep recognition that modular interfaces impose information bottlenecks and that joint optimization across the full stack can yield emergent capabilities invisible to individually optimized modules.\nThe evolution can be broadly characterized in three phases:\nV1.0 \u0026mdash; Modular end-to-end: Individual modules (detection, tracking, prediction) are trained end-to-end with differentiable interfaces, but the overall architecture retains a modular structure with hand-designed information flow. V2.0 \u0026mdash; One-stage end-to-end: A single model directly predicts trajectories from multi-modal sensor inputs. The core research question becomes: what is the optimal decoder head for the planner? V3.0 \u0026mdash; VLA-native end-to-end: The action space is natively integrated into a vision-language-action model, where driving decisions emerge from the same representational substrate as linguistic reasoning. This article focuses on the V2.0 →\\to V3.0 transition. We examine the three dominant decoder paradigms\u0026mdash;autoregressive (AR), diffusion, and flow matching\u0026mdash;analyze their trade-offs in diversity, stability, and real-time feasibility, and discuss how the VLA paradigm in V3.0 resolves core tensions that persist in V2.0 architectures.\nV2.0: The Planner Decoder Selection Problem The central design decision in a one-stage end-to-end system is the planner decoder head: the mechanism by which the model\u0026rsquo;s learned scene representation is decoded into a drivable trajectory. Unlike classification or detection heads, trajectory decoding must satisfy several competing constraints simultaneously:\nMulti-modality: At any given scene, multiple plausible trajectories exist (lane keeping, lane change, yield). The decoder must represent this multi-modal distribution without collapsing to a single mode. Temporal consistency: Consecutive frames must produce consistent trajectories; jitter between frames is unacceptable for passenger comfort and safety. Kinematic feasibility: Predicted trajectories must satisfy vehicle dynamics constraints (curvature, acceleration, jerk). Real-time inference: The decoder must produce trajectories within the vehicle\u0026rsquo;s control loop latency budget (typically ≤100\\leq 100 ms). Three families of decoder architectures have emerged as the leading candidates: autoregressive token prediction, diffusion-based generation, and flow matching. We analyze each in turn.\nAutoregressive (AR) Decoding The autoregressive approach treats trajectory generation as a next-token prediction problem, directly borrowing the paradigm that has proven enormously successful in large language models. Given a trajectory τ=(a1,a2,…,aT)\\tau = (a_1, a_2, \\ldots, a_T) discretized into action tokens, the model generates:\np(τ)=∏t=1Tp(at∣a\u0026lt;t,x)p(\\tau) = \\prod_{t=1}^{T} p(a_t \\mid a_{\u0026lt;t}, \\mathbf{x})where x\\mathbf{x} denotes the scene encoding (visual features, map information, ego state). This formulation is exemplified by MotionLM [1], which represents continuous trajectories as sequences of discrete motion tokens and casts multi-agent motion prediction as a language modeling task.\nThe key advantage of AR decoding is its expressive multi-modality: by modeling the full conditional distribution autoregressively, the decoder can naturally represent diverse trajectory outcomes. However, this advantage comes at a cost:\nInter-frame inconsistency: Because each frame\u0026rsquo;s trajectory is generated independently from the same conditional distribution, small perturbations in the scene encoding can lead to mode-switching between frames, producing the characteristic \u0026ldquo;jitter\u0026rdquo; or \u0026ldquo;wobble\u0026rdquo; in the ego trajectory. Error accumulation: Autoregressive errors compound over the trajectory horizon, particularly for long-horizon predictions. Recent work has attempted to mitigate the jitter problem through reinforcement learning. Specifically, GRPO (Group Relative Policy Optimization) with a frame-consistency reward can reduce inter-frame variability. However, this approach introduces its own pathology: by penalizing deviation from the previous frame\u0026rsquo;s trajectory, the model becomes overly conservative and the lane-change trigger metric degrades\u0026mdash;the model learns to \u0026ldquo;play it safe\u0026rdquo; by avoiding lane changes altogether.\nDiffusion-Based Decoding Diffusion models generate trajectories by iteratively denoising from a Gaussian prior:\nτ0∼pθ(τ0∣x)=∫p(τK)∏k=K1pθ(τk−1∣τk,x) dτ1…dτK\\tau_0 \\sim p_\\theta(\\tau_0 \\mid \\mathbf{x}) = \\int p(\\tau_K) \\prod_{k=K}^{1} p_\\theta(\\tau_{k-1} \\mid \\tau_k, \\mathbf{x}) \\, d\\tau_1 \\ldots d\\tau_Kwhere KK is the number of denoising steps and τK∼N(0,I)\\tau_K \\sim \\mathcal{N}(0, \\mathbf{I}).\nDiffusionDrive [2] introduces a critical innovation: anchor-based truncated diffusion. Rather than denoising from pure noise, the model starts from a set of anchor trajectories that represent different driving intentions (lane keeping, left lane change, right lane change). The diffusion schedule is truncated\u0026mdash;starting from an intermediate noise level rather than pure noise\u0026mdash;which dramatically reduces the number of required denoising steps while preserving multi-modality.\nThe truncation strategy addresses a fundamental limitation of naive diffusion for driving: full denoising from τK\\tau_K is both computationally expensive and prone to mode collapse when the distribution is highly concentrated. By conditioning on anchors and truncating the schedule, DiffusionDrive achieves real-time inference with multi-modal output.\nHowever, the anchor-based approach introduces a subtle problem: the AR-like jitter reappears at the anchor selection level. When the model switches between anchors across consecutive frames, the resulting trajectory exhibits the same inconsistency that plagues AR decoding.\nFlow Matching Decoding Flow matching learns a continuous-time vector field (ODE) that transports a simple prior distribution to the target trajectory distribution:\ndτdt=vθ(τt,t,x),t∈[0,1]\\frac{d\\tau}{dt} = v_\\theta(\\tau_t, t, \\mathbf{x}), \\quad t \\in [0, 1]where vθv_\\theta is the learned velocity field and the trajectory is obtained by solving the ODE from t=0t=0 to t=1t=1. This formulation, known as FlowDrive in the driving context, has several attractive properties:\nSmooth trajectories: Because the ODE solver produces a continuous trajectory, the output is inherently smooth. In practice, flow matching produces the smoothest, most \u0026ldquo;silky\u0026rdquo; trajectories among the three approaches. Deterministic inference: The ODE solver is deterministic given the same initial conditions, eliminating sampling noise. The critical weakness of flow matching is mode collapse via ODE sampling. Because the vector field is trained to minimize the flow matching loss:\nLFM=Et,τ0,τ1[∥vθ(τt,t,x)−(τ1−τ0)∥2]\\mathcal{L}_{FM} = \\mathbb{E}_{t, \\tau_0, \\tau_1} \\left[ \\| v_\\theta(\\tau_t, t, \\mathbf{x}) - (\\tau_1 - \\tau_0) \\|^2 \\right]the learned flow tends to transport all prior samples toward the dominant mode, particularly in regions where the trajectory distribution is highly concentrated. This is fundamentally different from diffusion, where the stochastic sampling process inherently maintains diversity.\nAttempts to apply GRPO reinforcement learning to flow matching face a particularly severe version of the \u0026ldquo;all-or-nothing\u0026rdquo; problem: the RL signal tends to push the entire batch toward either the good mode or the bad mode, rather than improving the average case. This bimodal training dynamic makes GRPO for flow matching unstable in practice.\nThree-Way Trade-off The three approaches can be positioned in a trade-off space along three axes: trajectory diversity, temporal consistency, and inference determinism:\nDiversity (Multi-modality) Temporal Consistency Inference Determinism AR High diversity Low consistency Flow Best smoothness Mode collapse Diff Balanced Anchor jitter Sweet spot AR + Diffusion The table below summarizes the quantitative trade-offs observed across reproduced experiments:\nProperty AR (MotionLM-style) Flow Matching DiffusionDrive AR + Diffusion Trajectory diversity High Low (mode collapse) Moderate High Inter-frame consistency Low (jitter) Best (smooth) Moderate (anchor jitter) Moderate-High GRPO compatibility Good (but hurts lane change) Poor (all-or-nothing) Moderate Good Inference speed Fast (single pass) Fast (few ODE steps) Moderate (KK denoising steps) Moderate Real-time feasibility Yes Yes With truncation: Yes Yes AR + Diffusion: The Optimal Combination The experimental evidence points to a hybrid AR + Diffusion strategy as the most effective decoder for one-stage end-to-end driving. The intuition is straightforward: AR decoding provides the diversity guarantee, while the diffusion denoising process acts as a consistency regularizer, smoothing out the mode-switching artifacts of pure AR.\nPer internal engineering documentation (which the author was unable to independently verify from publicly accessible sources), the Chainflow-VLA system (combining AR trajectory tokenization with chain-of-diffusion refinement) reportedly achieved a PDMS score of 94.05 on NAVSIM v1 navtest [3]. If the number holds, it would offer indicative support for the AR + Diffusion hybrid route; given its unverifiable nature, however, the argument below does not rest on this single data point but on the reproducible results from the DiffusionDrive / GoalFlow / ReflectDrive line of work.\nThe two components address complementary failure modes:\nAR prevents the mode collapse that plagues flow matching and, to a lesser extent, diffusion. Diffusion denoising smooths the AR jitter by denoising across the trajectory sequence rather than within a single frame. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Algorithm: AR + Diffusion Decoding Input: Scene encoding x, anchors A = {a_1, ..., a_M} Output: Trajectory tau 1. // AR phase: generate coarse trajectory tokens 2. for t = 1 to T do 3. a_t ~ p_theta(a_t | a_{\u0026lt;t}, x) 4. end 5. tau_coarse = TokenToTrajectory(a_1, ..., a_T) 6. 7. // Diffusion refinement phase 8. tau_noisy = AddNoise(tau_coarse, sigma_K) 9. for k = K to 1 do 10. tau_{k-1} = DenoiseStep(tau_k, x, A) 11. end 12. return tau_0 It should be noted that the exact architecture of Chainflow-VLA is not fully detailed in publicly available documentation; the description above reflects the general principle of AR-initialized diffusion refinement consistent with the reported approach. Readers are advised to consult the original source for precise architectural details.\nV3.0: VLA Architecture \u0026mdash; Two Philosophies of Action Integration The transition from V2.0 to V3.0 is marked by a major architectural shift: the introduction of Vision-Language-Action (VLA) models, where driving actions are natively generated within the same large model that processes visual and linguistic inputs. This is not simply \u0026ldquo;adding an action head to a VLM\u0026rdquo;\u0026mdash;it requires a deep rethinking of how action representations relate to the model\u0026rsquo;s internal semantics.\nThe Corner Case Motivation The primary motivation for V3.0 is the corner case problem. In autonomous driving, corner cases are scenarios characterized by three properties:\nMinimal visual difference: The perceptual distinction between a safe and an unsafe scenario can be extremely subtle (e.g., a pedestrian glancing at their phone while crossing vs. walking purposefully). High decision significance: Despite the minimal perceptual difference, the correct action can be qualitatively different (emergency brake vs. moderate slowdown). Temporal context dependence: The correct decision cannot be determined from a single frame alone; it requires understanding the temporal evolution of the scene. These properties make corner cases fundamentally unsuitable for the V2.0 paradigm, where the planner decoder operates on a single-frame scene encoding. The VLA approach addresses this by grounding the action in a richer semantic representation that includes temporal reasoning and causal understanding.\nTwo Architectural Philosophies The integration of action into a VLA model admits two fundamentally different architectural philosophies, depending on the assumed relationship between semantic understanding and action generation:\nPhilosophy 1: Action requires deep semantic alignment (Concat-KV)\nIf one believes that driving action requires multi-layer semantic abstraction\u0026mdash;understanding why a situation is dangerous, not just what is present\u0026mdash;then action tokens should be integrated into the LLM\u0026rsquo;s key-value cache alongside text tokens. In this approach, the action tokens attend to and are attended by the full sequence of visual and linguistic tokens, enabling the model to ground its actions in the same deep semantic representations that support reasoning.\nKVaction=Concat(KVvision,KVlanguage,KVaction)\\text{KV}_{\\text{action}} = \\text{Concat}(\\text{KV}_{\\text{vision}}, \\text{KV}_{\\text{language}}, \\text{KV}_{\\text{action}})The advantage is that actions are fully grounded in the model\u0026rsquo;s semantic understanding. The risk is that the action head inherits the full complexity of the LLM\u0026rsquo;s attention patterns, making training unstable and inference expensive. OpenDriveVLA [4] exemplifies this approach with a hierarchical vision-language alignment process that projects both 2D and 3D visual features into the language embedding space before action decoding.\nPhilosophy 2: VLM as feature extractor + downstream action module\nIf one believes that driving action is primarily a low-dimensional conditional generation problem\u0026mdash;the scene understanding is \u0026ldquo;solved\u0026rdquo; by the VLM\u0026rsquo;s visual encoder, and the action module only needs to sample from the conditional distribution given stable scene features\u0026mdash;then a decoupled architecture is more appropriate. The VLM serves as a frozen feature extractor, and a lightweight action module generates trajectories conditioned on the VLM\u0026rsquo;s output features.\nz=VLMencoder(x),τ∼pθ(τ∣z)\\mathbf{z} = \\text{VLM}_{\\text{encoder}}(\\mathbf{x}), \\quad \\tau \\sim p_\\theta(\\tau \\mid \\mathbf{z})The advantage is training stability: the VLM encoder is not disrupted by the action training signal, and the action module can be trained independently with standard imitation learning or RL. The risk is that the action module may not have access to the full depth of the VLM\u0026rsquo;s semantic understanding, limiting its ability to handle corner cases that require causal reasoning.\nThe choice between these two philosophies is not settled. It depends on the empirical answer to a deeper question: is driving action fundamentally a semantic reasoning problem or a conditional generation problem? If the former, concat-KV is justified; if the latter, the decoupled approach is more efficient and stable.\nV3.0: Two Philosophies of Action Integration Philosophy 1: Concat-KV Action needs deep semantic alignment VLM Encoder LLM Decoder Action Head Shared KV cache: action tokens attend to vision + language + Grounded in semantics - Training instability Philosophy 2: Decoupled VLM as feature extractor VLM Encoder frozen z Action Module Lightweight: condition on z sample tau ~ p(tau | z) + Training stable - May miss causal reasoning Engineering Practice: From Research to Production The transition from research prototypes to production-grade end-to-end driving systems requires solving a distinct set of engineering challenges. The following sections document key practices observed in real-world deployments.\nData Infrastructure A one-stage end-to-end model is only as good as its training data. The data infrastructure challenge has several dimensions:\nFormat unification: Multiple data sources (perception labels, driving behavior, navigation instructions) must be unified into a single training format. The \u0026ldquo;six-in-one\u0026rdquo; unification format integrates perception data from five separate pipelines (object detection, occupancy, lane detection, traffic light, and driving behavior) into a single schema, enabling joint training over 1.5M+ clips from heterogeneous sources.\nData quality: Validation workflows are essential. Each data source has its own failure modes (mislabeled bounding boxes, inconsistent lane topology, incorrect traffic light states). A structured data acceptance process\u0026mdash;with automated sanity checks and human review\u0026mdash;catches systematic errors before they contaminate training.\nDistribution balancing: Real-world driving data is heavily imbalanced: highway cruising dominates, while urban intersections and corner cases are underrepresented. Explicit distribution construction\u0026mdash;through targeted data collection, augmentation, and re-weighting\u0026mdash;is necessary to ensure the model does not degenerate into a \u0026ldquo;go straight\u0026rdquo; policy.\nTraining Optimization Scaling to million-clip training sets requires significant infrastructure investment:\nDistributed training: Codebase must support 1M+ clips across 16+ GPU nodes with near-linear scaling. The key bottleneck is typically gradient synchronization and data loading, not compute. Training efficiency: Through architecture and pipeline optimizations, the training time for 1M clips can be reduced from 8 days to 5 days (a ∼\\sim30% improvement), primarily through mixed-precision training, gradient accumulation, and optimized data loading. Incremental gains: Two phases of improvement are typical: Data scaling: Increasing training data from 25K to 750K clips, combined with model structure optimization, reduces Ego ADE (Average Displacement Error) by 10+%, from 3.0m to 2.6m. Feature distillation: Removing unnecessary structured information (e.g., explicit object proposals) and using a pure feature representation with expert supervision further reduces Ego ADE by 7.6%, from 2.6m to 2.4m. The second phase is particularly noteworthy: it suggests that explicit structured representations (object boxes, lane lines) may not be necessary for the planner, and that learned dense features can be more informative when properly supervised.\nEvaluation Systems Open-loop metrics (ADE, FDE) are necessary but insufficient for evaluating driving quality. A comprehensive evaluation system must assess multiple dimensions:\nDimension Metrics Description Safety TTC (Time-to-Collision) Minimum time to collision with any dynamic object Comfort Jerk, lateral acceleration Passenger comfort metrics Efficiency Progress, speed deviation How efficiently the ego reaches its destination Compliance Traffic light adherence, lane keeping Adherence to traffic regulations Consistency Trajectory overlap ratio Agreement between consecutive-frame predictions Benchmark construction: A dedicated test set of 1200 clips covering diverse scenarios (urban, highway, intersection, adverse weather) provides the foundation for reproducible evaluation.\nEfficiency: Evaluation pipeline optimization can reduce per-clip evaluation time from 10 minutes to 10 seconds, enabling rapid iteration during development.\nSemi-closed-loop metrics: Pure open-loop evaluation can miss failure modes that only appear under the model\u0026rsquo;s own actions. Semi-closed-loop metrics\u0026mdash;where the model\u0026rsquo;s predicted trajectory is \u0026ldquo;unrolled\u0026rdquo; for a few steps without affecting the environment\u0026mdash;provide a middle ground. Key metrics include GT-free TTC (safety), comfort measures, and efficiency, computed under the model\u0026rsquo;s own trajectory rather than the ground-truth future.\nReal-Vehicle Deployment The transition from simulation to real-vehicle testing reveals additional challenges not captured by any open-loop or semi-closed-loop metric. Successfully deploying a one-stage model on a real vehicle requires:\nLatency optimization: The model must produce trajectories within the vehicle\u0026rsquo;s control cycle (≤\\leq100ms), including all pre-processing, inference, and post-processing. Fallback mechanisms: When the model\u0026rsquo;s confidence is low (e.g., in out-of-distribution scenarios), the system must gracefully fall back to a rule-based planner or emergency stop. Monitoring and logging: Comprehensive logging of model inputs, outputs, and internal states is essential for post-hoc analysis of failure cases. Architecture Evolution Summary The full architectural evolution from modular systems through V2.0 one-stage models to V3.0 VLA-native systems can be visualized as follows:\nEnd-to-End Autonomous Driving: Architecture Evolution V1.0 Modular Perception Prediction Planning Control Info bottleneck at each interface Individually optimized modules merge V2.0 One-Stage Shared Backbone (Visual + Map + Ego) Planner Decoder (AR / Diff / Flow) Trajectory Output No info bottleneck Decoder selection is key AR+Diffusion best on NavSim Corner cases remain VLA V3.0 VLA-Native Vision-Language Model (Multi-frame temporal understanding) Reason Trace Meta Action Action Generation (Concat-KV or Decoupled) Action natively in LLM Causal reasoning grounded Corner case capable Architecture choice: semantic vs conditional? Discussion and Open Questions Several core questions remain open as the field moves toward V3.0:\n1. Is the corner case problem primarily a representation problem or a data problem? If corner cases arise from insufficient coverage in the training distribution, then more data (or better augmentation) is the solution. If they arise from the model\u0026rsquo;s inability to represent the relevant distinctions, then architectural changes (like VLA) are necessary. The truth is likely a combination, but the relative importance determines whether V3.0 is a qualitative leap or an incremental improvement.\n2. Can the two VLA philosophies be unified? The concat-KV and decoupled approaches represent two ends of a spectrum. A promising direction is adaptive grounding: use concat-KV for scenes that require deep reasoning (detected by an uncertainty or complexity estimator) and the decoupled approach for routine driving. This would give the best of both worlds at the cost of architectural complexity.\n3. How should we evaluate corner case performance? Current benchmarks (NavSim, nuScenes) are dominated by routine driving scenarios. Dedicated corner case benchmarks [5] are emerging, but standardized evaluation remains an open problem. The WM-MoE framework [6] proposes using world models to generate corner cases, but the fidelity of these generated scenarios to real-world corner cases is not yet validated.\n4. What is the role of reinforcement learning? GRPO and similar RL methods can improve specific metrics (frame consistency, lane-change triggering) but often introduce new failure modes. The RL reward design problem for driving is fundamentally harder than for language: there is no simple analog of \u0026ldquo;helpfulness\u0026rdquo; that captures all aspects of safe, efficient, and comfortable driving.\nReferences [1] MotionLM: Multi-Agent Motion Forecasting as Language Modeling. Waymo Research, ICCV 2023. arXiv:2309.16534\n[2] DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving. Liao et al., 2024. arXiv:2411.15139\n[3] Chainflow-VLA: AR-initialized chain-of-diffusion for end-to-end driving. NavSim v1 navtest leaderboard, PDMS 94.05. The author was unable to independently verify the leaderboard ranking from publicly accessible sources as of this writing; the score is reported as cited in internal engineering documentation.\n[4] OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision-Language-Action Model. 2025. arXiv:2503.23463\n[5] Driving in Corner Case: A Real-World Adversarial Driving Benchmark for End-to-End Autonomous Driving. 2025. arXiv:2512.16055\n[6] WM-MoE: Addressing corner cases in autonomous driving with a world model-based Mixture of Experts. Transportation Research Part C, 2026. DOI:10.1016/j.trc.2025.105607\n[7] NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking. Da et al., CoRL 2024. arXiv:2406.15349\n[8] GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectories Generation in End-to-End Autonomous Driving. CVPR 2025.\n[9] A Survey on Vision-Language-Action Models for Autonomous Driving. Jiang et al., ICCV 2025 Workshop. arXiv:2512.16760\n[10] GuideFlow: Constraint-Guided Flow Matching for Planning in End-to-End Autonomous Driving. 2025. arXiv:2511.18729\n","permalink":"https://xuquant.com/en/posts/autonomous-driving/e2e-autonomous-driving-evolution/","summary":"\u003cblockquote\u003e\n\u003cp\u003e中文版本：\u003ca href=\"/posts/autonomous-driving/e2e-autonomous-driving-evolution.zh/\"\u003e阅读中文版\u003c/a\u003e\u003c/p\u003e\n\u003c/blockquote\u003e\n\u003ch2 id=\"introduction\"\u003eIntroduction\u003c/h2\u003e\n\u003cp\u003eThe trajectory of autonomous driving architecture has undergone a paradigm shift: from the classical modular pipeline (perception \u003cspan class=\"katex\"\u003e\u003cmath xmlns=\"http://www.w3.org/1998/Math/MathML\"\u003e\u003csemantics\u003e\u003cmrow\u003e\u003cmo\u003e→\u003c/mo\u003e\u003c/mrow\u003e\u003cannotation encoding=\"application/x-tex\"\u003e\\to\u003c/annotation\u003e\u003c/semantics\u003e\u003c/math\u003e\u003c/span\u003e prediction \u003cspan class=\"katex\"\u003e\u003cmath xmlns=\"http://www.w3.org/1998/Math/MathML\"\u003e\u003csemantics\u003e\u003cmrow\u003e\u003cmo\u003e→\u003c/mo\u003e\u003c/mrow\u003e\u003cannotation encoding=\"application/x-tex\"\u003e\\to\u003c/annotation\u003e\u003c/semantics\u003e\u003c/math\u003e\u003c/span\u003e planning \u003cspan class=\"katex\"\u003e\u003cmath xmlns=\"http://www.w3.org/1998/Math/MathML\"\u003e\u003csemantics\u003e\u003cmrow\u003e\u003cmo\u003e→\u003c/mo\u003e\u003c/mrow\u003e\u003cannotation encoding=\"application/x-tex\"\u003e\\to\u003c/annotation\u003e\u003c/semantics\u003e\u003c/math\u003e\u003c/span\u003e control) toward end-to-end systems that map sensory inputs directly to driving actions. This transition is not merely an engineering convenience\u0026mdash;it reflects a deep recognition that modular interfaces impose information bottlenecks and that joint optimization across the full stack can yield emergent capabilities invisible to individually optimized modules.\u003c/p\u003e","title":"End-to-End Autonomous Driving: From Modular Decoders to VLA Architectures"},{"content":" 中文版本：阅读中文版\nFigure from DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving\nAutoregressive (AR) trajectory generation \u0026mdash; predicting driving trajectories as sequences of discrete tokens, much like language models predict text \u0026mdash; has emerged as a powerful paradigm for end-to-end autonomous driving. But how do we turn continuous trajectories into discrete tokens? How do we ensure the tokenized representation preserves enough fidelity for planning? And how does the AR paradigm combine with diffusion and reinforcement learning to produce state-of-the-art results? This article walks through the complete pipeline, from tokenization theory to RL post-training.\n1. Background: Regression vs. Classification in AR Planning Autoregressive trajectory generation splits into two main paradigms:\nRegression-based AR Continuous output at each step + No discretization error - Multimodality via GMM is hard - True distribution unknown Examples: MDR, UniAD Classification-based AR Discrete token prediction + Explicit distribution modeling + Natural multimodality - Quantization error Examples: MotionLM, SMART Regression-based AR outputs continuous coordinates at each step. In theory, multi-modal regression (e.g., via GMM) can capture diverse behaviors. In practice, the true distribution is unknown, and fitting enough modes is extremely difficult.\nClassification-based AR discretizes the continuous action space and predicts token indices via cross-entropy. This naturally models the conditional probability distribution p(at∣a\u0026lt;t,x)p(a_t \\mid a_{\u0026lt;t}, x), making multimodality a first-class citizen.\n1.1 Discretization: State Quantities vs. High-Order Motion Quantities For classification-based AR, the choice of what to discretize is critical:\nQuantity Pros Cons State (x,y,h)(x, y, h) Directly available from data; no inverse kinematics Requires clustering to build vocabulary High-order (acc,yaw_rate)(\\text{acc}, \\text{yaw\\_rate}) More compact; control-oriented GT values hard to obtain; unreasonable for VRU; incompatible with low-frequency prediction The high-order approach has three specific problems:\nGround truth acquisition: Acceleration and yaw rate for obstacles are difficult to measure accurately. Uniform kinematic model: Applying the same model to vehicles, cyclists, and pedestrians is unreasonable. Frequency limitation: At 0.5 Hz, assuming constant acceleration/yaw rate over 1 second cannot capture the actual motion trend. VAE-based discretization avoids explicit quantization by operating in latent space, but suffers from training instability and mode collapse.\nOur choice: State-based discretization via clustering. Trajectories are state quantities directly extractable from data, eliminating inverse kinematics errors. Clustering compresses massive historical data into a finite, representative Trajectory Vocabulary.\n2. The Mdriver AR Pipeline 2.1 Task Definition We assume any trajectory of length TT can be composed of trajectory fragments (tokens). For an 8-second trajectory at 2 Hz (16 coordinate points), we can define:\n16 single-point tokens 8 two-point tokens (each covering 1 second) 4 four-point tokens (each covering 2 seconds) The joint distribution factorizes as:\np(S1:T∣Env)=∏t=1Tp(St:t+n∣s\u0026lt;t,Env)p(S_{1:T} \\mid \\text{Env}) = \\prod_{t=1}^{T} p(S_{t:t+n} \\mid s_{\u0026lt;t}, \\text{Env})where St:t+nS_{t:t+n} denotes the state sequence (x,y,h,v)(x, y, h, v) from timestep tt to t+nt+n.\n2.2 Tokenizer: Cluster, Match, Reconstruct Tokenization is the core of the AR model. It has three stages:\nStage 1: Clustering Given nn training trajectories, each containing 17 frames (1 current + 16 future) with state [x,y,h,v][x, y, h, v] per frame:\nApply k-means separately per category (ego, vehicle, cyclist, pedestrian). Each token has shape [m,3,4][m, 3, 4]: mm cluster centers, 3 points per segment (current + 2 future), 4 state dimensions. Current vocabulary size: m=6000m = 6000 per category (uniform across categories; ablation on category-specific sizes is pending). Token refinement (optional but important):\nHeading fix: Ensure motion direction is consistent with heading. Velocity fix: Use finite-difference velocity instead of raw values. These fixes reduce noise from imperfect perception data, producing cleaner cluster centers.\nStage 2: Matching Matching assigns each ground-truth trajectory segment to the nearest token. This is critically important and has a subtle design decision:\nT1 (origin) Token GT trajectory Wrong: match from GT Correct: match from Token Key Insight Matching from GT at each step introduces no accumulation error. But inference uses tokens! → Train-test mismatch The critical question: When matching the GT segment at T2T_2, should we start from the GT position at T1T_1 or from the token-matched position at T1T_1?\nAnswer: We must match from the token position. Matching from GT creates a train-test mismatch \u0026mdash; at inference time, the model always conditions on its own previous predictions, not the ground truth. Matching from GT introduces no accumulation error during training, but the model never learns to recover from its own prediction errors.\nMatching cost: Currently using a weighted combination of center-point L2 distance and heading L2 distance. The SMART approach uses bounding-box corner matching, which avoids the threshold-tuning problem.\nStage 3: Reconstruction Error Analysis After matching, we reconstruct full trajectories from tokens and measure error against GT. Key observation: finer tokens and larger vocabulary reduce reconstruction error, but model performance does not always correlate with reconstruction accuracy. The tokenizer\u0026rsquo;s fidelity is necessary but not sufficient for good downstream performance.\n3. Model Architecture: Why AR + Diffusion? Pure diffusion models (DiffusionDrive, GoalFlow) have demonstrated that anchors are critical for preventing trajectory divergence. The question is: where do the anchors come from?\nPure Diffusion Huge search Divergent without anchor AR + Diffusion AR Small refinement Coherent output Analogy 1. Have a concept 2. Describe in words 3. Refine expression Step 1-2 = AR Step 3 = Diffusion = AR + Diffusion The AR model learns the conditional distribution p(xt+1∣x1:t)p(x_{t+1} \\mid x_{1:t}), but during rollout it samples from its own predictions, accumulating exposure bias and compounding error over long horizons. Diffusion\u0026rsquo;s multi-step iterative refinement is naturally suited to correcting this drift.\nThe complementary strengths:\nAR solves Diffusion\u0026rsquo;s cold-start: Pure diffusion starts from Gaussian noise with an enormous search space. AR provides a trajectory already on the data manifold, dramatically reducing the denoising burden. Diffusion solves AR\u0026rsquo;s drift: Global modeling via diffusion acts as a \u0026ldquo;smoothing filter,\u0026rdquo; correcting accumulated deviations in long-horizon predictions. This combination reportedly achieved top results on the NavSim benchmark (NAVSIM v1 navtest), with Chainflow-VLA scoring 94.05 PDMS (per internal engineering documentation; not independently verified from public sources).\n4. RL Post-Training with GRPO For a pre-trained AR model, reinforcement learning can further optimize driving strategy through environment interaction. We formulate the problem as an MDP: M=(S,A,P,R,γ)\\mathcal{M} = (S, A, P, R, \\gamma).\n4.1 State, Action, Reward Component Definition State SS Latent representation from encoder: element=fencoder(input)\\text{element} = f_{\\text{encoder}}(\\text{input}) Action AA Scheme 1: Actor network replaces decoder; action = token selection from discrete vocabulary. Scheme 2: Actor makes continuous adjustment (Δx,Δy,Δh)(\\Delta x, \\Delta y, \\Delta h) to selected token. Reward RR TTC-based collision penalty: RTTC=−10max⁡(0,2−TTC)−1R_{\\text{TTC}} = -\\frac{10}{\\max(0, 2 - \\text{TTC})} - 1 The TTC reward is designed with exponential growth:\nTTC \u0026gt; 2s: No penalty (safe) TTC ≤\\leq 2s: Exponentially increasing penalty TTC = 0: Large penalty + episode termination 4.2 From PPO to GRPO The core evolution from PPO to GRPO lies in how the advantage AA is estimated:\nPPO uses a learned value function Vϕ(s)V_\\phi(s) as baseline:\nAπ(st,at)=Qπ(st,at)−Vϕ(st)A_\\pi(s_t, a_t) = Q_\\pi(s_t, a_t) - V_\\phi(s_t)This requires training and maintaining a separate value network.\nGRPO replaces the value baseline with the group mean, eliminating the value network entirely:\nAi=ri−rˉσr+ϵA_i = \\frac{r_i - \\bar{r}}{\\sigma_r + \\epsilon}where rir_i is the reward for the ii-th sample in a group of GG trajectories sampled from the same initial state, rˉ=1G∑j=1Grj\\bar{r} = \\frac{1}{G}\\sum_{j=1}^{G} r_j, and σr\\sigma_r is the group standard deviation.\nThis is particularly well-suited for driving: we can sample GG candidate trajectories for the same scene, evaluate them all, and use the relative ranking within the group as the advantage signal. No value network needed.\n4.3 Loss Design The total loss combines multiple objectives:\nLtotal(θ)=λpgLGRPO-clip+λklLKL+λvfLvalue+λentLentropy+λbcLBC+∑m=1Mλaux,mLaux,m\\mathcal{L}_{\\text{total}}(\\theta) = \\lambda_{\\text{pg}} \\mathcal{L}_{\\text{GRPO-clip}} + \\lambda_{\\text{kl}} \\mathcal{L}_{\\text{KL}} + \\lambda_{\\text{vf}} \\mathcal{L}_{\\text{value}} + \\lambda_{\\text{ent}} \\mathcal{L}_{\\text{entropy}} + \\lambda_{\\text{bc}} \\mathcal{L}_{\\text{BC}} + \\sum_{m=1}^{M} \\lambda_{\\text{aux},m} \\mathcal{L}_{\\text{aux},m} Term Role LGRPO-clip\\mathcal{L}_{\\text{GRPO-clip}} Policy gradient with clipped importance ratio LKL\\mathcal{L}_{\\text{KL}} Prevent distribution drift from reference policy Lvalue\\mathcal{L}_{\\text{value}} Value function fitting (if applicable) Lentropy\\mathcal{L}_{\\text{entropy}} Maintain exploration LBC\\mathcal{L}_{\\text{BC}} Behavioral cloning: preserve pre-training capability 4.4 Sampling in Driving vs. LLM A crucial difference from LLM RL: the sampling space is constrained. In language, temperature/top-k/top-p sampling is unrestricted. In driving, sampled trajectories must satisfy physical constraints \u0026mdash; no sudden velocity changes, heading reversals, or kinematically impossible curvatures.\nFor diffusion-based planners specifically, noise experiments are essential because noise directly determines exploration range, candidate diversity, and group distribution. The right noise level must provide:\nEffective diversity: Group rewards should have meaningful spread. Physical plausibility: No obviously infeasible samples. Goal alignment: Exploration direction should align with training objectives. 5. Evaluation Metrics 5.1 Accuracy Metrics Metric Description Top1_ADE ADE of highest-scoring mode minADE Minimum ADE across all modes (per-agent, may be intra-modally inconsistent) Joint_minADE Minimum ADE at the mode level (all agents from the same mode) 5.2 Kinematic Metrics (Ego Only) Metric Description Top1_Kinematic_Score Weighted average of all kinematic sub-metrics Top1_Kinematic_Rec_Cons Reconstructability: forward-predicted vs. inverse-reconstructed state error via bicycle model Top1_Kinematic_Vel Velocity error (predicted vs. GT via finite difference) Top1_Kinematic_Acc Acceleration error Top1_Kinematic_YR Yaw rate error Top1_Kinematic_Jerk_Long/Lat Longitudinal/lateral jerk error The Reconstruction Consistency metric is particularly insightful: it evaluates whether predicted trajectories satisfy the bicycle kinematic model by computing forward predictions, then inversely reconstructing the states, and measuring the residual. This tests physical plausibility independent of GT.\n5.3 Interaction Metrics (Collision) Metric Description Top1_CR_Ego Ego collision rate in top-1 mode Top1_CR_Agents Any-agent collision rate = Pairwise + Agent-Time components Top1_CR_Scenario Per-scenario binary: does any collision occur? Joint_minCR_* Same metrics at mode level (best mode selected) 6. Quantitative Results Comparison with the regression baseline on the same train/test split:\nMetric Regression Model AR Model Top1_ADE (Ego) 2.622 2.869 Top1_ADE (Agent) 1.759 1.847 Joint_minADE (Ego) \u0026ndash; 2.811 Joint_minADE (Agent) \u0026ndash; 1.841 minADE (Ego) 1.464 1.876 minADE (Agent) 1.576 1.286 The AR model shows slightly higher top-1 ADE (expected: discrete quantization introduces error), but achieves significantly lower minADE for agents (1.286 vs. 1.576), confirming that its multi-modal predictions better cover the distribution of agent behaviors.\n7. Qualitative Observations The AR model demonstrates strong interactive behavior across challenging scenarios:\nUnprotected left turn: Waits for through-traffic, then proceeds smoothly. Obstacle circumnavigation: Suggests detour paths when GT chooses to stop; \u0026ldquo;finds a way\u0026rdquo; around obstacles. Narrow road steering: Small lateral adjustments to create clearance. Cut-in: Assertive lane changes into tight gaps. Pedestrian yielding: Smooth deceleration at crosswalks. These behaviors emerge naturally from the AR model\u0026rsquo;s learned distribution over trajectory tokens, without explicit rule programming.\n8. Remaining Challenges Token vocabulary design: Should we cluster per-timestep or globally? Current uniform m=6000m=6000 across categories needs ablation. Online vs. offline matching: Current online matching in the model slows training; offline matching is planned. AR+Diffusion integration: The AR+Diffusion pipeline is the theoretical target but only the AR baseline + RL has been completed so far. Frame consistency: AR models exhibit higher ADE between consecutive rollout frames (jitter). GRPO with frame-stability reward can fix this but at some cost to lane-change trigger metrics. References - MotionLM: Multi-Agent Motion Forecasting as Language Modeling (Waymo, ICRA 2024)\n- DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving (CVPR 2025)\n- GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectory Generation\n- AlphaDrive: GRPO-based RL for Autonomous Driving\n- SMART: Scalable Multi-agent Real-time Simulation\n- NavSim Benchmark\n","permalink":"https://xuquant.com/en/posts/autonomous-driving/ar-trajectory-tokenization/","summary":"\u003cblockquote\u003e\n\u003cp\u003e中文版本：\u003ca href=\"/posts/autonomous-driving/ar-trajectory-tokenization.zh/\"\u003e阅读中文版\u003c/a\u003e\u003c/p\u003e\n\u003c/blockquote\u003e\n\u003cp\u003e\u003cpicture\u003e\n  \u003csource srcset=\"/images/paper-figures/diffusion-drive-fig1.webp\" type=\"image/webp\"\u003e\n  \u003cimg src=\"/images/paper-figures/diffusion-drive-fig1.png\" alt=\"DiffusionDrive: End-to-End Autonomous Driving Paradigm Comparison\" loading=\"lazy\" decoding=\"async\"\u003e\n\u003c/picture\u003e\n\u003cem\u003eFigure from \u003ca href=\"https://arxiv.org/abs/2411.15139\"\u003eDiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving\u003c/a\u003e\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003eAutoregressive (AR) trajectory generation \u0026mdash; predicting driving trajectories as sequences of discrete tokens, much like language models predict text \u0026mdash; has emerged as a powerful paradigm for end-to-end autonomous driving. But how do we turn continuous trajectories into discrete tokens? How do we ensure the tokenized representation preserves enough fidelity for planning? And how does the AR paradigm combine with diffusion and reinforcement learning to produce state-of-the-art results? This article walks through the complete pipeline, from tokenization theory to RL post-training.\u003c/p\u003e","title":"Trajectory Tokenization for Autoregressive Planning: Clustering, Matching, and the AR+Diffusion Paradigm"},{"content":" 中文版本：阅读中文版\nFigure from DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving\nThe trajectory planner is the decision-making core of an autonomous driving system. Its task: given the current scene, output a future trajectory that is safe, comfortable, and efficient. Most production systems today use some form of regression \u0026mdash; minimizing the distance between predicted and ground-truth trajectories. Yet a growing body of research and engineering evidence suggests this approach has a basic flaw: it assumes the feasible set is convex when it is emphatically not. This article lays out the first-principles argument for why generative approaches (diffusion, autoregressive) are necessary paradigm shifts, not merely improvements.\n1. The Non-Convexity of the Feasible Set A set SS is convex if for any two points A,B∈SA, B \\in S, every point on the line segment connecting them also belongs to SS. In driving, this property fails dramatically:\nObs Ego A: left detour (feasible) B: right detour (feasible) C = (A+B)/2: CRASH Trajectory A goes left around the obstacle; trajectory B goes right. Both are valid. Their average A+B2\\frac{A+B}{2} drives straight into the obstacle \u0026mdash; infeasible. The feasible set is not convex, and no amount of regularization changes this geometric fact.\n2. Why Regression Fails: MSE Averages Modes Regression with MSE loss minimizes:\nmin⁡ E[∥ypred−ygt∥2]\\min \\; \\mathbb{E}\\left[\\| y_{\\text{pred}} - y_{\\text{gt}} \\|^2\\right]When the data distribution is multimodal (e.g., left detour and right detour are both common), the optimal MSE predictor outputs the conditional mean:\ny∗=E[ygt∣x]=A+B2y^* = \\mathbb{E}[y_{\\text{gt}} \\mid x] = \\frac{A + B}{2}This is not a bug in training \u0026mdash; it is the mathematically correct solution to the wrong objective. The regression objective assumes a unimodal distribution centered on the mean, which is provably incorrect for non-convex feasible sets.\nTrajectory Space Density Mode A (left) Mode B (right) MSE mean (low density!) The MSE mean lands in the valley between two modes \u0026mdash; a region of low probability density. The model outputs a trajectory that no human driver would ever take.\n3. GMM: A Patch, Not a Solution Gaussian Mixture Models (GMM) with KK components attempt to address multimodality by learning KK means. Each component\u0026rsquo;s update μi\\mu_i is still the weighted average of samples assigned to that component:\nμi=∑nγn,i⋅yn∑nγn,i\\mu_i = \\frac{\\sum_n \\gamma_{n,i} \\cdot y_n}{\\sum_n \\gamma_{n,i}}This creates two problems:\nSpurious peaks: When two true modes are close, their Gaussian components can overlap and produce a false peak in the valley between them. Finite approximation: KK Gaussians are finite convex building blocks. A non-convex shape can never be perfectly tiled by convex pieces. There will always be \u0026ldquo;gaps\u0026rdquo; (non-zero probability where there should be none) and \u0026ldquo;dead corners\u0026rdquo; (insufficient KK to cover all modes). True distribution GMM K=2 Spurious density in valley! GMM is a patch, not a solution. It uses a finite number of simple convex building blocks to approximate a complex non-convex shape. The approximation error is structural, not parametric \u0026mdash; it cannot be fixed by increasing training data or tuning hyperparameters.\n4. The Penalty Loss Illusion A common engineering practice is to add penalty terms (collision, off-road, comfort) on top of MSE loss:\nL=LMSE+λ1Lcollision+λ2Loff-road+λ3Lcomfort+⋯\\mathcal{L} = \\mathcal{L}_{\\text{MSE}} + \\lambda_1 \\mathcal{L}_{\\text{collision}} + \\lambda_2 \\mathcal{L}_{\\text{off-road}} + \\lambda_3 \\mathcal{L}_{\\text{comfort}} + \\cdotsThis is equivalent to converting hard constraints into soft penalties via Lagrange multipliers. The approach is valid only when the optimization problem is convex. On a non-convex landscape, gradient descent from the MSE initialization can get trapped in local minima, and the penalty terms merely push the solution toward the nearest feasible boundary rather than the globally optimal trajectory.\nThe classical EM (Expectation Maximization) planner understood this well. It decomposed the problem into two stages:\nStep A Path Decider Select corridor (non-convex → convex) Step B Speed Optimizer QP in convex sub-region Result Smooth, feasible trajectory Step A (Path Decider): Choose a corridor (e.g., \u0026ldquo;go left\u0026rdquo;), cutting the non-convex space into a convex sub-region. Step B (Speed Optimizer): Solve a Quadratic Program (QP) within this convex sub-region to obtain a smooth trajectory. First find a convex sub-problem, then solve it. End-to-end regression skips Step A entirely, attempting to solve the non-convex problem in one shot.\n5. Generative Models: Learning the Non-Convex Shape Generative approaches take a fundamentally different path:\nMethod How it handles non-convexity Diffusion Directly learns the shape of the non-convex distribution via gradient/flow field Autoregressive Decomposes the joint distribution via chain rule into conditional distributions; converts a geometric problem into a sequential decision problem 5.1 Diffusion: Learning the Contour A diffusion model learns the score function ∇ylog⁡p(y∣x)\\nabla_y \\log p(y \\mid x), which points toward higher-density regions at every point in trajectory space. During sampling, it follows this gradient field from noise to data, naturally navigating around infeasible regions:\nFeasible A Feasible B Infeasible Noise start → Mode A → Mode B The score field naturally pushes samples away from infeasible regions (zero density) and toward high-density modes.\n5.2 Autoregressive: Sequential Decision Decomposition The autoregressive approach applies the chain rule to decompose the joint trajectory distribution:\np(S1:T∣Env)=∏t=1Tp(St:t+n∣s\u0026lt;t,Env)p(S_{1:T} \\mid \\text{Env}) = \\prod_{t=1}^{T} p(S_{t:t+n} \\mid s_{\u0026lt;t}, \\text{Env})At each step, the model only needs to predict a local trajectory segment conditioned on the current state. Each local prediction faces a simpler distribution (often nearly unimodal at the step level), and the global multimodality emerges from the sequential composition of these choices.\nThis converts a geometric problem (find a trajectory in a non-convex set) into a sequential decision problem (at each step, choose the most likely next segment), which is precisely the regime where autoregressive models excel.\n6. The Convergence: AR + Diffusion The most promising direction combines both paradigms, leveraging their complementary strengths:\nAR Diffusion Strength Accurate single-step prediction; diversity via token vocabulary Global trajectory coherence; smooth \u0026ldquo;error correction\u0026rdquo; over long horizons Weakness Exposure bias and compounding error over long rollouts Cold-start problem: enormous search space from pure noise Role in combination Provides anchor trajectory near the data manifold Refines anchor into globally coherent, smooth trajectory The synergy is clear:\nAR solves Diffusion\u0026rsquo;s cold-start: Instead of starting from Gaussian noise, diffusion begins from the AR-generated anchor \u0026mdash; already near the manifold \u0026mdash; vastly reducing the denoising burden. Diffusion solves AR\u0026rsquo;s drift: The global refinement step corrects compounding errors that accumulate in long autoregressive rollouts. This AR + Diffusion combination reportedly achieved top-ranking results on the NavSim benchmark (NAVSIM v1 navtest) (Chainflow-VLA, 94.05 PDMS, not independently verified from public sources by the author) and has been validated in works like DiffusionDrive (anchor-based truncated diffusion, NAVSIM v1) and GoalFlow (goal-point guided flow matching, NAVSIM v1).\nAR Sequential token prediction → Diversity Anchor Diffusion Global refinement (smooth filter) → Coherence Refined Output Diverse + Coherent + Smooth 7. Summary Approach Non-convex handling Multimodality Limitation Regression (MSE) None \u0026mdash; outputs conditional mean Fails: averages modes into infeasible region Structural failure on non-convex sets GMM Partial \u0026mdash; finite convex approximation Limited by KK; spurious peaks Patch, not solution MSE + Penalty Loss Indirect via soft constraints Same MSE mean, just pushed toward boundary Only valid for convex sub-problems Diffusion Direct \u0026mdash; learns the full distribution shape Natural: samples from learned modes Cold-start; may lack diversity without anchors Autoregressive Decomposition via chain rule Natural: sequential choices compose to multimodality Compounding error; frame inconsistency AR + Diffusion Both: decomposition + global refinement Best of both: diverse anchors + coherent output Engineering complexity; training cost The progression from regression to GMM to generative models is not a matter of incremental improvement. It reflects a basic recognition: the planning problem in autonomous driving is inherently non-convex, and any approach that ignores this geometric fact will produce artifacts that no amount of engineering patching can fix.\nReferences - DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving (CVPR 2025)\n- GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectory Generation\n- MotionLM: Multi-Agent Motion Forecasting as Language Modeling (Waymo, ICRA 2024)\n- AlphaDrive: GRPO-based RL with Planning Reasoning for Autonomous Driving\n- NavSim Benchmark\n- Diffusion-Planner: Transformer-based Diffusion for Closed-Loop Planning\n","permalink":"https://xuquant.com/en/posts/autonomous-driving/generative-planning-nonconvex/","summary":"\u003cblockquote\u003e\n\u003cp\u003e中文版本：\u003ca href=\"/posts/autonomous-driving/generative-planning-nonconvex.zh/\"\u003e阅读中文版\u003c/a\u003e\u003c/p\u003e\n\u003c/blockquote\u003e\n\u003cp\u003e\u003cpicture\u003e\n  \u003csource srcset=\"/images/paper-figures/diffusion-drive-arch.webp\" type=\"image/webp\"\u003e\n  \u003cimg src=\"/images/paper-figures/diffusion-drive-arch.png\" alt=\"DiffusionDrive Architecture\" loading=\"lazy\" decoding=\"async\"\u003e\n\u003c/picture\u003e\n\u003cem\u003eFigure from \u003ca href=\"https://arxiv.org/abs/2411.15139\"\u003eDiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving\u003c/a\u003e\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003eThe trajectory planner is the decision-making core of an autonomous driving system. Its task: given the current scene, output a future trajectory that is safe, comfortable, and efficient. Most production systems today use some form of regression \u0026mdash; minimizing the distance between predicted and ground-truth trajectories. Yet a growing body of research and engineering evidence suggests this approach has a basic flaw: it assumes the feasible set is convex when it is emphatically not. This article lays out the first-principles argument for why generative approaches (diffusion, autoregressive) are \u003cem\u003enecessary\u003c/em\u003e paradigm shifts, not merely improvements.\u003c/p\u003e","title":"Why Generative Planning? The Non-Convexity Argument Against Regression in Autonomous Driving"}]