Autonomous Driving

ReflectDrive-2：理想汽车的离散扩散端到端驾驶与 RL 联合优化

引言：离散扩散 + 端到端驾驶 = 新范式？ 2025-2026 年，端到端自动驾驶的路线之争愈演愈烈。主流阵营分为两派：方案代表核心思路痛点自回归 (AR) GPT-driver, VLA 系列顺序 token-by-token 输出轨迹串行解码慢，端侧只能跑小模型连续 Diffusion UniAD, DriveWM, PlanningDiffuser 连续空间去噪生成轨迹 anchor/goal 引入额外系统，破坏数据分布理想汽车（Li Auto）的 ReflectDrive-2（CVPR 2026）选择了第三条路：离散扩散模型做端到端自动驾驶。乍一看以为是 ReflectDrive 的升级版，但仔细研究后发现——这可能是对量产级端到端方案的全新思考。本文将从建模选择、推理架构、训练策略、工程部署四个维度进行完整技术解析。一、为什么选离散扩散？——从第一性原理出发三条路线的本质对比 flowchart LR subgraph AR["自回归 (AR)"] direction TB AR1[t₁: 输出 token 1] AR2["t₂: 输出 token 2 ⬅️ 依赖 t₁"] AR3["t₃: 输出 token 3 ⬅️ 依赖 t₁,t₂"] AR1 --> AR2 --> AR3 AR_style["❌ 串行瓶颈❌ 端侧小模型✅ 探索成熟"] end subgraph ContDiff["连续 Diffusion"] direction TB CD1[连续噪声注入] CD2[连续空间去噪 N 步] CD3[输出连续轨迹坐标] CD1 --> CD2 --> CD3 CD_style["⚠️ 需要额外 anchor 系统⚠️ 打破数据分布规律✅ 并行生成"] end subgraph DiscDiff["离散扩散 (本方案)"] direction TB DD1[Token 级掩码注入] DD2[双向并行去噪 N 步] DD3[输出离散 token 序列] DD1 --> DD2 --> DD3 DD_style["✅ 全并行解码✅ 统一词表方便预训练✅ Token2Token 支持 AutoEdit✅ RL 探索空间清晰"] end AR --- ContDiff --- DiscDiff离散扩散的五大优势 # 优势对比 AR 对比连续 Diffusion 1 统一词表同所有输入（视觉/状态/导航）可离散化为统一 token → 信息交互自然、支持预训练任务 2 高效采样 ❌ 串行 O(n) ✅ 并行解码 O(1) 每步 3 AutoEdit 天然支持 ❌ 不支持 ✅ Token-to-token 直接改写 4 RL 友好困难（序列信用分配） ✅ 离散 action space，探索清晰 5 端到端 Scaling 受限于串行解码 ✅ 独立 Action Expert FFN，参数效率高二、模型架构：0.8B 参数的紧凑设计整体架构 flowchart LR subgraph Input["多模态输入"] CAM["三路环视相机左前 / 正前 / 右前各 2 个时间帧"] NAV["导航指令 tokens（文本编码后）"] EGO["自车状态 tokens速度 / 航向等"] end subgraph Encoder["视觉编码器 ViT (0.1B)"] direction TB V1[Patch Embedding] V2[Transformer Blocks] V1 --> V2 end subgraph Backbone["掩码扩散语言模型 (0.7B)"] direction TB B1["Prompt Tokens因果注意力 Causal Attention⬆️ 支持 KV 缓存复用"] B2["Trajectory Token 块双向注意力 Bidirectional Attention⬆️ 支持扩散去噪"] B3["Action Expert FFN隐层 4096→1024 精简+ Action Head 输出层"] B1 --> B2 --> B3 end subgraph Output["输出"] OUT["16 个离散 trajectory tokens8 个航路点 × 2 坐标(纵向 x + 横向 y)"] end CAM --> Encoder Encoder --> Backbone NAV --> Backbone EGO --> Backbone Backbone --> OUT style Encoder fill:#e1f5fe style Backbone fill:#fff3e0 style Output fill:#e8f5e9关键设计决策注意力模式混合模型在同一个 Transformer 中混合使用两种注意力机制： ...

X-Cache：小鹏自动驾驶世界模型的推理加速 Infra

引言：世界模型的 Infra 瓶颈自动驾驶领域正在经历一场范式转变——从模块化感知-预测-规划-控制到端到端 / VLA（Vision-Language-Action）系统。在这个新范式中，世界模型（World Model）正在从「炫酷的视频生成 demo」演变为智驾研发体系的底层基础设施。小鹏汽车的 X-World 世界模型已进入闭环仿真、在线强化学习和数据生成等生产流程，用于 VLA 2.0 的研发与验证。但一个根本性瓶颈横亘在前：推理太慢了。世界模型的工作模式是自回归的：每生成一段未来画面 → 策略模型观察后输出动作 → 世界模型继续响应下一段。这个交互链路如果每一环都要等几十秒，闭环效率将无法支撑规模化训练和实时评测。 X-Cache 正是针对这一瓶颈提出的 training-free 推理加速方案：在 DiT（Diffusion Transformer）block 层面实现跨段缓存复用，达到 2.6~2.7 倍壁钟加速、~71% block skip rate，同时保持 SSIM > 0.9990 的极低画质损失。本文将从问题动机、核心技术架构、工程设计细节三个维度进行深度解析。一、为什么传统扩散缓存不适用于世界模型 1.1 传统扩散缓存的假设现有视频扩散模型的推理加速主要沿 denoising step 轴做缓存——即复用相邻去噪步骤之间的中间特征。其核心假设是：相邻 step t 与 t−1 的 latent 表示高度相似 ⟹ 可复用\text{相邻 step } t \text{ 与 } t-1 \text{ 的 latent 表示高度相似} \implies \text{可复用}这在标准的 DDPM / DDIM 采样流程中效果显著，因为这些采样器通常需要 50~1000 步去噪，步间冗余极为丰富。 ...

Reinforcement Learning for End-to-End Autonomous Driving: From Offline DPO to Iterative Self-Improvement

Introduction The integration of reinforcement learning into end-to-end autonomous driving systems has emerged as a promising direction for improving trajectory planning beyond what supervised learning alone can achieve. However, the direct application of standard RL algorithms to driving tasks faces fundamental challenges: the sim-to-real gap in log-replay environments, the computational bottleneck of online simulation, and the difficulty of defining dense reward signals for continuous trajectory generation. This article examines the RL pipeline for end-to-end autonomous driving through the lens of post-training alignment. We begin with the concept of metric caching, which decouples expensive environment evaluation from model training. We then analyze how Direct Preference Optimization (DPO) can be applied across different action representations—discrete tokens, continuous regression, and diffusion models—and discuss the fundamental distinction between offline and online RL in the driving context. Finally, we present three strategies for breaking the sampling ceiling that limits the performance of iterative self-improvement pipelines. ...

Vision-Language-Action Models for Autonomous Driving: The Cosmos-Reason Approach

Introduction End-to-end autonomous driving has made significant progress in recent years, yet deploying Vision-Language-Action (VLA) models in real-world driving scenarios remains challenging. The fundamental difficulties are fourfold. First, multi-frame temporal understanding requires the model to extract decision-relevant changes from highly redundant consecutive observations, rather than merely processing static snapshots. Second, driving decisions must be causal: the model must model why a particular action is taken, not just learn statistical correlations between situations and actions. Third, predicted trajectories must satisfy kinematic and dynamic constraints while remaining multi-modal and efficient enough for real-time inference. Fourth, the reasoning process must be tightly aligned with action output—reasoning should not be a post-hoc rationalization but must be verifiable by and constrained by the actual actions taken. ...

End-to-End Autonomous Driving: From Modular Decoders to VLA Architectures

Introduction The trajectory of autonomous driving architecture has undergone a paradigm shift: from the classical modular pipeline (perception →\to prediction →\to planning →\to control) toward end-to-end systems that map sensory inputs directly to driving actions. This transition is not merely an engineering convenience—it reflects a deep recognition that modular interfaces impose information bottlenecks and that joint optimization across the full stack can yield emergent capabilities invisible to individually optimized modules. The evolution can be broadly characterized in three phases: ...

Policy Optimization for End-to-End Autonomous Driving: From REINFORCE to GRPO

1. Why End-to-End Driving Needs Reinforcement Learning Supervised learning—whether through imitation learning or behavior cloning—can only take an autonomous driving system so far. The fundamental limitation is distributional: the training data is drawn from expert demonstrations, and any distributional shift between training and deployment leads to compounding errors. More critically, supervised objectives are misaligned with the true goal of driving. Minimizing the L2 distance to a ground-truth trajectory penalizes safe deviations as harshly as dangerous ones, and provides no mechanism for the model to discover better trajectories than those in the dataset. ...

Trajectory Tokenization for Autoregressive Planning: Clustering, Matching, and the AR+Diffusion Paradigm

Autoregressive (AR) trajectory generation — predicting driving trajectories as sequences of discrete tokens, much like language models predict text — has emerged as a powerful paradigm for end-to-end autonomous driving. But how do we turn continuous trajectories into discrete tokens? How do we ensure the tokenized representation preserves enough fidelity for planning? And how does the AR paradigm combine with diffusion and reinforcement learning to produce state-of-the-art results? This article walks through the complete pipeline, from tokenization theory to RL post-training. ...

Why Generative Planning? The Non-Convexity Argument Against Regression in Autonomous Driving

The trajectory planner is the decision-making core of an autonomous driving system. Its task: given the current scene, output a future trajectory that is safe, comfortable, and efficient. Most production systems today use some form of regression — minimizing the distance between predicted and ground-truth trajectories. Yet a growing body of research and engineering evidence suggests this approach has a fundamental flaw: it assumes the feasible set is convex when it is emphatically not. This article lays out the first-principles argument for why generative approaches (diffusion, autoregressive) are not merely improvements but necessary paradigm shifts. ...