Reinforcement-Learning

ReflectDrive-2：理想汽车的离散扩散端到端驾驶与 RL 联合优化

引言：离散扩散 + 端到端驾驶 = 新范式？ 2025-2026 年，端到端自动驾驶的路线之争愈演愈烈。主流阵营分为两派：方案代表核心思路痛点自回归 (AR) GPT-driver, VLA 系列顺序 token-by-token 输出轨迹串行解码慢，端侧只能跑小模型连续 Diffusion UniAD, DriveWM, PlanningDiffuser 连续空间去噪生成轨迹 anchor/goal 引入额外系统，破坏数据分布理想汽车（Li Auto）的 ReflectDrive-2（CVPR 2026）选择了第三条路：离散扩散模型做端到端自动驾驶。乍一看以为是 ReflectDrive 的升级版，但仔细研究后发现——这可能是对量产级端到端方案的全新思考。本文将从建模选择、推理架构、训练策略、工程部署四个维度进行完整技术解析。一、为什么选离散扩散？——从第一性原理出发三条路线的本质对比 flowchart LR subgraph AR["自回归 (AR)"] direction TB AR1[t₁: 输出 token 1] AR2["t₂: 输出 token 2 ⬅️ 依赖 t₁"] AR3["t₃: 输出 token 3 ⬅️ 依赖 t₁,t₂"] AR1 --> AR2 --> AR3 AR_style["❌ 串行瓶颈❌ 端侧小模型✅ 探索成熟"] end subgraph ContDiff["连续 Diffusion"] direction TB CD1[连续噪声注入] CD2[连续空间去噪 N 步] CD3[输出连续轨迹坐标] CD1 --> CD2 --> CD3 CD_style["⚠️ 需要额外 anchor 系统⚠️ 打破数据分布规律✅ 并行生成"] end subgraph DiscDiff["离散扩散 (本方案)"] direction TB DD1[Token 级掩码注入] DD2[双向并行去噪 N 步] DD3[输出离散 token 序列] DD1 --> DD2 --> DD3 DD_style["✅ 全并行解码✅ 统一词表方便预训练✅ Token2Token 支持 AutoEdit✅ RL 探索空间清晰"] end AR --- ContDiff --- DiscDiff离散扩散的五大优势 # 优势对比 AR 对比连续 Diffusion 1 统一词表同所有输入（视觉/状态/导航）可离散化为统一 token → 信息交互自然、支持预训练任务 2 高效采样 ❌ 串行 O(n) ✅ 并行解码 O(1) 每步 3 AutoEdit 天然支持 ❌ 不支持 ✅ Token-to-token 直接改写 4 RL 友好困难（序列信用分配） ✅ 离散 action space，探索清晰 5 端到端 Scaling 受限于串行解码 ✅ 独立 Action Expert FFN，参数效率高二、模型架构：0.8B 参数的紧凑设计整体架构 flowchart LR subgraph Input["多模态输入"] CAM["三路环视相机左前 / 正前 / 右前各 2 个时间帧"] NAV["导航指令 tokens（文本编码后）"] EGO["自车状态 tokens速度 / 航向等"] end subgraph Encoder["视觉编码器 ViT (0.1B)"] direction TB V1[Patch Embedding] V2[Transformer Blocks] V1 --> V2 end subgraph Backbone["掩码扩散语言模型 (0.7B)"] direction TB B1["Prompt Tokens因果注意力 Causal Attention⬆️ 支持 KV 缓存复用"] B2["Trajectory Token 块双向注意力 Bidirectional Attention⬆️ 支持扩散去噪"] B3["Action Expert FFN隐层 4096→1024 精简+ Action Head 输出层"] B1 --> B2 --> B3 end subgraph Output["输出"] OUT["16 个离散 trajectory tokens8 个航路点 × 2 坐标(纵向 x + 横向 y)"] end CAM --> Encoder Encoder --> Backbone NAV --> Backbone EGO --> Backbone Backbone --> OUT style Encoder fill:#e1f5fe style Backbone fill:#fff3e0 style Output fill:#e8f5e9关键设计决策注意力模式混合模型在同一个 Transformer 中混合使用两种注意力机制： ...

Reinforcement Learning for End-to-End Autonomous Driving: From Offline DPO to Iterative Self-Improvement

Introduction The integration of reinforcement learning into end-to-end autonomous driving systems has emerged as a promising direction for improving trajectory planning beyond what supervised learning alone can achieve. However, the direct application of standard RL algorithms to driving tasks faces fundamental challenges: the sim-to-real gap in log-replay environments, the computational bottleneck of online simulation, and the difficulty of defining dense reward signals for continuous trajectory generation. This article examines the RL pipeline for end-to-end autonomous driving through the lens of post-training alignment. We begin with the concept of metric caching, which decouples expensive environment evaluation from model training. We then analyze how Direct Preference Optimization (DPO) can be applied across different action representations—discrete tokens, continuous regression, and diffusion models—and discuss the fundamental distinction between offline and online RL in the driving context. Finally, we present three strategies for breaking the sampling ceiling that limits the performance of iterative self-improvement pipelines. ...

Policy Optimization for End-to-End Autonomous Driving: From REINFORCE to GRPO

1. Why End-to-End Driving Needs Reinforcement Learning Supervised learning—whether through imitation learning or behavior cloning—can only take an autonomous driving system so far. The fundamental limitation is distributional: the training data is drawn from expert demonstrations, and any distributional shift between training and deployment leads to compounding errors. More critically, supervised objectives are misaligned with the true goal of driving. Minimizing the L2 distance to a ground-truth trajectory penalizes safe deviations as harshly as dangerous ones, and provides no mechanism for the model to discover better trajectories than those in the dataset. ...