RL on Xu'Blog

RL on Xu'Bloghttps://xuquant.com/tags/rl/Recent content in RL on Xu'BlogXu'Bloghttps://xuquant.com/og-default.pnghttps://xuquant.com/og-default.pngHugo -- 0.152.2zhThu, 28 May 2026 22:30:00 +0800Qwen-VLA 解读：T2A 解压先验、流匹配 PPO、跨形态零样本https://xuquant.com/posts/foundation-models/qwen-vla/Thu, 28 May 2026 22:30:00 +0800https://xuquant.com/posts/foundation-models/qwen-vla/Qwen Team 2026-05-28 放出的 Qwen-VLA (arXiv:2605.30280) 把 Qwen3.5-4B 多模态骨干和 1.15B 单流 DiT 流匹配动作专家拼成统一具身策略，最有意思的不是数字而是 T2A——冻住 VLM、屏蔽图像，只用文本和 embodiment prompt 把动作先验学出来，再分别灌图像、专门化、RL。本文照 paper 走一遍架构、四阶段 recipe、五维 T2A 消融、流匹配 PPO 的 log-prob 技巧、DOMINO 零样本 26.6% 这个数字背后的含义，以及几条保留的质疑。ReflectDrive-2：理想汽车的离散扩散端到端驾驶与 RL 联合优化https://xuquant.com/posts/autonomous-driving/reflectdrive-2-discrete-diffusion-end-to-end-driving/Sat, 25 Apr 2026 18:00:00 +0800https://xuquant.com/posts/autonomous-driving/reflectdrive-2-discrete-diffusion-end-to-end-driving/深度解读理想汽车 ReflectDrive-2：离散扩散用于端到端规划，「决策-起草-反思」三阶段配 AutoEdit 局部修正，RL 联合优化把 AutoEdit 增益放大 6 倍，纯相机输入 91.0 PDMS（NAVSIM v1 navtest），Thor 上 31.8ms/帧。CORAL：面向开放式发现的自主多Agent进化https://xuquant.com/posts/foundation-models/coral-autonomous-multi-agent-evolution/Sat, 22 Nov 2025 10:00:00 +0800https://xuquant.com/posts/foundation-models/coral-autonomous-multi-agent-evolution/将进化搜索的关键决策委托给自主Agent而非固定启发式规则，如何在数学优化和系统优化任务上实现更快的收敛和更强的结果。Reinforcement Learning for End-to-End Autonomous Driving: From Offline DPO to Iterative Self-Improvementhttps://xuquant.com/posts/autonomous-driving/basic_rl/Sat, 20 Sep 2025 10:00:00 +0800https://xuquant.com/posts/autonomous-driving/basic_rl/全面分析将强化学习应用于端到端自动驾驶系统，涵盖 metric caching 机制、不同动作表示下的 DPO，以及突破迭代自改进流水线采样上限的策略。Alpamayo：面向自动驾驶的推理-动作对齐 VLA 系统https://xuquant.com/posts/autonomous-driving/nvidia_vla/Sat, 30 Aug 2025 10:00:00 +0800https://xuquant.com/posts/autonomous-driving/nvidia_vla/深入技术解析 Nvidia Alpamayo VLA 自动驾驶系统，以 Cosmos-Reason 为 VLM 主干，涵盖三平面视觉编码、自车捷径规避、变化因数据集范式，以及通过强化学习实现的推理-动作对齐。Policy Optimization for End-to-End Autonomous Driving: From REINFORCE to GRPOhttps://xuquant.com/posts/autonomous-driving/rl-policy-optimization-e2e-driving/Sat, 09 Aug 2025 10:00:00 +0800https://xuquant.com/posts/autonomous-driving/rl-policy-optimization-e2e-driving/端到端自动驾驶策略优化方法的系统推导：从 REINFORCE 到 PPO 再到 GRPO，涵盖优势估计、LLM 与驾驶采样的差异、多目标损失设计，以及扩散模型探索中噪声的作用。Trajectory Tokenization for Autoregressive Planning: Clustering, Matching, and the AR+Diffusion Paradigmhttps://xuquant.com/posts/autonomous-driving/ar-trajectory-tokenization/Sat, 28 Jun 2025 10:00:00 +0800https://xuquant.com/posts/autonomous-driving/ar-trajectory-tokenization/深入探讨自回归驾驶规划器的轨迹分词方法：从基于 k-means 聚类的状态离散化，到 token 匹配与重建，再到 AR+Diffusion 范式与基于 GRPO 的强化学习后训练。