<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>RL on Xu'Blog</title><link>https://xuquant.com/tags/rl/</link><description>Recent content in RL on Xu'Blog</description><image><title>Xu'Blog</title><url>https://xuquant.com/og-default.png</url><link>https://xuquant.com/og-default.png</link></image><generator>Hugo -- 0.152.2</generator><language>zh</language><lastBuildDate>Thu, 28 May 2026 22:30:00 +0800</lastBuildDate><atom:link href="https://xuquant.com/tags/rl/index.xml" rel="self" type="application/rss+xml"/><item><title>Qwen-VLA 解读：T2A 解压先验、流匹配 PPO、跨形态零样本</title><link>https://xuquant.com/posts/foundation-models/qwen-vla/</link><pubDate>Thu, 28 May 2026 22:30:00 +0800</pubDate><guid>https://xuquant.com/posts/foundation-models/qwen-vla/</guid><description>Qwen Team 2026-05-28 放出的 Qwen-VLA (arXiv:2605.30280) 把 Qwen3.5-4B 多模态骨干和 1.15B 单流 DiT 流匹配动作专家拼成统一具身策略，最有意思的不是数字而是 T2A——冻住 VLM、屏蔽图像，只用文本和 embodiment prompt 把动作先验学出来，再分别灌图像、专门化、RL。本文照 paper 走一遍架构、四阶段 recipe、五维 T2A 消融、流匹配 PPO 的 log-prob 技巧、DOMINO 零样本 26.6% 这个数字背后的含义，以及几条保留的质疑。</description></item><item><title>ReflectDrive-2：理想汽车的离散扩散端到端驾驶与 RL 联合优化</title><link>https://xuquant.com/posts/autonomous-driving/reflectdrive-2-discrete-diffusion-end-to-end-driving/</link><pubDate>Sat, 25 Apr 2026 18:00:00 +0800</pubDate><guid>https://xuquant.com/posts/autonomous-driving/reflectdrive-2-discrete-diffusion-end-to-end-driving/</guid><description>深度解读理想汽车 ReflectDrive-2：离散扩散用于端到端规划，「决策-起草-反思」三阶段配 AutoEdit 局部修正，RL 联合优化把 AutoEdit 增益放大 6 倍，纯相机输入 91.0 PDMS（NAVSIM v1 navtest），Thor 上 31.8ms/帧。</description></item><item><title>CORAL：面向开放式发现的自主多Agent进化</title><link>https://xuquant.com/posts/foundation-models/coral-autonomous-multi-agent-evolution/</link><pubDate>Sat, 22 Nov 2025 10:00:00 +0800</pubDate><guid>https://xuquant.com/posts/foundation-models/coral-autonomous-multi-agent-evolution/</guid><description>将进化搜索的关键决策委托给自主Agent而非固定启发式规则，如何在数学优化和系统优化任务上实现更快的收敛和更强的结果。</description></item><item><title>Reinforcement Learning for End-to-End Autonomous Driving: From Offline DPO to Iterative Self-Improvement</title><link>https://xuquant.com/posts/autonomous-driving/basic_rl/</link><pubDate>Sat, 20 Sep 2025 10:00:00 +0800</pubDate><guid>https://xuquant.com/posts/autonomous-driving/basic_rl/</guid><description>全面分析将强化学习应用于端到端自动驾驶系统，涵盖 metric caching 机制、不同动作表示下的 DPO，以及突破迭代自改进流水线采样上限的策略。</description></item><item><title>Alpamayo：面向自动驾驶的推理-动作对齐 VLA 系统</title><link>https://xuquant.com/posts/autonomous-driving/nvidia_vla/</link><pubDate>Sat, 30 Aug 2025 10:00:00 +0800</pubDate><guid>https://xuquant.com/posts/autonomous-driving/nvidia_vla/</guid><description>深入技术解析 Nvidia Alpamayo VLA 自动驾驶系统，以 Cosmos-Reason 为 VLM 主干，涵盖三平面视觉编码、自车捷径规避、变化因数据集范式，以及通过强化学习实现的推理-动作对齐。</description></item><item><title>Policy Optimization for End-to-End Autonomous Driving: From REINFORCE to GRPO</title><link>https://xuquant.com/posts/autonomous-driving/rl-policy-optimization-e2e-driving/</link><pubDate>Sat, 09 Aug 2025 10:00:00 +0800</pubDate><guid>https://xuquant.com/posts/autonomous-driving/rl-policy-optimization-e2e-driving/</guid><description>端到端自动驾驶策略优化方法的系统推导：从 REINFORCE 到 PPO 再到 GRPO，涵盖优势估计、LLM 与驾驶采样的差异、多目标损失设计，以及扩散模型探索中噪声的作用。</description></item><item><title>Trajectory Tokenization for Autoregressive Planning: Clustering, Matching, and the AR+Diffusion Paradigm</title><link>https://xuquant.com/posts/autonomous-driving/ar-trajectory-tokenization/</link><pubDate>Sat, 28 Jun 2025 10:00:00 +0800</pubDate><guid>https://xuquant.com/posts/autonomous-driving/ar-trajectory-tokenization/</guid><description>深入探讨自回归驾驶规划器的轨迹分词方法：从基于 k-means 聚类的状态离散化，到 token 匹配与重建，再到 AR+Diffusion 范式与基于 GRPO 的强化学习后训练。</description></item></channel></rss>