Trajectory Tokenization for Autoregressive Planning: Clustering, Matching, and the AR+Diffusion Paradigm

DiffusionDrive: End-to-End Autonomous Driving Paradigm Comparison Figure from DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving

自回归（Autoregressive, AR）轨迹生成——将驾驶轨迹预测为离散 token 的序列，就像语言模型预测文本一样——已成为端到端自动驾驶的强大范式。但如何将连续轨迹转化为离散 token？如何确保分词后的表示保留足够的规划保真度？AR 范式又如何与扩散模型和强化学习结合以产生 SOTA 结果？本文将完整梳理整个流程，从分词理论到 RL 后训练。

1. 背景：AR 规划中的回归 vs. 分类

自回归轨迹生成分为两个基本范式：

基于回归的 AR 在每一步输出连续坐标。理论上，多模态回归（如通过 GMM）可以捕获多样的行为。实践中，真实分布未知，拟合足够数量的模式极其困难。

基于分类的 AR 将连续动作空间离散化，通过交叉熵预测 token 索引。这自然地对条件概率分布 $p(a_t \mid a_{<t}, x)$ 进行建模，使多模态成为一等公民。

1.1 离散化：状态量 vs. 高阶运动量

对于基于分类的 AR，选择对什么进行离散化至关重要：

Quantity	Pros	Cons
State $(x, y, h)$	Directly available from data; no inverse kinematics	Requires clustering to build vocabulary
High-order $(\text{acc}, \text{yaw\_rate})$	More compact; control-oriented	GT values hard to obtain; unreasonable for VRU; incompatible with low-frequency prediction

高阶方法存在三个具体问题：

真值获取困难：障碍物的加速度和横摆角速度难以精确测量。
统一运动学模型：对车辆、骑行者和行人使用相同模型是不合理的。
频率限制：在 0.5 Hz 下，假设 1 秒内加速度/横摆角速度恒定无法捕捉实际运动趋势。

基于 VAE 的 离散化通过在潜在空间中操作来避免显式量化，但面临训练不稳定和模式崩塌的问题。

我们的选择：基于聚类的状态量离散化。轨迹是可直接从数据中提取的状态量，消除了运动学逆解误差。聚类将海量历史数据压缩为有限且具代表性的轨迹词表（Trajectory Vocabulary）。

2. Mdriver AR 流水线

2.1 任务定义

我们假定任何长度为 $T$ 的轨迹都可以由轨迹片段（token）组成。对于 2 Hz 下的 8 秒轨迹（16 个坐标点），可以定义：

16 个单点 token
8 个双点 token（每个覆盖 1 秒）
4 个四点 token（每个覆盖 2 秒）

联合分布可分解为：

p(S_{1:T} \mid \text{Env}) = \prod_{t=1}^{T} p(S_{t:t+n} \mid s_{<t}, \text{Env})

其中 $S_{t:t+n}$ 表示从时间步 $t$ 到 $t+n$ 的状态序列 $(x, y, h, v)$ 。

2.2 分词器：聚类、匹配、重建

分词是 AR 模型的核心，包含三个阶段：

阶段 1：聚类

给定 $n$ 条训练轨迹，每条包含 17 帧（1 帧当前 + 16 帧未来），每帧状态为 $[x, y, h, v]$ ：

按类别（自车、车辆、骑行者、行人）分别应用 k-means 聚类。
每个 token 形状为 $[m, 3, 4]$ ： $m$ 个聚类中心，每段 3 个点（当前 + 2 帧未来），4 个状态维度。
当前词表大小：每个类别 $m = 6000$ （各类别统一；针对类别特定大小的消融实验待完成）。

Token 精修（可选但重要）：

航向修正：确保运动方向与航向一致。
速度修正：使用有限差分速度代替原始值。

这些修正减少了不完美感知数据带来的噪声，产生更干净的聚类中心。

阶段 2：匹配

匹配将每条真值（GT）轨迹段分配到最近的 token。这至关重要，且包含一个微妙的设计决策：

曝光偏差：Teacher Forcing vs AR Rollout

Loading visualization...

拖动 σ 滑块控制每步预测噪声强度，点击「单步」逐步展开 rollout 或「自动播放」连续推进。红色：教师强制，每步从 GT 出发，误差不累积；橙色：自回归 rollout，每步从上一步预测点出发，σ 越大累积误差越严重。

const W = container.clientWidth;
const H = container.clientHeight;
const margin = {top: 30, right: 30, bottom: 50, left: 50};
const iW = W - margin.left - margin.right;
const iH = H - margin.top - margin.bottom - 60;

const svg = d3.select(container).append("svg")
  .attr("width", W).attr("height", H);
svg.append("rect").attr("width", W).attr("height", H).attr("fill", "#1a1a2e");

const g = svg.append("g")
  .attr("transform", `translate(${margin.left},${margin.top})`);

// Ground truth trajectory: smooth S-curve over t ∈ [0, 10]
const N_STEPS = 12;
function gtY(t) {
  return 2 * Math.sin(t * 0.55) + 0.15 * t;
}

const xScale = d3.scaleLinear().domain([0, 10]).range([0, iW]);
const yScale = d3.scaleLinear().domain([-3.5, 4.5]).range([iH, 0]);

// Axes
const xAxis = g.append("g").attr("transform", `translate(0,${iH})`)
  .call(d3.axisBottom(xScale).ticks(10));
xAxis.selectAll("text").attr("fill", "#f1f5f9");
xAxis.selectAll("line").attr("stroke", "#475569");
xAxis.selectAll("path").attr("stroke", "#475569");

const yAxis = g.append("g").call(d3.axisLeft(yScale).ticks(6));
yAxis.selectAll("text").attr("fill", "#f1f5f9");
yAxis.selectAll("line").attr("stroke", "#475569");
yAxis.selectAll("path").attr("stroke", "#475569");

g.append("text").attr("x", iW / 2).attr("y", iH + 36)
  .attr("text-anchor", "middle").attr("fill", "#f1f5f9").attr("font-size", "11px")
  .text("时间步 t");
g.append("text").attr("transform", "rotate(-90)")
  .attr("x", -iH / 2).attr("y", -36)
  .attr("text-anchor", "middle").attr("fill", "#f1f5f9").attr("font-size", "11px")
  .text("轨迹位置 y(t)");

// GT curve
const gtPoints = [];
for (let i = 0; i <= N_STEPS; i++) {
  const t = i * 10 / N_STEPS;
  gtPoints.push({t, y: gtY(t)});
}

const lineGen = d3.line().x(d => xScale(d.t)).y(d => yScale(d.y)).curve(d3.curveMonotoneX);
g.append("path").datum(gtPoints).attr("fill", "none")
  .attr("stroke", "#10b981").attr("stroke-width", 2.5).attr("d", lineGen);

// GT step markers
g.selectAll(".gt-pt").data(gtPoints).enter().append("circle")
  .attr("class", "gt-pt").attr("cx", d => xScale(d.t)).attr("cy", d => yScale(d.y))
  .attr("r", 3.5).attr("fill", "#10b981");

// State for animation
let sigma = 0.3;
let stepIdx = 0;
let arPath = [{t: 0, y: gtY(0)}]; // AR rollout starts at GT[0]
let tfPath = [{t: 0, y: gtY(0)}]; // Teacher forcing starts at GT[0]

// "Model" prediction: GT slope + perturbation proportional to drift from GT
// In AR rollout: from cur AR point, predict next using local GT slope but adds bias proportional to error * (1 + σ)
function predictFromPoint(cur_t, cur_y) {
  const next_t = cur_t + 10 / N_STEPS;
  const trueNextY = gtY(next_t);
  const trueSlope = (trueNextY - gtY(cur_t));
  // Local slope from cur_y (model "sees" cur_y as condition); when cur_y deviates from GT, prediction inherits + amplifies
  const drift = cur_y - gtY(cur_t);
  const noise = sigma * (Math.random() - 0.5) * 2;
  // AR rollout: error compounds — drift propagates forward, plus fresh noise
  const predY = cur_y + trueSlope + 0.4 * drift * sigma + noise;
  return {t: next_t, y: predY};
}

// Teacher forcing: always re-anchor at GT, single-step prediction error
function predictTF(cur_t) {
  const next_t = cur_t + 10 / N_STEPS;
  const trueNextY = gtY(next_t);
  const noise = sigma * (Math.random() - 0.5) * 2;
  return {t: next_t, y: trueNextY + noise};
}

const arPathSel = g.append("path").attr("fill", "none")
  .attr("stroke", "#f97316").attr("stroke-width", 2).attr("stroke-dasharray", "5,3");
const tfPathSel = g.append("path").attr("fill", "none")
  .attr("stroke", "#ef4444").attr("stroke-width", 2).attr("stroke-dasharray", "5,3");

const arPtsG = g.append("g");
const tfPtsG = g.append("g");

function render() {
  arPathSel.attr("d", lineGen(arPath));
  tfPathSel.attr("d", lineGen(tfPath));
  const arSel = arPtsG.selectAll("circle").data(arPath);
  arSel.enter().append("circle").attr("r", 4).attr("fill", "#f97316")
    .attr("stroke", "#7c2d12").attr("stroke-width", 0.8).merge(arSel)
    .attr("cx", d => xScale(d.t)).attr("cy", d => yScale(d.y));
  arSel.exit().remove();
  const tfSel = tfPtsG.selectAll("circle").data(tfPath);
  tfSel.enter().append("circle").attr("r", 4).attr("fill", "#ef4444")
    .attr("stroke", "#7f1d1d").attr("stroke-width", 0.8).merge(tfSel)
    .attr("cx", d => xScale(d.t)).attr("cy", d => yScale(d.y));
  tfSel.exit().remove();

// Draw connector lines from AR cur to next prediction origin (showing AR feeds back)
  stepLabel.text(`Step: ${stepIdx} / ${N_STEPS}`);

// Compute final ADE
  if (stepIdx > 0) {
    let arErr = 0, tfErr = 0;
    for (let i = 1; i <= stepIdx; i++) {
      arErr += Math.abs(arPath[i].y - gtY(arPath[i].t));
      tfErr += Math.abs(tfPath[i].y - gtY(tfPath[i].t));
    }
    arErr /= stepIdx;
    tfErr /= stepIdx;
    errLabel.text(`累积 ADE → AR: ${arErr.toFixed(2)}, TF: ${tfErr.toFixed(2)}`);
  } else {
    errLabel.text("累积 ADE → AR: 0.00, TF: 0.00");
  }
}

function stepOnce() {
  if (stepIdx >= N_STEPS) return;
  const arCur = arPath[arPath.length - 1];
  arPath.push(predictFromPoint(arCur.t, arCur.y));
  tfPath.push(predictTF(arCur.t));
  stepIdx++;
  render();
}

function reset() {
  stepIdx = 0;
  arPath = [{t: 0, y: gtY(0)}];
  tfPath = [{t: 0, y: gtY(0)}];
  render();
}

// Legend
const legend = g.append("g").attr("transform", `translate(${iW - 220}, 8)`);
legend.append("line").attr("x1", 0).attr("y1", 6).attr("x2", 20).attr("y2", 6)
  .attr("stroke", "#10b981").attr("stroke-width", 2.5);
legend.append("text").attr("x", 26).attr("y", 10).attr("fill", "#e5e7eb").attr("font-size", "11px")
  .text("Ground Truth");
legend.append("line").attr("x1", 0).attr("y1", 26).attr("x2", 20).attr("y2", 26)
  .attr("stroke", "#ef4444").attr("stroke-width", 2).attr("stroke-dasharray", "5,3");
legend.append("text").attr("x", 26).attr("y", 30).attr("fill", "#e5e7eb").attr("font-size", "11px")
  .text("Teacher Forcing (从 GT 出发)");
legend.append("line").attr("x1", 0).attr("y1", 46).attr("x2", 20).attr("y2", 46)
  .attr("stroke", "#f97316").attr("stroke-width", 2).attr("stroke-dasharray", "5,3");
legend.append("text").attr("x", 26).attr("y", 50).attr("fill", "#e5e7eb").attr("font-size", "11px")
  .text("AR Rollout (从预测点出发)");

const stepLabel = g.append("text").attr("x", 8).attr("y", 18)
  .attr("fill", "#fbbf24").attr("font-size", "12px").attr("font-weight", "bold")
  .text(`Step: 0 / ${N_STEPS}`);
const errLabel = g.append("text").attr("x", 8).attr("y", 36)
  .attr("fill", "#fbbf24").attr("font-size", "11px")
  .text("累积 ADE → AR: 0.00, TF: 0.00");

// Controls
const controls = d3.select(container).append("div")
  .attr("style", "text-align:center; margin-top:10px; font-size:13px; color:#f1f5f9;");
controls.append("span").text("预测噪声 σ: ");
const slider = controls.append("input").attr("type", "range")
  .attr("min", 0).attr("max", 1).attr("step", 0.02).attr("value", 0.3)
  .attr("style", "width:200px; vertical-align:middle;");
const sigmaReadout = controls.append("span").style("margin-left", "8px").style("font-weight", "bold").text("0.30");
slider.on("input", function() {
  sigma = +this.value;
  sigmaReadout.text(sigma.toFixed(2));
});

const stepBtn = controls.append("button").text("单步")
  .attr("style", "margin-left:14px; padding:3px 12px; cursor:pointer; border:1px solid #999; border-radius:3px; background:#f9f9f9;");
stepBtn.on("click", stepOnce);

const playBtn = controls.append("button").text("▶ 自动播放")
  .attr("style", "margin-left:8px; padding:3px 12px; cursor:pointer; border:1px solid #999; border-radius:3px; background:#f9f9f9;");
let playing = false;
playBtn.on("click", () => {
  playing = !playing;
  playBtn.text(playing ? "⏸ 暂停" : "▶ 自动播放");
});

const resetBtn = controls.append("button").text("↻ 重置")
  .attr("style", "margin-left:8px; padding:3px 12px; cursor:pointer; border:1px solid #999; border-radius:3px; background:#f9f9f9;");
resetBtn.on("click", () => {
  playing = false;
  playBtn.text("▶ 自动播放");
  reset();
});

render();

关键问题：匹配 $T_2$ 时刻的 GT 段时，应该从 $T_1$ 时刻的 GT 位置出发，还是从 $T_1$ 时刻的 token 匹配位置出发？

答案：我们必须从 token 位置出发进行匹配。从 GT 出发匹配会产生训练-推理不一致——推理时，模型总是以自身的前序预测作为条件，而非真值。从 GT 出发匹配在训练期间不会引入累积误差，但模型永远学不会从自身的预测错误中恢复。

匹配代价函数：当前使用中心点 L2 距离和航向 L2 距离的加权组合。SMART 方法采用边界框角点匹配，避免了阈值调优问题。

阶段 3：重建误差分析

匹配后，我们从 token 重建完整轨迹并测量与 GT 的误差。关键观察：更细粒度的 token 和更大的词表能降低重建误差，但模型性能并不总与重建精度相关。分词器的保真度是良好下游性能的必要条件但非充分条件。

3. 模型架构：为什么选择 AR + 扩散？

纯扩散模型（DiffusionDrive、GoalFlow）已经证明锚点对于防止轨迹发散至关重要。问题是：锚点从何而来？

AR 模型学习条件分布 $p(x_{t+1} \mid x_{1:t})$ ，但在 rollout 过程中它从自身预测中进行采样，累积暴露偏差并在长时域上产生复合误差。扩散的多步迭代精修天然适合纠正这种漂移。

互补优势：

AR 解决扩散的冷启动问题：纯扩散从高斯噪声开始，搜索空间巨大。AR 提供一条已在数据流形上的轨迹，大幅降低去噪负担。
扩散解决 AR 的漂移问题：通过扩散的全局建模充当"平滑滤波器"，修正长时域预测中的累积偏差。

这一组合据报道在 NavSim benchmark（NAVSIM v1 navtest）上取得了头部成绩，Chainflow-VLA 获得 94.05 PDMS（数字来源为内部工程文档，本博客未能从公开渠道独立核实）。

4. 基于 GRPO 的 RL 后训练

对于预训练好的 AR 模型，强化学习可以通过环境交互进一步优化驾驶策略。我们将问题形式化为 MDP： $\mathcal{M} = (S, A, P, R, \gamma)$ 。

4.1 状态、动作、奖励

Component	Definition
State $S$	Latent representation from encoder: $\text{element} = f_{\text{encoder}}(\text{input})$
Action $A$	Scheme 1: Actor network replaces decoder; action = token selection from discrete vocabulary. Scheme 2: Actor makes continuous adjustment $(\Delta x, \Delta y, \Delta h)$ to selected token.
Reward $R$	TTC-based collision penalty: $R_{\text{TTC}} = -\frac{10}{\max(0, 2 - \text{TTC})} - 1$

TTC 奖励采用指数增长设计：

TTC > 2s：无惩罚（安全）
TTC $\leq$ 2s：指数增长的惩罚
TTC = 0：大惩罚 + episode 终止

4.2 从 PPO 到 GRPO

PPO 到 GRPO 的核心演进在于优势值（advantage） $A$ 的估计方式不同：

PPO 使用学习到的价值函数 $V_\phi(s)$ 作为基线：

A_\pi(s_t, a_t) = Q_\pi(s_t, a_t) - V_\phi(s_t)

这需要训练和维护一个独立的价值网络。

GRPO 用组均值替代价值基线，完全消除了价值网络：

A_i = \frac{r_i - \bar{r}}{\sigma_r + \epsilon}

其中 $r_i$ 是同一初始状态下采样的 $G$ 条轨迹中第 $i$ 个样本的奖励， $\bar{r} = \frac{1}{G}\sum_{j=1}^{G} r_j$ ， $\sigma_r$ 为组内标准差。

这特别适用于驾驶场景：我们可以为同一场景采样 $G$ 条候选轨迹，全部评估后用组内相对排名作为优势信号。无需价值网络。

4.3 损失设计

总损失融合了多个目标：

\mathcal{L}_{\text{total}}(\theta) = \lambda_{\text{pg}} \mathcal{L}_{\text{GRPO-clip}} + \lambda_{\text{kl}} \mathcal{L}_{\text{KL}} + \lambda_{\text{vf}} \mathcal{L}_{\text{value}} + \lambda_{\text{ent}} \mathcal{L}_{\text{entropy}} + \lambda_{\text{bc}} \mathcal{L}_{\text{BC}} + \sum_{m=1}^{M} \lambda_{\text{aux},m} \mathcal{L}_{\text{aux},m}

Term	Role
$\mathcal{L}_{\text{GRPO-clip}}$	Policy gradient with clipped importance ratio
$\mathcal{L}_{\text{KL}}$	Prevent distribution drift from reference policy
$\mathcal{L}_{\text{value}}$	Value function fitting (if applicable)
$\mathcal{L}_{\text{entropy}}$	Maintain exploration
$\mathcal{L}_{\text{BC}}$	Behavioral cloning: preserve pre-training capability

4.4 驾驶 vs. LLM 中的采样

与 LLM 强化学习的一个关键区别在于采样空间受到约束。语言中的温度/top-k/top-p 采样是无约束的；而在驾驶中，采样轨迹必须满足物理约束——不能有突变的车速、航向反转或运动学上不可能的曲率。

对于基于扩散的规划器而言，噪声实验尤为关键，因为噪声直接决定了探索范围、候选多样性和组分布。合适的噪声水平需要满足：

有效多样性：组内奖励应有有意义的分散度。
物理可行性：没有明显不可行的样本。
目标一致性：探索方向应与训练目标对齐。

5. 评估指标

5.1 准确性指标

Metric	Description
Top1_ADE	ADE of highest-scoring mode
minADE	Minimum ADE across all modes (per-agent, may be intra-modally inconsistent)
Joint_minADE	Minimum ADE at the mode level (all agents from the same mode)

5.2 运动学指标（仅自车）

Metric	Description
Top1_Kinematic_Score	Weighted average of all kinematic sub-metrics
Top1_Kinematic_Rec_Cons	Reconstructability: forward-predicted vs. inverse-reconstructed state error via bicycle model
Top1_Kinematic_Vel	Velocity error (predicted vs. GT via finite difference)
Top1_Kinematic_Acc	Acceleration error
Top1_Kinematic_YR	Yaw rate error
Top1_Kinematic_Jerk_Long/Lat	Longitudinal/lateral jerk error

**重建一致性（Reconstruction Consistency）**指标尤其具有洞察力：它通过前向预测再反向重构状态并测量残差，来评估预测轨迹是否满足自行车运动学模型。这在独立于 GT 的情况下测试物理可行性。

5.3 交互指标（碰撞）

Metric	Description
Top1_CR_Ego	Ego collision rate in top-1 mode
Top1_CR_Agents	Any-agent collision rate = Pairwise + Agent-Time components
Top1_CR_Scenario	Per-scenario binary: does any collision occur?
Joint_minCR_*	Same metrics at mode level (best mode selected)

6. 定量结果

在同一训练/测试集划分下与回归基线的对比：

Metric	Regression Model	AR Model
Top1_ADE (Ego)	2.622	2.869
Top1_ADE (Agent)	1.759	1.847
Joint_minADE (Ego)	–	2.811
Joint_minADE (Agent)	–	1.841
minADE (Ego)	1.464	1.876
minADE (Agent)	1.576	1.286

AR 模型的 top-1 ADE 略高（符合预期：离散量化引入了误差），但在 agent 的 minADE 上显著更低（1.286 vs. 1.576），证实其多模态预测更好地覆盖了 agent 行为的分布。

7. 定性观察

AR 模型在多种挑战场景下展现了良好的交互行为：

无保护左转：等待直行交通通过后平稳前行。
障碍物绕行：当 GT 选择停车时建议绕行路径；能够"找到出路"绕过障碍物。
窄路转向：小幅横向调整以腾出安全距离。
切入变道：果断地切入狭窄间隙。
行人让行：在人行横道处平滑减速。

这些行为自然地源于 AR 模型在轨迹 token 上学习的分布，无需显式的规则编程。

8. 待解决的问题

Token 词表设计：应按时间步聚类还是全局聚类？当前各类别统一 $m=6000$ 需要消融验证。
在线 vs. 离线匹配：当前模型内的在线匹配拖慢了训练速度，计划迁移至离线匹配。
AR+Diffusion 集成：AR+Diffusion 流水线是理论目标，目前仅完成了 AR 基线 + RL 后训练。
帧间一致性：AR 模型在连续 rollout 帧之间表现出更高的 ADE（抖动）。引入帧稳定性奖励的 GRPO 可以改善此问题，但对变道触发指标有一定代价。

References

- MotionLM: Multi-Agent Motion Forecasting as Language Modeling (Waymo, ICRA 2024)

- DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving (CVPR 2025)

- GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectory Generation

- AlphaDrive: GRPO-based RL for Autonomous Driving

- SMART: Scalable Multi-agent Real-time Simulation

- NavSim Benchmark

1. 背景：AR 规划中的回归 vs. 分类#

1.1 离散化：状态量 vs. 高阶运动量#

2. Mdriver AR 流水线#

2.1 任务定义#

2.2 分词器：聚类、匹配、重建#

阶段 1：聚类#

阶段 2：匹配#

阶段 3：重建误差分析#

3. 模型架构：为什么选择 AR + 扩散？#

4. 基于 GRPO 的 RL 后训练#

4.1 状态、动作、奖励#

4.2 从 PPO 到 GRPO#

4.3 损失设计#

4.4 驾驶 vs. LLM 中的采样#

5. 评估指标#

5.1 准确性指标#

5.2 运动学指标（仅自车）#

5.3 交互指标（碰撞）#

6. 定量结果#

7. 定性观察#

8. 待解决的问题#

References#

相关文章