Trajectory Tokenization for Autoregressive Planning: Clustering, Matching, and the AR+Diffusion Paradigm

Autoregressive (AR) trajectory generation — predicting driving trajectories as sequences of discrete tokens, much like language models predict text — has emerged as a powerful paradigm for end-to-end autonomous driving. But how do we turn continuous trajectories into discrete tokens? How do we ensure the tokenized representation preserves enough fidelity for planning? And how does the AR paradigm combine with diffusion and reinforcement learning to produce state-of-the-art results? This article walks through the complete pipeline, from tokenization theory to RL post-training.

1. Background: Regression vs. Classification in AR Planning

Autoregressive trajectory generation splits into two fundamental paradigms:

Regression-based AR outputs continuous coordinates at each step. In theory, multi-modal regression (e.g., via GMM) can capture diverse behaviors. In practice, the true distribution is unknown, and fitting enough modes is extremely difficult.

Classification-based AR discretizes the continuous action space and predicts token indices via cross-entropy. This naturally models the conditional probability distribution $p(a_t \mid a_{<t}, x)$ , making multimodality a first-class citizen.

1.1 Discretization: State Quantities vs. High-Order Motion Quantities

For classification-based AR, the choice of what to discretize is critical:

Quantity	Pros	Cons
State $(x, y, h)$	Directly available from data; no inverse kinematics	Requires clustering to build vocabulary
High-order $(\text{acc}, \text{yaw\_rate})$	More compact; control-oriented	GT values hard to obtain; unreasonable for VRU; incompatible with low-frequency prediction

The high-order approach has three specific problems:

Ground truth acquisition: Acceleration and yaw rate for obstacles are difficult to measure accurately.
Uniform kinematic model: Applying the same model to vehicles, cyclists, and pedestrians is unreasonable.
Frequency limitation: At 0.5 Hz, assuming constant acceleration/yaw rate over 1 second cannot capture the actual motion trend.

VAE-based discretization avoids explicit quantization by operating in latent space, but suffers from training instability and mode collapse.

Our choice: State-based discretization via clustering. Trajectories are state quantities directly extractable from data, eliminating inverse kinematics errors. Clustering compresses massive historical data into a finite, representative Trajectory Vocabulary.

2. The Mdriver AR Pipeline

2.1 Task Definition

We assume any trajectory of length $T$ can be composed of trajectory fragments (tokens). For an 8-second trajectory at 2 Hz (16 coordinate points), we can define:

16 single-point tokens
8 two-point tokens (each covering 1 second)
4 four-point tokens (each covering 2 seconds)

The joint distribution factorizes as:

p(S_{1:T} \mid \text{Env}) = \prod_{t=1}^{T} p(S_{t:t+n} \mid s_{<t}, \text{Env})

where $S_{t:t+n}$ denotes the state sequence $(x, y, h, v)$ from timestep $t$ to $t+n$ .

2.2 Tokenizer: Cluster, Match, Reconstruct

Tokenization is the core of the AR model. It has three stages:

Stage 1: Clustering

Given $n$ training trajectories, each containing 17 frames (1 current + 16 future) with state $[x, y, h, v]$ per frame:

Apply k-means separately per category (ego, vehicle, cyclist, pedestrian).
Each token has shape $[m, 3, 4]$ : $m$ cluster centers, 3 points per segment (current + 2 future), 4 state dimensions.
Current vocabulary size: $m = 6000$ per category (uniform across categories; ablation on category-specific sizes is pending).

Token refinement (optional but important):

Heading fix: Ensure motion direction is consistent with heading.
Velocity fix: Use finite-difference velocity instead of raw values.

These fixes reduce noise from imperfect perception data, producing cleaner cluster centers.

Stage 2: Matching

Matching assigns each ground-truth trajectory segment to the nearest token. This is critically important and has a subtle design decision:

The critical question: When matching the GT segment at $T_2$ , should we start from the GT position at $T_1$ or from the token-matched position at $T_1$ ?

Answer: We must match from the token position. Matching from GT creates a train-test mismatch — at inference time, the model always conditions on its own previous predictions, not the ground truth. Matching from GT introduces no accumulation error during training, but the model never learns to recover from its own prediction errors.

Matching cost: Currently using a weighted combination of center-point L2 distance and heading L2 distance. The SMART approach uses bounding-box corner matching, which avoids the threshold-tuning problem.

Stage 3: Reconstruction Error Analysis

After matching, we reconstruct full trajectories from tokens and measure error against GT. Key observation: finer tokens and larger vocabulary reduce reconstruction error, but model performance does not always correlate with reconstruction accuracy. The tokenizer’s fidelity is necessary but not sufficient for good downstream performance.

3. Model Architecture: Why AR + Diffusion?

Pure diffusion models (DiffusionDrive, GoalFlow) have demonstrated that anchors are critical for preventing trajectory divergence. The question is: where do the anchors come from?

The AR model learns the conditional distribution $p(x_{t+1} \mid x_{1:t})$ , but during rollout it samples from its own predictions, accumulating exposure bias and compounding error over long horizons. Diffusion’s multi-step iterative refinement is naturally suited to correcting this drift.

The complementary strengths:

AR solves Diffusion’s cold-start: Pure diffusion starts from Gaussian noise with an enormous search space. AR provides a trajectory already on the data manifold, dramatically reducing the denoising burden.
Diffusion solves AR’s drift: Global modeling via diffusion acts as a “smoothing filter,” correcting accumulated deviations in long-horizon predictions.

This combination achieved top-1 results on the NavSim benchmark, with Chainflow-VLA scoring 94.05 PDMS.

4. RL Post-Training with GRPO

For a pre-trained AR model, reinforcement learning can further optimize driving strategy through environment interaction. We formulate the problem as an MDP: $\mathcal{M} = (S, A, P, R, \gamma)$ .

4.1 State, Action, Reward

Component	Definition
State $S$	Latent representation from encoder: $\text{element} = f_{\text{encoder}}(\text{input})$
Action $A$	Scheme 1: Actor network replaces decoder; action = token selection from discrete vocabulary. Scheme 2: Actor makes continuous adjustment $(\Delta x, \Delta y, \Delta h)$ to selected token.
Reward $R$	TTC-based collision penalty: $R_{\text{TTC}} = -\frac{10}{\max(0, 2 - \text{TTC})} - 1$

The TTC reward is designed with exponential growth:

TTC > 2s: No penalty (safe)
TTC $\leq$ 2s: Exponentially increasing penalty
TTC = 0: Large penalty + episode termination

4.2 From PPO to GRPO

The core evolution from PPO to GRPO lies in how the advantage $A$ is estimated:

PPO uses a learned value function $V_\phi(s)$ as baseline:

A_\pi(s_t, a_t) = Q_\pi(s_t, a_t) - V_\phi(s_t)

This requires training and maintaining a separate value network.

GRPO replaces the value baseline with the group mean, eliminating the value network entirely:

A_i = \frac{r_i - \bar{r}}{\sigma_r + \epsilon}

where $r_i$ is the reward for the $i$ -th sample in a group of $G$ trajectories sampled from the same initial state, $\bar{r} = \frac{1}{G}\sum_{j=1}^{G} r_j$ , and $\sigma_r$ is the group standard deviation.

This is particularly well-suited for driving: we can sample $G$ candidate trajectories for the same scene, evaluate them all, and use the relative ranking within the group as the advantage signal. No value network needed.

4.3 Loss Design

The total loss combines multiple objectives:

\mathcal{L}_{\text{total}}(\theta) = \lambda_{\text{pg}} \mathcal{L}_{\text{GRPO-clip}} + \lambda_{\text{kl}} \mathcal{L}_{\text{KL}} + \lambda_{\text{vf}} \mathcal{L}_{\text{value}} + \lambda_{\text{ent}} \mathcal{L}_{\text{entropy}} + \lambda_{\text{bc}} \mathcal{L}_{\text{BC}} + \sum_{m=1}^{M} \lambda_{\text{aux},m} \mathcal{L}_{\text{aux},m}

Term	Role
$\mathcal{L}_{\text{GRPO-clip}}$	Policy gradient with clipped importance ratio
$\mathcal{L}_{\text{KL}}$	Prevent distribution drift from reference policy
$\mathcal{L}_{\text{value}}$	Value function fitting (if applicable)
$\mathcal{L}_{\text{entropy}}$	Maintain exploration
$\mathcal{L}_{\text{BC}}$	Behavioral cloning: preserve pre-training capability

4.4 Sampling in Driving vs. LLM

A crucial difference from LLM RL: the sampling space is constrained. In language, temperature/top-k/top-p sampling is unrestricted. In driving, sampled trajectories must satisfy physical constraints — no sudden velocity changes, heading reversals, or kinematically impossible curvatures.

For diffusion-based planners specifically, noise experiments are essential because noise directly determines exploration range, candidate diversity, and group distribution. The right noise level must provide:

Effective diversity: Group rewards should have meaningful spread.
Physical plausibility: No obviously infeasible samples.
Goal alignment: Exploration direction should align with training objectives.

5. Evaluation Metrics

5.1 Accuracy Metrics

Metric	Description
Top1_ADE	ADE of highest-scoring mode
minADE	Minimum ADE across all modes (per-agent, may be intra-modally inconsistent)
Joint_minADE	Minimum ADE at the mode level (all agents from the same mode)

5.2 Kinematic Metrics (Ego Only)

Metric	Description
Top1_Kinematic_Score	Weighted average of all kinematic sub-metrics
Top1_Kinematic_Rec_Cons	Reconstructability: forward-predicted vs. inverse-reconstructed state error via bicycle model
Top1_Kinematic_Vel	Velocity error (predicted vs. GT via finite difference)
Top1_Kinematic_Acc	Acceleration error
Top1_Kinematic_YR	Yaw rate error
Top1_Kinematic_Jerk_Long/Lat	Longitudinal/lateral jerk error

The Reconstruction Consistency metric is particularly insightful: it evaluates whether predicted trajectories satisfy the bicycle kinematic model by computing forward predictions, then inversely reconstructing the states, and measuring the residual. This tests physical plausibility independent of GT.

5.3 Interaction Metrics (Collision)

Metric	Description
Top1_CR_Ego	Ego collision rate in top-1 mode
Top1_CR_Agents	Any-agent collision rate = Pairwise + Agent-Time components
Top1_CR_Scenario	Per-scenario binary: does any collision occur?
Joint_minCR_*	Same metrics at mode level (best mode selected)

6. Quantitative Results

Comparison with the regression baseline on the same train/test split:

Metric	Regression Model	AR Model
Top1_ADE (Ego)	2.622	2.869
Top1_ADE (Agent)	1.759	1.847
Joint_minADE (Ego)	–	2.811
Joint_minADE (Agent)	–	1.841
minADE (Ego)	1.464	1.876
minADE (Agent)	1.576	1.286

The AR model shows slightly higher top-1 ADE (expected: discrete quantization introduces error), but achieves significantly lower minADE for agents (1.286 vs. 1.576), confirming that its multi-modal predictions better cover the distribution of agent behaviors.

7. Qualitative Observations

The AR model demonstrates strong interactive behavior across challenging scenarios:

Unprotected left turn: Waits for through-traffic, then proceeds smoothly.
Obstacle circumnavigation: Suggests detour paths when GT chooses to stop; “finds a way” around obstacles.
Narrow road steering: Small lateral adjustments to create clearance.
Cut-in: Assertive lane changes into tight gaps.
Pedestrian yielding: Smooth deceleration at crosswalks.

These behaviors emerge naturally from the AR model’s learned distribution over trajectory tokens, without explicit rule programming.

8. Remaining Challenges

Token vocabulary design: Should we cluster per-timestep or globally? Current uniform $m=6000$ across categories needs ablation.
Online vs. offline matching: Current online matching in the model slows training; offline matching is planned.
AR+Diffusion integration: The AR+Diffusion pipeline is the theoretical target but only the AR baseline + RL has been completed so far.
Frame consistency: AR models exhibit higher ADE between consecutive rollout frames (jitter). GRPO with frame-stability reward can fix this but at some cost to lane-change trigger metrics.

1. Background: Regression vs. Classification in AR Planning#

1.1 Discretization: State Quantities vs. High-Order Motion Quantities#

2. The Mdriver AR Pipeline#

2.1 Task Definition#

2.2 Tokenizer: Cluster, Match, Reconstruct#

Stage 1: Clustering#

Stage 2: Matching#

Stage 3: Reconstruction Error Analysis#

3. Model Architecture: Why AR + Diffusion?#

4. RL Post-Training with GRPO#

4.1 State, Action, Reward#

4.2 From PPO to GRPO#

4.3 Loss Design#

4.4 Sampling in Driving vs. LLM#

5. Evaluation Metrics#

5.1 Accuracy Metrics#

5.2 Kinematic Metrics (Ego Only)#

5.3 Interaction Metrics (Collision)#

6. Quantitative Results#

7. Qualitative Observations#

8. Remaining Challenges#

References#