Autoregressive (AR) trajectory generation — predicting driving trajectories as sequences of discrete tokens, much like language models predict text — has emerged as a powerful paradigm for end-to-end autonomous driving. But how do we turn continuous trajectories into discrete tokens? How do we ensure the tokenized representation preserves enough fidelity for planning? And how does the AR paradigm combine with diffusion and reinforcement learning to produce state-of-the-art results? This article walks through the complete pipeline, from tokenization theory to RL post-training.

1. Background: Regression vs. Classification in AR Planning

Autoregressive trajectory generation splits into two fundamental paradigms:

Regression-based ARContinuous output at each step+ No discretization error- Multimodality via GMM is hard- True distribution unknownExamples: MDR, UniADClassification-based ARDiscrete token prediction+ Explicit distribution modeling+ Natural multimodality- Quantization errorExamples: MotionLM, SMART

Regression-based AR outputs continuous coordinates at each step. In theory, multi-modal regression (e.g., via GMM) can capture diverse behaviors. In practice, the true distribution is unknown, and fitting enough modes is extremely difficult.

Classification-based AR discretizes the continuous action space and predicts token indices via cross-entropy. This naturally models the conditional probability distribution p(ata<t,x)p(a_t \mid a_{<t}, x), making multimodality a first-class citizen.

1.1 Discretization: State Quantities vs. High-Order Motion Quantities

For classification-based AR, the choice of what to discretize is critical:

QuantityProsCons
State (x,y,h)(x, y, h)Directly available from data; no inverse kinematicsRequires clustering to build vocabulary
High-order (acc,yaw_rate)(\text{acc}, \text{yaw\_rate})More compact; control-orientedGT values hard to obtain; unreasonable for VRU; incompatible with low-frequency prediction

The high-order approach has three specific problems:

  1. Ground truth acquisition: Acceleration and yaw rate for obstacles are difficult to measure accurately.
  2. Uniform kinematic model: Applying the same model to vehicles, cyclists, and pedestrians is unreasonable.
  3. Frequency limitation: At 0.5 Hz, assuming constant acceleration/yaw rate over 1 second cannot capture the actual motion trend.

VAE-based discretization avoids explicit quantization by operating in latent space, but suffers from training instability and mode collapse.

Our choice: State-based discretization via clustering. Trajectories are state quantities directly extractable from data, eliminating inverse kinematics errors. Clustering compresses massive historical data into a finite, representative Trajectory Vocabulary.

2. The Mdriver AR Pipeline

2.1 Task Definition

We assume any trajectory of length TT can be composed of trajectory fragments (tokens). For an 8-second trajectory at 2 Hz (16 coordinate points), we can define:

  • 16 single-point tokens
  • 8 two-point tokens (each covering 1 second)
  • 4 four-point tokens (each covering 2 seconds)

The joint distribution factorizes as:

p(S1:TEnv)=t=1Tp(St:t+ns<t,Env)p(S_{1:T} \mid \text{Env}) = \prod_{t=1}^{T} p(S_{t:t+n} \mid s_{<t}, \text{Env})

where St:t+nS_{t:t+n} denotes the state sequence (x,y,h,v)(x, y, h, v) from timestep tt to t+nt+n.

2.2 Tokenizer: Cluster, Match, Reconstruct

Tokenization is the core of the AR model. It has three stages:

Stage 1: Clustering

Given nn training trajectories, each containing 17 frames (1 current + 16 future) with state [x,y,h,v][x, y, h, v] per frame:

  1. Apply k-means separately per category (ego, vehicle, cyclist, pedestrian).
  2. Each token has shape [m,3,4][m, 3, 4]: mm cluster centers, 3 points per segment (current + 2 future), 4 state dimensions.
  3. Current vocabulary size: m=6000m = 6000 per category (uniform across categories; ablation on category-specific sizes is pending).

Token refinement (optional but important):

  • Heading fix: Ensure motion direction is consistent with heading.
  • Velocity fix: Use finite-difference velocity instead of raw values.

These fixes reduce noise from imperfect perception data, producing cleaner cluster centers.

Stage 2: Matching

Matching assigns each ground-truth trajectory segment to the nearest token. This is critically important and has a subtle design decision:

T1 (origin)TokenGT trajectoryWrong: match from GTCorrect: match from TokenKey InsightMatching from GT ateach step introducesno accumulation error.But inference uses tokens!→ Train-test mismatch

The critical question: When matching the GT segment at T2T_2, should we start from the GT position at T1T_1 or from the token-matched position at T1T_1?

Answer: We must match from the token position. Matching from GT creates a train-test mismatch — at inference time, the model always conditions on its own previous predictions, not the ground truth. Matching from GT introduces no accumulation error during training, but the model never learns to recover from its own prediction errors.

Matching cost: Currently using a weighted combination of center-point L2 distance and heading L2 distance. The SMART approach uses bounding-box corner matching, which avoids the threshold-tuning problem.

Stage 3: Reconstruction Error Analysis

After matching, we reconstruct full trajectories from tokens and measure error against GT. Key observation: finer tokens and larger vocabulary reduce reconstruction error, but model performance does not always correlate with reconstruction accuracy. The tokenizer’s fidelity is necessary but not sufficient for good downstream performance.

3. Model Architecture: Why AR + Diffusion?

Pure diffusion models (DiffusionDrive, GoalFlow) have demonstrated that anchors are critical for preventing trajectory divergence. The question is: where do the anchors come from?

Pure DiffusionHugesearchDivergent without anchorAR + DiffusionARSmall refinementCoherent outputAnalogy1. Have a concept2. Describe in words3. Refine expressionStep 1-2 = ARStep 3 = Diffusion= AR + Diffusion

The AR model learns the conditional distribution p(xt+1x1:t)p(x_{t+1} \mid x_{1:t}), but during rollout it samples from its own predictions, accumulating exposure bias and compounding error over long horizons. Diffusion’s multi-step iterative refinement is naturally suited to correcting this drift.

The complementary strengths:

  1. AR solves Diffusion’s cold-start: Pure diffusion starts from Gaussian noise with an enormous search space. AR provides a trajectory already on the data manifold, dramatically reducing the denoising burden.
  2. Diffusion solves AR’s drift: Global modeling via diffusion acts as a “smoothing filter,” correcting accumulated deviations in long-horizon predictions.

This combination achieved top-1 results on the NavSim benchmark, with Chainflow-VLA scoring 94.05 PDMS.

4. RL Post-Training with GRPO

For a pre-trained AR model, reinforcement learning can further optimize driving strategy through environment interaction. We formulate the problem as an MDP: M=(S,A,P,R,γ)\mathcal{M} = (S, A, P, R, \gamma).

4.1 State, Action, Reward

ComponentDefinition
State SSLatent representation from encoder: element=fencoder(input)\text{element} = f_{\text{encoder}}(\text{input})
Action AAScheme 1: Actor network replaces decoder; action = token selection from discrete vocabulary. Scheme 2: Actor makes continuous adjustment (Δx,Δy,Δh)(\Delta x, \Delta y, \Delta h) to selected token.
Reward RRTTC-based collision penalty: RTTC=10max(0,2TTC)1R_{\text{TTC}} = -\frac{10}{\max(0, 2 - \text{TTC})} - 1

The TTC reward is designed with exponential growth:

  • TTC > 2s: No penalty (safe)
  • TTC \leq 2s: Exponentially increasing penalty
  • TTC = 0: Large penalty + episode termination

4.2 From PPO to GRPO

The core evolution from PPO to GRPO lies in how the advantage AA is estimated:

PPO uses a learned value function Vϕ(s)V_\phi(s) as baseline:

Aπ(st,at)=Qπ(st,at)Vϕ(st)A_\pi(s_t, a_t) = Q_\pi(s_t, a_t) - V_\phi(s_t)

This requires training and maintaining a separate value network.

GRPO replaces the value baseline with the group mean, eliminating the value network entirely:

Ai=rirˉσr+ϵA_i = \frac{r_i - \bar{r}}{\sigma_r + \epsilon}

where rir_i is the reward for the ii-th sample in a group of GG trajectories sampled from the same initial state, rˉ=1Gj=1Grj\bar{r} = \frac{1}{G}\sum_{j=1}^{G} r_j, and σr\sigma_r is the group standard deviation.

This is particularly well-suited for driving: we can sample GG candidate trajectories for the same scene, evaluate them all, and use the relative ranking within the group as the advantage signal. No value network needed.

4.3 Loss Design

The total loss combines multiple objectives:

Ltotal(θ)=λpgLGRPO-clip+λklLKL+λvfLvalue+λentLentropy+λbcLBC+m=1Mλaux,mLaux,m\mathcal{L}_{\text{total}}(\theta) = \lambda_{\text{pg}} \mathcal{L}_{\text{GRPO-clip}} + \lambda_{\text{kl}} \mathcal{L}_{\text{KL}} + \lambda_{\text{vf}} \mathcal{L}_{\text{value}} + \lambda_{\text{ent}} \mathcal{L}_{\text{entropy}} + \lambda_{\text{bc}} \mathcal{L}_{\text{BC}} + \sum_{m=1}^{M} \lambda_{\text{aux},m} \mathcal{L}_{\text{aux},m}
TermRole
LGRPO-clip\mathcal{L}_{\text{GRPO-clip}}Policy gradient with clipped importance ratio
LKL\mathcal{L}_{\text{KL}}Prevent distribution drift from reference policy
Lvalue\mathcal{L}_{\text{value}}Value function fitting (if applicable)
Lentropy\mathcal{L}_{\text{entropy}}Maintain exploration
LBC\mathcal{L}_{\text{BC}}Behavioral cloning: preserve pre-training capability

4.4 Sampling in Driving vs. LLM

A crucial difference from LLM RL: the sampling space is constrained. In language, temperature/top-k/top-p sampling is unrestricted. In driving, sampled trajectories must satisfy physical constraints — no sudden velocity changes, heading reversals, or kinematically impossible curvatures.

For diffusion-based planners specifically, noise experiments are essential because noise directly determines exploration range, candidate diversity, and group distribution. The right noise level must provide:

  1. Effective diversity: Group rewards should have meaningful spread.
  2. Physical plausibility: No obviously infeasible samples.
  3. Goal alignment: Exploration direction should align with training objectives.

5. Evaluation Metrics

5.1 Accuracy Metrics

MetricDescription
Top1_ADEADE of highest-scoring mode
minADEMinimum ADE across all modes (per-agent, may be intra-modally inconsistent)
Joint_minADEMinimum ADE at the mode level (all agents from the same mode)

5.2 Kinematic Metrics (Ego Only)

MetricDescription
Top1_Kinematic_ScoreWeighted average of all kinematic sub-metrics
Top1_Kinematic_Rec_ConsReconstructability: forward-predicted vs. inverse-reconstructed state error via bicycle model
Top1_Kinematic_VelVelocity error (predicted vs. GT via finite difference)
Top1_Kinematic_AccAcceleration error
Top1_Kinematic_YRYaw rate error
Top1_Kinematic_Jerk_Long/LatLongitudinal/lateral jerk error

The Reconstruction Consistency metric is particularly insightful: it evaluates whether predicted trajectories satisfy the bicycle kinematic model by computing forward predictions, then inversely reconstructing the states, and measuring the residual. This tests physical plausibility independent of GT.

5.3 Interaction Metrics (Collision)

MetricDescription
Top1_CR_EgoEgo collision rate in top-1 mode
Top1_CR_AgentsAny-agent collision rate = Pairwise + Agent-Time components
Top1_CR_ScenarioPer-scenario binary: does any collision occur?
Joint_minCR_*Same metrics at mode level (best mode selected)

6. Quantitative Results

Comparison with the regression baseline on the same train/test split:

MetricRegression ModelAR Model
Top1_ADE (Ego)2.6222.869
Top1_ADE (Agent)1.7591.847
Joint_minADE (Ego)2.811
Joint_minADE (Agent)1.841
minADE (Ego)1.4641.876
minADE (Agent)1.5761.286

The AR model shows slightly higher top-1 ADE (expected: discrete quantization introduces error), but achieves significantly lower minADE for agents (1.286 vs. 1.576), confirming that its multi-modal predictions better cover the distribution of agent behaviors.

7. Qualitative Observations

The AR model demonstrates strong interactive behavior across challenging scenarios:

  • Unprotected left turn: Waits for through-traffic, then proceeds smoothly.
  • Obstacle circumnavigation: Suggests detour paths when GT chooses to stop; “finds a way” around obstacles.
  • Narrow road steering: Small lateral adjustments to create clearance.
  • Cut-in: Assertive lane changes into tight gaps.
  • Pedestrian yielding: Smooth deceleration at crosswalks.

These behaviors emerge naturally from the AR model’s learned distribution over trajectory tokens, without explicit rule programming.

8. Remaining Challenges

  1. Token vocabulary design: Should we cluster per-timestep or globally? Current uniform m=6000m=6000 across categories needs ablation.
  2. Online vs. offline matching: Current online matching in the model slows training; offline matching is planned.
  3. AR+Diffusion integration: The AR+Diffusion pipeline is the theoretical target but only the AR baseline + RL has been completed so far.
  4. Frame consistency: AR models exhibit higher ADE between consecutive rollout frames (jitter). GRPO with frame-stability reward can fix this but at some cost to lane-change trigger metrics.

References