Autoregressive (AR) trajectory generation — predicting driving trajectories as sequences of discrete tokens, much like language models predict text — has emerged as a powerful paradigm for end-to-end autonomous driving. But how do we turn continuous trajectories into discrete tokens? How do we ensure the tokenized representation preserves enough fidelity for planning? And how does the AR paradigm combine with diffusion and reinforcement learning to produce state-of-the-art results? This article walks through the complete pipeline, from tokenization theory to RL post-training.
1. Background: Regression vs. Classification in AR Planning
Autoregressive trajectory generation splits into two fundamental paradigms:
Regression-based AR outputs continuous coordinates at each step. In theory, multi-modal regression (e.g., via GMM) can capture diverse behaviors. In practice, the true distribution is unknown, and fitting enough modes is extremely difficult.
Classification-based AR discretizes the continuous action space and predicts token indices via cross-entropy. This naturally models the conditional probability distribution , making multimodality a first-class citizen.
1.1 Discretization: State Quantities vs. High-Order Motion Quantities
For classification-based AR, the choice of what to discretize is critical:
| Quantity | Pros | Cons |
|---|---|---|
| State | Directly available from data; no inverse kinematics | Requires clustering to build vocabulary |
| High-order | More compact; control-oriented | GT values hard to obtain; unreasonable for VRU; incompatible with low-frequency prediction |
The high-order approach has three specific problems:
- Ground truth acquisition: Acceleration and yaw rate for obstacles are difficult to measure accurately.
- Uniform kinematic model: Applying the same model to vehicles, cyclists, and pedestrians is unreasonable.
- Frequency limitation: At 0.5 Hz, assuming constant acceleration/yaw rate over 1 second cannot capture the actual motion trend.
VAE-based discretization avoids explicit quantization by operating in latent space, but suffers from training instability and mode collapse.
Our choice: State-based discretization via clustering. Trajectories are state quantities directly extractable from data, eliminating inverse kinematics errors. Clustering compresses massive historical data into a finite, representative Trajectory Vocabulary.
2. The Mdriver AR Pipeline
2.1 Task Definition
We assume any trajectory of length can be composed of trajectory fragments (tokens). For an 8-second trajectory at 2 Hz (16 coordinate points), we can define:
- 16 single-point tokens
- 8 two-point tokens (each covering 1 second)
- 4 four-point tokens (each covering 2 seconds)
The joint distribution factorizes as:
where denotes the state sequence from timestep to .
2.2 Tokenizer: Cluster, Match, Reconstruct
Tokenization is the core of the AR model. It has three stages:
Stage 1: Clustering
Given training trajectories, each containing 17 frames (1 current + 16 future) with state per frame:
- Apply k-means separately per category (ego, vehicle, cyclist, pedestrian).
- Each token has shape : cluster centers, 3 points per segment (current + 2 future), 4 state dimensions.
- Current vocabulary size: per category (uniform across categories; ablation on category-specific sizes is pending).
Token refinement (optional but important):
- Heading fix: Ensure motion direction is consistent with heading.
- Velocity fix: Use finite-difference velocity instead of raw values.
These fixes reduce noise from imperfect perception data, producing cleaner cluster centers.
Stage 2: Matching
Matching assigns each ground-truth trajectory segment to the nearest token. This is critically important and has a subtle design decision:
The critical question: When matching the GT segment at , should we start from the GT position at or from the token-matched position at ?
Answer: We must match from the token position. Matching from GT creates a train-test mismatch — at inference time, the model always conditions on its own previous predictions, not the ground truth. Matching from GT introduces no accumulation error during training, but the model never learns to recover from its own prediction errors.
Matching cost: Currently using a weighted combination of center-point L2 distance and heading L2 distance. The SMART approach uses bounding-box corner matching, which avoids the threshold-tuning problem.
Stage 3: Reconstruction Error Analysis
After matching, we reconstruct full trajectories from tokens and measure error against GT. Key observation: finer tokens and larger vocabulary reduce reconstruction error, but model performance does not always correlate with reconstruction accuracy. The tokenizer’s fidelity is necessary but not sufficient for good downstream performance.
3. Model Architecture: Why AR + Diffusion?
Pure diffusion models (DiffusionDrive, GoalFlow) have demonstrated that anchors are critical for preventing trajectory divergence. The question is: where do the anchors come from?
The AR model learns the conditional distribution , but during rollout it samples from its own predictions, accumulating exposure bias and compounding error over long horizons. Diffusion’s multi-step iterative refinement is naturally suited to correcting this drift.
The complementary strengths:
- AR solves Diffusion’s cold-start: Pure diffusion starts from Gaussian noise with an enormous search space. AR provides a trajectory already on the data manifold, dramatically reducing the denoising burden.
- Diffusion solves AR’s drift: Global modeling via diffusion acts as a “smoothing filter,” correcting accumulated deviations in long-horizon predictions.
This combination achieved top-1 results on the NavSim benchmark, with Chainflow-VLA scoring 94.05 PDMS.
4. RL Post-Training with GRPO
For a pre-trained AR model, reinforcement learning can further optimize driving strategy through environment interaction. We formulate the problem as an MDP: .
4.1 State, Action, Reward
| Component | Definition |
|---|---|
| State | Latent representation from encoder: |
| Action | Scheme 1: Actor network replaces decoder; action = token selection from discrete vocabulary. Scheme 2: Actor makes continuous adjustment to selected token. |
| Reward | TTC-based collision penalty: |
The TTC reward is designed with exponential growth:
- TTC > 2s: No penalty (safe)
- TTC 2s: Exponentially increasing penalty
- TTC = 0: Large penalty + episode termination
4.2 From PPO to GRPO
The core evolution from PPO to GRPO lies in how the advantage is estimated:
PPO uses a learned value function as baseline:
This requires training and maintaining a separate value network.
GRPO replaces the value baseline with the group mean, eliminating the value network entirely:
where is the reward for the -th sample in a group of trajectories sampled from the same initial state, , and is the group standard deviation.
This is particularly well-suited for driving: we can sample candidate trajectories for the same scene, evaluate them all, and use the relative ranking within the group as the advantage signal. No value network needed.
4.3 Loss Design
The total loss combines multiple objectives:
| Term | Role |
|---|---|
| Policy gradient with clipped importance ratio | |
| Prevent distribution drift from reference policy | |
| Value function fitting (if applicable) | |
| Maintain exploration | |
| Behavioral cloning: preserve pre-training capability |
4.4 Sampling in Driving vs. LLM
A crucial difference from LLM RL: the sampling space is constrained. In language, temperature/top-k/top-p sampling is unrestricted. In driving, sampled trajectories must satisfy physical constraints — no sudden velocity changes, heading reversals, or kinematically impossible curvatures.
For diffusion-based planners specifically, noise experiments are essential because noise directly determines exploration range, candidate diversity, and group distribution. The right noise level must provide:
- Effective diversity: Group rewards should have meaningful spread.
- Physical plausibility: No obviously infeasible samples.
- Goal alignment: Exploration direction should align with training objectives.
5. Evaluation Metrics
5.1 Accuracy Metrics
| Metric | Description |
|---|---|
| Top1_ADE | ADE of highest-scoring mode |
| minADE | Minimum ADE across all modes (per-agent, may be intra-modally inconsistent) |
| Joint_minADE | Minimum ADE at the mode level (all agents from the same mode) |
5.2 Kinematic Metrics (Ego Only)
| Metric | Description |
|---|---|
| Top1_Kinematic_Score | Weighted average of all kinematic sub-metrics |
| Top1_Kinematic_Rec_Cons | Reconstructability: forward-predicted vs. inverse-reconstructed state error via bicycle model |
| Top1_Kinematic_Vel | Velocity error (predicted vs. GT via finite difference) |
| Top1_Kinematic_Acc | Acceleration error |
| Top1_Kinematic_YR | Yaw rate error |
| Top1_Kinematic_Jerk_Long/Lat | Longitudinal/lateral jerk error |
The Reconstruction Consistency metric is particularly insightful: it evaluates whether predicted trajectories satisfy the bicycle kinematic model by computing forward predictions, then inversely reconstructing the states, and measuring the residual. This tests physical plausibility independent of GT.
5.3 Interaction Metrics (Collision)
| Metric | Description |
|---|---|
| Top1_CR_Ego | Ego collision rate in top-1 mode |
| Top1_CR_Agents | Any-agent collision rate = Pairwise + Agent-Time components |
| Top1_CR_Scenario | Per-scenario binary: does any collision occur? |
| Joint_minCR_* | Same metrics at mode level (best mode selected) |
6. Quantitative Results
Comparison with the regression baseline on the same train/test split:
| Metric | Regression Model | AR Model |
|---|---|---|
| Top1_ADE (Ego) | 2.622 | 2.869 |
| Top1_ADE (Agent) | 1.759 | 1.847 |
| Joint_minADE (Ego) | – | 2.811 |
| Joint_minADE (Agent) | – | 1.841 |
| minADE (Ego) | 1.464 | 1.876 |
| minADE (Agent) | 1.576 | 1.286 |
The AR model shows slightly higher top-1 ADE (expected: discrete quantization introduces error), but achieves significantly lower minADE for agents (1.286 vs. 1.576), confirming that its multi-modal predictions better cover the distribution of agent behaviors.
7. Qualitative Observations
The AR model demonstrates strong interactive behavior across challenging scenarios:
- Unprotected left turn: Waits for through-traffic, then proceeds smoothly.
- Obstacle circumnavigation: Suggests detour paths when GT chooses to stop; “finds a way” around obstacles.
- Narrow road steering: Small lateral adjustments to create clearance.
- Cut-in: Assertive lane changes into tight gaps.
- Pedestrian yielding: Smooth deceleration at crosswalks.
These behaviors emerge naturally from the AR model’s learned distribution over trajectory tokens, without explicit rule programming.
8. Remaining Challenges
- Token vocabulary design: Should we cluster per-timestep or globally? Current uniform across categories needs ablation.
- Online vs. offline matching: Current online matching in the model slows training; offline matching is planned.
- AR+Diffusion integration: The AR+Diffusion pipeline is the theoretical target but only the AR baseline + RL has been completed so far.
- Frame consistency: AR models exhibit higher ADE between consecutive rollout frames (jitter). GRPO with frame-stability reward can fix this but at some cost to lane-change trigger metrics.
References
- MotionLM: Multi-Agent Motion Forecasting as Language Modeling (Waymo, ICRA 2024)
- DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving (CVPR 2025)
- GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectory Generation
- AlphaDrive: GRPO-based RL for Autonomous Driving
- SMART: Scalable Multi-agent Real-time Simulation
- NavSim Benchmark