The trajectory planner is the decision-making core of an autonomous driving system. Its task: given the current scene, output a future trajectory that is safe, comfortable, and efficient. Most production systems today use some form of regression — minimizing the distance between predicted and ground-truth trajectories. Yet a growing body of research and engineering evidence suggests this approach has a fundamental flaw: it assumes the feasible set is convex when it is emphatically not. This article lays out the first-principles argument for why generative approaches (diffusion, autoregressive) are not merely improvements but necessary paradigm shifts.

1. The Non-Convexity of the Feasible Set

A set SS is convex if for any two points A,BSA, B \in S, every point on the line segment connecting them also belongs to SS. In driving, this property fails dramatically:

ObsEgoA: left detour (feasible)B: right detour (feasible)C = (A+B)/2: CRASH

Trajectory A goes left around the obstacle; trajectory B goes right. Both are valid. Their average A+B2\frac{A+B}{2} drives straight into the obstacle — infeasible. The feasible set is not convex, and no amount of regularization changes this geometric fact.

2. Why Regression Fails: MSE Averages Modes

Regression with MSE loss minimizes:

min  E[ypredygt2]\min \; \mathbb{E}\left[\| y_{\text{pred}} - y_{\text{gt}} \|^2\right]

When the data distribution is multimodal (e.g., left detour and right detour are both common), the optimal MSE predictor outputs the conditional mean:

y=E[ygtx]=A+B2y^* = \mathbb{E}[y_{\text{gt}} \mid x] = \frac{A + B}{2}

This is not a bug in training — it is the mathematically correct solution to the wrong objective. The regression objective assumes a unimodal distribution centered on the mean, which is provably incorrect for non-convex feasible sets.

Trajectory SpaceDensityMode A (left)Mode B (right)MSE mean(low density!)

The MSE mean lands in the valley between two modes — a region of low probability density. The model outputs a trajectory that no human driver would ever take.

3. GMM: A Patch, Not a Solution

Gaussian Mixture Models (GMM) with KK components attempt to address multimodality by learning KK means. Each component’s update μi\mu_i is still the weighted average of samples assigned to that component:

μi=nγn,iynnγn,i\mu_i = \frac{\sum_n \gamma_{n,i} \cdot y_n}{\sum_n \gamma_{n,i}}

This creates two problems:

  1. Spurious peaks: When two true modes are close, their Gaussian components can overlap and produce a false peak in the valley between them.
  2. Finite approximation: KK Gaussians are finite convex building blocks. A non-convex shape can never be perfectly tiled by convex pieces. There will always be “gaps” (non-zero probability where there should be none) and “dead corners” (insufficient KK to cover all modes).
True distributionGMM K=2Spurious densityin valley!

GMM is a patch, not a solution. It uses a finite number of simple convex building blocks to approximate a complex non-convex shape. The approximation error is structural, not parametric — it cannot be fixed by increasing training data or tuning hyperparameters.

4. The Penalty Loss Illusion

A common engineering practice is to add penalty terms (collision, off-road, comfort) on top of MSE loss:

L=LMSE+λ1Lcollision+λ2Loff-road+λ3Lcomfort+\mathcal{L} = \mathcal{L}_{\text{MSE}} + \lambda_1 \mathcal{L}_{\text{collision}} + \lambda_2 \mathcal{L}_{\text{off-road}} + \lambda_3 \mathcal{L}_{\text{comfort}} + \cdots

This is equivalent to converting hard constraints into soft penalties via Lagrange multipliers. The approach is valid only when the optimization problem is convex. On a non-convex landscape, gradient descent from the MSE initialization can get trapped in local minima, and the penalty terms merely push the solution toward the nearest feasible boundary rather than the globally optimal trajectory.

The classical EM (Expectation Maximization) planner understood this well. It decomposed the problem into two stages:

Step APath DeciderSelect corridor(non-convex → convex)Step BSpeed OptimizerQP in convexsub-regionResultSmooth, feasibletrajectory
  1. Step A (Path Decider): Choose a corridor (e.g., “go left”), cutting the non-convex space into a convex sub-region.
  2. Step B (Speed Optimizer): Solve a Quadratic Program (QP) within this convex sub-region to obtain a smooth trajectory.

The key insight: first find a convex sub-problem, then solve it. End-to-end regression skips Step A entirely, attempting to solve the non-convex problem in one shot.

5. Generative Models: Learning the Non-Convex Shape

Generative approaches take a fundamentally different path:

MethodHow it handles non-convexity
DiffusionDirectly learns the shape of the non-convex distribution via gradient/flow field
AutoregressiveDecomposes the joint distribution via chain rule into conditional distributions; converts a geometric problem into a sequential decision problem

5.1 Diffusion: Learning the Contour

A diffusion model learns the score function ylogp(yx)\nabla_y \log p(y \mid x), which points toward higher-density regions at every point in trajectory space. During sampling, it follows this gradient field from noise to data, naturally navigating around infeasible regions:

Feasible AFeasible BInfeasibleNoise start→ Mode A→ Mode B

The score field naturally pushes samples away from infeasible regions (zero density) and toward high-density modes.

5.2 Autoregressive: Sequential Decision Decomposition

The autoregressive approach applies the chain rule to decompose the joint trajectory distribution:

p(S1:TEnv)=t=1Tp(St:t+ns<t,Env)p(S_{1:T} \mid \text{Env}) = \prod_{t=1}^{T} p(S_{t:t+n} \mid s_{<t}, \text{Env})

At each step, the model only needs to predict a local trajectory segment conditioned on the current state. Each local prediction faces a simpler distribution (often nearly unimodal at the step level), and the global multimodality emerges from the sequential composition of these choices.

This converts a geometric problem (find a trajectory in a non-convex set) into a sequential decision problem (at each step, choose the most likely next segment), which is precisely the regime where autoregressive models excel.

6. The Convergence: AR + Diffusion

The most promising direction combines both paradigms, leveraging their complementary strengths:

ARDiffusion
StrengthAccurate single-step prediction; diversity via token vocabularyGlobal trajectory coherence; smooth “error correction” over long horizons
WeaknessExposure bias and compounding error over long rolloutsCold-start problem: enormous search space from pure noise
Role in combinationProvides anchor trajectory near the data manifoldRefines anchor into globally coherent, smooth trajectory

The synergy is clear:

  • AR solves Diffusion’s cold-start: Instead of starting from Gaussian noise, diffusion begins from the AR-generated anchor — already near the manifold — vastly reducing the denoising burden.
  • Diffusion solves AR’s drift: The global refinement step corrects compounding errors that accumulate in long autoregressive rollouts.

This AR + Diffusion combination achieved top-ranking results on the NavSim benchmark (Chainflow-VLA, 94.05 PDMS) and has been validated in works like DiffusionDrive (anchor-based truncated diffusion) and GoalFlow (goal-point guided flow matching).

ARSequential tokenprediction→ DiversityAnchorDiffusionGlobal refinement(smooth filter)→ CoherenceRefinedOutputDiverse +Coherent +Smooth

7. Summary

ApproachNon-convex handlingMultimodalityLimitation
Regression (MSE)None — outputs conditional meanFails: averages modes into infeasible regionStructural failure on non-convex sets
GMMPartial — finite convex approximationLimited by KK; spurious peaksPatch, not solution
MSE + Penalty LossIndirect via soft constraintsSame MSE mean, just pushed toward boundaryOnly valid for convex sub-problems
DiffusionDirect — learns the full distribution shapeNatural: samples from learned modesCold-start; may lack diversity without anchors
AutoregressiveDecomposition via chain ruleNatural: sequential choices compose to multimodalityCompounding error; frame inconsistency
AR + DiffusionBoth: decomposition + global refinementBest of both: diverse anchors + coherent outputEngineering complexity; training cost

The progression from regression to GMM to generative models is not a matter of incremental improvement. It reflects a fundamental recognition: the planning problem in autonomous driving is inherently non-convex, and any approach that ignores this geometric fact will produce artifacts that no amount of engineering patching can fix.

References