End-to-End Autonomous Driving: From Modular Decoders to VLA Architectures

Introduction

The trajectory of autonomous driving architecture has undergone a paradigm shift: from the classical modular pipeline (perception $\to$ prediction $\to$ planning $\to$ control) toward end-to-end systems that map sensory inputs directly to driving actions. This transition is not merely an engineering convenience—it reflects a deep recognition that modular interfaces impose information bottlenecks and that joint optimization across the full stack can yield emergent capabilities invisible to individually optimized modules.

The evolution can be broadly characterized in three phases:

V1.0 — Modular end-to-end: Individual modules (detection, tracking, prediction) are trained end-to-end with differentiable interfaces, but the overall architecture retains a modular structure with hand-designed information flow.
V2.0 — One-stage end-to-end: A single model directly predicts trajectories from multi-modal sensor inputs. The core research question becomes: what is the optimal decoder head for the planner?
V3.0 — VLA-native end-to-end: The action space is natively integrated into a vision-language-action model, where driving decisions emerge from the same representational substrate as linguistic reasoning.

This article focuses on the V2.0 $\to$ V3.0 transition. We examine the three dominant decoder paradigms—autoregressive (AR), diffusion, and flow matching—analyze their trade-offs in diversity, stability, and real-time feasibility, and discuss how the VLA paradigm in V3.0 resolves fundamental tensions that persist in V2.0 architectures.

V2.0: The Planner Decoder Selection Problem

The central design decision in a one-stage end-to-end system is the planner decoder head: the mechanism by which the model’s learned scene representation is decoded into a drivable trajectory. Unlike classification or detection heads, trajectory decoding must satisfy several competing constraints simultaneously:

Multi-modality: At any given scene, multiple plausible trajectories exist (lane keeping, lane change, yield). The decoder must represent this multi-modal distribution without collapsing to a single mode.
Temporal consistency: Consecutive frames must produce consistent trajectories; jitter between frames is unacceptable for passenger comfort and safety.
Kinematic feasibility: Predicted trajectories must satisfy vehicle dynamics constraints (curvature, acceleration, jerk).
Real-time inference: The decoder must produce trajectories within the vehicle’s control loop latency budget (typically $\leq 100$ ms).

Three families of decoder architectures have emerged as the leading candidates: autoregressive token prediction, diffusion-based generation, and flow matching. We analyze each in turn.

Autoregressive (AR) Decoding

The autoregressive approach treats trajectory generation as a next-token prediction problem, directly borrowing the paradigm that has proven enormously successful in large language models. Given a trajectory $\tau = (a_1, a_2, \ldots, a_T)$ discretized into action tokens, the model generates:

p(\tau) = \prod_{t=1}^{T} p(a_t \mid a_{<t}, \mathbf{x})

where $\mathbf{x}$ denotes the scene encoding (visual features, map information, ego state). This formulation is exemplified by MotionLM [1], which represents continuous trajectories as sequences of discrete motion tokens and casts multi-agent motion prediction as a language modeling task.

The key advantage of AR decoding is its expressive multi-modality: by modeling the full conditional distribution autoregressively, the decoder can naturally represent diverse trajectory outcomes. However, this advantage comes at a cost:

Inter-frame inconsistency: Because each frame’s trajectory is generated independently from the same conditional distribution, small perturbations in the scene encoding can lead to mode-switching between frames, producing the characteristic “jitter” or “wobble” in the ego trajectory.
Error accumulation: Autoregressive errors compound over the trajectory horizon, particularly for long-horizon predictions.

Recent work has attempted to mitigate the jitter problem through reinforcement learning. Specifically, GRPO (Group Relative Policy Optimization) with a frame-consistency reward can reduce inter-frame variability. However, this approach introduces its own pathology: by penalizing deviation from the previous frame’s trajectory, the model becomes overly conservative and the lane-change trigger metric degrades—the model learns to “play it safe” by avoiding lane changes altogether.

Diffusion-Based Decoding

Diffusion models generate trajectories by iteratively denoising from a Gaussian prior:

\tau_0 \sim p_\theta(\tau_0 \mid \mathbf{x}) = \int p(\tau_K) \prod_{k=K}^{1} p_\theta(\tau_{k-1} \mid \tau_k, \mathbf{x}) \, d\tau_1 \ldots d\tau_K

where $K$ is the number of denoising steps and $\tau_K \sim \mathcal{N}(0, \mathbf{I})$ .

DiffusionDrive [2] introduces a critical innovation: anchor-based truncated diffusion. Rather than denoising from pure noise, the model starts from a set of anchor trajectories that represent different driving intentions (lane keeping, left lane change, right lane change). The diffusion schedule is truncated—starting from an intermediate noise level rather than pure noise—which dramatically reduces the number of required denoising steps while preserving multi-modality.

The truncation strategy addresses a fundamental limitation of naive diffusion for driving: full denoising from $\tau_K$ is both computationally expensive and prone to mode collapse when the distribution is highly concentrated. By conditioning on anchors and truncating the schedule, DiffusionDrive achieves real-time inference with multi-modal output.

However, the anchor-based approach introduces a subtle problem: the AR-like jitter reappears at the anchor selection level. When the model switches between anchors across consecutive frames, the resulting trajectory exhibits the same inconsistency that plagues AR decoding.

Flow Matching Decoding

Flow matching learns a continuous-time vector field (ODE) that transports a simple prior distribution to the target trajectory distribution:

\frac{d\tau}{dt} = v_\theta(\tau_t, t, \mathbf{x}), \quad t \in [0, 1]

where $v_\theta$ is the learned velocity field and the trajectory is obtained by solving the ODE from $t=0$ to $t=1$ . This formulation, known as FlowDrive in the driving context, has several attractive properties:

Smooth trajectories: Because the ODE solver produces a continuous trajectory, the output is inherently smooth. In practice, flow matching produces the smoothest, most “silky” trajectories among the three approaches.
Deterministic inference: The ODE solver is deterministic given the same initial conditions, eliminating sampling noise.

The critical weakness of flow matching is mode collapse via ODE sampling. Because the vector field is trained to minimize the flow matching loss:

\mathcal{L}_{FM} = \mathbb{E}_{t, \tau_0, \tau_1} \left[ \| v_\theta(\tau_t, t, \mathbf{x}) - (\tau_1 - \tau_0) \|^2 \right]

the learned flow tends to transport all prior samples toward the dominant mode, particularly in regions where the trajectory distribution is highly concentrated. This is fundamentally different from diffusion, where the stochastic sampling process inherently maintains diversity.

Attempts to apply GRPO reinforcement learning to flow matching face a particularly severe version of the “all-or-nothing” problem: the RL signal tends to push the entire batch toward either the good mode or the bad mode, rather than improving the average case. This bimodal training dynamic makes GRPO for flow matching unstable in practice.

Three-Way Trade-off

The three approaches can be positioned in a trade-off space along three axes: trajectory diversity, temporal consistency, and inference determinism:

The table below summarizes the quantitative trade-offs observed across reproduced experiments:

Property	AR (MotionLM-style)	Flow Matching	DiffusionDrive	AR + Diffusion
Trajectory diversity	High	Low (mode collapse)	Moderate	High
Inter-frame consistency	Low (jitter)	Best (smooth)	Moderate (anchor jitter)	Moderate-High
GRPO compatibility	Good (but hurts lane change)	Poor (all-or-nothing)	Moderate	Good
Inference speed	Fast (single pass)	Fast (few ODE steps)	Moderate ( $K$ denoising steps)	Moderate
Real-time feasibility	Yes	Yes	With truncation: Yes	Yes

AR + Diffusion: The Optimal Combination

The experimental evidence points to a hybrid AR + Diffusion strategy as the most effective decoder for one-stage end-to-end driving. The intuition is straightforward: AR decoding provides the diversity guarantee, while the diffusion denoising process acts as a consistency regularizer, smoothing out the mode-switching artifacts of pure AR.

On the NavSim benchmark, the Chainflow-VLA system (which combines AR trajectory tokenization with chain-of-diffusion refinement) achieved a PDMS score of 94.05, ranking first on the NavSim v1 navtest leaderboard at the time of submission [3]. This result provides strong empirical support for the hybrid approach.

The key insight is that the two components address complementary failure modes:

AR prevents the mode collapse that plagues flow matching and, to a lesser extent, diffusion.
Diffusion denoising smooths the AR jitter by denoising across the trajectory sequence rather than within a single frame.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
Algorithm: AR + Diffusion Decoding
Input: Scene encoding x, anchors A = {a_1, ..., a_M}
Output: Trajectory tau

1. // AR phase: generate coarse trajectory tokens
2. for t = 1 to T do
3.   a_t ~ p_theta(a_t | a_{<t}, x)
4. end
5. tau_coarse = TokenToTrajectory(a_1, ..., a_T)
6.
7. // Diffusion refinement phase
8. tau_noisy = AddNoise(tau_coarse, sigma_K)
9. for k = K to 1 do
10.  tau_{k-1} = DenoiseStep(tau_k, x, A)
11. end
12. return tau_0

It should be noted that the exact architecture of Chainflow-VLA is not fully detailed in publicly available documentation; the description above reflects the general principle of AR-initialized diffusion refinement consistent with the reported approach. Readers are advised to consult the original source for precise architectural details.

V3.0: VLA Architecture — Two Philosophies of Action Integration

The transition from V2.0 to V3.0 is marked by a fundamental architectural shift: the introduction of Vision-Language-Action (VLA) models, where driving actions are natively generated within the same large model that processes visual and linguistic inputs. This is not simply “adding an action head to a VLM”—it requires a deep rethinking of how action representations relate to the model’s internal semantics.

The Corner Case Motivation

The primary motivation for V3.0 is the corner case problem. In autonomous driving, corner cases are scenarios characterized by three properties:

Minimal visual difference: The perceptual distinction between a safe and an unsafe scenario can be extremely subtle (e.g., a pedestrian glancing at their phone while crossing vs. walking purposefully).
High decision significance: Despite the minimal perceptual difference, the correct action can be qualitatively different (emergency brake vs. moderate slowdown).
Temporal context dependence: The correct decision cannot be determined from a single frame alone; it requires understanding the temporal evolution of the scene.

These properties make corner cases fundamentally unsuitable for the V2.0 paradigm, where the planner decoder operates on a single-frame scene encoding. The VLA approach addresses this by grounding the action in a richer semantic representation that includes temporal reasoning and causal understanding.

Two Architectural Philosophies

The integration of action into a VLA model admits two fundamentally different architectural philosophies, depending on the assumed relationship between semantic understanding and action generation:

Philosophy 1: Action requires deep semantic alignment (Concat-KV)

If one believes that driving action requires multi-layer semantic abstraction—understanding why a situation is dangerous, not just what is present—then action tokens should be integrated into the LLM’s key-value cache alongside text tokens. In this approach, the action tokens attend to and are attended by the full sequence of visual and linguistic tokens, enabling the model to ground its actions in the same deep semantic representations that support reasoning.

\text{KV}_{\text{action}} = \text{Concat}(\text{KV}_{\text{vision}}, \text{KV}_{\text{language}}, \text{KV}_{\text{action}})

The advantage is that actions are fully grounded in the model’s semantic understanding. The risk is that the action head inherits the full complexity of the LLM’s attention patterns, making training unstable and inference expensive. OpenDriveVLA [4] exemplifies this approach with a hierarchical vision-language alignment process that projects both 2D and 3D visual features into the language embedding space before action decoding.

Philosophy 2: VLM as feature extractor + downstream action module

If one believes that driving action is primarily a low-dimensional conditional generation problem—the scene understanding is “solved” by the VLM’s visual encoder, and the action module only needs to sample from the conditional distribution given stable scene features—then a decoupled architecture is more appropriate. The VLM serves as a frozen feature extractor, and a lightweight action module generates trajectories conditioned on the VLM’s output features.

\mathbf{z} = \text{VLM}_{\text{encoder}}(\mathbf{x}), \quad \tau \sim p_\theta(\tau \mid \mathbf{z})

The advantage is training stability: the VLM encoder is not disrupted by the action training signal, and the action module can be trained independently with standard imitation learning or RL. The risk is that the action module may not have access to the full depth of the VLM’s semantic understanding, limiting its ability to handle corner cases that require causal reasoning.

The choice between these two philosophies is not settled. It depends on the empirical answer to a deeper question: is driving action fundamentally a semantic reasoning problem or a conditional generation problem? If the former, concat-KV is justified; if the latter, the decoupled approach is more efficient and stable.

Engineering Practice: From Research to Production

The transition from research prototypes to production-grade end-to-end driving systems requires solving a distinct set of engineering challenges. The following sections document key practices observed in real-world deployments.

Data Infrastructure

A one-stage end-to-end model is only as good as its training data. The data infrastructure challenge has several dimensions:

Format unification: Multiple data sources (perception labels, driving behavior, navigation instructions) must be unified into a single training format. The “six-in-one” unification format integrates perception data from five separate pipelines (object detection, occupancy, lane detection, traffic light, and driving behavior) into a single schema, enabling joint training over 1.5M+ clips from heterogeneous sources.

Data quality: Validation workflows are essential. Each data source has its own failure modes (mislabeled bounding boxes, inconsistent lane topology, incorrect traffic light states). A structured data acceptance process—with automated sanity checks and human review—catches systematic errors before they contaminate training.

Distribution balancing: Real-world driving data is heavily imbalanced: highway cruising dominates, while urban intersections and corner cases are underrepresented. Explicit distribution construction—through targeted data collection, augmentation, and re-weighting—is necessary to ensure the model does not degenerate into a “go straight” policy.

Training Optimization

Scaling to million-clip training sets requires significant infrastructure investment:

Distributed training: Codebase must support 1M+ clips across 16+ GPU nodes with near-linear scaling. The key bottleneck is typically gradient synchronization and data loading, not compute.
Training efficiency: Through architecture and pipeline optimizations, the training time for 1M clips can be reduced from 8 days to 5 days (a $\sim$ 30% improvement), primarily through mixed-precision training, gradient accumulation, and optimized data loading.
Incremental gains: Two phases of improvement are typical:
1. Data scaling: Increasing training data from 25K to 750K clips, combined with model structure optimization, reduces Ego ADE (Average Displacement Error) by 10+%, from 3.0m to 2.6m.
2. Feature distillation: Removing unnecessary structured information (e.g., explicit object proposals) and using a pure feature representation with expert supervision further reduces Ego ADE by 7.6%, from 2.6m to 2.4m.

The second phase is particularly noteworthy: it suggests that explicit structured representations (object boxes, lane lines) may not be necessary for the planner, and that learned dense features can be more informative when properly supervised.

Evaluation Systems

Open-loop metrics (ADE, FDE) are necessary but insufficient for evaluating driving quality. A comprehensive evaluation system must assess multiple dimensions:

Dimension	Metrics	Description
Safety	TTC (Time-to-Collision)	Minimum time to collision with any dynamic object
Comfort	Jerk, lateral acceleration	Passenger comfort metrics
Efficiency	Progress, speed deviation	How efficiently the ego reaches its destination
Compliance	Traffic light adherence, lane keeping	Adherence to traffic regulations
Consistency	Trajectory overlap ratio	Agreement between consecutive-frame predictions

Benchmark construction: A dedicated test set of 1200 clips covering diverse scenarios (urban, highway, intersection, adverse weather) provides the foundation for reproducible evaluation.

Efficiency: Evaluation pipeline optimization can reduce per-clip evaluation time from 10 minutes to 10 seconds, enabling rapid iteration during development.

Semi-closed-loop metrics: Pure open-loop evaluation can miss failure modes that only appear under the model’s own actions. Semi-closed-loop metrics—where the model’s predicted trajectory is “unrolled” for a few steps without affecting the environment—provide a middle ground. Key metrics include GT-free TTC (safety), comfort measures, and efficiency, computed under the model’s own trajectory rather than the ground-truth future.

Real-Vehicle Deployment

The transition from simulation to real-vehicle testing reveals additional challenges not captured by any open-loop or semi-closed-loop metric. Successfully deploying a one-stage model on a real vehicle requires:

Latency optimization: The model must produce trajectories within the vehicle’s control cycle ( $\leq$ 100ms), including all pre-processing, inference, and post-processing.
Fallback mechanisms: When the model’s confidence is low (e.g., in out-of-distribution scenarios), the system must gracefully fall back to a rule-based planner or emergency stop.
Monitoring and logging: Comprehensive logging of model inputs, outputs, and internal states is essential for post-hoc analysis of failure cases.

Architecture Evolution Summary

The full architectural evolution from modular systems through V2.0 one-stage models to V3.0 VLA-native systems can be visualized as follows:

Discussion and Open Questions

Several fundamental questions remain open as the field moves toward V3.0:

1. Is the corner case problem primarily a representation problem or a data problem? If corner cases arise from insufficient coverage in the training distribution, then more data (or better augmentation) is the solution. If they arise from the model’s inability to represent the relevant distinctions, then architectural changes (like VLA) are necessary. The truth is likely a combination, but the relative importance determines whether V3.0 is a qualitative leap or an incremental improvement.

2. Can the two VLA philosophies be unified? The concat-KV and decoupled approaches represent two ends of a spectrum. A promising direction is adaptive grounding: use concat-KV for scenes that require deep reasoning (detected by an uncertainty or complexity estimator) and the decoupled approach for routine driving. This would give the best of both worlds at the cost of architectural complexity.

3. How should we evaluate corner case performance? Current benchmarks (NavSim, nuScenes) are dominated by routine driving scenarios. Dedicated corner case benchmarks [5] are emerging, but standardized evaluation remains an open problem. The WM-MoE framework [6] proposes using world models to generate corner cases, but the fidelity of these generated scenarios to real-world corner cases is not yet validated.

4. What is the role of reinforcement learning? GRPO and similar RL methods can improve specific metrics (frame consistency, lane-change triggering) but often introduce new failure modes. The RL reward design problem for driving is fundamentally harder than for language: there is no simple analog of “helpfulness” that captures all aspects of safe, efficient, and comfortable driving.

References

[1] MotionLM: Multi-Agent Motion Forecasting as Language Modeling. Waymo Research, ICCV 2023. arXiv:2309.16534

[2] DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving. Liao et al., 2024. arXiv:2411.15139

[3] Chainflow-VLA: AR-initialized chain-of-diffusion for end-to-end driving. NavSim v1 navtest leaderboard, PDMS 94.05. The author was unable to independently verify the leaderboard ranking from publicly accessible sources as of this writing; the score is reported as cited in internal engineering documentation.

[4] OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision-Language-Action Model. 2025. arXiv:2503.23463

[5] Driving in Corner Case: A Real-World Adversarial Driving Benchmark for End-to-End Autonomous Driving. 2025. arXiv:2512.16055

[6] WM-MoE: Addressing corner cases in autonomous driving with a world model-based Mixture of Experts. Transportation Research Part C, 2026. DOI:10.1016/j.trc.2025.105607

[7] NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking. Da et al., CoRL 2024. arXiv:2406.15349

[8] GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectories Generation in End-to-End Autonomous Driving. CVPR 2025.

[9] A Survey on Vision-Language-Action Models for Autonomous Driving. Jiang et al., ICCV 2025 Workshop. arXiv:2512.16760

[10] GuideFlow: Constraint-Guided Flow Matching for Planning in End-to-End Autonomous Driving. 2025. arXiv:2511.18729

Introduction#

V2.0: The Planner Decoder Selection Problem#

Autoregressive (AR) Decoding#

Diffusion-Based Decoding#

Flow Matching Decoding#

Three-Way Trade-off#

AR + Diffusion: The Optimal Combination#

V3.0: VLA Architecture — Two Philosophies of Action Integration#

The Corner Case Motivation#

Two Architectural Philosophies#

Engineering Practice: From Research to Production#

Data Infrastructure#

Training Optimization#

Evaluation Systems#

Real-Vehicle Deployment#

Architecture Evolution Summary#

Discussion and Open Questions#

References#