Autonomous Driving: End-to-End, VLA, and Beyond

Technical deep dives into the evolution of autonomous driving from modular pipelines to end-to-end systems, VLA architectures, and generative planning.

Foundational Arguments

Article	Core Thesis
Why Generative Planning?	The feasible set is non-convex; regression fundamentally fails
Trajectory Tokenization for AR Planning	Clustering, matching, and the AR+Diffusion paradigm
RL Policy Optimization for E2E	From REINFORCE to GRPO for driving
E2E Architecture Evolution	V2.0 decoder selection to V3.0 VLA integration

Model Architecture Analysis

Article	Topic
DeepSeek MLA	Multi-Head Latent Attention for KV cache compression
Nvidia Cosmos-Reason VLA	Vision-Language-Action for driving
RL: DPO to Self-Improvement	Post-training pipeline for driving

Qwen3.5 vs Qwen3: A Deep Architectural Comparison

Based on Qwen3.5 official technical documentation and code structure analysis. 交互式架构对比下面是 Qwen3-VL 与 Qwen3.5 的交互式架构可视化，支持 Tab 切换、拖拽平移、滚轮缩放，点击节点查看详细信息。操作提示：点击顶部 Tab 切换 Qwen3-VL / Qwen3.5 / Compare 视图；滚轮缩放；拖拽平移；点击节点查看参数详情。 1. 注意力机制：根本性重构这是最大的代际差异。Qwen3 用标准 Transformer 注意力，Qwen3.5 引入了混合注意力（Hybrid Attention）。维度 Qwen3 Qwen3.5 注意力类型标准 Softmax 注意力混合注意力：Gated DeltaNet (线性) + Full Attention 层间比例全部是 Full Attention 3:1 — 每 3 层线性注意力 + 1 层完整注意力复杂度 O(L²·d) O(L·d²)，近线性 KV Cache 存储全部历史 KV 对，随序列线性增长 75% 的层用固定大小循环状态 S_t，不缓存 KV 长文本衰减有线性层有衰减，但每隔 4 层 Full Attention 做"上下文刷新" 序列并行支持不支持（注意力实现不兼容） 1.1 Gated DeltaNet 状态更新公式 1 S_t = β_t ⊙ S_{t-1} + Δ_t ⊗ (K_t ⊗ V_t) β_t = 门控参数（控制记忆保留/遗忘） Δ_t = 增量更新参数（精确修改特定位置，不是全量覆写）状态空间固定 O(1)，不随序列长度增长 1.2 层分布示例（24 层模型） 1 2 3 4 5 6 7 8 9 Layer 0: linear_attention Layer 1: linear_attention Layer 2: linear_attention Layer 3: full_attention ← 上下文刷新 Layer 4: linear_attention Layer 5: linear_attention Layer 6: linear_attention Layer 7: full_attention ← 上下文刷新 ... 重复（full_attention_interval=4）配置参数： ...

Reinforcement Learning for End-to-End Autonomous Driving: From Offline DPO to Iterative Self-Improvement

Introduction The integration of reinforcement learning into end-to-end autonomous driving systems has emerged as a promising direction for improving trajectory planning beyond what supervised learning alone can achieve. However, the direct application of standard RL algorithms to driving tasks faces fundamental challenges: the sim-to-real gap in log-replay environments, the computational bottleneck of online simulation, and the difficulty of defining dense reward signals for continuous trajectory generation. This article examines the RL pipeline for end-to-end autonomous driving through the lens of post-training alignment. We begin with the concept of metric caching, which decouples expensive environment evaluation from model training. We then analyze how Direct Preference Optimization (DPO) can be applied across different action representations—discrete tokens, continuous regression, and diffusion models—and discuss the fundamental distinction between offline and online RL in the driving context. Finally, we present three strategies for breaking the sampling ceiling that limits the performance of iterative self-improvement pipelines. ...

Vision-Language-Action Models for Autonomous Driving: The Cosmos-Reason Approach

Introduction End-to-end autonomous driving has made significant progress in recent years, yet deploying Vision-Language-Action (VLA) models in real-world driving scenarios remains challenging. The fundamental difficulties are fourfold. First, multi-frame temporal understanding requires the model to extract decision-relevant changes from highly redundant consecutive observations, rather than merely processing static snapshots. Second, driving decisions must be causal: the model must model why a particular action is taken, not just learn statistical correlations between situations and actions. Third, predicted trajectories must satisfy kinematic and dynamic constraints while remaining multi-modal and efficient enough for real-time inference. Fourth, the reasoning process must be tightly aligned with action output—reasoning should not be a post-hoc rationalization but must be verifiable by and constrained by the actual actions taken. ...

End-to-End Autonomous Driving: From Modular Decoders to VLA Architectures

Introduction The trajectory of autonomous driving architecture has undergone a paradigm shift: from the classical modular pipeline (perception →\to prediction →\to planning →\to control) toward end-to-end systems that map sensory inputs directly to driving actions. This transition is not merely an engineering convenience—it reflects a deep recognition that modular interfaces impose information bottlenecks and that joint optimization across the full stack can yield emergent capabilities invisible to individually optimized modules. The evolution can be broadly characterized in three phases: ...

Policy Optimization for End-to-End Autonomous Driving: From REINFORCE to GRPO

1. Why End-to-End Driving Needs Reinforcement Learning Supervised learning—whether through imitation learning or behavior cloning—can only take an autonomous driving system so far. The fundamental limitation is distributional: the training data is drawn from expert demonstrations, and any distributional shift between training and deployment leads to compounding errors. More critically, supervised objectives are misaligned with the true goal of driving. Minimizing the L2 distance to a ground-truth trajectory penalizes safe deviations as harshly as dangerous ones, and provides no mechanism for the model to discover better trajectories than those in the dataset. ...

InSpatio-World: Real-Time 4D World Simulation via Spatiotemporal Autoregressive Modeling

The ability to simulate a 4D world — one that evolves in time and can be viewed from arbitrary perspectives — is a foundational capability for autonomous driving, robotics, and embodied AI. Existing video generation models produce visually compelling sequences but lack spatial consistency when the camera moves. 3D reconstruction methods achieve geometric fidelity but struggle with dynamic scenes and real-time performance. InSpatio-World bridges this gap through a spatiotemporal autoregressive (STAR) architecture that combines the strengths of both paradigms. ...

Trajectory Tokenization for Autoregressive Planning: Clustering, Matching, and the AR+Diffusion Paradigm

Autoregressive (AR) trajectory generation — predicting driving trajectories as sequences of discrete tokens, much like language models predict text — has emerged as a powerful paradigm for end-to-end autonomous driving. But how do we turn continuous trajectories into discrete tokens? How do we ensure the tokenized representation preserves enough fidelity for planning? And how does the AR paradigm combine with diffusion and reinforcement learning to produce state-of-the-art results? This article walks through the complete pipeline, from tokenization theory to RL post-training. ...

Why Generative Planning? The Non-Convexity Argument Against Regression in Autonomous Driving

The trajectory planner is the decision-making core of an autonomous driving system. Its task: given the current scene, output a future trajectory that is safe, comfortable, and efficient. Most production systems today use some form of regression — minimizing the distance between predicted and ground-truth trajectories. Yet a growing body of research and engineering evidence suggests this approach has a fundamental flaw: it assumes the feasible set is convex when it is emphatically not. This article lays out the first-principles argument for why generative approaches (diffusion, autoregressive) are not merely improvements but necessary paradigm shifts. ...

Multi-Head Latent Attention: Efficient KV Cache Compression in DeepSeek-V2

Autoregressive language models based on the decoder-only Transformer architecture generate tokens sequentially, conditioning each prediction on all previously generated tokens. During inference, the key-value pairs of prior tokens must be retained to ensure coherence across the generation sequence. In the standard Multi-Head Attention (MHA) formulation, the size of this KV cache grows linearly with both the sequence length and the number of attention heads, creating a significant memory bottleneck that limits the maximum context length achievable on commodity hardware. Multi-Head Latent Attention (MLA), introduced in DeepSeek-V2, addresses this bottleneck through low-rank joint projection of the key and value representations, achieving KV cache sizes comparable to Grouped-Query Attention (GQA) while preserving — and in some cases exceeding — the modeling capacity of full MHA. ...

Foundational Arguments#

Model Architecture Analysis#

Foundational Arguments

Model Architecture Analysis