VLA

Introduction End-to-end autonomous driving has made significant progress in recent years, yet deploying Vision-Language-Action (VLA) models in real-world driving scenarios remains challenging. The fundamental difficulties are fourfold. First, multi-frame temporal understanding requires the model to extract decision-relevant changes from highly redundant consecutive observations, rather than merely processing static snapshots. Second, driving decisions must be causal: the model must model why a particular action is taken, not just learn statistical correlations between situations and actions. Third, predicted trajectories must satisfy kinematic and dynamic constraints while remaining multi-modal and efficient enough for real-time inference. Fourth, the reasoning process must be tightly aligned with action output—reasoning should not be a post-hoc rationalization but must be verifiable by and constrained by the actual actions taken. ...

Vision-Language-Action Models for Autonomous Driving: The Cosmos-Reason Approach

End-to-End Autonomous Driving: From Modular Decoders to VLA Architectures