Introduction The integration of reinforcement learning into end-to-end autonomous driving systems has emerged as a promising direction for improving trajectory planning beyond what supervised learning alone can achieve. However, the direct application of standard RL algorithms to driving tasks faces fundamental challenges: the sim-to-real gap in log-replay environments, the computational bottleneck of online simulation, and the difficulty of defining dense reward signals for continuous trajectory generation.
This article examines the RL pipeline for end-to-end autonomous driving through the lens of post-training alignment. We begin with the concept of metric caching, which decouples expensive environment evaluation from model training. We then analyze how Direct Preference Optimization (DPO) can be applied across different action representations—discrete tokens, continuous regression, and diffusion models—and discuss the fundamental distinction between offline and online RL in the driving context. Finally, we present three strategies for breaking the sampling ceiling that limits the performance of iterative self-improvement pipelines.
...