Reinforcement Learning for End-to-End Autonomous Driving: From Offline DPO to Iterative Self-Improvement

Introduction

The integration of reinforcement learning into end-to-end autonomous driving systems has emerged as a promising direction for improving trajectory planning beyond what supervised learning alone can achieve. However, the direct application of standard RL algorithms to driving tasks faces fundamental challenges: the sim-to-real gap in log-replay environments, the computational bottleneck of online simulation, and the difficulty of defining dense reward signals for continuous trajectory generation.

This article examines the RL pipeline for end-to-end autonomous driving through the lens of post-training alignment. We begin with the concept of metric caching, which decouples expensive environment evaluation from model training. We then analyze how Direct Preference Optimization (DPO) can be applied across different action representations—discrete tokens, continuous regression, and diffusion models—and discuss the fundamental distinction between offline and online RL in the driving context. Finally, we present three strategies for breaking the sampling ceiling that limits the performance of iterative self-improvement pipelines.

Metric Cache: Decoupling Evaluation from Training

A central engineering insight in modern driving RL pipelines is the separation of environment simulation from model training through precomputed metric caches. The metric cache is a serialized snapshot of ground-truth environmental data and scene context, designed specifically to accelerate the evaluation of predicted trajectories.

The cache contains several key components. The reference trajectory is generated by a rule-based planner (typically an Intelligent Driver Model) and serves as a baseline for comparison. The ego state records the initial position, velocity, and heading of the ego vehicle. The observation field stores interpolated ground-truth future trajectories of all surrounding agents at 10 Hz, enabling precise collision detection during evaluation. The centerline and route lane IDs encode the navigable path for computing progress and direction compliance. The drivable area map provides a polygonal representation of road boundaries for off-road detection.

The production pipeline proceeds in three stages. First, the raw scenario is loaded from the driving database, and the rule-based planner generates a reference trajectory. Second, ground-truth agent trajectories are interpolated and map features are extracted. Third, all components are serialized into a compressed cache file. At evaluation time, the model simply generates a predicted trajectory, and the scoring module loads the cache to perform collision detection against the stored observations, boundary checks against the drivable area map, and progress computation against the centerline—all without accessing the original database.

This design has a profound implication for the training pipeline: it enables the Generate-Score-Train loop that underpins post-training RL. By precomputing all environment information, the system can rapidly evaluate thousands of candidate trajectories from a single scene, producing the preference pairs needed for DPO training.

Post-Training Pipeline: DPO for Trajectory Planning

Sampling and Preference Pair Construction

The post-training pipeline begins with sampling. For each input context (multi-camera observations, navigation command, ego history), the model generates $K$ candidate trajectories (typically $K=128$ ). Each trajectory is then evaluated by the scoring module, which produces a multi-dimensional score vector comprising: collision penalty, drivable area compliance, ego progress, time-to-collision, comfort, and a weighted total score.

The candidate trajectories are encoded as discrete action sequences through a Vector Quantization (VQ) module. Specifically, each trajectory is represented as a sequence of 8 discrete token IDs, corresponding to 4 seconds of prediction at 0.5-second intervals. The model records both the selected action tokens and their log probabilities under the current policy, which are stored for subsequent DPO training as the reference policy probabilities $\log \pi_{\text{ref}}(a|x)$ .

Preference pairs are constructed by selecting the highest-scoring trajectory as the winner and the lowest-scoring as the loser, based on the total weighted score. Crucially, the reference policy probabilities are recorded at sampling time, eliminating the need to maintain a separate frozen reference model during training.

DPO Loss Formulation

For discrete action spaces, the DPO loss follows the standard formulation. Let $y_w$ denote the winner trajectory and $y_l$ the loser trajectory. The joint log-probability of a trajectory under an autoregressive model is the sum of per-step log-probabilities:

\log \pi(y|x) = \sum_{t=1}^{T} \log \pi(a_t | a_{<t}, x)

The DPO loss is then:

L_{\text{DPO}} = -\log \sigma\left(\beta \left( \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right)\right)

where $\beta$ controls the deviation from the reference policy. During training, the log-probabilities are computed by gathering the model’s output logits at the positions corresponding to the actual action tokens, taking logarithms, and summing across time steps. The reference log-probabilities are read directly from the cached sampling data.

To monitor training progress, the implicit reward is computed as:

r(x, y) = \beta \left(\log \pi_\theta(y|x) - \log \pi_{\text{ref}}(y|x)\right)

The training objective is to increase the implicit reward for winners while decreasing it for losers, widening the margin between them.

Action Space Comparison for DPO

The choice of action representation fundamentally determines how $\log P(y|x)$ is computed for DPO, and each choice carries distinct trade-offs.

Discrete Token Space

In the discrete setting, the model outputs a sequence of token IDs from a learned codebook (e.g., 8192 entries). The log-probability is computed via the standard softmax over logits:

\log P(y|x) = \sum_{t=1}^{T} \log \frac{\exp(z_{a_t})}{\sum_{k=1}^{K} \exp(z_k)}

This representation is naturally multi-modal, provides exact probability values, and is robust to noise. It is also directly compatible with policy gradient RL methods. However, discretization introduces precision loss and faces the curse of dimensionality when the action space grows large. In the driving domain, this limitation is mitigated by the fact that the codebook can be trained to cover the relevant trajectory manifold effectively.

Continuous Regression

When the model directly regresses trajectory coordinates, the log-probability must be approximated under a distributional assumption. The most common approach assumes a Gaussian distribution with the model output as the mean and a fixed variance $\sigma^2$ . Under this assumption:

\log P(y|x) \propto -\frac{1}{2\sigma^2} \|y - \mu_\theta(x)\|^2

That is, the negative mean squared error serves as a proxy for log-probability. The DPO loss then becomes a contrastive objective that pulls the model’s prediction closer to the winner trajectory while pushing it away from the loser:

L_{\text{DPO-Reg}} = -\log \sigma\left(\beta \left[-\|y_w - \mu_\theta\|^2 + \|y_w - \mu_{\text{ref}}\|^2\right] - \left[-\|y_l - \mu_\theta\|^2 + \|y_l - \mu_{\text{ref}}\|^2\right]\right)

More sophisticated models (e.g., Trajectron++, MultiPath) output a Gaussian Mixture Model with parameters $(\pi_k, \mu_k, \Sigma_k)$ , where the probability density is:

P(y|x) = \sum_{k=1}^{K} \pi_k \cdot \mathcal{N}(y | \mu_k, \Sigma_k)

The log-probability of a sampled trajectory is computed via log-sum-exp over the mixture components. Continuous regression offers precise coordinate prediction and fast inference, but suffers from the averaging curse—mode-averaged predictions tend toward the mean of multi-modal distributions, producing unrealistic trajectories at decision points.

Diffusion Models

Diffusion-based trajectory decoders generate continuous coordinates through an iterative denoising process. Computing $\log P(y|x)$ for DPO requires a different approach: the denoising reconstruction error serves as a proxy for negative log-likelihood. Specifically:

\log P_\theta(x) \approx -\mathbb{E}_{t, \epsilon}\left[\|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]

The intuition is that if the model can accurately predict the noise added to a trajectory, then that trajectory is “likely” under the model’s distribution. For DPO, the loss compares the denoising errors for winner and loser trajectories:

L_{\text{Diffusion-DPO}} = -\log \sigma\left(\beta\left[\|\text{Error}_{\text{Loser}}\|^2 - \|\text{Error}_{\text{Winner}}\|^2\right]\right)

The winner trajectory should be easier to denoise (lower error), while the loser should be harder. Diffusion models combine multi-modality with high precision and physical consistency, but are sensitive to hyperparameters and slower at inference time.

Summary

Model Type	Output	Log-Probability Proxy	DPO Objective	Strengths	Weaknesses
Discrete (VQ)	Token IDs	$\log \text{Softmax}(\text{logits})$	Increase winner token logit	Multi-modal, exact probability, RL-friendly	Precision loss, curse of dimensionality
Regression	$(x, y)$ coordinates	$-\text{MSE}(\text{pred}, \text{target})$	Pull closer to winner coords	Precise, fast inference	Mode averaging, distributional assumption
Diffusion	$(x, y)$ coordinates	$-\text{MSE}(\text{pred\_noise}, \text{noise})$	Make winner easier to denoise	Multi-modal + precise, physically consistent	Hyperparameter-sensitive, slow

Offline RL vs. Online RL for Driving

The Contextual Bandit Structure

In most end-to-end driving systems operating on log-replay data, the RL problem has a fundamentally different structure from the standard Markov Decision Process (MDP) assumed by algorithms like PPO or DQN. At time $t=0$ , the model observes the current scene and generates a complete future trajectory (e.g., 8 seconds). There is no sequential interaction: the model does not observe the outcome of the first second before deciding the second. The environment feedback arrives only after the entire trajectory is generated and evaluated.

This makes the problem a Contextual Bandit: the environment is the traffic scene, the action is the generated trajectory, and the reward is the evaluation score. The model commits to a single action (the full trajectory) and receives a single reward, with no intermediate state transitions.

Why Iterative Offline Beats Naive Online

The current pipeline operates in an offline-to-online iterative mode. Scene data (the “prompt”) comes from real driving logs and is fixed. Experience data (the “samples”) is self-generated by the model through sampling. This self-generated experience is a crucial advantage over traditional offline RL, which only learns from human demonstrations. Self-generated samples allow the model to learn from its own failures—a trajectory that appears smooth but causes a collision at second 3 is an excellent negative example.

The engineering advantages of the offline sampling mode over true online RL are significant:

Property	Online RL	Offline Sampling
Computation	CPU/IO-blocked: GPU waits for simulator	CPU cluster samples to disk; GPU trains at 100% utilization
Data Efficiency	On-policy: samples discarded after one update	Off-policy: samples reused across multiple epochs
Stability	Prone to collapse from poor batches	Global view: cache can be cleaned before training
Throughput	Simulator runs at 10–20 Hz	Sampling fully parallelized across CPU cluster

The Simulator Flaw

A deeper problem with online RL in log-replay environments is the simulator flaw. In most driving benchmarks, other agents follow their recorded trajectories regardless of the ego vehicle’s actions. If the ego vehicle swerves into an adjacent car, that car does not react—it is a “ghost car” replaying a recording. An online RL agent would quickly discover this and learn either overly conservative policies (never move when any car is nearby) or overly aggressive ones (exploit the fact that other cars never react). Neither strategy transfers to the real world.

Breaking the Sampling Ceiling

The fundamental limitation of the Generate-Score-Train pipeline is captured by the inequality:

\text{Training Ceiling} = \max(\text{Samples})

If the model is weak and all $K$ sampled trajectories are poor, DPO can only select the “least bad” trajectory as the winner. The model learns to distinguish bad from worse, but never sees what a genuinely good trajectory looks like. Three strategies can break through this ceiling.

Iterative Self-Improvement

The most engineering-friendly approach requires no architectural changes, only a change in the training loop. Instead of a single round of sampling and training, the process is iterated:

Initial model $\pi_0$ samples to produce dataset $D_0$ .
Training on $D_0$ produces improved model $\pi_1$ .
$\pi_1$ samples again (now exploring regions of the state space that $\pi_0$ could not reach) to produce $D_1$ .
Training on $D_1$ produces $\pi_2$ .
Repeat for $N$ iterations.

Each iteration shifts the sampling distribution toward better regions. The first round might discover “slow but safe” trajectories; the second round, building on a stronger policy, might explore “fast and safe” trajectories. This is essentially off-policy RL with iterative data collection, while maintaining the engineering simplicity of the offline pipeline.

Test-Time Compute and Search

Rather than improving the model, this approach improves the sampling process itself. Two strategies are available:

Guided sampling exploits the structure of diffusion models by introducing a lightweight cost function during the reverse denoising process. This steers trajectory generation toward collision-free regions, raising the floor of sample quality without additional model training.

Tree search (e.g., Monte Carlo Tree Search) generates a large number of candidate trajectories (e.g., 1000) and uses a fast value model to pre-filter them down to a small set (e.g., 10) for expensive evaluation. This front-loads computational effort into the data generation phase, effectively performing “thinking” during sampling and distilling the results into the trained model.

Expert Injection

The fastest way to raise the ceiling is to introduce external expertise. During sampling, rule-based or optimization-based planners (e.g., lattice planners) generate trajectories that are mixed into the candidate pool. These expert trajectories become the winners in the preference pairs, forcing the model to learn: “this is how an expert planner handles this situation.” Over time, the model internalizes the expert’s decision-making patterns while retaining the neural network’s ability to generalize to scenarios where the rule-based planner fails.

Discussion

The Generate-Score-Train paradigm has become the standard approach for aligning large models (whether LLMs, VLMs, or end-to-end driving systems) to desired behaviors. Its strength lies in engineering pragmatism: it decouples the expensive simulation step from the GPU-intensive training step, enables data reuse, and allows quality control before training. The key insight is that in this framework, sampling quality determines the performance ceiling, and the loss function merely determines how efficiently the model approaches that ceiling.

The three strategies for breaking the sampling ceiling are complementary rather than mutually exclusive. Iterative self-improvement provides a natural progression of model capability. Test-time search improves sample quality at the cost of additional computation during data generation. Expert injection provides an immediate boost by importing external knowledge. In practice, the most effective pipelines combine all three, using expert trajectories to bootstrap the first iteration, iterative self-improvement to progressively expand the frontier, and guided sampling or tree search to maximize the quality of each iteration’s samples.

The path from offline DPO to truly online RL in autonomous driving remains open. The simulator flaw—the non-reactivity of log-replay agents—is a fundamental obstacle that cannot be solved by algorithmic improvements alone. Addressing it requires either more realistic reactive simulators or hybrid approaches that combine log-replay evaluation with learned environment models. Until then, the iterative offline paradigm, with its engineering simplicity and demonstrated effectiveness, remains the pragmatic choice for production systems.

References

Rafailov, R., Sharma, A., Mitchell, E., et al. “Direct Preference Optimization: Your Language Model is Secretly a Reward Model.” NeurIPS, 2023.
Wallace, B., Dang, M., Rafailov, R., et al. “Diffusion Model Alignment Using Direct Preference Optimization.” arXiv:2311.12908, 2023.
Shao, Z., Wang, P., Zhu, Q., et al. “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.” arXiv:2402.03300, 2024.
Chai, Y., et al. “UniAD: Planning-oriented Autonomous Driving.” CVPR, 2023.
Daoud, A., et al. “DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving.” arXiv, 2024.
Hu, Y., et al. “Planning-oriented Autonomous Driving via Interactive Multi-agent Modeling.” NeurIPS, 2023.
Silver, D., Huang, A., Maddison, C.J., et al. “Mastering the Game of Go with Deep Neural Networks and Tree Search.” Nature, 2016.
VAE-based discretization: van den Oord, A., Vinyals, O., and Kavukcuoglu, K. “Neural Discrete Representation Learning.” NeurIPS, 2017.
Petrov, A., et al. “Trajectron++: Dynamically-Feasible Trajectory Forecasting with Heterogeneous Data.” ECCV, 2020.

Introduction#

Metric Cache: Decoupling Evaluation from Training#

Post-Training Pipeline: DPO for Trajectory Planning#

Sampling and Preference Pair Construction#

DPO Loss Formulation#

Action Space Comparison for DPO#

Discrete Token Space#

Continuous Regression#

Diffusion Models#

Summary#

Offline RL vs. Online RL for Driving#

The Contextual Bandit Structure#

Why Iterative Offline Beats Naive Online#

The Simulator Flaw#

Breaking the Sampling Ceiling#

Iterative Self-Improvement#

Test-Time Compute and Search#

Expert Injection#

Discussion#

References#