1. Why End-to-End Driving Needs Reinforcement Learning
Supervised learning—whether through imitation learning or behavior cloning—can only take an autonomous driving system so far. The fundamental limitation is distributional: the training data is drawn from expert demonstrations, and any distributional shift between training and deployment leads to compounding errors. More critically, supervised objectives are misaligned with the true goal of driving. Minimizing the L2 distance to a ground-truth trajectory penalizes safe deviations as harshly as dangerous ones, and provides no mechanism for the model to discover better trajectories than those in the dataset.
Reinforcement learning offers a principled alternative. Rather than mimicking specific actions, RL optimizes a reward signal that directly measures driving quality—collision avoidance, progress toward the destination, passenger comfort, rule compliance. The policy is free to discover novel strategies that achieve high reward, even if they differ from expert behavior. This is particularly valuable for handling long-tail scenarios where no demonstration exists.
The challenge, however, is that driving is not a standard MDP. In most end-to-end systems operating on log-replay data, the model generates a complete future trajectory at and receives a single reward after evaluation—a contextual bandit structure, not a sequential decision process. This structural difference propagates through every aspect of the RL pipeline: how advantages are estimated, how sampling works, and how the loss function is designed.
2. From REINFORCE to PPO to GRPO: The Policy Gradient Lineage
2.1 The Policy Gradient Theorem
Consider a parameterized policy . The objective is to maximize the expected return:
The policy gradient theorem [1] gives the gradient of this objective:
This is the REINFORCE estimator [2]. Intuitively, it increases the log-probability of actions that led to high returns and decreases the log-probability of those that did not. The following diagram illustrates the computational flow:
2.2 Variance Reduction: From Returns to Advantages
The raw REINFORCE gradient uses as a multiplier. This is problematic because has high variance—a single trajectory return fluctuates wildly around the true expected return. The standard fix is to replace with the advantage function:
The advantage measures how much better action is compared to the average action from state . This yields the advantage actor-critic gradient:
In practice, the advantage is estimated using Generalized Advantage Estimation (GAE) [3], which interpolates between the high-variance Monte Carlo estimate and the high-bias TD(0) estimate via a parameter :
where is the TD error. GAE requires a learned value function , which is typically a neural network trained concurrently with the policy.
2.3 PPO: Clipped Surrogate Objective
Proximal Policy Optimization (PPO) [4] addresses the instability of large policy updates. The key insight is to constrain the policy ratio:
The clipped surrogate objective is:
The clip removes the incentive for moving the ratio outside , while the outer ensures the clipped version is a pessimistic lower bound. Combined with the advantage estimator from GAE, PPO provides stable, monotonic policy improvement in practice.
However, PPO has a significant architectural cost: it requires a value network of comparable size to the policy network for computing GAE. In the LLM setting, this means training and maintaining a second model of equal parameter count, which doubles memory usage and complicates the training pipeline.
2.4 GRPO: Eliminating the Value Network
Group Relative Policy Optimization (GRPO), introduced in DeepSeekMath [5], removes the value network entirely. The key idea is simple but powerful: for a given input, sample a group of outputs from the old policy, score them all, and use the group statistics as the baseline.
Given a question (or scene) , sample outputs from . Each output receives a reward from the reward model. The group-relative advantage is:
The GRPO objective then applies the same clipped surrogate structure as PPO, but with these group-relative advantages:
where .
A critical design choice in GRPO is the placement of the KL penalty. In PPO, the KL term is added into the reward at each step, which means it affects the advantage computation. In GRPO, the KL penalty is placed directly in the loss function, decoupled from advantage estimation. This keeps the advantage computation clean and interpretable.
| Dimension | PPO | GRPO |
|---|---|---|
| Value function | Requires learned | None; group mean as baseline |
| Advantage estimation | GAE via TD errors | Group-relative normalization |
| KL penalty | Embedded in per-step reward | Directly in loss function |
| Sampling | Single output per input | Group of outputs per input |
| Memory overhead | ~2x (policy + value networks) | ~1x (policy only) |
| Per-token advantage | Yes (varies across positions) | No (shared across output) |
2.5 GRPO in Autonomous Driving: AlphaDrive
The first application of GRPO to autonomous driving is AlphaDrive [6], which applies GRPO-based RL to Vision-Language Models for planning. AlphaDrive introduces four planning-oriented RL rewards tailored to driving scenarios and employs a two-stage training pipeline (SFT followed by RL). A notable finding is that RL training elicits emergent multi-modal planning capabilities—the model learns to propose diverse viable trajectories without explicit multi-modal supervision. This is particularly significant because multi-modality in trajectory planning (e.g., deciding whether to pass on the left or right of an obstacle) is a core requirement for safe and efficient driving.
3. The Unified Optimization Framework
Across PPO, GRPO, and their variants, the optimization objective for policy methods can be expressed in a unified form:
Each term serves a distinct purpose:
- : The core policy optimization loss that drives the policy toward higher-reward actions. This is the clipped surrogate objective (PPO-clip or GRPO-clip).
- : Regularization that constrains the policy from deviating too far from a reference. This includes KL divergence , entropy bonuses for exploration, and trust region constraints.
- : Auxiliary losses that preserve capabilities from pre-training or supervised fine-tuning. This includes imitation learning (behavior cloning) loss, reconstruction loss (for diffusion decoders), and value function loss.
The mapping to concrete terms in a driving RL pipeline:
| Abstract Term | Concrete Implementation |
|---|---|
| or | |
The core policy loss expands to:
where .
4. Sampling: LLM vs. Autonomous Driving
The sampling step—generating candidate outputs from the current policy—is where the difference between LLM and driving RL becomes most pronounced. In both cases, the quality of the advantage estimate depends on the diversity and quality of the sampled group. But the constraints on what constitutes a valid sample differ fundamentally.
LLM sampling is essentially unconstrained. Given a prompt, the model samples token sequences via temperature scaling, top-k filtering, or nucleus sampling (top-p). Any sequence of valid tokens is a syntactically legal output; the only question is whether it is semantically useful. The sampling space is the full vocabulary raised to the sequence length, and the diversity of samples is controlled by the temperature parameter.
Driving sampling is fundamentally constrained. A sampled trajectory must satisfy:
- Kinematic feasibility: The trajectory must respect vehicle dynamics—maximum steering angle, acceleration limits, jerk constraints. A trajectory that requires instantaneous lateral displacement is physically impossible.
- Scene consistency: The trajectory must not pass through observed obstacles, violate traffic rules, or leave the drivable area.
- Temporal coherence: The trajectory must be smooth and continuous, without discontinuous jumps in position or heading.
These constraints mean that naive perturbation of a trajectory (analogous to temperature sampling in LLMs) produces mostly invalid samples. A small perturbation might push the trajectory into an obstacle; a large perturbation might produce a physically impossible path. The sampling strategy must be carefully designed to produce meaningful diversity—trajectories that differ in interesting ways (left pass vs. right pass, aggressive merge vs. conservative yield) while remaining physically feasible.
This is where diffusion-based trajectory decoders offer a natural advantage. The denoising process can be guided to satisfy constraints, and the noise schedule controls the exploration-exploitation tradeoff in a physically meaningful way.
5. Loss Design: Multi-Objective Composition
The full training objective in a production driving RL system typically combines multiple loss terms:
Each term plays a specific role:
Policy gradient loss (): The primary driver of policy improvement. The clip mechanism prevents destructively large updates, while the group-relative advantage provides a variance-reduced gradient signal.
KL divergence (): Constrains the policy from drifting too far from the reference (typically the SFT checkpoint). Without this, RL training can cause the model to “forget” its pre-trained capabilities—a phenomenon known as reward hacking where the policy finds loopholes in the reward function that produce high scores but low-quality trajectories.
Entropy bonus (): Encourages exploration by preventing the policy from collapsing to a deterministic mode. In driving, this is essential for maintaining multi-modality: the model should continue to propose diverse plausible trajectories rather than converging to a single average solution.
Behavior cloning loss (): An auxiliary imitation loss computed on expert demonstrations. This acts as a regularizer that prevents the policy from departing too far from safe, human-like driving behavior. It is particularly important in early RL training when the reward signal may be noisy or sparse.
Value function loss (): When a value network is used (as in PPO), this is the regression loss for training . In GRPO-based systems, this term is absent, but it may still appear in hybrid approaches that combine GRPO advantages with a learned baseline for additional variance reduction.
Other auxiliary losses (): Domain-specific terms such as reconstruction loss for diffusion decoders, collision prediction loss, or comfort regularization. These are typically small in magnitude but provide important inductive biases.
The coefficients are critical hyperparameters. In practice, they are tuned through a combination of grid search and manual adjustment. A common pattern is to start with a high (strong imitation regularization) and gradually anneal it as the RL training stabilizes, allowing the policy gradient signal to dominate.
| |
6. Diffusion, Noise, and Exploration
For diffusion-based trajectory decoders, the relationship between noise and exploration deserves special attention. In the standard diffusion process, a clean trajectory is corrupted by adding Gaussian noise over steps:
At inference time, the model denoises from (pure noise) back to (a clean trajectory). The initial noise determines which trajectory is generated. In the RL context, this noise plays a role analogous to temperature in LLM sampling—but with a crucial difference.
Noise in diffusion-based driving is not merely “sampling randomness.” It directly determines:
- Exploration range: The magnitude and structure of the noise control how far the generated trajectories can deviate from the mean. Larger noise leads to more diverse candidates.
- Candidate trajectory morphology: Different noise realizations produce qualitatively different trajectory shapes—lane-change vs. lane-follow, aggressive vs. conservative, left vs. right. The noise does not just shift a trajectory; it can change its mode.
- Group distribution quality: For GRPO, the advantage estimation depends on the group of samples having meaningful reward variance. If the noise is too small, all trajectories are nearly identical, and the group-relative advantage is dominated by noise rather than signal. If the noise is too large, many trajectories become physically invalid, and the reward signal becomes uninformative.
This creates a three-way tension in noise scheduling:
- Effective diversity: The noise must be large enough to produce trajectories with meaningfully different rewards, enabling the group-relative advantage to separate good from bad.
- Trajectory validity: The noise must be small enough (or the denoising process must be constrained enough) to keep trajectories within the kinematically feasible and scene-consistent region.
- Alignment with training objectives: The exploration direction should be consistent with what the reward function actually measures. Noise that produces diverse but reward-irrelevant variations (e.g., tiny lateral shifts that do not affect collision safety) wastes sampling budget.
In practice, these tensions are addressed through a combination of constrained diffusion (guiding the denoising process with kinematic constraints), adaptive noise scheduling (adjusting noise levels based on scene complexity), and rejection sampling (discarding trajectories that violate hard constraints before computing rewards).
7. Summary
| Component | REINFORCE | PPO | GRPO |
|---|---|---|---|
| Objective | Maximize | Clipped surrogate | Clipped surrogate |
| Advantage | (raw return) | GAE via learned | Group-relative normalization |
| Baseline | None | Learned value function | Group mean reward |
| Value network | No | Yes (same scale as policy) | No |
| Variance | Very high | Low (GAE + learned baseline) | Moderate (group statistics) |
| Update constraint | None | Clip ratio | Clip ratio |
| KL regularization | None | In reward (affects advantage) | In loss (independent of advantage) |
| Memory | 1x | ~2x | ~1x |
| Driving applicability | Baseline only | General purpose | VLM planning, group-sampled scenarios |
The progression from REINFORCE to PPO to GRPO represents a trajectory of increasing practical efficiency: REINFORCE establishes the theoretical foundation, PPO introduces stable optimization through clipping and learned baselines, and GRPO removes the expensive value network by exploiting the group structure of the sampling process. For autonomous driving, GRPO is particularly attractive because the contextual bandit structure of trajectory planning naturally produces group-sampled outputs, and the absence of a value network simplifies the training pipeline for already complex end-to-end models.
However, GRPO is not a universal replacement for PPO. In settings where per-token advantages matter (e.g., sequential decision-making with meaningful intermediate states), GAE provides a richer signal than the per-output advantage of GRPO. The choice between the two should be guided by the structure of the problem: contextual bandit with group sampling favors GRPO; sequential MDP with long horizons favors PPO.
References
Sutton, R.S., McAllester, D.A., Singh, S.P., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. NeurIPS. Link
Williams, R.J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4), 229-256.
Schulman, J., Moritz, P., Levine, S., Jordan, M.I., & Abbeel, P. (2016). High-dimensional continuous control using generalized advantage estimation. ICLR.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv:1707.06347.
Shao, Z., Wang, P., Zhu, Q., et al. (2024). DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv:2402.03300. GRPO is introduced in Section 4 of this paper.
Jiang, B., Chen, S., Zhang, Q., Liu, W., & Wang, X. (2025). AlphaDrive: Unleashing the power of VLMs in autonomous driving via GRPO-based reasoning and planning. arXiv:2503.07608.