Policy Optimization for End-to-End Autonomous Driving: From REINFORCE to GRPO

1. Why End-to-End Driving Needs Reinforcement Learning

Supervised learning—whether through imitation learning or behavior cloning—can only take an autonomous driving system so far. The fundamental limitation is distributional: the training data is drawn from expert demonstrations, and any distributional shift between training and deployment leads to compounding errors. More critically, supervised objectives are misaligned with the true goal of driving. Minimizing the L2 distance to a ground-truth trajectory penalizes safe deviations as harshly as dangerous ones, and provides no mechanism for the model to discover better trajectories than those in the dataset.

Reinforcement learning offers a principled alternative. Rather than mimicking specific actions, RL optimizes a reward signal that directly measures driving quality—collision avoidance, progress toward the destination, passenger comfort, rule compliance. The policy is free to discover novel strategies that achieve high reward, even if they differ from expert behavior. This is particularly valuable for handling long-tail scenarios where no demonstration exists.

The challenge, however, is that driving is not a standard MDP. In most end-to-end systems operating on log-replay data, the model generates a complete future trajectory at $t=0$ and receives a single reward after evaluation—a contextual bandit structure, not a sequential decision process. This structural difference propagates through every aspect of the RL pipeline: how advantages are estimated, how sampling works, and how the loss function is designed.

2. From REINFORCE to PPO to GRPO: The Policy Gradient Lineage

2.1 The Policy Gradient Theorem

Consider a parameterized policy $\pi_\theta(a|s)$ . The objective is to maximize the expected return:

J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[R(\tau)\right] = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \gamma^t r(s_t, a_t)\right]

The policy gradient theorem [1] gives the gradient of this objective:

\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot R(\tau)\right]

This is the REINFORCE estimator [2]. Intuitively, it increases the log-probability of actions that led to high returns and decreases the log-probability of those that did not. The following diagram illustrates the computational flow:

2.2 Variance Reduction: From Returns to Advantages

The raw REINFORCE gradient uses $R(\tau)$ as a multiplier. This is problematic because $R(\tau)$ has high variance—a single trajectory return fluctuates wildly around the true expected return. The standard fix is to replace $R(\tau)$ with the advantage function:

A^\pi(s_t, a_t) = Q^\pi(s_t, a_t) - V^\pi(s_t)

The advantage measures how much better action $a_t$ is compared to the average action from state $s_t$ . This yields the advantage actor-critic gradient:

\nabla_\theta J(\theta) = \mathbb{E}\left[\sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot A^{\pi}(s_t, a_t)\right]

In practice, the advantage is estimated using Generalized Advantage Estimation (GAE) [3], which interpolates between the high-variance Monte Carlo estimate and the high-bias TD(0) estimate via a parameter $\lambda$ :

\hat{A}^{\text{GAE}}_t = \sum_{l=0}^{T-t} (\gamma\lambda)^l \delta_{t+l}

where $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ is the TD error. GAE requires a learned value function $V_\psi$ , which is typically a neural network trained concurrently with the policy.

2.3 PPO: Clipped Surrogate Objective

Proximal Policy Optimization (PPO) [4] addresses the instability of large policy updates. The key insight is to constrain the policy ratio:

\rho_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}

The clipped surrogate objective is:

L^{\text{PPO-clip}}(\theta) = \mathbb{E}_t\left[\min\left(\rho_t(\theta) \hat{A}_t,\ \text{clip}(\rho_t(\theta), 1-\varepsilon, 1+\varepsilon) \hat{A}_t\right)\right]

The clip removes the incentive for moving the ratio outside $[1-\varepsilon, 1+\varepsilon]$ , while the outer $\min$ ensures the clipped version is a pessimistic lower bound. Combined with the advantage estimator from GAE, PPO provides stable, monotonic policy improvement in practice.

However, PPO has a significant architectural cost: it requires a value network $V_\psi$ of comparable size to the policy network for computing GAE. In the LLM setting, this means training and maintaining a second model of equal parameter count, which doubles memory usage and complicates the training pipeline.

2.4 GRPO: Eliminating the Value Network

Group Relative Policy Optimization (GRPO), introduced in DeepSeekMath [5], removes the value network entirely. The key idea is simple but powerful: for a given input, sample a group of $G$ outputs from the old policy, score them all, and use the group statistics as the baseline.

Given a question (or scene) $q$ , sample $G$ outputs $\{o_1, o_2, \ldots, o_G\}$ from $\pi_{\theta_{\text{old}}}$ . Each output receives a reward $r_i$ from the reward model. The group-relative advantage is:

\tilde{r}_i = \frac{r_i - \text{mean}(\mathbf{r})}{\text{std}(\mathbf{r}) + \varepsilon}

\hat{A}_{i,t} = \tilde{r}_i \quad \text{(all tokens in output } i \text{ share the same advantage)}

The GRPO objective then applies the same clipped surrogate structure as PPO, but with these group-relative advantages:

\mathcal{J}_{\text{GRPO}}(\theta) = \mathbb{E}\left[\frac{1}{G}\sum_{i=1}^{G}\frac{1}{|o_i|}\sum_{t=1}^{|o_i|}\left\{\min\left[\rho_{i,t}\hat{A}_{i,t},\ \text{clip}(\rho_{i,t}, 1-\varepsilon, 1+\varepsilon)\hat{A}_{i,t}\right] - \beta \mathbb{D}_{KL}[\pi_\theta \| \pi_{\text{ref}}]\right\}\right]

where $\rho_{i,t} = \pi_\theta(o_{i,t}|q, o_{i,<t}) / \pi_{\theta_{\text{old}}}(o_{i,t}|q, o_{i,<t})$ .

A critical design choice in GRPO is the placement of the KL penalty. In PPO, the KL term is added into the reward at each step, which means it affects the advantage computation. In GRPO, the KL penalty is placed directly in the loss function, decoupled from advantage estimation. This keeps the advantage computation clean and interpretable.

Dimension	PPO	GRPO
Value function	Requires learned $V_\psi$	None; group mean as baseline
Advantage estimation	GAE via TD errors	Group-relative normalization
KL penalty	Embedded in per-step reward	Directly in loss function
Sampling	Single output per input	Group of $G$ outputs per input
Memory overhead	~2x (policy + value networks)	~1x (policy only)
Per-token advantage	Yes (varies across positions)	No (shared across output)

2.5 GRPO in Autonomous Driving: AlphaDrive

The first application of GRPO to autonomous driving is AlphaDrive [6], which applies GRPO-based RL to Vision-Language Models for planning. AlphaDrive introduces four planning-oriented RL rewards tailored to driving scenarios and employs a two-stage training pipeline (SFT followed by RL). A notable finding is that RL training elicits emergent multi-modal planning capabilities—the model learns to propose diverse viable trajectories without explicit multi-modal supervision. This is particularly significant because multi-modality in trajectory planning (e.g., deciding whether to pass on the left or right of an obstacle) is a core requirement for safe and efficient driving.

3. The Unified Optimization Framework

Across PPO, GRPO, and their variants, the optimization objective for policy methods can be expressed in a unified form:

\mathcal{L}_{\text{total}} = \lambda_{\text{policy}} \cdot \mathcal{L}_{\text{policy}} + \lambda_{\text{reg}} \cdot \mathcal{L}_{\text{reg}} + \lambda_{\text{aux}} \cdot \mathcal{L}_{\text{aux}}

Each term serves a distinct purpose:

$\mathcal{L}_{\text{policy}}$ : The core policy optimization loss that drives the policy toward higher-reward actions. This is the clipped surrogate objective (PPO-clip or GRPO-clip).
$\mathcal{L}_{\text{reg}}$ : Regularization that constrains the policy from deviating too far from a reference. This includes KL divergence $\mathbb{D}_{KL}[\pi_\theta \| \pi_{\text{ref}}]$ , entropy bonuses for exploration, and trust region constraints.
$\mathcal{L}_{\text{aux}}$ : Auxiliary losses that preserve capabilities from pre-training or supervised fine-tuning. This includes imitation learning (behavior cloning) loss, reconstruction loss (for diffusion decoders), and value function loss.

The mapping to concrete terms in a driving RL pipeline:

Abstract Term	Concrete Implementation
$\mathcal{L}_{\text{policy}}$	$\mathcal{L}_{\text{GRPO-clip}}$ or $\mathcal{L}_{\text{PPO-clip}}$
$\mathcal{L}_{\text{reg}}$	$\mathcal{L}_{\text{KL}} + \mathcal{L}_{\text{entropy}}$
$\mathcal{L}_{\text{aux}}$	$\mathcal{L}_{\text{BC}} + \mathcal{L}_{\text{value}} + \sum_m \lambda_{\text{aux},m} \mathcal{L}_{\text{aux},m}$

The core policy loss expands to:

\mathcal{L}_{\text{GRPO-clip}} = -\mathbb{E}\left[\frac{1}{G}\sum_{i=1}^{G}\min\left(\rho_i(\theta)\hat{A}_i,\ \text{clip}(\rho_i(\theta), 1-\varepsilon, 1+\varepsilon)\hat{A}_i\right)\right]

where $\rho_i(\theta) = \pi_\theta(a_i|s) / \pi_{\theta_{\text{old}}}(a_i|s) = \exp(\log \pi_\theta(a_i|s) - \log \pi_{\theta_{\text{old}}}(a_i|s))$ .

4. Sampling: LLM vs. Autonomous Driving

The sampling step—generating candidate outputs from the current policy—is where the difference between LLM and driving RL becomes most pronounced. In both cases, the quality of the advantage estimate depends on the diversity and quality of the sampled group. But the constraints on what constitutes a valid sample differ fundamentally.

LLM sampling is essentially unconstrained. Given a prompt, the model samples token sequences via temperature scaling, top-k filtering, or nucleus sampling (top-p). Any sequence of valid tokens is a syntactically legal output; the only question is whether it is semantically useful. The sampling space is the full vocabulary raised to the sequence length, and the diversity of samples is controlled by the temperature parameter.

Driving sampling is fundamentally constrained. A sampled trajectory must satisfy:

Kinematic feasibility: The trajectory must respect vehicle dynamics—maximum steering angle, acceleration limits, jerk constraints. A trajectory that requires instantaneous lateral displacement is physically impossible.
Scene consistency: The trajectory must not pass through observed obstacles, violate traffic rules, or leave the drivable area.
Temporal coherence: The trajectory must be smooth and continuous, without discontinuous jumps in position or heading.

These constraints mean that naive perturbation of a trajectory (analogous to temperature sampling in LLMs) produces mostly invalid samples. A small perturbation might push the trajectory into an obstacle; a large perturbation might produce a physically impossible path. The sampling strategy must be carefully designed to produce meaningful diversity—trajectories that differ in interesting ways (left pass vs. right pass, aggressive merge vs. conservative yield) while remaining physically feasible.

This is where diffusion-based trajectory decoders offer a natural advantage. The denoising process can be guided to satisfy constraints, and the noise schedule controls the exploration-exploitation tradeoff in a physically meaningful way.

5. Loss Design: Multi-Objective Composition

The full training objective in a production driving RL system typically combines multiple loss terms:

\mathcal{L}_{\text{total}} = \lambda_{\text{pg}} \cdot \mathcal{L}_{\text{GRPO-clip}} + \lambda_{\text{kl}} \cdot \mathcal{L}_{\text{KL}} + \lambda_{\text{vf}} \cdot \mathcal{L}_{\text{value}} + \lambda_{\text{ent}} \cdot \mathcal{L}_{\text{entropy}} + \lambda_{\text{bc}} \cdot \mathcal{L}_{\text{BC}} + \sum_m \lambda_{\text{aux},m} \cdot \mathcal{L}_{\text{aux},m}

Each term plays a specific role:

Policy gradient loss ( $\mathcal{L}_{\text{GRPO-clip}}$ ): The primary driver of policy improvement. The clip mechanism prevents destructively large updates, while the group-relative advantage provides a variance-reduced gradient signal.

KL divergence ( $\mathcal{L}_{\text{KL}} = \mathbb{D}_{KL}[\pi_\theta \| \pi_{\text{ref}}]$ ): Constrains the policy from drifting too far from the reference (typically the SFT checkpoint). Without this, RL training can cause the model to “forget” its pre-trained capabilities—a phenomenon known as reward hacking where the policy finds loopholes in the reward function that produce high scores but low-quality trajectories.

Entropy bonus ( $\mathcal{L}_{\text{entropy}} = -\mathbb{E}[\mathcal{H}(\pi_\theta)]$ ): Encourages exploration by preventing the policy from collapsing to a deterministic mode. In driving, this is essential for maintaining multi-modality: the model should continue to propose diverse plausible trajectories rather than converging to a single average solution.

Behavior cloning loss ( $\mathcal{L}_{\text{BC}}$ ): An auxiliary imitation loss computed on expert demonstrations. This acts as a regularizer that prevents the policy from departing too far from safe, human-like driving behavior. It is particularly important in early RL training when the reward signal may be noisy or sparse.

Value function loss ( $\mathcal{L}_{\text{value}}$ ): When a value network is used (as in PPO), this is the regression loss for training $V_\psi$ . In GRPO-based systems, this term is absent, but it may still appear in hybrid approaches that combine GRPO advantages with a learned baseline for additional variance reduction.

Other auxiliary losses ( $\mathcal{L}_{\text{aux},m}$ ): Domain-specific terms such as reconstruction loss for diffusion decoders, collision prediction loss, or comfort regularization. These are typically small in magnitude but provide important inductive biases.

The coefficients $\{\lambda\}$ are critical hyperparameters. In practice, they are tuned through a combination of grid search and manual adjustment. A common pattern is to start with a high $\lambda_{\text{bc}}$ (strong imitation regularization) and gradually anneal it as the RL training stabilizes, allowing the policy gradient signal to dominate.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# Pseudocode: GRPO Training Loop for Driving
for each iteration:
    # 1. Sample trajectories from old policy
    for each scene s in batch:
        trajectories = sample_G(π_θ_old, s, G=16)  # G trajectories per scene

    # 2. Score trajectories with reward model
    for i in range(G):
        r_i = reward_model(trajectories[i], scene)

    # 3. Compute group-relative advantages
    r_mean = mean(r_1, ..., r_G)
    r_std  = std(r_1, ..., r_G) + ε
    for i in range(G):
        Â_i = (r_i - r_mean) / r_std

    # 4. Compute clipped surrogate loss
    for i in range(G):
        ρ_i = π_θ(a_i|s) / π_θ_old(a_i|s)
        L_pg_i = -min(ρ_i * Â_i, clip(ρ_i, 1-ε, 1+ε) * Â_i)
    L_pg = mean(L_pg_1, ..., L_pg_G)

    # 5. Compute regularization losses
    L_kl  = D_KL[π_θ || π_ref]
    L_ent = -H(π_θ)
    L_bc  = imitation_loss(π_θ, expert_data)

    # 6. Total loss and update
    L_total = λ_pg * L_pg + λ_kl * L_kl + λ_ent * L_ent + λ_bc * L_bc
    θ = θ - α * ∇_θ L_total

6. Diffusion, Noise, and Exploration

For diffusion-based trajectory decoders, the relationship between noise and exploration deserves special attention. In the standard diffusion process, a clean trajectory $x_0$ is corrupted by adding Gaussian noise over $T$ steps:

x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

At inference time, the model denoises from $x_T$ (pure noise) back to $x_0$ (a clean trajectory). The initial noise $\epsilon$ determines which trajectory is generated. In the RL context, this noise plays a role analogous to temperature in LLM sampling—but with a crucial difference.

Noise in diffusion-based driving is not merely “sampling randomness.” It directly determines:

Exploration range: The magnitude and structure of the noise control how far the generated trajectories can deviate from the mean. Larger noise leads to more diverse candidates.
Candidate trajectory morphology: Different noise realizations produce qualitatively different trajectory shapes—lane-change vs. lane-follow, aggressive vs. conservative, left vs. right. The noise does not just shift a trajectory; it can change its mode.
Group distribution quality: For GRPO, the advantage estimation depends on the group of samples having meaningful reward variance. If the noise is too small, all trajectories are nearly identical, and the group-relative advantage is dominated by noise rather than signal. If the noise is too large, many trajectories become physically invalid, and the reward signal becomes uninformative.

This creates a three-way tension in noise scheduling:

Effective diversity: The noise must be large enough to produce trajectories with meaningfully different rewards, enabling the group-relative advantage to separate good from bad.
Trajectory validity: The noise must be small enough (or the denoising process must be constrained enough) to keep trajectories within the kinematically feasible and scene-consistent region.
Alignment with training objectives: The exploration direction should be consistent with what the reward function actually measures. Noise that produces diverse but reward-irrelevant variations (e.g., tiny lateral shifts that do not affect collision safety) wastes sampling budget.

In practice, these tensions are addressed through a combination of constrained diffusion (guiding the denoising process with kinematic constraints), adaptive noise scheduling (adjusting noise levels based on scene complexity), and rejection sampling (discarding trajectories that violate hard constraints before computing rewards).

7. Summary

Component	REINFORCE	PPO	GRPO
Objective	Maximize $J(\theta) = \mathbb{E}[R(\tau)]$	Clipped surrogate	Clipped surrogate
Advantage	$R(\tau)$ (raw return)	GAE via learned $V_\psi$	Group-relative normalization
Baseline	None	Learned value function	Group mean reward
Value network	No	Yes (same scale as policy)	No
Variance	Very high	Low (GAE + learned baseline)	Moderate (group statistics)
Update constraint	None	Clip ratio $[1-\varepsilon, 1+\varepsilon]$	Clip ratio $[1-\varepsilon, 1+\varepsilon]$
KL regularization	None	In reward (affects advantage)	In loss (independent of advantage)
Memory	1x	~2x	~1x
Driving applicability	Baseline only	General purpose	VLM planning, group-sampled scenarios

The progression from REINFORCE to PPO to GRPO represents a trajectory of increasing practical efficiency: REINFORCE establishes the theoretical foundation, PPO introduces stable optimization through clipping and learned baselines, and GRPO removes the expensive value network by exploiting the group structure of the sampling process. For autonomous driving, GRPO is particularly attractive because the contextual bandit structure of trajectory planning naturally produces group-sampled outputs, and the absence of a value network simplifies the training pipeline for already complex end-to-end models.

However, GRPO is not a universal replacement for PPO. In settings where per-token advantages matter (e.g., sequential decision-making with meaningful intermediate states), GAE provides a richer signal than the per-output advantage of GRPO. The choice between the two should be guided by the structure of the problem: contextual bandit with group sampling favors GRPO; sequential MDP with long horizons favors PPO.

References

Sutton, R.S., McAllester, D.A., Singh, S.P., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. NeurIPS. Link
Williams, R.J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4), 229-256.
Schulman, J., Moritz, P., Levine, S., Jordan, M.I., & Abbeel, P. (2016). High-dimensional continuous control using generalized advantage estimation. ICLR.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv:1707.06347.
Shao, Z., Wang, P., Zhu, Q., et al. (2024). DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv:2402.03300. GRPO is introduced in Section 4 of this paper.
Jiang, B., Chen, S., Zhang, Q., Liu, W., & Wang, X. (2025). AlphaDrive: Unleashing the power of VLMs in autonomous driving via GRPO-based reasoning and planning. arXiv:2503.07608.

1. Why End-to-End Driving Needs Reinforcement Learning#

2. From REINFORCE to PPO to GRPO: The Policy Gradient Lineage#

2.1 The Policy Gradient Theorem#

2.2 Variance Reduction: From Returns to Advantages#

2.3 PPO: Clipped Surrogate Objective#

2.4 GRPO: Eliminating the Value Network#

2.5 GRPO in Autonomous Driving: AlphaDrive#

3. The Unified Optimization Framework#

4. Sampling: LLM vs. Autonomous Driving#

5. Loss Design: Multi-Objective Composition#

6. Diffusion, Noise, and Exploration#

7. Summary#

References#