Policy Optimization for End-to-End Autonomous Driving: From REINFORCE to GRPO
1. Why End-to-End Driving Needs Reinforcement Learning Supervised learning—whether through imitation learning or behavior cloning—can only take an autonomous driving system so far. The fundamental limitation is distributional: the training data is drawn from expert demonstrations, and any distributional shift between training and deployment leads to compounding errors. More critically, supervised objectives are misaligned with the true goal of driving. Minimizing the L2 distance to a ground-truth trajectory penalizes safe deviations as harshly as dangerous ones, and provides no mechanism for the model to discover better trajectories than those in the dataset. ...