REINFORCE: Direct Policy Optimization After Deep Q-Learning
Reinforcement Learning
Policy Gradients
PyTorch
Full PyTorch implementation of the REINFORCE policy gradient algorithm with baseline variance reduction on a continuous-state grid environment.
Why REINFORCE After DQN?
In Deep Q-Networks: From Tables to Neural Function Approximators, we kept the Bellman target and replaced the Q-table with a neural network. That solved state-space scaling, but it kept one central design choice: the policy is still implicit, obtained by an argmax over .