Expected SARSA: Bridging On-Policy and Off-Policy TD Control
Reinforcement Learning
TD Learning
Expected SARSA
How Expected SARSA averages over policy distributions to reduce update variance compared to SARSA and Q-Learning, with code and policy visualization.
Why Expected SARSA?
In a previous post, we placed SARSA and Q-Learning side by side and observed the clearest expression of the on-policy vs off-policy divide. That comparison surfaced a deeper tension: SARSA is honest about the cost of exploration, but it carries sampling noise into every update; Q-Learning bootstraps from the imagined best action, gaining speed at the cost of accuracy about what the agent will actually do. Both compromises are measurable.