This text draws primarily from course materials for PA230 Reinforcement Learning, taught by Petr Novotný. Any errors or inaccuracies are my own.
Value & Policy Iteration
Problem: How to compute the optimal value vector and find an optimal policy in an MDP.
Solution: Use linear programming.
Problem: Linear programming is computationally slow.
Solution: Repeatedly apply Bellman updates until convergence -> Value Iteration
Problem: How to use the optimal values for policy?
Solution: Choose actions greedily with respect to the computed values.
Problem: We don't need exact values to find an optimal policy.
Solution: Start with any policy, compute its approximate values, make the policy greedy with respect to the computed values, repeat until the policy doesn't change -> Policy Iteration
Monte Carlo
Problem: We may not have complete access to the MDP - no enumeration of states, actions, or exact probabilities and rewards.
Solution: Sample trajectories and estimate the average return from that state, either using first or every visit to the state -> First/Every-visit Monte Carlo Prediction
Problem: How to use Monte Carlo evaluation to find better policies?
Solution: Estimate action values q(s, a), and after each estimation, update the policy greedily -> Monte Carlo Control
Problem: Some state-action pairs won't be explored if the state is never visited or the action is never chosen.
Solution: Begin each episode in a random state with random action -> Exploring Starts
Problem: Random starts may not be possible in some environments (like robotics).
Solution: Use ε-greedy policy: with probability ε choose random action, otherwise follow current best action -> ε-greedy Monte Carlo Control
Problem: What if we need to evaluate a policy but can't modify it to include exploration?
Solution: Use importance sampling to correct for the difference between behaviour and target policy -> Off-policy Monte Carlo
Temporal Difference
Problem: Monte Carlo has high variance and requires waiting for episode end.
Solution: Instead of using trajectory return, use reward plus next state's value (bootstrap) as target -> TD(0) Evaluation
Problem: How to do control (find optimal policy) with TD?
Solution: Use q-values, update immediately with ε-greedy policy -> SARSA
Problem: SARSA pushes q-values towards the current policy, but ideally we'd want optimal values.
Solution: Use the best action in TD-target calculation -> Q-learning
Problem: Q-learning considers only the "best scenario", so it's positively biased when there's uncertainty.
Solution: Keep two Q estimates, use one for selection and the other for evaluation -> Double Q-learning
Problem: Can we get something between full Monte Carlo and TD(0)?
Solution: Look ahead n steps -> n-step SARSA/Q-learning
Problem: It's hard to pick the right value of n.
Solution: Calculate n-step returns for multiple n's and combine them -> TD(λ)
Problem: We need to wait until the episode ends for updates with TD(λ).
Solution: Keep track of eligibility traces for each state and update accordingly. E.g. states visited frequently and more recently should get bigger update -> Backward view TD(λ)
Value-based On-Policy methods
Problem: There are too many states to store values in tables.
Solution: Use function approximation with neural nets.
Problem: Find parameters such that the network outputs value when given state representation (Policy Evaluation).
Solution: Let's minimize
Problem: We don't know true values v(s).
Solution: Estimate using targets:
MC -> Gradient MC evaluation
TD(0) -> Semi-gradient TD(0)
Problem: Control with function approximation.
Solution: Estimate q-values, same as TD(0) but with SARSA -> Semi-gradient SARSA
Deep Q-Networks (DQN)
Problem: Off-policy algorithms suffer from instability - a combination of function approximation, bootstrapping, and off-policy training.
Solution: Do it anyway but stabilize the training:
Store and replay past experience
Use separate target network
Double DQN
Dueling networks
N-step rewards
Policy Gradient methods
Problem: We want to work directly with policy instead of values.
Solution: Parametric representation of policy instead of Q. Use softmax to output distribution over actions
Problem: How to compute gradients to maximize expected return using gradient ascent?
Solution: Use policy gradient theorems to derive the gradient -> Vanilla MC policy gradient
Problem: All actions on a trajectory are reinforced proportionally to the trajectory return without distinguishing which action helped more.
Solution: Step-wise gradient -> REINFORCE
Problem: High variance in updates.
Solution: Compare returns to a reference value (baseline) to better judge if the outcome was actually above average from that state.
Problem: v(s) is a good baseline, but we don't know its true value.
Solution: Estimate it with gradient every-visit MC -> REINFORCE with baseline
Problem: G in baseline estimation is a source of variance.
Solution: Remove it with bootstrapping (MC -> TD methods)
Actor: plays action
Critic: temporal difference - how good the action was
-> Actor-Critic
TRPO, PPO
Problem: Unstable training with large updates. We're not sure that updating the parameters improves the policy.
Solution: In each step, optimize metric dependent on the previous policy:
Problem: We don't know the actual adv.
Solution: Replace with neural network estimate.
Problem: We can't sample from π’.
Solution: Rewrite with importance sampling when π’ is close to π. This holds in the "trust region" around π.
Problem: How to enforce the trust region constraint?
Solution: Using KL divergence -> TRPO (Equation 9 in TRPO)
Problem: The steps with the recommended KL coefficient are too small.
Solution 1: Maximize the objective subject to KL divergence as a constraint -> TRPO (Equation 12 in TRPO)
Solution 2: Change it adaptively -> Adaptive KL Penalty (from PPO)
Problem: Ideally, we'd solve an unconstrained optimization.
Solution: Use clipped objective that prevents too large policy changes (from PPO):
Problem: How to estimate advantages effectively?
Solution: Use generalized advantage estimation (GAE).
Solution: Clipped loss + GAE + entropy regularization -> Proximal Policy Optimization (PPO)