Reinforcement Learning – A Reference

...

Jan 22, 2025

This text draws primarily from course materials for PA230 Reinforcement Learning, taught by Petr Novotný. Any errors or inaccuracies are my own.

A painterly scene in oil painting style depicting a whimsical robotic figure with metallic reflections, standing in an impressionistic natural environment with soft, blended brushstrokes. Two bold, flowing arrows in rich burgundy—one emanating from the robot towards a dreamy landscape of abstract shapes and swirls, and another gracefully curving back to the robot. The color palette consists of deep navy blues, warm golds, and muted earth tones, with visible brushstrokes and texture reminiscent of van Gogh's expressive style. The background blends reality and abstraction, with sweeping curves and spirals suggesting the flow of information and interaction. The arrows maintain their directional clarity while appearing to be painted with confident, artistic strokes that integrate naturally with the overall composition. — A variation of the agent-environment figure (such as this one) made with DALLE 3.

Value & Policy Iteration

Problem: How to compute the optimal value vector and find an optimal policy in an MDP.
Solution: Use linear programming.

Problem: Linear programming is computationally slow.
Solution: Repeatedly apply Bellman updates until convergence -> Value Iteration

\(v(s) \leftarrow \max_{a \in A(s)} \left[r(s,a) + \gamma \sum_{s' \in S} p(s'|s,a) \cdot v(s')\right]\)

Problem: How to use the optimal values for policy?
Solution: Choose actions greedily with respect to the computed values.

Problem: We don't need exact values to find an optimal policy.
Solution: Start with any policy, compute its approximate values, make the policy greedy with respect to the computed values, repeat until the policy doesn't change -> Policy Iteration

Monte Carlo

Problem: We may not have complete access to the MDP - no enumeration of states, actions, or exact probabilities and rewards.
Solution: Sample trajectories and estimate the average return from that state, either using first or every visit to the state -> First/Every-visit Monte Carlo Prediction

Problem: How to use Monte Carlo evaluation to find better policies?
Solution: Estimate action values q(s, a), and after each estimation, update the policy greedily -> Monte Carlo Control

Problem: Some state-action pairs won't be explored if the state is never visited or the action is never chosen.
Solution: Begin each episode in a random state with random action -> Exploring Starts

Problem: Random starts may not be possible in some environments (like robotics).
Solution: Use ε-greedy policy: with probability ε choose random action, otherwise follow current best action -> ε-greedy Monte Carlo Control

Problem: What if we need to evaluate a policy but can't modify it to include exploration?
Solution: Use importance sampling to correct for the difference between behaviour and target policy -> Off-policy Monte Carlo

Temporal Difference

Problem: Monte Carlo has high variance and requires waiting for episode end.
Solution: Instead of using trajectory return, use reward plus next state's value (bootstrap) as target -> TD(0) Evaluation

\(V(s_t) \leftarrow V(s_t) + \alpha[R_{t+1} + \gamma V(s_{t+1}) - V(s_t)]\)

Problem: How to do control (find optimal policy) with TD?
Solution: Use q-values, update immediately with ε-greedy policy -> SARSA

Problem: SARSA pushes q-values towards the current policy, but ideally we'd want optimal values.
Solution: Use the best action in TD-target calculation -> Q-learning

\(Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha[R_{t+1} + \gamma \max_a Q(s_{t+1},a) - Q(s_t,a_t)]\)

Problem: Q-learning considers only the "best scenario", so it's positively biased when there's uncertainty.
Solution: Keep two Q estimates, use one for selection and the other for evaluation -> Double Q-learning

Problem: Can we get something between full Monte Carlo and TD(0)?
Solution: Look ahead n steps -> n-step SARSA/Q-learning

Problem: It's hard to pick the right value of n.
Solution: Calculate n-step returns for multiple n's and combine them -> TD(λ)

Problem: We need to wait until the episode ends for updates with TD(λ).
Solution: Keep track of eligibility traces for each state and update accordingly. E.g. states visited frequently and more recently should get bigger update -> Backward view TD(λ)

Value-based On-Policy methods

Problem: There are too many states to store values in tables.
Solution: Use function approximation with neural nets.

Problem: Find parameters such that the network outputs value when given state representation (Policy Evaluation).
Solution: Let's minimize

\((v^\pi(s) - V_\theta(s))^2\)

Problem: We don't know true values v(s).
Solution: Estimate using targets:

MC -> Gradient MC evaluation
TD(0) -> Semi-gradient TD(0)

Problem: Control with function approximation.
Solution: Estimate q-values, same as TD(0) but with SARSA -> Semi-gradient SARSA

Deep Q-Networks (DQN)

Problem: Off-policy algorithms suffer from instability - a combination of function approximation, bootstrapping, and off-policy training.
Solution: Do it anyway but stabilize the training:

Store and replay past experience
Use separate target network
Double DQN
Dueling networks
N-step rewards

Policy Gradient methods

Problem: We want to work directly with policy instead of values.
Solution: Parametric representation of policy instead of Q. Use softmax to output distribution over actions

Problem: How to compute gradients to maximize expected return using gradient ascent?
Solution: Use policy gradient theorems to derive the gradient -> Vanilla MC policy gradient

Problem: All actions on a trajectory are reinforced proportionally to the trajectory return without distinguishing which action helped more.
Solution: Step-wise gradient -> REINFORCE

Problem: High variance in updates.
Solution: Compare returns to a reference value (baseline) to better judge if the outcome was actually above average from that state.

Problem: v(s) is a good baseline, but we don't know its true value.
Solution: Estimate it with gradient every-visit MC -> REINFORCE with baseline

Problem: G in baseline estimation is a source of variance.
Solution: Remove it with bootstrapping (MC -> TD methods)

Actor: plays action
Critic: temporal difference - how good the action was
-> Actor-Critic

TRPO, PPO

Problem: Unstable training with large updates. We're not sure that updating the parameters improves the policy.
Solution: In each step, optimize metric dependent on the previous policy:

\(J(\theta') - J(\theta) = E_{\tau\sim\pi'}\left[\sum_{t=0}^\infty \gamma^t \cdot adv^\pi(s_t,a_t)\right] \equiv L_\pi(\pi')\)

Problem: We don't know the actual adv.
Solution: Replace with neural network estimate.

Problem: We can't sample from π’.
Solution: Rewrite with importance sampling when π’ is close to π. This holds in the "trust region" around π.

Problem: How to enforce the trust region constraint?
Solution: Using KL divergence -> TRPO (Equation 9 in TRPO)

Problem: The steps with the recommended KL coefficient are too small.
Solution 1: Maximize the objective subject to KL divergence as a constraint -> TRPO (Equation 12 in TRPO)

Solution 2: Change it adaptively -> Adaptive KL Penalty (from PPO)

Problem: Ideally, we'd solve an unconstrained optimization.
Solution: Use clipped objective that prevents too large policy changes (from PPO):

\(L^{CLIP}_\pi(\theta) = E_t[\min(r_t(\theta)A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)A_t)]\)

Problem: How to estimate advantages effectively?
Solution: Use generalized advantage estimation (GAE).

Solution: Clipped loss + GAE + entropy regularization -> Proximal Policy Optimization (PPO)

Jakub’s Substack

Discussion about this post