Model-Free Prediction: Reinforcement Learning (2024)

Part 4: Model-Free Predictions with Monte-Carlo Learning, Temporal-Difference Learning and TD( λ)

Ryan Wong

Published in

Towards Data Science

7 min read

Feb 3, 2019

Monte-Carlo Policy Evaluation

Goal: Given a policy π, learn v_π (value for the policy) from episodes of experience.

Model-Free Prediction: Reinforcement Learning (4)

Model-Free Prediction: Reinforcement Learning (5)

Model-Free Prediction: Reinforcement Learning (6)

Monte-Carlo policy evaluation uses empirical mean return instead of expected return. Two approaches to evaluate the value function of a policy at a state is to use First-Visit Monte-Carlo Policy Evaluation or Every-Visit Monte-Carlo Policy Evaluation.

First-Visit Monte-Carlo Policy Evaluation

Evaluate the value of state s of a given policy
The first time-step (t) that state (s) is visited in an episode
Increment counter: N(s) ← N(s) + 1
Increment total return: S(s) ← S(s) + Gₜ
Value is estimated by mean return: V(s) = S(s)/N(s)
V(s) → v_π(s) as N(s) → ∞

Every-Visit Monte-Carlo Policy Evaluation

Evaluate the value of state s of a given policy
Every time-step (t) that state (s) is visited in an episode
Increment counter: N(s) ← N(s) + 1
Increment total return: S(s) ← S(s) + Gₜ
Value is estimated by mean return: V(s) = S(s)/N(s)
V(s) → v_π(s) as N(s) → ∞

In both of the above evaluation approaches, we had to track the statistics of our algorithm. i.e we can only compute the value after we have completed all the episodes. To solve this problem we can use the Incremental Mean equation to update the value incrementally.

Incremental Mean
The mean µ₁, µ₂, … of a sequence x₁, x₂, … can be computed incrementally.

Model-Free Prediction: Reinforcement Learning (7)

Incremental Monte-Carlo Updates
Update V(s) incrementally after episode S₁, A₁, R₂, …, Sₜ. For each state Sₜ with return Gₜ :

Model-Free Prediction: Reinforcement Learning (8)

In non-stationary problems (where things are drifting around and you don’t need to remember things that happened long time ago), we can use the running mean approach, i.e. forget old episodes.

Model-Free Prediction: Reinforcement Learning (9)

Temporal-Difference is model-free. Temporal Difference methods learn directly from experience / interaction with the environment. Temporal Difference learns from incomplete episodes, by bootstrapping (update the guess of the value function).

In both MC and TD the goal is to learn v_π online from experience under policy π.
If we were to apply incremental every-visit Monte-Carlo we update the value V(Sₜ) towards the actual return Gₜ

Model-Free Prediction: Reinforcement Learning (10)

The simplest temporal-difference learning algorithm, TD(0) differs as we update the value V(Sₜ) towards an estimated return Rₜ₊₁ + γV(Sₜ₊₁)

Model-Free Prediction: Reinforcement Learning (11)

Rₜ₊₁ + γV(Sₜ₊₁) is the TD target and δₜ=Rₜ₊₁+γV(Sₜ₊₁)-V(Sₜ) is the TD error.

TD learning updates the value function immediately which allows it to learn before knowing the final outcome after every step unlike MC which must wait until the end of the episode before the return is known. TD works in continuing (non-terminating) environments while MC only works for episodic (terminating) environments / complete sequences.

An example to illustrate the difference between TD and MC, is if we were to try to predict the how long it takes to drive home at each state along the way.
In MC, we would assign each state the value we receive at the end of the journey (actual outcome).
In TD, we would update the value along the way at each state using the impact that the next state has on the current state (estimated outcome).

There is a trade-off between the bias and the variance. MC has a high variance and zero bias as it uses the return Gₜ which depends on many random actions, transitions and rewards. Therefore it has good convergence properties even with the function approximation and is not sensitive to initial value.

TD has low variance and some bias as the TD target depends on one random action, transition and reward. It is usually more efficient than MC. TD(0) converges to v_π(s) but not always with function approximation. Unlike MC it is more sensitive to the initial value.

Batch MC and TD

So we have seen that MC and TD converge: V(s) → v_π(s) as experience → ∞
But in practise we cannot go on forever, so how do these algorithms converge for batch solution for finite experience?

Suppose we have two states A, B with no discounting and 8 episodes of experience.

Model-Free Prediction: Reinforcement Learning (12)

What is the value at state A. V(A) ?
MC converges to solution which best fits the observed returns with minimum mean-squared error.

Model-Free Prediction: Reinforcement Learning (13)

Therefore V(A)=0. As the only time the state A appears in an episode is when the return is 0.

TD(0) converges to solution of max likelihood Markov model. It is the solution to the MDP that best fits the data.

Model-Free Prediction: Reinforcement Learning (14)

Therefore V(A)=0.75. Since we got a reward 6 out of 8 episodes. TD exploits the Markov property unlike MC.

Comparison Between Backup Methods

Monte-Carlo Backup:
Value of a state Sₜ can only be computed once a terminal state is reached

Model-Free Prediction: Reinforcement Learning (15)

Temporal-Difference TD(0) Backup:
Value of a state Sₜ is computed using only one step look ahead.

Model-Free Prediction: Reinforcement Learning (16)

Dynamic Programming Backup:
Value at Sₜ is computed with a one step look at every possible state and computes an expected value.

Model-Free Prediction: Reinforcement Learning (17)

n-Step Return
An approach between TD(0) and MC, where we have n-step temporal-difference learning. Therefore the value will be computed by looking ahead n-steps and apply the temporal-difference learning method.

Model-Free Prediction: Reinforcement Learning (18)

Instead of looking at each n-step return Gₜ⁽ⁿ⁾, we can use a decaying weighted sum to combine all n-step returns called the λ-return.

Model-Free Prediction: Reinforcement Learning (19)

Forward-view TD(λ)

The value at a state can now be computed using a forward-view TD(λ)

Model-Free Prediction: Reinforcement Learning (20)

The forward-view looks into the future to compute the λ-return and updates the value function towards it and can only be computed from complete episodes.

Backward-view TD(λ)

The backward-view provides mechanism to update the value online, every step, from incomplete sequences. We keep an eligibility trace for every state s and update the V(s) for every state s in proportion to TD-error δₜ and eligibility trace Eₜ(s).

Model-Free Prediction: Reinforcement Learning (21)

Eligibility Trace
Eligibility traces combine both frequency heuristic and recency heuristic.
- Frequency heuristic: assign credit to most frequent states
- Recency heuristic: assign credit to most recent states

Model-Free Prediction: Reinforcement Learning (22)

We have looked at various methods for model-free predictions such as Monte-Carlo Learning, Temporal-Difference Learning and TD(λ). These methods allowed us to find the value of a state when given a policy. In the next post, we will look at finding the optimal policies using model-free methods.

UCL Course on RL — Lecture 4
An Introduction to Reinforcement Learning, Sutton and Barto, 1998

If you enjoyed this post and want to see more don’t forget follow and/or leave a clap.