Introduction to Approximate Solution Methods for Reinforcement Learning


series about Reinforcement Learning (RL), following Sutton and Barto’s famous book “Reinforcement Learning” [1].

In the previous posts we finished dissecting Part I of said book, which introduces fundamental solution techniques which form the basis for many RL methods. These are: Dynamic Programming (DP), Monte Carlo methods (MC) and Temporal Difference Learning (TD). What separates Part I from Part II of Sutton’s book, and justifies the distinction, is a constraint on the problem size: while in Part I tabular solution methods were covered, we now dare to dive deeper into this fascinating topics and include function approximation.

To make it specific, in Part I we assumed the state space of the problems under investigation to be small enough s.t. we could represent it and also the found solutions via a simple table (imagine a table denoting a certain “goodness” – a value – for each state). Now, in Part II, we drop this assumption, and are thus able to tackle arbitrary problems.

And this modified setup is dearly needed, as we could observe first-hand: in a previous post we managed to learn to play Tic Tac Toe, but already failed for Connect Four – since the number of states here is in the order of 10²⁰. Or, consider an RL problem which learns a task based on camera images: the number of possible camera images is greater than the number of atoms in the known universe [1].

These numbers should convince everyone that approximate solution methods are absolutely necessary. Next to enabling tackling such problems, they also offer generalization: for tabular methods, two close, but still different states were treated completely separate – whereas for approximate solution methods, we would hope that our function approximation can detect such close states and generalize.

With that, let’s begin. In the next few paragraphs, we will:

  • give an introduction to function approximation
  • produce solution methods for such problems
  • discuss different choices for approximation functions.

Introduction to Function Approximation

As opposed to tabular solution methods, for which we used a table to represent e.g. value functions, we now use a parametrized function

with a weight vector

image 440

v might be anything, such as a linear function of the input values, or a deep neural network. Later in this post we will discuss different possibilities in details.

Usually, the number of weights is much smaller than the number of states – which yields generalization: when we update our function by adjusting some weights, we don’t just update a single entry in a table – but this will have an effect on (possibly) all other estimates, too.

Read Also:  The Impact of GenAI and Its Implications for Data Scientists

Let’s recap the updates rules from a few of the methods we saw in previous posts.

MC methods assign the observed return G as value estimate for a state:

image 441

TD(0) bootstraps the value estimate of the next state:

image 443

While DP uses:

image 444

From now on, we will interpret updates of the form s -> u as input / output pairs of a function we would like to approximate, and for this use techniques from machine learning, in particular: supervised learning. Tasks where numbers (u) have to be estimated is known as function approximation, or regression.

To solve this problem, we can in theory resort to any possible method for such task. We will discuss this in a bit, but should mention that there are certain requirements on such methods: for one, they should be able to handle incremental changes and datasets – since in RL we usually build up experience over time, which differs from, e.g. classical supervised learning tasks. Further, the chosen method should be able to handle non-stationary targets – which we will discuss in the next subsection.

The Prediction Objective

Throughout Part I of Sutton’s book, we never needed a prediction objective or similar – after all, we could always converge to the optimal function which described each state’s value perfectly. Due to the reasons stated above, this is no longer possible – requiring us to define an objective, a cost function, which we want to optimize.

We use the following:

1

Let’s try and understand this. This is an expectation over the difference between predicted and actual values, which, intuitively makes sense and is common in supervised learning. Note that this requires us to define a distribution µ, which specifies how much we care about certain states.

Often, this simply is a measure proportional to how often states are visited – the on-policy-distribution, on which we will focus in this section.

However, note that it is actually not clear whether this is the right objective: in RL, we care about finding good policies. Some method of ours might optimize above objective extremely well, but still fail to solve the problem at hand – e.g. when the policy spends too much time in undesired states. Still, as discussed, we need one such objective – and due to lack of other possibilities, we just optimize this.

Read Also:  How I Actually Use Statistics as a Data Scientist

Next, let’s introduce a method for minimizing this objective.

Minimizing the Prediction Objective

The tool we pick for this task is Stochastic Gradient Descent (SGD). Unlike Sutton, I do not want to go into too many details here, and only focus on the RL part – so I would like to refer the interested reader to [1] or any other tutorial on SGD / deep learning.

But, in principle, SGD uses batches (or mini batches) to compute the gradient of the objective and update the weights a small step in the direction minimizing this objective.

For thus, this gradient is:

Screenshot from 2025 06 30 19 38 22

Now the interesting part: assume that v_π is not the true target, but some (noisy) approximation of it, say U_t:

Screenshot from 2025 06 30 19 41 17

We can show that if U_t is an unbiased of v_π, then the solution obtained via SGD converges to a local optimum – convenient. We can now simply use e.g. the MC return as U_t, and obtain our very first gradient RL method:

Screenshot from 2025 06 30 19 45 10
Image from [1]

It is also possible to use other measures for U_t, in particular also use bootstrapping, i.e. use previous estimates. When doing so, we lose these convergence guarantees – but as so often empirically this still works. Such methods are called semi-gradient methods – since they only consider the effect of changing the weights on the value to update, but not on the target.

Based on this we can introduce TD(0) with function approximation:

Screenshot from 2025 06 30 19 59 59
Image from [1]

A natural extension of this, and likewise an extension to the corresponding n-step tabular method, is n-step semi-gradient TD:

image 433
Image from [1]

Methods for Function Approximation

In the remainder of Chapter 9 Sutton describes different ways of representing the approximate function: a large part of the chapter covers linear function approximation and feature design for this, and for non-linear function approximation artificial neural networks are introduced. We will only briefly cover these topics, as on this blog we mainly work with (deep) neural networks and not simple linear approximations, and also suspect the astute reader is already familiar with basics of deep learning and neural networks.

Linear Function Approximation

Still, let’s briefly discuss linear approximation. In this, the state-value function is approximated by the inner product:

Screenshot from 2025 12 24 15 10 00

Here, the state is described by the vector

image 434

– and, as we can see, this is a linear combination of the weights.

Due to the simplicity of the representation, there are some elegant formulas (and closed-loop representations) for the solution, as well as some convergence guarantees.

Read Also:  Vision Transformers (ViT) Explained: Are They Better Than CNNs?

Feature Construction for Linear Methods

A limitation of the above introduced naive linear function approximation is that each feature is used separately, and no combination of features is possible. Sutton lists the problem cart pole as an example: here, high angular velocity can be good or bad, depending on the context. When the pole is nicely centered, one should probably avoid quick, jerky movements. However, the closer the pole gets to falling over, the faster velocities might be needed.

There is thus a separate branch of research about designing efficient feature representations (although one could argue, that due to the rise of deep learning, this is becoming less important).

One such representations are polynomials. As an introductory example, imagine the state vector is comprised of two parts, s_1 and s_2. We could thus define the feature space:

image 435

Then, using this representation, we could still do linear function approximation – i.e. use four weights to the four newly constructed features, and overall still have a linear function w.r.t. the weights.

More generally, the polynomial-basis features of order n+1 can be represented by

image 436

where the c’s are integers in {0 … n}.

Other commonly used bases are the Fourier basis, coarse and tile coding, and radial basis functions – but as mentioned we will not dive deeper at this point.

Conclusion

In this post we made an important step beyond the previous posts towards deploying RL algorithms “in the wild”. In the preceding posts, we focused on introducing the essential RL methods, albeit in the form of tabular methods. We saw that they quickly reach their limits when deployed to larger problems and thus realized that approximate solution methods are needed.

In this post we introduced fundamentals for this. Next to enabling the tackling of large-scale, real-world problems, these methods also introduce generalization – a powerful necessity for any successful RL algorithm.

We began by introducing a suitable prediction objective and ways of optimizing this.

Then we introduced our first gradient and semi-gradient RL algorithms for the prediction objective – that is learning a value function for a given policy.

Lastly we discussed different ways for constructing the approximation function.

As always, thanks for reading! And if you are interested, stay tuned for the next post in which we will dive into the corresponding control problem.

Other Posts in this Series

References

[1] http://incompleteideas.net/book/RLbook2020.pdf

[2] https://pettingzoo.farama.org/

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top