1. Introduction
Dynamic programming (DP) is both an optimization method and a computer programming method which was developed by Richard Bellman in the 1950s. The main idea is to break a complicated problem down into simpler sub-problems recursively. If there exists a recursive relation between the original problem and its sub-problems, then DP methods are applicable. In the optimization literature this relationship is called the Bellman equation.
In machine learning, one of the problems that we are interested in is the sequential decision making problem regarding the way in which the software agents ought to take actions to maximize the cumulative rewards from the stochastic environment. We formulate the interaction bewteen agents and the environment by Markov Decision Process (MDP) whose structure can be modeled by the Bellman equation. But when the state space of the environment is too large or an exact model of the dynamics is absent, MDP can only be solved approximately with simulation and function approximation and this is what we call approximate dynamic programming (neuro-dynamic programming or reinforcement learning). A large body of methods have been developed, e.g. temporal difference learning (TD) [
4] , least-squares temporal difference learning (LSTD) [
3,
4], least-squares policy evaluation (LSPE) [
9] and least-squares policy iteration (LSPI) [
6].
Computing the value function of a given policy (policy evaluation) by the Bellman equation plays an important role in solving approximate DP and it is equivalent to solving a linear system [
13]. In matrix computations, Krylov subspace methods [
5,
8] are the most powerful tools to solve large scale linear systems of equations and least-squares problems. So there is a connection between these two areas but this connection lacks deep exploration.
Instead of using reinforcement learning, we use approximate DP in this report to emphasize its strong relationship with DP, because DP provides theoretical foundations and important guidelines for designing approximate DP algorithms.
My report will review some fundamental concepts and ideas of approximate DP and Krylov subspace methods and their connections.
2. Approximate dynamic programming methods
2.1. Problem description
Markov decision process (MDP) is a framework to model the learning process that the agent learns from the interaction with the environment. Given the current state
at time
t, the agent interacts with the environment by selecting actions
according to the policy
and the environment would take the agent to another state
, and then provide a numerical feedback (immediate reward)
according to the dynamics (transition probability)
. The policy
is a probability distribution from which we select the action. Our goal is to find an optimal policy (often deterministic) to maximize the long-term cumulative rewards. We assume that the policy is stationary. We only consider the inifite horizon discounted problem because the ideas can also be applied to the simpler absorbing Markov chains [
17]. So the objective function is
. It’s easy to see the recursive structure
The value function of the given policy
at state
s is defined as
The Equation (
3) is the Bellman equation for MDP, which depicts the recursive relationships between the value of
s and the value of its possible successor states
and is consistent to (1). It can be written in a more compact form.
The Equation (
4) can be written in the matrix form
where
,
and
. Here
can be infinite, but for simplicity, we assume it is finite. Since we only consider a fixed policy, we can write (5) as
or
.
We define a short hand notation for DP mapping:
where
is the value function space and
T is called the Bellman operator. It can be proved that
T is a contraction mapping [
1] and its fixed point
is the solution for (5). Finding the fixed point is exactly what we do in the policy evaluation step in the policy iteration method [
13].
Solving (5) by a direct linear solver is not feasible when the state space is very large or when the transition matrix is unavailable. But we can find approximate solutions by value function approximation (VFA). Linear approximation is a popular VFA method [
11] and it has the form of
where
is the
j-th feature function or basis function over the state space and
K is the number of features, see [
14]. For each state
i,
, where
is the
row of
and represent the feature vector for state
i. Linear function approximations have better convergence property [
17] than nonlinear function approximation and is easier to analyze, interpret and implement. We can introduce nonlinearity when doing feature extraction, e.g. deep neural networks. Suppose we already have well-chosen features, then by the linear approximation, (5) becomes
We define a
weighted quadratic norm on
,
where
with
is the weight vector and
is the diagonal matrix with
on the diagonal. Then we can define the operator
that projects any value function to its closest representable function in range
(i.e. column space of
) with respect to
. That means
is an orthogonal projector onto
with respect to
, i.e.
, which is equivalent to
We assume that the Markov chain we are studying is ergodic, then we have the stationary distribution,
We usually use the vector
as our weight vector
, otherwise it may cause a
norm mismatch problem which will make the iterative methods described in the following sections diverge [
1,
14].
Bertsekas [
1] gives a short summary of the main approximate DP algorithms:
Direct policy evaluation: Find that minimize . We compute the approximation of the projection by Monte Carlo simulation. We can introduce nonlinear architectures in , e.g. deep neural networks.
-
Indirect policy evaluation:
- -
-
Solve the projected equation (Galerkin approximation [
16,
18])
by simulation. Equation (
11) is equivalent to minimizing the expected temporal difference error (TDE)
||
||
ξ down to 0 for
[
14].
- *
TD learning: Stochastic iterative algorithm for solving (11).
- *
Least-squares temporal difference learning (LSTD): TD fixed point method [
14].
- *
Least-squares policy evaluation (LSPE): A simulation-based form of projected iteration; essentially simulation noise.
- -
Bellman equation error (Bellman residual [
14]) method: minimize
2.2. Direct policy evaluation
We are interested in finding the
with
. Suppose we can sample the exact values or give an accurate estimation of
v, then we can learn the parameters in supervised learning paradigm. The samples or estimation of
v could be considered as the true label, and
is the approximator. Then, it becomes a standard weighted least-squares problem. The solution is the projection of
v onto
with respect to
[
1]. Then,
Set the gradient of
to 0, we have
Suppose we have samples
generated by the environment or a simulator of the environment according to
, where
then we can get the simulation-based approximation:
Note that is a matrix. Thus, we are using low-dimensional linear algebra (of order K, the number of basis functions) compared with (5).
We can obtain the unbiased estimate of the true value function v by Monte Carlo sampling, i.e. perform long-term lookahead through the Markov chain and collect all the future rewards. However, the Monte-Carlo estimate of v requires a large number of samples as it suffers from a high variance.
2.3. Indirect policy evaluation
2.3.1. Stochastic algorithms: a general view
Consider the approximate solution
of a linear system of equations
by using
m simulation samples
and
where
are simulation noise. Think of
as approximate policy evaluation which is the same as (5) [
1]:
-
Stochastic approximation (SA) approach: starting from an initial point
, for
are the weights.
-
Monte Carlo estimation (MCE) approach: Form Monte Carlo estimation of
b and
A,
Then solve by a direct method or an iterative method. can be built incrementally.
TD is SA method, LSTD and LSPE are MCE methods. We will introduce them in section 2.3.2-2.3.4.
2.3.2. Temporal difference (TD) learning
In 1988, Sutton developed temporal difference learning as a class of incremental learning methods for general prediction problems. It assigns credit through the difference of the predictions between current state and its successor, which makes it different with the conventional supervised learning methods. It has the form [
4]
where
is a constant step-size parameter. This form is equivalent to (12). TD learning is a combination of Monte Carlo ideas and DP ideas. It can learn directly from the experience
sampled from environment which is similar to Monte Carlo methods. It also updates the estimates partially based on other learned estimates
without collecting all future rewards (bootstrapping) which is similar to DP.
When we use linear function approximation
, the parameter update can be written as [
4],
Equation (
15) is like we are doing
. We apply gradient descent
and substitute
by its approximation
. We call them semi-gradient methods because we only take derivative of
but not of
. TD is proved to be convergent with probability one [
17].
TD requires less memory, less peak computation and produce more accurate predictions than conventional supervised learning methods.
2.3.3. Least-squares temporal difference (LSTD) learning
As
is a contraction mapping on
with respect to
[
1,
14], there exists an unique fixed point
that satisfies (11). That means
is the unique solution to
which means
is the unique point in
that satisfies the condition that when we use the projector
on it, the projection is itself. According to the definition of
,
should satisfies the condition
The orthogonal condition (17) is equivalent to the Galerkin method [
16,
18]. Equation (
17) can be written as a linear systems of equations
, where
So we can have
. The simulation machanism for
C and
d is
where
can be computed with low-dimensional calculation. Then
can be approximated by
.
LSTD makes more efficient use of data because it extract more information from each additional observation, and thus would be expected to converge with fewer training samples. In addition, it does not require the user to manually tune a stepsize schedule for good performance. LSTD converges to the same point as TD [
4].
TD’s convergence can be slowed dramatically by a poor choice of the stepsize parameters . Furthermore, TD’s performance is sensitive to the initial estimate for . LSTD does not rely on an arbitrary initial estimate. TD is also sensitive to the ranges of the individual features. LSTD is not.
2.3.4. Least-squares policy evaluation
An alternative way to solve the projected equation is projected value iteration (PVI), that is
It converges to
because
is a contraction mapping with respect to
. Rather than directly solve the projected equation, it uses an iterative way to get close to the fixed point starting from an arbitrary initial point. It is equivalent to the minimization problem
Set the gradient with respect to
to be 0, we could have
Then we could have
where
are defined as (18) and could be estimated by (19) and (20).
could be estimated by
and
can be formed incrementally or in temporal difference style [
1].
2.3.5. Bellman equation error methods
The main idea of the Bellman equation error (BE) method is to find a
so that
can be minimized. Recall that
. With the notation
, we have
. Then it becomes a standard least squares problem with regard to
. It is easy to see that the solution to the weighted least squares problem is
One reason that we are interested in BE method is the well-known result:
and for weighted norm we have
, where
is a “concentration coefficient”, which can be seen as a measure of the stochasticity of the MDP [
14]. This means when BE is small, we are close to the true value function
v. But for TD, there is no similar result. Actually, it is proved [
14] that BE is an upper bound of the TDE. More exactly, since
and from Pythagorean Theorem,
which means minimizing BE is not only minimizing TDE, but also minimizing
, which is part that the TD method misses. And also we can see that
and this is why people also call it the projected Bellman error (PBE).
BE has the additional component which has to do with the features. Hence, if the feature space is fixed, the term we really want to minimize is TDE. Only when features can be modified we start to look at BE.
2.4. Least-squares policy iteration (LSPI)
Motivated by LSTD, Lagoudakis [
6] extended it to control problems. Instead of evaluating the state value function
v, he did evaluation on the state-action pairs (
), i.e. the
Q function. The resulting method is referred to as LSTDQ. The
Q function is defined as
We can directly choose action and improve our policy from the Q function without any knowledge of the underlying model of the environment. Everything about LSTDQ looks similar to LSTD after we change to be the feature matrix for all state-action pairs. Then we put LSTD into the general policy iteration (GPI) frameworks to update our policy and get LSPI algorithm.
2.5. A unified oblique projection view
In the LSTD, we attempt to find the fixed point of the composite projector , where is an orthogonal projection onto with respect to . Now we consider using a projector that is not necessarily orthogonal, which is a general operator that satisfies and whose range is . In the most general case, such operators are called oblique projections and can be written with . The parameter X specifies the projection direction: precisely, is the projection onto orthogonal to or we could write it as (the defined in (9) can be written as ). The vector is the coordinate of the projection of v under the basis .
Scherrer [
14] provided a unified oblique projection view of TD fixed point and BE minimization methods. Recall that
. If we write
and
, we can have (1) The TD fixed point and the BE minimizer are solutions to the projected equation
(
and
respectively). (2) When it exists, the solution of this projected equation is the projection of
v onto
orthogonally to
, i.e. formally
From this perspective, we put the two methods under the same framework of projected fixed point equation.
From the results above, it’s easy to have
where
,
are matrices of size
and
is the spectral radius of the given matrix.
can be estimated by simulations. This inequality could also give us some inspiration on how to choose
X and extract
, e.g. if we choose
, we could have
, which is the orthogonal projection of true
v onto
with respect to
.
Finally, Scherrer gives a comparison of TD and BE methods: BE is more robust than TD because it has guarantee of its performance. Although TD often outperforms BE, it could sometimes fail dramatically. So TD is more risky. In practice, TD is easier to implement by sampling and has lower variance. BE needs more samples to eliminate the effect of noise.
3. Krylov subspace methods
The earliest Krylov subspace methods were developed by Lanczos, Hestenes and Stiefel in the late 1940s and early 1950s for solving large sparse linear systems of equations (with nonsingular ). Krylov subspace methods form a the class of iterative method. An iterative method that solves the linear systems by producing a sequence of iterates starting from an initial point , which is expected to converge to the true solution. Iterative methods have advantages over the direct methods (e.g. matrix factorization methods) when A is a large sparse matrix whose structure is hard to exploit. Moreover, we can stop an iterative method early if we just need an approximate solution, but we cannot do this with a direct method. Krylov subspace methods are different from classical iterative methods (e.g. matrix splitting methods) because they have memory, i.e. they construct not only from but also from the information of the previous iterates. So they are faster. Furthermore, Krylov subspace methods require only the ability to determine for any x (some also require the computation of ).
3.1. Projection Process
First we will introduce a general projection process to find the solution
. Let
denote an initial approximation to
and then the initial residual
. At step
k, we impose some conditions on
:
where
is a
k-dimensional subspace, referred to as the search subspace, and
is also a
dimensional subspace, referred to as the constraint subspace. Since
has dimension
k, we have
k degrees of freedom to construct
. In order to determine
, we need
to give us
k constraints which are imposed on the residual
in (30).
Let
and
be two full-column rank matrices such that
Then from (1) there exists
such that
and
which can be rewritten as
Theorem 1.
1(see Theorem 2.1.2 in [8]) The matrix in Equation (34) is nonsingular if and only if
Now we assume that (35) holds. The linear system (34) can be regarded as a projected one of the original linear system . As the dimensions of the matrix is , (34) can also be regarded as a reduced linear system.
The matrix is an oblique projector because . So is the projection of onto . And it is easy to verify that in (35) and
Theorem 2. 2 (see Lemma 2.1.7 in [8]) Suppose that (34) holds. If and , then is the solution of
Theorem 2 gives a sufficent condition for
to be the solution of
at step
k, that is in order to to make
we need to have
[
8]. It seems like a good idea to start the projection process with the search space
, and to build up a nested sequence of search spaces.
and these spaces eventually satisfy
for some
k. Then we can obtain the solution in finite number of steps and we hope
.
3.2. Krylov Subspaces
Given
and a nonzero vector
, consider the following Krylov sequence,
There exists an integer such that are linearly independent, but are linearly dependent. It’s easy to see that . The matrix is called a Krylov subspace. We will show that , where m is the degree of the minimal polynomial of A to be introduced below.
The characteristic polynomial of matrix A is defined by
where
are distinct eigenvalues of
A and
is the algebraic multiplicity for the corresponding
. The minimal polynomial of
A is defined as
where
is the largest size of Jordan block associated with
. We all know that for a Jordan block
with eigenvalue
whose size is
,
and
. And for a polynomial
and a matrix
B whose Jordan blocks are
,
has Jordan blocks
. Then for matrix
A whose minimal polynomial is
, we have
. Then for all
v, we have
.
could be written as
where
. It it easy to see that
from (41), so
means
can be represented by
which means
and we must have
Combining this conclusion with theorem 2 and the fact that
, it tells us that
will eventually stop growing and become invariant under
A and
is a choice of nested sequence of spaces satisfy (38). So we have
Theorem 3. 3 If the minimal polynomial of nonsingular matrix A has degree m, then the solution to lies in the space .
Actually, with different choice of
and
, we have different Krylov subspace methods as Table 1 shows.
Table 1.
Different Krylov subspace methods.
Table 1.
Different Krylov subspace methods.
Name |
Problem |
|
|
Property |
CG |
SPD A
|
|
|
|
SYMMLQ |
|
|
|
|
MINRES |
|
|
|
|
GMRES |
general A
|
|
|
|
BiCG |
general A
|
|
|
Does not solve an optimization problem |
QMR |
general A
|
|
no
|
|
3.3. The Arnoldi algorithm for orthonormal basis
If we form the matrix
,
because
converges to an eigenvector corresponding to the largest eigenvalue in magnitude, it is likely that
is very ill-conditioned for large
j. In order to avoid numerical troubles, we want to find orthonormal vectors
such that
The idea is to use the Gram–Schmidt orthogonalization process and we show one step of it in the following.
Suppose we want to find
and
such that
Then this can be done by:
1. Normalize : , then
2. Let
, then
3. Orthogonalize
against
:
, then
and we still have
4. Normalize : , then and we still have .
We can write Arnoldi algorithm in terms of matrix. Let
whose columns are composed by the orthonormal vectors
. After
k steps, the algorithm has [
8]
where
is the
unreduced upper Hessenberg matrix with
and
and
is the
k-th column of
. At the
m-th iteration, we get
. Then Arnoldi algorithm stops and we have
4. Connections
One direct and most common application of the Krylov subspace methods onto VFA is to construct basis functions [
11]. Recall as (5) shows, we need to solve linear system
Suppose
is nonsingular, since
P is a stochastic matrix and
, we have
. If we want to approximate
, one way is to use the Neumann series expansion [
13],
Then the solution of (5) lies in
, where
m is the degree of the minimal polynomial of
P, as what shows below,
Therefore, we can set our feature matrix as .
In practice, we could first start with a random initial guess and and progressively add into the basis function matrix.
Parr [
10] shows that the basis that constructed by Krylov subspace is equivalent the Bellman Error Basis Functions (BEBFs) and Model Error Basis Functions (MEBFs) which respectively add the Bellman error vectors and model error vectors progressively into the current basis.
Senda [
15] applied the Conjugate Gradient (CG) algorithm, the Bi-Conjugate Gradient (BiCG) algorithm and the Block Product-type Krylov Subspace Method (BPKSM) for policy evaluation and did a comparison with the existing stationary iterative methods. What he concluded was that the algorithms based on the Krylov subspace methods were tens to hundreds times more efficient than conventional algorithms. He found the convergence rate of Krylov subspace algorithms improved as the iteration number increased, whereas the other algorithms remained the same. Algorithms based on Krylov subspace methods are far more efficient than the classical methods. The Krylov subspace methods can also be used in other machine learning area [
19,
20,
21,
22,
23,
24].
5. Conclusion and Discussion
From the last section we can see that, constructing the basis functions is only a simple and direct usage of Krylov subspace methods. It works better on model-based learning, in which we have enough information of the transition function
P. But whether or not it can be extended to model-free learning has not been well studied. Bertsekas [
2] provided a good example on how to connect proximal iterations for solving
with TD(
) methods and got some interesting results. We can get some inspiration from it and try to use Krylov subspace methods for linear systems to solve approximate DP problems as well.
References
- Dimitri, P. Bertsekas, Dynamic Programming and Stochastic Control lecture notes, 2015 fall, MIT.
- Dimitri P. Bertsekas, Proximal algorithms and temporal differences for large linear systems: extrapolation, approximation, and simulation, Report LIDS-P-3205, MIT; arXiv preprint arXiv: 1610.05427, 2016.
- Justin, A. Boyan, Technical update: Least-squares temporal difference learning, Machine Learning, 49(2-3):233-246, 2002.
- Steven, J. Bradtke and Andrew G. Barto, Linear least-squares algorithms for temporal difference learning, Machine Learning, 22:33-57, 1996.
- Gene, H. Golub and Charles F. Van Loan, Matrix Computations, 4th Edn., The Johns Hopkins University Press, Baltimore, 2013, Chapter 11.
- Michail, G. Lagoudakis and Ronald Parr, Least-squares policy iteration, Journal of Machine Learning Research, 4:1107-1149, 2003.
- Alessandro Lazaric, Mohammad Ghavamzadeh and Rémi Munos, Finite-sample analysis of LSTD, In Proceedings of the 27th International Conference on Machine Learning, 2010.
- Jorg Liesen and Zdenek Strakos, Krylov Subspace Methods: Principles and Analysis, Oxford University Press, 2015, Chapter 1 and 2.
- Angelia Nedić and Dimitri P. Bertsekas, Least-squares policy evaluation algorithms with linear function approximation, Discrete Event Dynamic Systems: Theory and Applications, 13(1-2):79-110, 2003.
- Ronald Parr, Lihong Li, Gavin Taylor, Christopher Painter-Wakefield and Michael L. Littman, An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning, Proceedings of the 25th international conference on Machine learning, pp.752-759, Helsinki, Finland, July 05-09, 2008.
- Marek Petrik, An analysis of Laplacian methods for value function approximation in MDPs, Proceedings of the 20th international joint conference on Artifical intelligence, pp.2574-2579, Hyderabad, India, January 06-12, 2007.
- Pascal Poupart and Craig Boutilier, VDCBPI: an approximate scalable algorithm for large scale POMDPs, Advances in Neural Information Processing Systems 17 (NIPS-2004), pp.1081-1088, MIT Press, 2005.
- Martin, L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, John Wiley & Sons, Inc., New York, NY, 1994, Chapter 6.
- Bruno Scherrer, Should one compute the temporal difference fixed point or minimize the Bellman residual: the unified oblique projection view, Proceedings of International Conference on Machine Learning, Haifa, Israel, pp.959-966, 2010.
- K. Senda, S. Hattori, T. Hishinuma and T. Kohda, Acceleration of reinforcement learning by policy evaluation using nonstationary iterative method, IEEE Trans. Cybern., 44(12):2696-2705, Dec. 2014.
- Csaba Szepesvári, Least squares temporal difference learning and Galerkin’s method, Mini-Workshop: Mathematics of Machine Learning, Oberwolfach Reports, European Mathematical Society, Oberwolfach, pp. 2385-2388, 2011.
- John N. Tsitsiklis and Benjamin Van Roy, An analysis of temporal-difference learning with function approximation, IEEE Trans. Automat. Contr., 42(5):674-690, 1997.
- Huizhen Yu and Dimitri P. Bertsekas, Error bounds for approximations from projected linear equations, Mathematics of Operations Research, 35(2):306-329, 2010.
- Sitao Luan, Mingde Zhao, Xiao-Wen Chang, Doina Precup (2019). Break the ceiling: Stronger multi-scale deep graph convolutional networks. Advances in neural information processing systems, 32.
- Sitao Luan, Xiao-Wen Chang, Doina Precup (2019). Revisit Policy Optimization in Matrix Form. arXiv preprint arXiv:1909.09186.
- Sitao Luan, Mingde Zhao, Chenqing Hua, Xiao-Wen Chang, Doina Precup (2020). Complete the missing half: Augmenting aggregation filtering with diversification for graph convolutional networks. arXiv preprint arXiv:2008.08844.
- Sitao Luan, Chenqing Hua, Qincheng Lu, Jiaqi Zhu, Mingde Zhao, Shuyuan Zhang, Xiao-Wen Chang, Doina Precup (2021). Is Heterophily A Real Nightmare For Graph Neural Networks To Do Node Classification?. arXiv preprint arXiv:2109.05641.
- Sitao Luan, Chenqing Hua, Qincheng Lu, Jiaqi Zhu, Mingde Zhao, Shuyuan Zhang, Xiao-Wen Chang, Doina Precup (2022). Revisiting heterophily for graph neural networks. arXiv preprint arXiv:2210.07606.
- Sitao Luan, Chenqing Hua, Minkai Xu, Qincheng Lu, Jiaqi Zhu, Xiao-Wen Chang, Jie Fu, Jure Leskovec, Doina Precup (2023). When Do Graph Neural Networks Help with Node Classification: Investigating the Homophily Principle on Node Distinguishability. arXiv preprint arXiv:2304.14274.
|
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).