Preprint
Brief Report

Approximate Dynamic Programming And The Krylov Subspace Methods

Altmetrics

Downloads

114

Views

25

Comments

0

This version is not peer-reviewed

Submitted:

28 June 2023

Posted:

29 June 2023

You are already at the latest version

Alerts
Abstract
This report discusses the application of approximate dynamic programming (ADP) and Krylov subspace methods in solving sequential decision-making problems in machine learning. ADP is used when dealing with large state spaces or when the exact dynamics are unknown. The paper explores various ADP methods such as temporal difference learning and least-squares policy evaluation. The authors also focus on the use of Krylov subspace methods for solving the Bellman equation in ADP. The report provides insights into linear approximation, stochastic algorithms, and the Arnoldi algorithm for orthonormal basis. Overall, the paper highlights the theoretical foundation provided by ADP and its connection with Krylov subspace methods, shedding light on their application in computer science.
Keywords: 
Subject: Computer Science and Mathematics  -   Computer Science

1. Introduction

Dynamic programming (DP) is both an optimization method and a computer programming method which was developed by Richard Bellman in the 1950s. The main idea is to break a complicated problem down into simpler sub-problems recursively. If there exists a recursive relation between the original problem and its sub-problems, then DP methods are applicable. In the optimization literature this relationship is called the Bellman equation.
In machine learning, one of the problems that we are interested in is the sequential decision making problem regarding the way in which the software agents ought to take actions to maximize the cumulative rewards from the stochastic environment. We formulate the interaction bewteen agents and the environment by Markov Decision Process (MDP) whose structure can be modeled by the Bellman equation. But when the state space of the environment is too large or an exact model of the dynamics is absent, MDP can only be solved approximately with simulation and function approximation and this is what we call approximate dynamic programming (neuro-dynamic programming or reinforcement learning). A large body of methods have been developed, e.g. temporal difference learning (TD) [4] , least-squares temporal difference learning (LSTD) [3,4], least-squares policy evaluation (LSPE) [9] and least-squares policy iteration (LSPI) [6].
Computing the value function of a given policy (policy evaluation) by the Bellman equation plays an important role in solving approximate DP and it is equivalent to solving a linear system [13]. In matrix computations, Krylov subspace methods [5,8] are the most powerful tools to solve large scale linear systems of equations and least-squares problems. So there is a connection between these two areas but this connection lacks deep exploration.
Instead of using reinforcement learning, we use approximate DP in this report to emphasize its strong relationship with DP, because DP provides theoretical foundations and important guidelines for designing approximate DP algorithms.
My report will review some fundamental concepts and ideas of approximate DP and Krylov subspace methods and their connections.

2. Approximate dynamic programming methods

2.1. Problem description

Markov decision process (MDP) is a framework to model the learning process that the agent learns from the interaction with the environment. Given the current state S t S at time t, the agent interacts with the environment by selecting actions A t A ( S t ) according to the policy π ( A t | S t ) and the environment would take the agent to another state S t + 1 S , and then provide a numerical feedback (immediate reward) R t + 1 = R ( S t , A t , S t + 1 ) R according to the dynamics (transition probability) p ( S t + 1 , R t + 1 | S t , A t ) . The policy π is a probability distribution from which we select the action. Our goal is to find an optimal policy (often deterministic) to maximize the long-term cumulative rewards. We assume that the policy is stationary. We only consider the inifite horizon discounted problem because the ideas can also be applied to the simpler absorbing Markov chains [17]. So the objective function is G t = k = 0 γ k R t + k + 1 . It’s easy to see the recursive structure
G t = R t + 1 + γ G t + 1 .
The value function of the given policy π at state s is defined as
v π ( s ) = E π [ G t | S t = s ] = E π k = 0 γ k R t + k + 1 | S t = s , s S
It can be written as
v π ( s ) = E π [ R t + 1 + γ G t + 1 | S t = s ] = a π ( A t = a | S t = s ) s r p ( S t + 1 = s , R t + 1 = r | S t = s , A t = a ) r + γ E π [ G t + 1 | S t + 1 = s ] = a π ( a | s ) s , r p ( s , r | s , a ) [ r + γ v π ( s ) ] , s S
The Equation (3) is the Bellman equation for MDP, which depicts the recursive relationships between the value of s and the value of its possible successor states s and is consistent to (1). It can be written in a more compact form.
v π ( s ) = a π ( a | s ) s , r p ( s , r | s , a ) [ r + γ v π ( s ) ] = a π ( a | s ) s , r r · p ( s , r | s , a ) + a π ( a | s ) s , r p ( s , r | s , a ) γ v π ( s ) = a π ( a | s ) r ( s , a ) + γ s { a π ( a | s ) · p ( s | s , a ) } v π ( s ) = r π ( s ) + γ s p π ( s | s ) v π ( s ) , s S
The Equation (4) can be written in the matrix form
v π = r π + γ P π v π
where v π R | S | , r π R | S | , P π R | S | × | S | and P π ( i , j ) = p π ( j | i ) . Here | S | can be infinite, but for simplicity, we assume it is finite. Since we only consider a fixed policy, we can write (5) as v = r + γ P v or L v = r where L = I γ P .
We define a short hand notation for DP mapping:
T v = r + γ P v , v V
where V is the value function space and T is called the Bellman operator. It can be proved that T is a contraction mapping [1] and its fixed point v * is the solution for (5). Finding the fixed point is exactly what we do in the policy evaluation step in the policy iteration method [13].
Solving (5) by a direct linear solver is not feasible when the state space is very large or when the transition matrix is unavailable. But we can find approximate solutions by value function approximation (VFA). Linear approximation is a popular VFA method [11] and it has the form of v ^ = Φ θ , Φ = ( Φ 1 , Φ 2 , , Φ K ) R | S | × K , where Φ j is the j-th feature function or basis function over the state space and K is the number of features, see [14]. For each state i, v ^ ( i ) = j = 1 K Φ j ( i ) θ j = ϕ ( i ) T θ , where ϕ ( i ) T is the i t h row of Φ and represent the feature vector for state i. Linear function approximations have better convergence property [17] than nonlinear function approximation and is easier to analyze, interpret and implement. We can introduce nonlinearity when doing feature extraction, e.g. deep neural networks. Suppose we already have well-chosen features, then by the linear approximation, (5) becomes
Φ θ = r + γ P Φ θ .
We define a ξ weighted quadratic norm on V ,
|| v || ξ = s S ξ ( s ) ( v ( s ) ) 2 = v T Ξ v , v V
where ξ = ( ξ i ) i = 1 | S | with ξ i > 0 is the weight vector and Ξ is the diagonal matrix with ξ on the diagonal. Then we can define the operator Π that projects any value function to its closest representable function in range ( Φ ) (i.e. column space of Φ ) with respect to Ξ . That means Π is an orthogonal projector onto range ( Φ ) with respect to Ξ , i.e. Φ T Ξ ( v Π v ) = 0 , which is equivalent to
Π v Φ θ , where θ = arg min θ R K || v Φ θ || ξ
We assume that the Markov chain we are studying is ergodic, then we have the stationary distribution,
μ ( s ) = lim N 1 N k = 1 N P ( S k = s | S 0 = s 0 ) > 0 , s , s 0 S
We usually use the vector μ as our weight vector ξ , otherwise it may cause a norm mismatch problem which will make the iterative methods described in the following sections diverge [1,14].
Bertsekas [1] gives a short summary of the main approximate DP algorithms:
  • Direct policy evaluation: Find θ that minimize || v Φ θ || ξ 2 . We compute the approximation of the projection by Monte Carlo simulation. We can introduce nonlinear architectures in v ^ , e.g. deep neural networks.
  • Indirect policy evaluation:
    -
    Solve the projected equation (Galerkin approximation [16,18])
    Φ θ = Π T ( Φ θ )
    by simulation. Equation (11) is equivalent to minimizing the expected temporal difference error (TDE) || v ^ Π T v ^ ξ ||ξ down to 0 for v ^ range ( Φ ) [14].
    *
    TD learning: Stochastic iterative algorithm for solving (11).
    *
    Least-squares temporal difference learning (LSTD): TD fixed point method [14].
    *
    Least-squares policy evaluation (LSPE): A simulation-based form of projected iteration; essentially Φ θ k + 1 = Π T ( Φ θ k ) + simulation noise.
    -
    Bellman equation error (Bellman residual [14]) method: minimize || Φ θ T ( Φ θ ) || ξ

2.2. Direct policy evaluation

We are interested in finding the θ * = arg min θ R K || v v ^ || ξ with v ^ = Φ θ . Suppose we can sample the exact values or give an accurate estimation of v, then we can learn the parameters in supervised learning paradigm. The samples or estimation of v could be considered as the true label, and v ^ is the approximator. Then, it becomes a standard weighted least-squares problem. The solution is the projection of v onto range ( Φ ) with respect to Ξ [1]. Then,
θ * = arg min θ R K || v Φ θ || ξ 2 = arg min θ R K i = 1 | S | ξ i ( v ( i ) ϕ ( i ) T θ ) 2
Set the gradient of θ to 0, we have
θ * = ( i = 1 | S | ξ i ϕ ( i ) ϕ ( i ) T ) 1 i = 1 | S | ξ i ϕ ( i ) v ( i )
Suppose we have samples { ( s 1 , v ( s 1 ) ) , , ( s m , v ( s m ) ) } generated by the environment or a simulator of the environment according to ξ , where s j { 1 , , | S | } for j = 1 : m , then we can get the simulation-based approximation:
θ ^ * = arg min θ R K j = 1 m ( v ( s j ) ϕ ( s j ) T θ ) 2 = ( j = 1 m ϕ ( s j ) ϕ ( s j ) T ) 1 j = 1 m ϕ ( s j ) v ( s j )
Note that ϕ ( s j ) ϕ ( s j ) T is a K × K matrix. Thus, we are using low-dimensional linear algebra (of order K, the number of basis functions) compared with (5).
We can obtain the unbiased estimate of the true value function v by Monte Carlo sampling, i.e. perform long-term lookahead through the Markov chain and collect all the future rewards. However, the Monte-Carlo estimate of v requires a large number of samples as it suffers from a high variance.

2.3. Indirect policy evaluation

2.3.1. Stochastic algorithms: a general view

Consider the approximate solution x ^ m of a linear system of equations x = b + A x , A R n × n by using m simulation samples b + w k and A + W k , k = 1 , , m , where w k , W k are simulation noise. Think of x = b + A x as approximate policy evaluation which is the same as (5) [1]:
  • Stochastic approximation (SA) approach: starting from an initial point x 0 , for k = 1 , , m
    x k = ( 1 α k ) x k 1 + α k ( ( b + w k ) + ( A + W k ) x k 1 )
    α k , k = 1 , , m are the weights.
  • Monte Carlo estimation (MCE) approach: Form Monte Carlo estimation of b and A,
    b m = 1 m k = 1 m ( b + w k ) , A m = 1 m k = 1 m ( A + W k )
    Then solve x = b m + A m x by a direct method or an iterative method. A m , b m can be built incrementally.
TD is SA method, LSTD and LSPE are MCE methods. We will introduce them in section 2.3.2-2.3.4.

2.3.2. Temporal difference (TD) learning

In 1988, Sutton developed temporal difference learning as a class of incremental learning methods for general prediction problems. It assigns credit through the difference of the predictions between current state and its successor, which makes it different with the conventional supervised learning methods. It has the form [4]
v ( S t ) = v ( S t ) + α [ R t + 1 + γ v ( S t + 1 ) v ( S t ) ]
where α is a constant step-size parameter. This form is equivalent to (12). TD learning is a combination of Monte Carlo ideas and DP ideas. It can learn directly from the experience R t + 1 sampled from environment which is similar to Monte Carlo methods. It also updates the estimates partially based on other learned estimates v ( S t + 1 ) without collecting all future rewards (bootstrapping) which is similar to DP.
When we use linear function approximation v ^ ( s t ) = ϕ ( s t ) T θ t , the parameter update can be written as [4],
θ t + 1 = θ t + α [ R t + 1 + γ ϕ ( s t + 1 ) T θ t ϕ ( s t ) T θ t ] ϕ ( s t )
Equation (15) is like we are doing min θ J ( θ ) = ( v ( s t ) v ^ ( s t ) ) 2 . We apply gradient descent θ t + 1 = θ t 1 2 α J ( θ ) = θ t + α ( v ( s t ) v ^ ( s t ) ) v ^ ( s t ) and substitute v ( s t ) by its approximation R t + 1 + γ ϕ ( s t + 1 ) T θ t . We call them semi-gradient methods because we only take derivative of v ^ ( s t ) but not of R t + 1 + γ ϕ ( s t + 1 ) T θ t . TD is proved to be convergent with probability one [17].
TD requires less memory, less peak computation and produce more accurate predictions than conventional supervised learning methods.

2.3.3. Least-squares temporal difference (LSTD) learning

As Π T is a contraction mapping on range ( Φ ) with respect to Ξ [1,14], there exists an unique fixed point Φ θ * that satisfies (11). That means θ * is the unique solution to
Φ θ = Π T ( Φ θ )
which means Φ θ * is the unique point in range ( Φ ) that satisfies the condition that when we use the projector Π T on it, the projection is itself. According to the definition of Π , θ * should satisfies the condition
Φ T Ξ ( Φ θ * ( r + γ P Φ θ * ) ) = 0 .
The orthogonal condition (17) is equivalent to the Galerkin method [16,18]. Equation (17) can be written as a linear systems of equations C θ * = d , where
C = Φ T Ξ ( I γ P ) Φ , d = Φ T Ξ r
So we can have θ * = C 1 d . The simulation machanism for C and d is
d k = 1 k + 1 t = 0 k ϕ ( s t ) r t i , j ξ i P ( i , j ) ϕ ( i ) R ( i , j ) = Φ T Ξ r = d
C k = 1 k + 1 t = 0 k ϕ ( s t ) ( ϕ ( s t ) γ ϕ ( s t + 1 ) ) T Φ T Ξ ( I γ P ) Φ = C
where C k R K × K , d k R K can be computed with low-dimensional calculation. Then θ * can be approximated by θ ^ k = C k 1 d k .
LSTD makes more efficient use of data because it extract more information from each additional observation, and thus would be expected to converge with fewer training samples. In addition, it does not require the user to manually tune a stepsize schedule for good performance. LSTD converges to the same point as TD [4].
TD’s convergence can be slowed dramatically by a poor choice of the stepsize parameters α . Furthermore, TD’s performance is sensitive to the initial estimate for θ . LSTD does not rely on an arbitrary initial estimate. TD is also sensitive to the ranges of the individual features. LSTD is not.

2.3.4. Least-squares policy evaluation

An alternative way to solve the projected equation is projected value iteration (PVI), that is
Φ θ k + 1 = Π T ( Φ θ k ) = Π ( r + γ P Φ θ k )
It converges to θ * because Π T is a contraction mapping with respect to Ξ . Rather than directly solve the projected equation, it uses an iterative way to get close to the fixed point starting from an arbitrary initial point. It is equivalent to the minimization problem
θ k + 1 = arg min θ R K || Φ θ ( r + γ P Φ θ k ) || ξ 2
Set the gradient with respect to θ to be 0, we could have
Φ T Ξ ( Φ θ k + 1 ( r + γ P Φ θ k ) ) = 0
Then we could have
θ k + 1 = θ k ( Φ T Ξ Φ ) 1 ( C θ k d )
where C , d are defined as (18) and could be estimated by (19) and (20). G = Φ T Ξ Φ could be estimated by
G k = 1 k + 1 t = 0 k ϕ ( s t ) ϕ ( s t ) T Φ T Ξ Φ = G
C k , d k and G k can be formed incrementally or in temporal difference style [1].

2.3.5. Bellman equation error methods

The main idea of the Bellman equation error (BE) method is to find a v ^ range ( Φ ) so that || v ^ T v ^ || ξ can be minimized. Recall that L = I γ P . With the notation Ψ = L Φ , we have || v ^ T v ^ || ξ = || Φ θ r γ P Φ θ || ξ = || Ψ θ r || ξ . Then it becomes a standard least squares problem with regard to || · || ξ . It is easy to see that the solution to the weighted least squares problem is
θ B E = ( Ψ T Ξ Ψ ) 1 Ψ T Ξ r
One reason that we are interested in BE method is the well-known result: v ^ , || v v ^ || 1 1 γ || T v ^ v ^ || and for weighted norm we have v ^ , || v v ^ || ξ C ( ξ ) 1 γ || T v ^ v ^ || ξ , where C ( ξ ) : = m a x i , j P ( i , j ) ξ i is a “concentration coefficient”, which can be seen as a measure of the stochasticity of the MDP [14]. This means when BE is small, we are close to the true value function v. But for TD, there is no similar result. Actually, it is proved [14] that BE is an upper bound of the TDE. More exactly, since v ^ = Φ θ = Π Φ θ and from Pythagorean Theorem,
B E 2 = || v ^ T v ^ || ξ 2 = || Π ( v ^ T v ^ ) || ξ 2 + || ( I Π ) ( v ^ T v ^ ) || ξ 2 = T D E 2 + || T v ^ Π T v ^ || ξ 2
which means minimizing BE is not only minimizing TDE, but also minimizing || T v ^ Π T v ^ || ξ 2 , which is part that the TD method misses. And also we can see that T D E 2 = || Π ( v ^ T v ^ ) || ξ 2 and this is why people also call it the projected Bellman error (PBE).
BE has the additional component which has to do with the features. Hence, if the feature space is fixed, the term we really want to minimize is TDE. Only when features can be modified we start to look at BE.

2.4. Least-squares policy iteration (LSPI)

Motivated by LSTD, Lagoudakis [6] extended it to control problems. Instead of evaluating the state value function v, he did evaluation on the state-action pairs ( s , a ), i.e. the Q function. The resulting method is referred to as LSTDQ. The Q function is defined as
Q ( s , a ) = E k = 0 γ k R t + k + 1 | S t = s , A t = a , s S , a A ( s )
We can directly choose action and improve our policy from the Q function without any knowledge of the underlying model of the environment. Everything about LSTDQ looks similar to LSTD after we change Φ to be the feature matrix for all state-action pairs. Then we put LSTD into the general policy iteration (GPI) frameworks to update our policy and get LSPI algorithm.

2.5. A unified oblique projection view

In the LSTD, we attempt to find the fixed point of the composite projector Π T , where Π is an orthogonal projection onto range ( Φ ) with respect to Ξ . Now we consider using a projector Π that is not necessarily orthogonal, which is a general operator that satisfies Π 2 = Π and whose range is range ( Φ ) . In the most general case, such operators are called oblique projections and can be written Π X = Φ π X with π X = ( X T Φ ) 1 X T . The parameter X specifies the projection direction: precisely, Π X is the projection onto range ( Φ ) orthogonal to range ( X ) or we could write it as v , ( v Π X v ) X (the Π defined in (9) can be written as Π Ξ Φ ). The vector π X v is the coordinate of the projection of v under the basis Φ .
Scherrer [14] provided a unified oblique projection view of TD fixed point and BE minimization methods. Recall that θ T D = ( Φ T Ξ L Φ ) 1 Φ T Ξ r , θ B E = ( Ψ T Ξ Ψ ) 1 Ψ T Ξ r . If we write X T D = Ξ Φ and X B E = Ξ L Φ , we can have (1) The TD fixed point and the BE minimizer are solutions to the projected equation v ^ X = Π X T v ^ X ( X = X T D and X = X B R respectively). (2) When it exists, the solution of this projected equation is the projection of v onto range ( Φ ) orthogonally to range ( L T X ) , i.e. formally v ^ X = Π L T X v . From this perspective, we put the two methods under the same framework of projected fixed point equation.
From the results above, it’s easy to have
|| v v ^ X || ξ || Π L T X || ξ || v Π v || ξ = σ ( A B C B T ) || v Π v || ξ , X .
where A = Φ T Ξ Φ , B = ( X T L Φ ) 1 , C = X T L Ξ 1 L T X are matrices of size K × K and σ is the spectral radius of the given matrix. A , B , C can be estimated by simulations. This inequality could also give us some inspiration on how to choose X and extract Φ , e.g. if we choose X = ( L T ) 1 Ξ Φ , we could have v ^ = Π Ξ Φ v , which is the orthogonal projection of true v onto range ( Φ ) with respect to Ξ .
Finally, Scherrer gives a comparison of TD and BE methods: BE is more robust than TD because it has guarantee of its performance. Although TD often outperforms BE, it could sometimes fail dramatically. So TD is more risky. In practice, TD is easier to implement by sampling and has lower variance. BE needs more samples to eliminate the effect of noise.

3. Krylov subspace methods

The earliest Krylov subspace methods were developed by Lanczos, Hestenes and Stiefel in the late 1940s and early 1950s for solving large sparse linear systems of equations A x = b (with nonsingular A R n × n ). Krylov subspace methods form a the class of iterative method. An iterative method that solves the linear systems by producing a sequence of iterates { x ( k ) } starting from an initial point x ( 0 ) , which is expected to converge to the true solution. Iterative methods have advantages over the direct methods (e.g. matrix factorization methods) when A is a large sparse matrix whose structure is hard to exploit. Moreover, we can stop an iterative method early if we just need an approximate solution, but we cannot do this with a direct method. Krylov subspace methods are different from classical iterative methods (e.g. matrix splitting methods) because they have memory, i.e. they construct x ( k + 1 ) not only from x ( k ) but also from the information of the previous iterates. So they are faster. Furthermore, Krylov subspace methods require only the ability to determine A x for any x (some also require the computation of A T x ).

3.1. Projection Process

First we will introduce a general projection process to find the solution x ^ . Let x 0 denote an initial approximation to x ^ and then the initial residual r 0 b A x 0 . At step k, we impose some conditions on x k :
x k x 0 + S k , r k b A x k C k , k = 1 , 2 , .
where S k R n is a k-dimensional subspace, referred to as the search subspace, and C k R n is also a k dimensional subspace, referred to as the constraint subspace. Since S k has dimension k, we have k degrees of freedom to construct x k . In order to determine x k , we need C k to give us k constraints which are imposed on the residual r k in (30).
Let V k R n × k and U k R n × k be two full-column rank matrices such that
S k = range ( V k ) , C k = range ( U k )
Then from (1) there exists z k R k such that
x k = x 0 + V k z k
and
U k T ( b A x 0 A V k z k ) = 0
which can be rewritten as
U k T A V k z k = U k T r 0
Theorem 1. 
1(see Theorem 2.1.2 in [8]) The matrix U k T A V k in Equation (34) is nonsingular if and only if
R n = A S k C k
Now we assume that (35) holds. The linear system (34) can be regarded as a projected one of the original linear system A x = b . As the dimensions of the matrix U k T A V k is k × k , (34) can also be regarded as a reduced linear system.
From (32) and (34)
x k = x 0 + V k ( U k T A V k ) 1 U k T r 0
Then
r k = b A x k = b A x 0 A V k ( U k T A V k ) 1 U k T r 0 = ( I P k ) r 0
where
P k A V k ( U k T A V k ) 1 U k T
The matrix P k is an oblique projector because P k 2 = P k . So r k is the projection of r 0 onto C k . And it is easy to verify that in (35) A S k = R ( P k ) and C k = N ( P k ) = R ( I P k ) .
Theorem 2. 
2 (see Lemma 2.1.7 in [8]) Suppose that (34) holds. If r 0 S k and A S k = S k , then x k is the solution of A x = b
Theorem 2 gives a sufficent condition for x k to be the solution of A x = b at step k, that is in order to to make r k = 0 we need to have r 0 = P k r 0 [8]. It seems like a good idea to start the projection process with the search space S 1 = span { r 0 } , and to build up a nested sequence of search spaces.
span ( r 0 ) = S 1 S 2 S 3
and these spaces eventually satisfy A S k = S k for some k. Then we can obtain the solution in finite number of steps and we hope k n .

3.2. Krylov Subspaces

Given A R n × n and a nonzero vector v R n , consider the following Krylov sequence,
v , A v , A 2 v , .
There exists an integer d = d ( A , v ) such that { v , A d 1 v } are linearly independent, but { v , A d v } are linearly dependent. It’s easy to see that d < n . The matrix K k ( A , v ) = span [ v , A v , A k 1 v ] is called a Krylov subspace. We will show that d m , where m is the degree of the minimal polynomial of A to be introduced below.
The characteristic polynomial of matrix A is defined by
P A ( t ) = d e t ( t I A ) = i = 1 k ( t λ i ) s i
where λ 1 , , λ k are distinct eigenvalues of A and s i is the algebraic multiplicity for the corresponding λ i . The minimal polynomial of A is defined as
m A ( t ) = i = 1 k ( t λ i ) m i
where m i is the largest size of Jordan block associated with λ i . We all know that for a Jordan block J i with eigenvalue λ i whose size is m i , ( J λ i I ) m i = 0 and ( J λ i I ) m i 1 0 . And for a polynomial q ( t ) and a matrix B whose Jordan blocks are B 1 , , B k , q ( B ) has Jordan blocks q ( B 1 ) , , q ( B k ) . Then for matrix A whose minimal polynomial is m A ( t ) , we have m A ( A ) = 0 . Then for all v, we have m A ( A ) v = 0 . m A ( t ) could be written as
m A ( t ) = i = 1 k ( t λ i ) m i = j = 0 m α j t j
where m = i = 1 k m i . It it easy to see that α m = 1 from (41), so m A ( A ) v = 0 means A m v can be represented by v , A v , , A m 1 v which means A m v K m ( A , v ) and we must have d m . Combining this conclusion with theorem 2 and the fact that r 0 K i ( A , r 0 ) , it tells us that K i ( A , r 0 ) will eventually stop growing and become invariant under A and K i ( A , r 0 ) , i = 1 , , m is a choice of nested sequence of spaces satisfy (38). So we have
Theorem 3. 
3 If the minimal polynomial of nonsingular matrix A has degree m, then the solution to A x = b lies in the space K m ( A , b ) .
Actually, with different choice of S k and C k , we have different Krylov subspace methods as Table 1 shows.
Table 1. Different Krylov subspace methods.
Table 1. Different Krylov subspace methods.
Name Problem S k C k Property
CG SPD 1  A K k ( A , r 0 ) K k ( A , r 0 ) x k = arg min x 0 + K k ( A , r 0 ) x x * A
SYMMLQ A T = A A K k ( A , r 0 ) K k ( A , r 0 ) x k = arg min x 0 + K k ( A , r 0 ) x x * 2 2
MINRES A T = A K k ( A , r 0 ) A K k ( A , r 0 ) x k = arg min x 0 + K k ( A , r 0 ) b A x 2
GMRES general A K k ( A , r 0 ) A K k ( A , r 0 ) x k = arg min x 0 + K k ( A , r 0 ) b A x 2
BiCG general A K k ( A , r 0 ) K k ( A T , r ˜ 0 ) Does not solve an optimization problem
QMR general A K k ( A , r 0 ) no C k x k arg min x 0 + K k ( A , r 0 ) b A x 2
1 Symmetric positive definite A. x 0 , x T A x > 0 . 2 Vector 2-norm. x 2 = x T x .

3.3. The Arnoldi algorithm for orthonormal basis

If we form the matrix K j ( A , r 0 ) ,
K j ( A , r 0 ) = [ r 0 , A r 0 , A 2 r 0 , , A j 1 r 0 ]
because A j r 0 converges to an eigenvector corresponding to the largest eigenvalue in magnitude, it is likely that K j ( A , r 0 ) is very ill-conditioned for large j. In order to avoid numerical troubles, we want to find orthonormal vectors { v i } i = 1 j such that
K j ( A , r 0 ) = range ( K j ( A , r 0 ) ) = span { v 1 , v 2 , , v j }
The idea is to use the Gram–Schmidt orthogonalization process and we show one step of it in the following.
Suppose we want to find v 1 and v 2 such that
K 2 ( A , r 0 ) = span { v 1 , v 2 } , v i T v j = δ i j , i , j = 1 , 2
Then this can be done by:
1. Normalize v 1 : v 1 : = r 0 / r 0 2 , then v 1 2 = 1 .
2. Let w 1 : = A v 1 , then
K 2 ( A , r 0 ) = span { v 1 , w 1 }
3. Orthogonalize w 1 against v 1 : v 2 : = w 1 ( w 1 T v 1 ) v 1 , then
v 2 T v 1 = w 1 T v 1 ( v 1 T v 1 ) ( w 1 T v 1 ) = 0
and we still have
K 2 ( A , r 0 ) = span { v 1 , v 2 }
4. Normalize v 2 : v 2 : = v 2 / v 2 2 , then v 2 2 = 1 and we still have K 2 ( A , r 0 ) = span { v 1 , v 2 } .
We can write Arnoldi algorithm in terms of matrix. Let V k = [ v 1 , , v k ] whose columns are composed by the orthonormal vectors v 1 , , v k . After k steps, the algorithm has [8]
A V k = V k H k + h k + 1 , k v k + 1 e k T , k = 1 , 2 ,
H k = h 11 h 12 h 1 , k 1 h 1 , k h 21 h 22 h 2 , k 1 h 2 , k h 32 h 3 , k 1 h 3 , k h k , k 1 h k , k
where H k is the k × k unreduced upper Hessenberg matrix with h i j = w j T v i , i = 1 , , j and h j + 1 , j = w j 2 and e k is the k-th column of I k . At the m-th iteration, we get h m + 1 , m = 0 . Then Arnoldi algorithm stops and we have
A V m = V m H m , V m T V m = I m

4. Connections

One direct and most common application of the Krylov subspace methods onto VFA is to construct basis functions [11]. Recall as (5) shows, we need to solve linear system
( I γ P ) v = r , 0 γ 1
Suppose I γ P R n × n is nonsingular, since P is a stochastic matrix and 0 γ < 1 , we have σ ( γ P ) < 1 . If we want to approximate ( I γ P ) 1 , one way is to use the Neumann series expansion [13],
( I γ P ) 1 = i = 0 ( γ P ) i
Then the solution of (5) lies in K m ( P , r ) , where m is the degree of the minimal polynomial of P, as what shows below,
v = ( I γ P ) 1 r = i = 0 ( γ P ) i r span { r , P r , P 2 r , } = span { r , P r , P 2 r , , P m 1 r }
Therefore, we can set our feature matrix as Φ = K m ( P , r ) .
In practice, we could first start with a random initial guess θ 0 and K 1 ( P , r ) and progressively add P i r into the basis function matrix.
Parr [10] shows that the basis that constructed by Krylov subspace is equivalent the Bellman Error Basis Functions (BEBFs) and Model Error Basis Functions (MEBFs) which respectively add the Bellman error vectors and model error vectors progressively into the current basis.
Senda [15] applied the Conjugate Gradient (CG) algorithm, the Bi-Conjugate Gradient (BiCG) algorithm and the Block Product-type Krylov Subspace Method (BPKSM) for policy evaluation and did a comparison with the existing stationary iterative methods. What he concluded was that the algorithms based on the Krylov subspace methods were tens to hundreds times more efficient than conventional algorithms. He found the convergence rate of Krylov subspace algorithms improved as the iteration number increased, whereas the other algorithms remained the same. Algorithms based on Krylov subspace methods are far more efficient than the classical methods. The Krylov subspace methods can also be used in other machine learning area [19,20,21,22,23,24].

5. Conclusion and Discussion

From the last section we can see that, constructing the basis functions is only a simple and direct usage of Krylov subspace methods. It works better on model-based learning, in which we have enough information of the transition function P. But whether or not it can be extended to model-free learning has not been well studied. Bertsekas [2] provided a good example on how to connect proximal iterations for solving x = A x + b with TD( λ ) methods and got some interesting results. We can get some inspiration from it and try to use Krylov subspace methods for linear systems to solve approximate DP problems as well.

References

  1. Dimitri, P. Bertsekas, Dynamic Programming and Stochastic Control lecture notes, 2015 fall, MIT.
  2. Dimitri P. Bertsekas, Proximal algorithms and temporal differences for large linear systems: extrapolation, approximation, and simulation, Report LIDS-P-3205, MIT; arXiv preprint arXiv: 1610.05427, 2016.
  3. Justin, A. Boyan, Technical update: Least-squares temporal difference learning, Machine Learning, 49(2-3):233-246, 2002.
  4. Steven, J. Bradtke and Andrew G. Barto, Linear least-squares algorithms for temporal difference learning, Machine Learning, 22:33-57, 1996.
  5. Gene, H. Golub and Charles F. Van Loan, Matrix Computations, 4th Edn., The Johns Hopkins University Press, Baltimore, 2013, Chapter 11.
  6. Michail, G. Lagoudakis and Ronald Parr, Least-squares policy iteration, Journal of Machine Learning Research, 4:1107-1149, 2003.
  7. Alessandro Lazaric, Mohammad Ghavamzadeh and Rémi Munos, Finite-sample analysis of LSTD, In Proceedings of the 27th International Conference on Machine Learning, 2010.
  8. Jorg Liesen and Zdenek Strakos, Krylov Subspace Methods: Principles and Analysis, Oxford University Press, 2015, Chapter 1 and 2.
  9. Angelia Nedić and Dimitri P. Bertsekas, Least-squares policy evaluation algorithms with linear function approximation, Discrete Event Dynamic Systems: Theory and Applications, 13(1-2):79-110, 2003.
  10. Ronald Parr, Lihong Li, Gavin Taylor, Christopher Painter-Wakefield and Michael L. Littman, An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning, Proceedings of the 25th international conference on Machine learning, pp.752-759, Helsinki, Finland, July 05-09, 2008.
  11. Marek Petrik, An analysis of Laplacian methods for value function approximation in MDPs, Proceedings of the 20th international joint conference on Artifical intelligence, pp.2574-2579, Hyderabad, India, January 06-12, 2007.
  12. Pascal Poupart and Craig Boutilier, VDCBPI: an approximate scalable algorithm for large scale POMDPs, Advances in Neural Information Processing Systems 17 (NIPS-2004), pp.1081-1088, MIT Press, 2005.
  13. Martin, L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, John Wiley & Sons, Inc., New York, NY, 1994, Chapter 6.
  14. Bruno Scherrer, Should one compute the temporal difference fixed point or minimize the Bellman residual: the unified oblique projection view, Proceedings of International Conference on Machine Learning, Haifa, Israel, pp.959-966, 2010.
  15. K. Senda, S. Hattori, T. Hishinuma and T. Kohda, Acceleration of reinforcement learning by policy evaluation using nonstationary iterative method, IEEE Trans. Cybern., 44(12):2696-2705, Dec. 2014.
  16. Csaba Szepesvári, Least squares temporal difference learning and Galerkin’s method, Mini-Workshop: Mathematics of Machine Learning, Oberwolfach Reports, European Mathematical Society, Oberwolfach, pp. 2385-2388, 2011.
  17. John N. Tsitsiklis and Benjamin Van Roy, An analysis of temporal-difference learning with function approximation, IEEE Trans. Automat. Contr., 42(5):674-690, 1997.
  18. Huizhen Yu and Dimitri P. Bertsekas, Error bounds for approximations from projected linear equations, Mathematics of Operations Research, 35(2):306-329, 2010.
  19. Sitao Luan, Mingde Zhao, Xiao-Wen Chang, Doina Precup (2019). Break the ceiling: Stronger multi-scale deep graph convolutional networks. Advances in neural information processing systems, 32.
  20. Sitao Luan, Xiao-Wen Chang, Doina Precup (2019). Revisit Policy Optimization in Matrix Form. arXiv preprint arXiv:1909.09186.
  21. Sitao Luan, Mingde Zhao, Chenqing Hua, Xiao-Wen Chang, Doina Precup (2020). Complete the missing half: Augmenting aggregation filtering with diversification for graph convolutional networks. arXiv preprint arXiv:2008.08844.
  22. Sitao Luan, Chenqing Hua, Qincheng Lu, Jiaqi Zhu, Mingde Zhao, Shuyuan Zhang, Xiao-Wen Chang, Doina Precup (2021). Is Heterophily A Real Nightmare For Graph Neural Networks To Do Node Classification?. arXiv preprint arXiv:2109.05641.
  23. Sitao Luan, Chenqing Hua, Qincheng Lu, Jiaqi Zhu, Mingde Zhao, Shuyuan Zhang, Xiao-Wen Chang, Doina Precup (2022). Revisiting heterophily for graph neural networks. arXiv preprint arXiv:2210.07606.
  24. Sitao Luan, Chenqing Hua, Minkai Xu, Qincheng Lu, Jiaqi Zhu, Xiao-Wen Chang, Jie Fu, Jure Leskovec, Doina Precup (2023). When Do Graph Neural Networks Help with Node Classification: Investigating the Homophily Principle on Node Distinguishability. arXiv preprint arXiv:2304.14274.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

© 2024 MDPI (Basel, Switzerland) unless otherwise stated