Approximate Dynamic Programming And The Krylov Subspace Methods

Preprint

Brief Report

Approximate Dynamic Programming And The Krylov Subspace Methods

Altmetrics

Downloads

114

Views

Comments

Ella Zhang^*

This version is not peer-reviewed

This preprints belongs to the Topic

Advances in Artificial Neural Networks

Submitted:

28 June 2023

Posted:

29 June 2023

You are already at the latest version

Alerts

Abstract

This report discusses the application of approximate dynamic programming (ADP) and Krylov subspace methods in solving sequential decision-making problems in machine learning. ADP is used when dealing with large state spaces or when the exact dynamics are unknown. The paper explores various ADP methods such as temporal difference learning and least-squares policy evaluation. The authors also focus on the use of Krylov subspace methods for solving the Bellman equation in ADP. The report provides insights into linear approximation, stochastic algorithms, and the Arnoldi algorithm for orthonormal basis. Overall, the paper highlights the theoretical foundation provided by ADP and its connection with Krylov subspace methods, shedding light on their application in computer science.

Keywords:

Subject: Computer Science and Mathematics - Computer Science

1. Introduction

Dynamic programming (DP) is both an optimization method and a computer programming method which was developed by Richard Bellman in the 1950s. The main idea is to break a complicated problem down into simpler sub-problems recursively. If there exists a recursive relation between the original problem and its sub-problems, then DP methods are applicable. In the optimization literature this relationship is called the Bellman equation.

In machine learning, one of the problems that we are interested in is the sequential decision making problem regarding the way in which the software agents ought to take actions to maximize the cumulative rewards from the stochastic environment. We formulate the interaction bewteen agents and the environment by Markov Decision Process (MDP) whose structure can be modeled by the Bellman equation. But when the state space of the environment is too large or an exact model of the dynamics is absent, MDP can only be solved approximately with simulation and function approximation and this is what we call approximate dynamic programming (neuro-dynamic programming or reinforcement learning). A large body of methods have been developed, e.g. temporal difference learning (TD) [4] , least-squares temporal difference learning (LSTD) [3,4], least-squares policy evaluation (LSPE) [9] and least-squares policy iteration (LSPI) [6].

Computing the value function of a given policy (policy evaluation) by the Bellman equation plays an important role in solving approximate DP and it is equivalent to solving a linear system [13]. In matrix computations, Krylov subspace methods [5,8] are the most powerful tools to solve large scale linear systems of equations and least-squares problems. So there is a connection between these two areas but this connection lacks deep exploration.

Instead of using reinforcement learning, we use approximate DP in this report to emphasize its strong relationship with DP, because DP provides theoretical foundations and important guidelines for designing approximate DP algorithms.

My report will review some fundamental concepts and ideas of approximate DP and Krylov subspace methods and their connections.

2. Approximate dynamic programming methods

2.1. Problem description

Markov decision process (MDP) is a framework to model the learning process that the agent learns from the interaction with the environment. Given the current state

S_{t} \in S

at time t, the agent interacts with the environment by selecting actions

A_{t} \in A (S_{t})

according to the policy

π (A_{t} | S_{t})

and the environment would take the agent to another state

S_{t + 1} \in S

, and then provide a numerical feedback (immediate reward)

R_{t + 1} = R (S_{t}, A_{t}, S_{t + 1}) \in R

according to the dynamics (transition probability)

p (S_{t + 1}, R_{t + 1} | S_{t}, A_{t})

. The policy

π

is a probability distribution from which we select the action. Our goal is to find an optimal policy (often deterministic) to maximize the long-term cumulative rewards. We assume that the policy is stationary. We only consider the inifite horizon discounted problem because the ideas can also be applied to the simpler absorbing Markov chains [17]. So the objective function is

G_{t} = \sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1}

. It’s easy to see the recursive structure

G_{t} = R_{t + 1} + γ G_{t + 1} .

(1)

The value function of the given policy

π

at state s is defined as

v_{π} (s) = E_{π} [G_{t} | S_{t} = s] = E_{π} [\sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1} | S_{t} = s], \forall s \in S

(2)

It can be written as

\begin{matrix} v_{π} (s) & = E_{π} [R_{t + 1} + γ G_{t + 1} | S_{t} = s] \\ = \sum_{a} π (A_{t} = a | S_{t} = s) \sum_{s^{'}} \sum_{r} p (S_{t + 1} = s^{'}, R_{t + 1} = r | S_{t} = s, A_{t} = a) [r + γ E_{π} [G_{t + 1} | S_{t + 1} = s^{'}]] \\ = \sum_{a} π (a | s) \sum_{s^{'}, r} p (s^{'}, r | s, a) [r + γ v_{π} (s^{'})], \forall s \in S \end{matrix}

(3)

The Equation (3) is the Bellman equation for MDP, which depicts the recursive relationships between the value of s and the value of its possible successor states

s^{'}

and is consistent to (1). It can be written in a more compact form.

\begin{matrix} v_{π} (s) & = \sum_{a} π (a | s) \sum_{s^{'}, r} p (s^{'}, r | s, a) [r + γ v_{π} (s^{'})] \\ = \sum_{a} π (a | s) \sum_{s^{'}, r} r \cdot p (s^{'}, r | s, a) + \sum_{a} π (a | s) \sum_{s^{'}, r} p (s^{'}, r | s, a) γ v_{π} (s^{'}) \\ = \sum_{a} π (a | s) r (s, a) + γ \sum_{s^{'}} {\sum_{a} π (a | s) \cdot p (s^{'} | s, a)} v_{π} (s^{'}) \\ = r_{π} (s) + γ \sum_{s^{'}} p_{π} (s^{'} | s) v_{π} (s^{'}), \forall s \in S \end{matrix}

(4)

The Equation (4) can be written in the matrix form

v_{π} = r_{π} + γ P_{π} v_{π}

(5)

where

v_{π} \in R^{| S |}, r_{π} \in R^{| S |}

P_{π} \in R^{| S | \times | S |}

and

P_{π} (i, j) = p_{π} (j | i)

. Here

| S |

can be infinite, but for simplicity, we assume it is finite. Since we only consider a fixed policy, we can write (5) as

v = r + γ P v

L v = r where L = I - γ P

We define a short hand notation for DP mapping:

T v = r + γ P v, \forall v \in V

(6)

where

V

is the value function space and T is called the Bellman operator. It can be proved that T is a contraction mapping [1] and its fixed point

v^{*}

is the solution for (5). Finding the fixed point is exactly what we do in the policy evaluation step in the policy iteration method [13].

Solving (5) by a direct linear solver is not feasible when the state space is very large or when the transition matrix is unavailable. But we can find approximate solutions by value function approximation (VFA). Linear approximation is a popular VFA method [11] and it has the form of

\hat{v} = Φ θ, Φ = (Φ_{1}, Φ_{2}, \dots, Φ_{K}) \in R^{| S | \times K},

where

Φ_{j}

is the j-th feature function or basis function over the state space and K is the number of features, see [14]. For each state i,

\hat{v} (i) = \sum_{j = 1}^{K} Φ_{j} (i) θ_{j} = ϕ {(i)}^{T} θ

, where

ϕ {(i)}^{T}

is the

i^{t h}

row of

Φ

and represent the feature vector for state i. Linear function approximations have better convergence property [17] than nonlinear function approximation and is easier to analyze, interpret and implement. We can introduce nonlinearity when doing feature extraction, e.g. deep neural networks. Suppose we already have well-chosen features, then by the linear approximation, (5) becomes

Φ θ = r + γ P Φ θ .

(7)

We define a

ξ -

weighted quadratic norm on

V

{|| v ||}_{ξ} = \sqrt{\sum_{s \in S} ξ (s) {(v (s))}^{2}} = \sqrt{v^{T} Ξ v}, \forall v \in V

(8)

where

ξ = {(ξ_{i})}_{i = 1}^{| S |}

with

ξ_{i} > 0

is the weight vector and

Ξ

is the diagonal matrix with

ξ

on the diagonal. Then we can define the operator

Π

that projects any value function to its closest representable function in range

(Φ)

(i.e. column space of

Φ

) with respect to

Ξ

. That means

Π

is an orthogonal projector onto

range (Φ)

with respect to

Ξ

, i.e.

Φ^{T} Ξ (v - Π v) = 0

, which is equivalent to

Π v ≐ Φ θ^{'}, where θ^{'} = \underset{θ \in R^{K}}{arg min} {|| v - Φ θ ||}_{ξ}

(9)

We assume that the Markov chain we are studying is ergodic, then we have the stationary distribution,

μ (s) = lim_{N \leftarrow \infty} \frac{1}{N} \sum_{k = 1}^{N} P (S_{k} = s | S_{0} = s_{0}) > 0, \forall s, s_{0} \in S

(10)

We usually use the vector

μ

as our weight vector

ξ

, otherwise it may cause a norm mismatch problem which will make the iterative methods described in the following sections diverge [1,14].

Bertsekas [1] gives a short summary of the main approximate DP algorithms:

Direct policy evaluation: Find $θ$ that minimize ${|| v - Φ θ ||}_{ξ}^{2}$ . We compute the approximation of the projection by Monte Carlo simulation. We can introduce nonlinear architectures in $\hat{v}$ , e.g. deep neural networks.
Indirect policy evaluation:

-

Solve the projected equation (Galerkin approximation [16,18])

$Φ θ = Π T (Φ θ)$

(11)

by simulation. Equation (11) is equivalent to minimizing the expected temporal difference error (TDE) || ${\hat{v} - Π T \hat{v}}_{ξ}$ ||_ξ down to 0 for $\hat{v} \in range (Φ)$ [14].

*

TD learning: Stochastic iterative algorithm for solving (11).

*

Least-squares temporal difference learning (LSTD): TD fixed point method [14].

*

Least-squares policy evaluation (LSPE): A simulation-based form of projected iteration; essentially $Φ θ_{k + 1} = Π T (Φ θ_{k}) +$ simulation noise.

-

Bellman equation error (Bellman residual [14]) method: minimize ${|| Φ θ - T (Φ θ) ||}_{ξ}$

2.2. Direct policy evaluation

We are interested in finding the

θ^{*} = \underset{θ \in R^{K}}{arg min} {|| v - \hat{v} ||}_{ξ}

with

\hat{v} = Φ θ

. Suppose we can sample the exact values or give an accurate estimation of v, then we can learn the parameters in supervised learning paradigm. The samples or estimation of v could be considered as the true label, and

\hat{v}

is the approximator. Then, it becomes a standard weighted least-squares problem. The solution is the projection of v onto

range (Φ)

with respect to

Ξ

[1]. Then,

θ^{*} = \underset{θ \in R^{K}}{arg min} {|| v - Φ θ ||}_{ξ}^{2} = \underset{θ \in R^{K}}{arg min} \sum_{i = 1}^{| S |} ξ_{i} {(v (i) - ϕ {(i)}^{T} θ)}^{2}

Set the gradient of

θ

to 0, we have

θ^{*} = {(\sum_{i = 1}^{| S |} ξ_{i} ϕ (i) ϕ {(i)}^{T})}^{- 1} \sum_{i = 1}^{| S |} ξ_{i} ϕ (i) v (i)

Suppose we have samples

{(s_{1}, v (s_{1})), \dots, (s_{m}, v (s_{m}))}

generated by the environment or a simulator of the environment according to

ξ

, where

s_{j} \in {1, \dots, | S |} for j = 1 : m,

then we can get the simulation-based approximation:

{\hat{θ}}^{*} = \underset{θ \in R^{K}}{arg min} \sum_{j = 1}^{m} {(v (s_{j}) - ϕ {(s_{j})}^{T} θ)}^{2} = {(\sum_{j = 1}^{m} ϕ (s_{j}) ϕ {(s_{j})}^{T})}^{- 1} \sum_{j = 1}^{m} ϕ (s_{j}) v (s_{j})

Note that

ϕ (s_{j}) ϕ {(s_{j})}^{T}

is a

K \times K

matrix. Thus, we are using low-dimensional linear algebra (of order K, the number of basis functions) compared with (5).

We can obtain the unbiased estimate of the true value function v by Monte Carlo sampling, i.e. perform long-term lookahead through the Markov chain and collect all the future rewards. However, the Monte-Carlo estimate of v requires a large number of samples as it suffers from a high variance.

2.3. Indirect policy evaluation

2.3.1. Stochastic algorithms: a general view

Consider the approximate solution

{\hat{x}}_{m}

of a linear system of equations

x = b + A x, A \in R^{n \times n}

by using m simulation samples

b + w_{k}

and

A + W_{k}, k = 1, \dots, m,

where

w_{k}, W_{k}

are simulation noise. Think of

x = b + A x

as approximate policy evaluation which is the same as (5) [1]:

Stochastic approximation (SA) approach: starting from an initial point $x_{0}$ , for $k = 1, \dots, m$

$x_{k} = (1 - α_{k}) x_{k - 1} + α_{k} ((b + w_{k}) + (A + W_{k}) x_{k - 1})$

(12)

$α_{k}, k = 1, \dots, m$ are the weights.
Monte Carlo estimation (MCE) approach: Form Monte Carlo estimation of b and A,

$b_{m} = \frac{1}{m} \sum_{k = 1}^{m} (b + w_{k}), A_{m} = \frac{1}{m} \sum_{k = 1}^{m} (A + W_{k})$

(13)

Then solve $x = b_{m} + A_{m} x$ by a direct method or an iterative method. $A_{m}, b_{m}$ can be built incrementally.

TD is SA method, LSTD and LSPE are MCE methods. We will introduce them in section 2.3.2-2.3.4.

2.3.2. Temporal difference (TD) learning

In 1988, Sutton developed temporal difference learning as a class of incremental learning methods for general prediction problems. It assigns credit through the difference of the predictions between current state and its successor, which makes it different with the conventional supervised learning methods. It has the form [4]

v (S_{t}) = v (S_{t}) + α [R_{t + 1} + γ v (S_{t + 1}) - v (S_{t})]

(14)

where

α

is a constant step-size parameter. This form is equivalent to (12). TD learning is a combination of Monte Carlo ideas and DP ideas. It can learn directly from the experience

R_{t + 1}

sampled from environment which is similar to Monte Carlo methods. It also updates the estimates partially based on other learned estimates

v (S_{t + 1})

without collecting all future rewards (bootstrapping) which is similar to DP.

When we use linear function approximation

\hat{v} (s_{t}) = ϕ {(s_{t})}^{T} θ_{t}

, the parameter update can be written as [4],

θ_{t + 1} = θ_{t} + α [R_{t + 1} + γ ϕ {(s_{t + 1})}^{T} θ_{t} - ϕ {(s_{t})}^{T} θ_{t}] ϕ (s_{t})

(15)

Equation (15) is like we are doing

{min}_{θ} J (θ) = {(v (s_{t}) - \hat{v} (s_{t}))}^{2}

. We apply gradient descent

θ_{t + 1} = θ_{t} - \frac{1}{2} α \nabla J (θ) = θ_{t} + α (v (s_{t}) - \hat{v} (s_{t})) \nabla \hat{v} (s_{t})

and substitute

v (s_{t})

by its approximation

R_{t + 1} + γ ϕ {(s_{t + 1})}^{T} θ_{t}

. We call them semi-gradient methods because we only take derivative of

\hat{v} (s_{t})

but not of

R_{t + 1} + γ ϕ {(s_{t + 1})}^{T} θ_{t}

. TD is proved to be convergent with probability one [17].

TD requires less memory, less peak computation and produce more accurate predictions than conventional supervised learning methods.

2.3.3. Least-squares temporal difference (LSTD) learning

Π T

is a contraction mapping on

range (Φ)

with respect to

Ξ

[1,14], there exists an unique fixed point

Φ θ^{*}

that satisfies (11). That means

θ^{*}

is the unique solution to

Φ θ = Π T (Φ θ)

(16)

which means

Φ θ^{*}

is the unique point in

range (Φ)

that satisfies the condition that when we use the projector

Π T

on it, the projection is itself. According to the definition of

Π

θ^{*}

should satisfies the condition

Φ^{T} Ξ (Φ θ^{*} - (r + γ P Φ θ^{*})) = 0 .

(17)

The orthogonal condition (17) is equivalent to the Galerkin method [16,18]. Equation (17) can be written as a linear systems of equations

C θ^{*} = d

, where

C = Φ^{T} Ξ (I - γ P) Φ, d = Φ^{T} Ξ r

(18)

So we can have

θ^{*} = C^{- 1} d

. The simulation machanism for C and d is

d_{k} = \frac{1}{k + 1} \sum_{t = 0}^{k} ϕ (s_{t}) r_{t} \approx \sum_{i, j} ξ_{i} P (i, j) ϕ (i) R (i, j) = Φ^{T} Ξ r = d

(19)

C_{k} = \frac{1}{k + 1} \sum_{t = 0}^{k} ϕ (s_{t}) {(ϕ (s_{t}) - γ ϕ (s_{t + 1}))}^{T} \approx Φ^{T} Ξ (I - γ P) Φ = C

(20)

where

C_{k} \in R^{K \times K}, d_{k} \in R^{K}

can be computed with low-dimensional calculation. Then

θ^{*}

can be approximated by

{\hat{θ}}_{k} = C_{k}^{- 1} d_{k}

LSTD makes more efficient use of data because it extract more information from each additional observation, and thus would be expected to converge with fewer training samples. In addition, it does not require the user to manually tune a stepsize schedule for good performance. LSTD converges to the same point as TD [4].

TD’s convergence can be slowed dramatically by a poor choice of the stepsize parameters

α

. Furthermore, TD’s performance is sensitive to the initial estimate for

θ

. LSTD does not rely on an arbitrary initial estimate. TD is also sensitive to the ranges of the individual features. LSTD is not.

2.3.4. Least-squares policy evaluation

An alternative way to solve the projected equation is projected value iteration (PVI), that is

Φ θ_{k + 1} = Π T (Φ θ_{k}) = Π (r + γ P Φ θ_{k})

(21)

It converges to

θ^{*}

because

Π T

is a contraction mapping with respect to

Ξ

. Rather than directly solve the projected equation, it uses an iterative way to get close to the fixed point starting from an arbitrary initial point. It is equivalent to the minimization problem

θ_{k + 1} = \underset{θ \in R^{K}}{arg min} {|| Φ θ - (r + γ P Φ θ^{k}) ||}_{ξ}^{2}

(22)

Set the gradient with respect to

θ

to be 0, we could have

Φ^{T} Ξ (Φ θ_{k + 1} - (r + γ P Φ θ_{k})) = 0

(23)

Then we could have

θ_{k + 1} = θ_{k} - {(Φ^{T} Ξ Φ)}^{- 1} (C θ_{k} - d)

(24)

where

C, d

are defined as (18) and could be estimated by (19) and (20).

G = Φ^{T} Ξ Φ

could be estimated by

G_{k} = \frac{1}{k + 1} \sum_{t = 0}^{k} ϕ (s_{t}) ϕ {(s_{t})}^{T} \approx Φ^{T} Ξ Φ = G

(25)

C_{k}, d_{k}

and

G_{k}

can be formed incrementally or in temporal difference style [1].

2.3.5. Bellman equation error methods

The main idea of the Bellman equation error (BE) method is to find a

\hat{v} \in range (Φ)

so that

{|| \hat{v} - T \hat{v} ||}_{ξ}

can be minimized. Recall that

L = I - γ P

. With the notation

Ψ = L Φ

, we have

{|| \hat{v} - T \hat{v} ||}_{ξ} = {|| Φ θ - r - γ P Φ θ ||}_{ξ} = {|| Ψ θ - r ||}_{ξ}

. Then it becomes a standard least squares problem with regard to

{|| \cdot ||}_{ξ}

. It is easy to see that the solution to the weighted least squares problem is

θ_{B E} = {(Ψ^{T} Ξ Ψ)}^{- 1} Ψ^{T} Ξ r

(26)

One reason that we are interested in BE method is the well-known result:

\forall \hat{v}, {|| v - \hat{v} ||}_{\infty} \leq \frac{1}{1 - γ} {|| T \hat{v} - \hat{v} ||}_{\infty}

and for weighted norm we have

\forall \hat{v}, {|| v - \hat{v} ||}_{ξ} \leq \frac{\sqrt{C (ξ)}}{1 - γ} {|| T \hat{v} - \hat{v} ||}_{ξ}

, where

C (ξ) : = m a x_{i, j} \frac{P (i, j)}{ξ_{i}}

is a “concentration coefficient”, which can be seen as a measure of the stochasticity of the MDP [14]. This means when BE is small, we are close to the true value function v. But for TD, there is no similar result. Actually, it is proved [14] that BE is an upper bound of the TDE. More exactly, since

\hat{v} = Φ θ = Π Φ θ

and from Pythagorean Theorem,

B E^{2} = {|| \hat{v} - T \hat{v} ||}_{ξ}^{2} = {|| Π (\hat{v} - T \hat{v}) ||}_{ξ}^{2} + {|| (I - Π) (\hat{v} - T \hat{v}) ||}_{ξ}^{2} = T D E^{2} + {|| T \hat{v} - Π T \hat{v} ||}_{ξ}^{2}

(27)

which means minimizing BE is not only minimizing TDE, but also minimizing

{|| T \hat{v} - Π T \hat{v} ||}_{ξ}^{2}

, which is part that the TD method misses. And also we can see that

T D E^{2} = {|| Π (\hat{v} - T \hat{v}) ||}_{ξ}^{2}

and this is why people also call it the projected Bellman error (PBE).

BE has the additional component which has to do with the features. Hence, if the feature space is fixed, the term we really want to minimize is TDE. Only when features can be modified we start to look at BE.

2.4. Least-squares policy iteration (LSPI)

Motivated by LSTD, Lagoudakis [6] extended it to control problems. Instead of evaluating the state value function v, he did evaluation on the state-action pairs (

s, a

), i.e. the Q function. The resulting method is referred to as LSTDQ. The Q function is defined as

Q (s, a) = E [\sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1} | S_{t} = s, A_{t} = a], \forall s \in S, a \in A (s)

(28)

We can directly choose action and improve our policy from the Q function without any knowledge of the underlying model of the environment. Everything about LSTDQ looks similar to LSTD after we change

Φ

to be the feature matrix for all state-action pairs. Then we put LSTD into the general policy iteration (GPI) frameworks to update our policy and get LSPI algorithm.

2.5. A unified oblique projection view

In the LSTD, we attempt to find the fixed point of the composite projector

Π T

, where

Π

is an orthogonal projection onto

range (Φ)

with respect to

Ξ

. Now we consider using a projector

Π

that is not necessarily orthogonal, which is a general operator that satisfies

Π^{2} = Π

and whose range is

range (Φ)

. In the most general case, such operators are called oblique projections and can be written

Π_{X} = Φ π_{X}

with

π_{X} = {(X^{T} Φ)}^{- 1} X^{T}

. The parameter X specifies the projection direction: precisely,

Π_{X}

is the projection onto

range (Φ)

orthogonal to

range (X)

or we could write it as

\forall v, (v - Π_{X} v) ⊥ X

(the

Π

defined in (9) can be written as

Π_{Ξ Φ}

). The vector

π_{X} v

is the coordinate of the projection of v under the basis

Φ

Scherrer [14] provided a unified oblique projection view of TD fixed point and BE minimization methods. Recall that

θ_{T D} = {(Φ^{T} Ξ L Φ)}^{- 1} Φ^{T} Ξ r, θ_{B E} = {(Ψ^{T} Ξ Ψ)}^{- 1} Ψ^{T} Ξ r

. If we write

X_{T D} = Ξ Φ

and

X_{B E} = Ξ L Φ

, we can have (1) The TD fixed point and the BE minimizer are solutions to the projected equation

{\hat{v}}_{X} = Π_{X} T {\hat{v}}_{X}

(

X = X_{T D}

and

X = X_{B R}

respectively). (2) When it exists, the solution of this projected equation is the projection of v onto

range (Φ)

orthogonally to

range (L^{T} X)

, i.e. formally

{\hat{v}}_{X} = Π_{L^{T} X} v .

From this perspective, we put the two methods under the same framework of projected fixed point equation.

From the results above, it’s easy to have

{|| v - {\hat{v}}_{X} ||}_{ξ} \leq {|| Π_{L^{T} X} ||}_{ξ} {|| v - Π v ||}_{ξ} = \sqrt{σ (A B C B^{T})} {|| v - Π v ||}_{ξ}, \forall X .

(29)

where

A = Φ^{T} Ξ Φ, B = {(X^{T} L Φ)}^{- 1}

C = X^{T} L Ξ^{- 1} L^{T} X

are matrices of size

K \times K

and

σ

is the spectral radius of the given matrix.

A, B, C

can be estimated by simulations. This inequality could also give us some inspiration on how to choose X and extract

Φ

, e.g. if we choose

X = {(L^{T})}^{- 1} Ξ Φ

, we could have

\hat{v} = Π_{Ξ Φ} v

, which is the orthogonal projection of true v onto

range (Φ)

with respect to

Ξ

Finally, Scherrer gives a comparison of TD and BE methods: BE is more robust than TD because it has guarantee of its performance. Although TD often outperforms BE, it could sometimes fail dramatically. So TD is more risky. In practice, TD is easier to implement by sampling and has lower variance. BE needs more samples to eliminate the effect of noise.

3. Krylov subspace methods

The earliest Krylov subspace methods were developed by Lanczos, Hestenes and Stiefel in the late 1940s and early 1950s for solving large sparse linear systems of equations

A x = b

(with nonsingular

A \in R^{n \times n}

). Krylov subspace methods form a the class of iterative method. An iterative method that solves the linear systems by producing a sequence of iterates

{x^{(k)}}

starting from an initial point

x^{(0)}

, which is expected to converge to the true solution. Iterative methods have advantages over the direct methods (e.g. matrix factorization methods) when A is a large sparse matrix whose structure is hard to exploit. Moreover, we can stop an iterative method early if we just need an approximate solution, but we cannot do this with a direct method. Krylov subspace methods are different from classical iterative methods (e.g. matrix splitting methods) because they have memory, i.e. they construct

x^{(k + 1)}

not only from

x^{(k)}

but also from the information of the previous iterates. So they are faster. Furthermore, Krylov subspace methods require only the ability to determine

A x

for any x (some also require the computation of

A^{T} x

3.1. Projection Process

First we will introduce a general projection process to find the solution

\hat{x}

. Let

x_{0}

denote an initial approximation to

\hat{x}

and then the initial residual

r_{0} \equiv b - A x_{0}

. At step k, we impose some conditions on

x_{k}

x_{k} \in x_{0} + S_{k}, r_{k} \equiv b - A x_{k} ⊥ C_{k}, k = 1, 2, \dots .

(30)

where

S_{k} \subseteq R^{n}

is a k-dimensional subspace, referred to as the search subspace, and

C_{k} \subseteq R^{n}

is also a

k -

dimensional subspace, referred to as the constraint subspace. Since

S_{k}

has dimension k, we have k degrees of freedom to construct

x_{k}

. In order to determine

x_{k}

, we need

C_{k}

to give us k constraints which are imposed on the residual

r_{k}

in (30).

Let

V_{k} \in R^{n \times k}

and

U_{k} \in R^{n \times k}

be two full-column rank matrices such that

S_{k} = range (V_{k}), C_{k} = range (U_{k})

(31)

Then from (1) there exists

z_{k} \in R^{k}

such that

x_{k} = x_{0} + V_{k} z_{k}

(32)

and

U_{k}^{T} (b - A x_{0} - A V_{k} z_{k}) = 0

(33)

which can be rewritten as

U_{k}^{T} A V_{k} z_{k} = U_{k}^{T} r_{0}

(34)

Theorem 1.

1(see Theorem 2.1.2 in [8]) The matrix

U_{k}^{T} A V_{k}

in Equation (34) is nonsingular if and only if

R^{n} = A S_{k} \oplus C_{k}^{⊥}

(35)

Now we assume that (35) holds. The linear system (34) can be regarded as a projected one of the original linear system

A x = b

. As the dimensions of the matrix

U_{k}^{T} A V_{k}

k \times k

, (34) can also be regarded as a reduced linear system.

From (32) and (34)

x_{k} = x_{0} + V_{k} {(U_{k}^{T} A V_{k})}^{- 1} U_{k}^{T} r_{0}

(36)

Then

\begin{matrix} r_{k} & = b - A x_{k} \\ = b - A x_{0} - A V_{k} {(U_{k}^{T} A V_{k})}^{- 1} U_{k}^{T} r_{0} = (I - P_{k}) r_{0} \end{matrix}

where

P_{k} \equiv A V_{k} {(U_{k}^{T} A V_{k})}^{- 1} U_{k}^{T}

(37)

The matrix

P_{k}

is an oblique projector because

P_{k}^{2} = P_{k}

. So

r_{k}

is the projection of

r_{0}

onto

C_{k}^{⊥}

. And it is easy to verify that in (35)

A S_{k} = R (P_{k})

and

C_{k}^{⊥} = N (P_{k}) = R (I - P_{k}) .

Theorem 2.

2 (see Lemma 2.1.7 in [8]) Suppose that (34) holds. If

r_{0} \in S_{k}

and

A S_{k} = S_{k}

, then

x_{k}

is the solution of

A x = b

Theorem 2 gives a sufficent condition for

x_{k}

to be the solution of

A x = b

at step k, that is in order to to make

r_{k} = 0

we need to have

r_{0} = P_{k} r_{0}

[8]. It seems like a good idea to start the projection process with the search space

S_{1} = span {r_{0}}

, and to build up a nested sequence of search spaces.

span (r_{0}) = S_{1} \subseteq S_{2} \subseteq S_{3} \subseteq \dots

(38)

and these spaces eventually satisfy

A S_{k} = S_{k}

for some k. Then we can obtain the solution in finite number of steps and we hope

k ≪ n

3.2. Krylov Subspaces

Given

A \in R^{n \times n}

and a nonzero vector

v \in R^{n}

, consider the following Krylov sequence,

v, A v, A^{2} v, \dots .

(39)

There exists an integer

d = d (A, v)

such that

{v, \dots A^{d - 1} v}

are linearly independent, but

{v, \dots A^{d} v}

are linearly dependent. It’s easy to see that

d < n

. The matrix

K_{k} (A, v) = span [v, A v, \dots A^{k - 1} v]

is called a Krylov subspace. We will show that

d \leq m

, where m is the degree of the minimal polynomial of A to be introduced below.

The characteristic polynomial of matrix A is defined by

P_{A} (t) = d e t (t I - A) = \prod_{i = 1}^{k} {(t - λ_{i})}^{s_{i}}

(40)

where

λ_{1}, \dots, λ_{k}

are distinct eigenvalues of A and

s_{i}

is the algebraic multiplicity for the corresponding

λ_{i}

. The minimal polynomial of A is defined as

m_{A} (t) = \prod_{i = 1}^{k} {(t - λ_{i})}^{m_{i}}

(41)

where

m_{i}

is the largest size of Jordan block associated with

λ_{i}

. We all know that for a Jordan block

J_{i}

with eigenvalue

λ_{i}

whose size is

m_{i}

{(J - λ_{i} I)}^{m_{i}} = 0

and

{(J - λ_{i} I)}^{m_{i} - 1} \neq 0

. And for a polynomial

q (t)

and a matrix B whose Jordan blocks are

B_{1}, \dots, B_{k}

q (B)

has Jordan blocks

q (B_{1}), \dots, q (B_{k})

. Then for matrix A whose minimal polynomial is

m_{A} (t)

, we have

m_{A} (A) = 0

. Then for all v, we have

m_{A} (A) v = 0

m_{A} (t)

could be written as

m_{A} (t) = \prod_{i = 1}^{k} {(t - λ_{i})}^{m_{i}} = \sum_{j = 0}^{m} α_{j} t^{j}

(42)

where

m = \sum_{i = 1}^{k} m_{i}

. It it easy to see that

α_{m} = 1

from (41), so

m_{A} (A) v = 0

means

A^{m} v

can be represented by

v, A v, \dots, A^{m - 1} v

which means

A^{m} v \in K_{m} (A, v)

and we must have

d \leq m .

Combining this conclusion with theorem 2 and the fact that

r_{0} \in K_{i} (A, r_{0})

, it tells us that

K_{i} (A, r_{0})

will eventually stop growing and become invariant under A and

K_{i} (A, r_{0}), i = 1, \dots, m

is a choice of nested sequence of spaces satisfy (38). So we have

Theorem 3.

3 If the minimal polynomial of nonsingular matrix A has degree m, then the solution to

A x = b

lies in the space

K_{m} (A, b)

Actually, with different choice of

S_{k}

and

C_{k}

, we have different Krylov subspace methods as Table 1 shows.

Table 1. Different Krylov subspace methods.

Name	Problem	$S_{k}$	$C_{k}$	Property
CG	SPD $^{1}$ A	$K_{k (A, r_{0})}$	$K_{k (A, r_{0})}$	$x_{k} = \arg \min_{x_{0} + K_{k (A, r_{0})}} {x - x_{*}}_{A}$
SYMMLQ	$A^{T} = A$	$A K_{k (A, r_{0})}$	$K_{k (A, r_{0})}$	$x_{k} = \arg \min_{x_{0} + K_{k (A, r_{0})}} {x - x_{*}}_{2}$ $^{2}$
MINRES	$A^{T} = A$	$K_{k (A, r_{0})}$	$A K_{k (A, r_{0})}$	$x_{k} = \arg \min_{x_{0} + K_{k (A, r_{0})}} {b - A x}_{2}$
GMRES	general A	$K_{k (A, r_{0})}$	$A K_{k (A, r_{0})}$	$x_{k} = \arg \min_{x_{0} + K_{k (A, r_{0})}} {b - A x}_{2}$
BiCG	general A	$K_{k (A, r_{0})}$	$K_{k (A^{T}, {\tilde{r}}_{0})}$	Does not solve an optimization problem
QMR	general A	$K_{k (A, r_{0})}$	no $C_{k}$	$x_{k} \approx \arg \min_{x_{0} + K_{k (A, r_{0})}} {b - A x}_{2}$

^{1}

Symmetric positive definite A.

\forall x \neq 0, x^{T} A x > 0

^{2}

Vector 2-norm.

x_{2} = \sqrt{x^{T} x}

3.3. The Arnoldi algorithm for orthonormal basis

If we form the matrix

K_{j} (A, r_{0})

K_{j} (A, r_{0}) = [r_{0}, A r_{0}, A^{2} r_{0}, \dots, A^{j - 1} r_{0}]

(43)

because

A^{j} r_{0}

converges to an eigenvector corresponding to the largest eigenvalue in magnitude, it is likely that

K_{j} (A, r_{0})

is very ill-conditioned for large j. In order to avoid numerical troubles, we want to find orthonormal vectors

{v_{i}}_{i = 1}^{j}

such that

K_{j} (A, r_{0}) = range (K_{j} (A, r_{0})) = span {v_{1}, v_{2}, \dots, v_{j}}

(44)

The idea is to use the Gram–Schmidt orthogonalization process and we show one step of it in the following.

Suppose we want to find

v_{1}

and

v_{2}

such that

K_{2} (A, r_{0}) = span {v_{1}, v_{2}}, v_{i}^{T} v_{j} = δ_{i j}, i, j = 1, 2

(45)

Then this can be done by:

1. Normalize

v_{1}

v_{1} : = r_{0} / {r_{0}}_{2}

, then

{v_{1}}_{2} = 1 .

2. Let

w_{1} : = A v_{1}

, then

K_{2} (A, r_{0}) = span {v_{1}, w_{1}}

(46)

3. Orthogonalize

w_{1}

against

v_{1}

v_{2} : = w_{1} - (w_{1}^{T} v_{1}) v_{1}

, then

v_{2}^{T} v_{1} = w_{1}^{T} v_{1} - (v_{1}^{T} v_{1}) (w_{1}^{T} v_{1}) = 0

(47)

and we still have

K_{2} (A, r_{0}) = span {v_{1}, v_{2}}

(48)

4. Normalize

v_{2}

v_{2} : = v_{2} / {v_{2}}_{2}

, then

{v_{2}}_{2} = 1

and we still have

K_{2} (A, r_{0}) = span {v_{1}, v_{2}}

We can write Arnoldi algorithm in terms of matrix. Let

V_{k} = [v_{1}, \dots, v_{k}]

whose columns are composed by the orthonormal vectors

v_{1}, \dots, v_{k}

. After k steps, the algorithm has [8]

A V_{k} = V_{k} H_{k} + h_{k + 1, k} v_{k + 1} e_{k}^{T}, k = 1, 2, \dots

(49)

H_{k} = [\begin{matrix} h_{11} & h_{12} & \dots & h_{1, k - 1} & h_{1, k} \\ h_{21} & h_{22} & \dots & h_{2, k - 1} & h_{2, k} \\ h_{32} & \dots & h_{3, k - 1} & h_{3, k} \\ ⋱ & ⋮ \\ h_{k, k - 1} & h_{k, k} \end{matrix}]

where

H_{k}

is the

k \times k

unreduced upper Hessenberg matrix with

h_{i j} = w_{j}^{T} v_{i}, i = 1, \dots, j

and

h_{j + 1, j} = {w_{j}}_{2}

and

e_{k}

is the k-th column of

I_{k}

. At the m-th iteration, we get

h_{m + 1, m} = 0

. Then Arnoldi algorithm stops and we have

A V_{m} = V_{m} H_{m}, V_{m}^{T} V_{m} = I_{m}

(50)

4. Connections

One direct and most common application of the Krylov subspace methods onto VFA is to construct basis functions [11]. Recall as (5) shows, we need to solve linear system

(I - γ P) v = r, 0 \leq γ \leq 1

(51)

Suppose

I - γ P \in R^{n \times n}

is nonsingular, since P is a stochastic matrix and

0 \leq γ < 1

, we have

σ (γ P) < 1

. If we want to approximate

{(I - γ P)}^{- 1}

, one way is to use the Neumann series expansion [13],

{(I - γ P)}^{- 1} = \sum_{i = 0}^{\infty} {(γ P)}^{i}

(52)

Then the solution of (5) lies in

K_{m} (P, r)

, where m is the degree of the minimal polynomial of P, as what shows below,

v = {(I - γ P)}^{- 1} r = \sum_{i = 0}^{\infty} {(γ P)}^{i} r \in span {r, P r, P^{2} r, \dots} = span {r, P r, P^{2} r, \dots, P^{m - 1} r}

(53)

Therefore, we can set our feature matrix as

Φ = K_{m} (P, r)

In practice, we could first start with a random initial guess

θ_{0}

and

K_{1} (P, r)

and progressively add

P^{i} r

into the basis function matrix.

Parr [10] shows that the basis that constructed by Krylov subspace is equivalent the Bellman Error Basis Functions (BEBFs) and Model Error Basis Functions (MEBFs) which respectively add the Bellman error vectors and model error vectors progressively into the current basis.

Senda [15] applied the Conjugate Gradient (CG) algorithm, the Bi-Conjugate Gradient (BiCG) algorithm and the Block Product-type Krylov Subspace Method (BPKSM) for policy evaluation and did a comparison with the existing stationary iterative methods. What he concluded was that the algorithms based on the Krylov subspace methods were tens to hundreds times more efficient than conventional algorithms. He found the convergence rate of Krylov subspace algorithms improved as the iteration number increased, whereas the other algorithms remained the same. Algorithms based on Krylov subspace methods are far more efficient than the classical methods. The Krylov subspace methods can also be used in other machine learning area [19,20,21,22,23,24].

5. Conclusion and Discussion

From the last section we can see that, constructing the basis functions is only a simple and direct usage of Krylov subspace methods. It works better on model-based learning, in which we have enough information of the transition function P. But whether or not it can be extended to model-free learning has not been well studied. Bertsekas [2] provided a good example on how to connect proximal iterations for solving

x = A x + b

with TD(

λ

) methods and got some interesting results. We can get some inspiration from it and try to use Krylov subspace methods for linear systems to solve approximate DP problems as well.

References

Dimitri, P. Bertsekas, Dynamic Programming and Stochastic Control lecture notes, 2015 fall, MIT.
Dimitri P. Bertsekas, Proximal algorithms and temporal differences for large linear systems: extrapolation, approximation, and simulation, Report LIDS-P-3205, MIT; arXiv preprint arXiv: 1610.05427, 2016.
Justin, A. Boyan, Technical update: Least-squares temporal difference learning, Machine Learning, 49(2-3):233-246, 2002.
Steven, J. Bradtke and Andrew G. Barto, Linear least-squares algorithms for temporal difference learning, Machine Learning, 22:33-57, 1996.
Gene, H. Golub and Charles F. Van Loan, Matrix Computations, 4th Edn., The Johns Hopkins University Press, Baltimore, 2013, Chapter 11.
Michail, G. Lagoudakis and Ronald Parr, Least-squares policy iteration, Journal of Machine Learning Research, 4:1107-1149, 2003.
Alessandro Lazaric, Mohammad Ghavamzadeh and Rémi Munos, Finite-sample analysis of LSTD, In Proceedings of the 27th International Conference on Machine Learning, 2010.
Jorg Liesen and Zdenek Strakos, Krylov Subspace Methods: Principles and Analysis, Oxford University Press, 2015, Chapter 1 and 2.
Angelia Nedić and Dimitri P. Bertsekas, Least-squares policy evaluation algorithms with linear function approximation, Discrete Event Dynamic Systems: Theory and Applications, 13(1-2):79-110, 2003.
Ronald Parr, Lihong Li, Gavin Taylor, Christopher Painter-Wakefield and Michael L. Littman, An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning, Proceedings of the 25th international conference on Machine learning, pp.752-759, Helsinki, Finland, July 05-09, 2008.
Marek Petrik, An analysis of Laplacian methods for value function approximation in MDPs, Proceedings of the 20th international joint conference on Artifical intelligence, pp.2574-2579, Hyderabad, India, January 06-12, 2007.
Pascal Poupart and Craig Boutilier, VDCBPI: an approximate scalable algorithm for large scale POMDPs, Advances in Neural Information Processing Systems 17 (NIPS-2004), pp.1081-1088, MIT Press, 2005.
Martin, L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, John Wiley & Sons, Inc., New York, NY, 1994, Chapter 6.
Bruno Scherrer, Should one compute the temporal difference fixed point or minimize the Bellman residual: the unified oblique projection view, Proceedings of International Conference on Machine Learning, Haifa, Israel, pp.959-966, 2010.
K. Senda, S. Hattori, T. Hishinuma and T. Kohda, Acceleration of reinforcement learning by policy evaluation using nonstationary iterative method, IEEE Trans. Cybern., 44(12):2696-2705, Dec. 2014.
Csaba Szepesvári, Least squares temporal difference learning and Galerkin’s method, Mini-Workshop: Mathematics of Machine Learning, Oberwolfach Reports, European Mathematical Society, Oberwolfach, pp. 2385-2388, 2011.
John N. Tsitsiklis and Benjamin Van Roy, An analysis of temporal-difference learning with function approximation, IEEE Trans. Automat. Contr., 42(5):674-690, 1997.
Huizhen Yu and Dimitri P. Bertsekas, Error bounds for approximations from projected linear equations, Mathematics of Operations Research, 35(2):306-329, 2010.
Sitao Luan, Mingde Zhao, Xiao-Wen Chang, Doina Precup (2019). Break the ceiling: Stronger multi-scale deep graph convolutional networks. Advances in neural information processing systems, 32.
Sitao Luan, Xiao-Wen Chang, Doina Precup (2019). Revisit Policy Optimization in Matrix Form. arXiv preprint arXiv:1909.09186.
Sitao Luan, Mingde Zhao, Chenqing Hua, Xiao-Wen Chang, Doina Precup (2020). Complete the missing half: Augmenting aggregation filtering with diversification for graph convolutional networks. arXiv preprint arXiv:2008.08844.
Sitao Luan, Chenqing Hua, Qincheng Lu, Jiaqi Zhu, Mingde Zhao, Shuyuan Zhang, Xiao-Wen Chang, Doina Precup (2021). Is Heterophily A Real Nightmare For Graph Neural Networks To Do Node Classification?. arXiv preprint arXiv:2109.05641.
Sitao Luan, Chenqing Hua, Qincheng Lu, Jiaqi Zhu, Mingde Zhao, Shuyuan Zhang, Xiao-Wen Chang, Doina Precup (2022). Revisiting heterophily for graph neural networks. arXiv preprint arXiv:2210.07606.
Sitao Luan, Chenqing Hua, Minkai Xu, Qincheng Lu, Jiaqi Zhu, Xiao-Wen Chang, Jie Fu, Jure Leskovec, Doina Precup (2023). When Do Graph Neural Networks Help with Node Classification: Investigating the Homophily Principle on Node Distinguishability. arXiv preprint arXiv:2304.14274.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

MDPI Initiatives

Important Links

Choose an area of interest and we will send you notifications of new preprints at your preferred frequency.

Disclaimer

Approximate Dynamic Programming And The Krylov Subspace Methods

Abstract

1. Introduction

2. Approximate dynamic programming methods

2.1. Problem description

2.2. Direct policy evaluation

2.3. Indirect policy evaluation

2.3.1. Stochastic algorithms: a general view

2.3.2. Temporal difference (TD) learning

2.3.3. Least-squares temporal difference (LSTD) learning

2.3.4. Least-squares policy evaluation

2.3.5. Bellman equation error methods

2.4. Least-squares policy iteration (LSPI)

2.5. A unified oblique projection view

3. Krylov subspace methods

3.1. Projection Process

3.2. Krylov Subspaces

3.3. The Arnoldi algorithm for orthonormal basis

4. Connections

5. Conclusion and Discussion

References

MDPI Initiatives

Important Links

Subscribe