The Finite–Time Turnpike Property in Machine Learning

Preprint

Article

The Finite–Time Turnpike Property in Machine Learning

Altmetrics

Downloads

Views

Comments

A peer-reviewed article of this preprint also exists.

Martin Gugat^*

This version is not peer-reviewed

Submitted:

16 August 2024

Posted:

20 August 2024

You are already at the latest version

Alerts

Abstract

The finite-time turnpike property describes the situation in an optimal control problem where an optimal trajectory reaches the desired state before the end of the time interval and remains there. We consider a machine learning problem with a neural ordinary differential equation that can be seen as a homogenization of a deep ResNet. We show that with appropriate scaling of the quadratic control cost and the non-smooth tracking term the optimal control problem has the finite-time turnpike property, that is the desired state is reached in the interior of the time interval and the optimal state remains there until the terminal time $T$. This property is useful to achieve a compromise between the depth of the network and the size of the optimal system parameters which we hope will be useful to determine optimal depths for neural network architectures in the future.

Keywords:

Subject: Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

We consider a system that is governed by a neural ODE that can be considered as a continuous-time ResNet. The system

S

is defined as follows:

S \{\begin{matrix} x (0) = x_{0} \in R^{d}, \\ x^{'} (t) = \sum_{i = 1}^{p} σ (a_{i} {(t)}^{⊤} x (t) + b_{i} (t)) w_{i} (t) \end{matrix}

(see for example [1,2]). For

i \in {1, . . ., p}

we have

w_{i} (t) \in R^{d}

. The

w_{i} (t)

are the columns of the matrix

W (t) \in R^{d \times p}

. We have

a_{i} (t) \in R^{d}

and the

a_{i} (t)

are the columns of the matrix

A (t) \in R^{d \times p}

. The bias vector

b (t)

is in

R^{p}

and has the components

b_{i} (t)

The motivation to study

S

is that a time-discrete version can be considered as a residual neural networks (ResNets) that has been very used in many applications, see [3] for example in image registration and classification problems. A time-discrete version can be obtained for example by an explicit Euler discretization of

S

The activation function

σ

is assumed to be Lipschitz continuous with a Lipschitz constant that is less than or equal to 1 and differentiable, for example

σ (z) = tanh (z),

σ (z) = \frac{1}{1 + exp (- z)}

. It acts on vectors componentwise.

For a given time horizon

T > 0

, we study an optimal control problem on the time interval

[0, T]

, where the desired state at T is prescribed by the terminal condition

x (T) = x_{T}

, where

x_{T} \in R^{d}

denotes the given desired output of the system. Let

t_{0} \in (0, T)

be given. For the training of the system, we study the loss function with a tracking term

Q (W, A, b) = \int_{t_{0}}^{T} | x (t) - x_{T} | + | x^{'} (t) | d t .

with the non–smooth norm

| z | = \sum_{i = 1}^{d} | z_{i} |

We define the control cost (regularization term)

R (W, A, b) = \int_{0}^{T} \frac{1}{2} {∥ W (t) ∥}^{2} + \frac{1}{2} {∥ A (t) ∥}^{2} + \frac{1}{2} {∥ b (t) ∥}^{2} d t .

Here (

∥ W (t) ∥

) denotes the Frobenius norm of

W (t)

. We introduce the space

X (T) = {measurable functions (W (t), A (t), b (t)) defined on (0, T)

such that \int_{0}^{T} {∥ W (t) ∥}^{2} + {∥ A (t) ∥}^{2} + {∥ b (t) ∥}^{2} d t < \infty} .

Lemma 10 in [4] states that system S is exactly controllable, that is the terminal condition

x (t_{0}) = z

(1)

can be satisfied for all

t_{0} > 0

. To be precise, for all

t_{0} > 0

there exists a constant

C_{e} > 0

such that for all

z \in R^{d}

we can find a control

u_{e x a c t}

such that for the state

\tilde{x}

that is generated by S with the initial condition

\tilde{x} (0) = x_{0}

we have

\tilde{x} (t_{0}) = z

and

∥ u_{e x a c t} ∥_{L^{2} (0, t_{0})} \leq C_{e} ∥ z - x_{0} ∥ .

(2)

Also the linearized system is exactly controllable in the sense that for all

t_{0} > 0

there exists a constant

C_{e} > 0

such that for all

z \in R^{d}

we can find a control

\tilde{v}

such that for the state

\tilde{x}

that is generated by the linearized system that is stated below with the initial condition

\tilde{x} (0) = 0

we have

\tilde{x} (t_{0}) = z

and

∥ \tilde{v} ∥_{L^{2} (0, t_{0})} \leq C_{e} ∥ z ∥ .

(3)

The linearized system at a given

u = (W, A, b)

for the variation

δ x

of the state that is generated by a variation

δ u = (δ W, δ A, δ b)

of the control is

δ x^{'} (t) = \sum_{i = 1}^{p} σ (a_{i} {(t)}^{⊤} x (t) + b_{i} (t)) δ w_{i} (t) + \sum_{i = 1}^{p} σ^{'} (a_{i} {(t)}^{⊤} x (t) + b_{i} (t)) w_{i} (t) x {(t)}^{⊤} δ a_{i} (t)

+ \sum_{i = 1}^{p} σ^{'} (a_{i} {(t)}^{⊤} x (t) + b_{i} (t)) w_{i} (t) δ b_{i} (t) + \sum_{i = 1}^{p} σ^{'} (a_{i} {(t)}^{⊤} x (t) + b_{i} (t)) w_{i} (t) a_{i} {(t)}^{⊤} δ x (t)

with the initial condition

δ x (0) = 0

A universal approximation theorem for the corresponding time-discrete case with recurrent neural networks can be found in the seminal paper [5] by Cybenko, see also [6,7,8,9].

For a parameter

γ > 0

define

J (W, A, b) = γ Q (W, A, b) + R (W, A, b) .

(4)

We study the minimization (training) problem

P (T, γ) : min_{(W, A, b) \in X (T)} J (W, A, b)

Our main result is that the optimal control problem

P (T, γ)

has the finite-time turnpike property, that is the desired state is already reached in the interior of the time-interval

[0, T]

and remains there until the end of the time interval. The finite-time turnpike property has been studied for example in [10,11] and [12]. In the first two references, the finite time-turnpike property is achieved by the non-smoothness of the objective functional. In this paper, we use a similar approach adapted to the framework of neural ordinary differential equations.

The finite-time turnpike property is an extremal case of the celebrated turnpike property that has originally been studied in economics. The turnpike analysis investigates how the solutions of dynamic optimal control problems with a time evolution are related to the solutions of the corresponding static problems where the time-derivatives are set to zero and the initial conditions are cancelled. It turns out that often for large time horizons on large parts of the time interval the solution of the dynamic problems is very close to the solution of the corresponding static problem. For an overview about the turnpike property, see [13,14,15,16] and the numerous references therein.

In the case of the finite-time turnpike property, after finite time the solution of the dynamic problem coincides with the solution of the static problem. The exponential turnpike property for ResNets and beyond has been studied for example in [17], but not the finite-time turnpike property.

2. The Finite-Time Turnpike Property

The following Theorem contains our main result, which states that the control cost entails the finite-time turnpike property.

Theorem 1.

For each sufficiently large

γ > 0

each optimal trajectory for

P (T, γ)

satisfies

x (t) = x_{T}, t \in [t_{0}, T],

that is

P (T, γ)

has the finite-time turnpike property. For

t \geq t_{0}

for the optimal parameters we have

W (t) = 0

A (t) = 0

and

b (t) = 0

. The optimal parameters remain unchanged if γ is further enlarged or if T is further enlarged.

For the proof of Theorem 1 we need a result about the embedding of the continuous functions in the Sobolev space

W^{1, 1}

: Let

L^{1} (0, T) = {f : [0, T] \to R, f is measurable, i.e. \int_{0}^{T} | f (t) | d t < \infty} .

Consider the embedding of the space of continuous functions in the space

W^{1, 1} (0, T) = {f \in L^{1} (0, T) : f^{'} \in L^{1} (0, T)} .

Lemma 1.

Let

t_{0} \in [0, T)

. For all

x \in W^{1, 1} (t_{0}, T)

we have

max_{t \in [t_{0}, T]} | x (t) | \leq (\frac{1}{T - t_{0}} + 1) \int_{t_{0}}^{T} |x (t)| + |x^{'} (t)| d t .

(5)

Proof of Lemma 1.

For

t_{1}

t_{2} \in [t_{0}, T]

we have

\begin{matrix} | x (t_{1}) - x (t_{2}) | & = & |\int_{t_{1}}^{t_{2}} x^{'} (t) d t| \\ \leq & |\int_{t_{1}}^{t_{2}} | x^{'} (t) | d t| . \end{matrix}

Thus x is continuous on

[t_{0}, T]

. Hence there exists a point

t_{*} \in [t_{0}, T]

with

| x (t_{*}) | = min_{t \in [t_{0}, T]} | x (t) | \leq \frac{1}{T - t_{0}} \int_{t_{0}}^{T} | x (t) | d t .

Thus for all

τ \in [t_{0}, T]

the following inequality holds:

\begin{matrix} | x (τ) | & \leq & | x (t_{*}) | + | x (t_{*}) - x (τ) | \\ \leq & \frac{1}{T - t_{0}} \int_{t_{0}}^{T} | x (t) | d t + \int_{t_{0}}^{T} | x^{'} (t) | d t \\ \leq & (\frac{1}{T - t_{0}} + 1) \int_{t_{0}}^{T} |x (t)| + |x^{'} (t)| d t . \end{matrix}

□

Now we are prepared for the proof of Theorem 1.

Proof of Theorem 1.

Case 1: If

x_{0} = x_{T}

, the parameters

u_{*} = (W_{*}, A_{*}, b_{*}) = (0, 0, 0)

generate the constant state

x (t) = x_{T}

. Hence

u_{*} = 0

solves

P (T, γ)

and the assertion follows.

Case 2: Now we assume that

x_{0} \neq x_{T}

. For

u = (W, A, b) \in X (T)

define the cost

C_{(0, t_{0})} (u) = \int_{0}^{t_{0}} \frac{1}{2} {∥ W (t) ∥}^{2} + \frac{1}{2} {∥ A (t) ∥}^{2} + \frac{1}{2} {∥ b (t) ∥}^{2} d t .

Define the non-smooth tracking term

I_{n o n} (u) = \int_{t_{0}}^{T} | x (t) - x_{T} | + | x^{'} (t) | d t .

Define the objective functional

K_{T} (u) = C_{(0, t_{0})} (u) + γ I_{n o n} (u) .

We consider the auxiliary problem

Q (T) : min_{u \in X (T)} K_{T} (u) .

We show that for solution

u_{*}

Q (T)

we have

I_{n o n} (u_{*}) = 0

by an indirect proof. Suppose that there exists a solution

u_{*} = (W_{*}, A_{*}, b_{*})

Q (T)

such that

I_{n o n} (u_{*}) > 0

. Then for the corresponding optimal state

x_{*}

that is generated by

S

we have

x_{*} (t_{0}) \neq x_{T}

; otherwise we could switch off the control at

t_{0}

and continue with the zero control

(0, 0, 0)

for

t \in (t_{0}, T]

that generates the constant state

x_{T}

(t_{0}, T]

to strictly improve the performance.

Define the auxiliary state

\tilde{x} (t_{0}) = x_{T} + \frac{1}{I_{n o n} (u_{*})} (x_{*} (t_{0}) - x_{T}) .

The exact controllability of the linearized system implies that we can find a control

\tilde{v} \in L^{2} (0, t_{0})

that due to (3) satisfies the inequality

∥ \tilde{v} ∥_{L^{2} (0, t_{0})} \leq C_{e} ∥ \tilde{x} (t_{0}) - x_{T} ∥ = C_{e} \frac{1}{I_{n o n} (u_{*})} ∥ x_{*} (t_{0}) - x_{T} ∥

that generates the state

\tilde{V}

with

\tilde{V} (0, \cdot) = 0

and

\tilde{V} (t_{0}) = \tilde{x} (t_{0}) - x_{T}

Due to (5) we have

∥ x_{*} (t_{0}) - x_{T} ∥ \leq (\frac{1}{T - t_{0}} + 1) \int_{t_{0}}^{T} |x_{*} (t) - x_{T}| + |x_{*}^{'} (t)| d t

(6)

= (\frac{1}{T - t_{0}} + 1) I_{n o n} (u_{*}) .

Thus we have

∥ \tilde{v} ∥_{L^{2} (0, t_{0})} \leq C_{e} (\frac{1}{T - t_{0}} + 1) .

For a step-size

ε \in (0, I_{n o n} (u_{*}))

define

λ = 1 - \frac{ε}{I_{n o n} (u_{*})} \in (0, 1) .

Consider the control u with

u (t) = u_{*} (t) - ε \tilde{v} (t)

for

t \in (0, t_{0}]

and for

t \in (t_{0}, T)

we defined

\tilde{v} = (δ W, δ A, δ b)

with

δ W (t) = - \frac{ε}{I_{n o n} (u_{*})} u_{*} (t)

δ A (t) = - \frac{ε}{I_{n o n} (u_{*})} A_{*} (t)

δ b (t) = - \frac{ε}{I_{n o n} (u_{*})} b_{*} (t)

Then if

γ > 0

is sufficiently large,

- \tilde{v}

is a descent direction in the sense that by a little step in the direction

- \tilde{v}

we can improve the performance of the control

u_{*}

. This can be seen as follows.

For the state

x = x_{*} + δ x

that is generated with the solution

δ x

of the linearized system with the initial condition

δ x (0) = 0

we have at

t_{0}

x (t_{0}) - x_{T} = (x_{*} (t_{0}) - x_{T}) - ε (\tilde{x} (t_{0}) - x_{T}) = (1 - \frac{ε}{I_{n o n} (u_{*})}) (x_{*} (t_{0}) - x_{T})

= λ (x_{*} (t_{0}) - x_{T}) .

Hence on

[t_{0}, T]

the state

x = x_{*} + δ x

that is generated with the solution

δ x

of the linearized system with the initial condition

δ x (t_{0}) = - \frac{ε}{I_{n o n} (u_{*})} (x_{*} (t_{0}) - x_{T})

x = x_{T} + λ (x_{*} (t) - x_{T}) .

Thus for the tracking term we have the bound

I_{n o n} (u) = λ I_{n o n} (u_{*}) + {O (∥ δ u ∥}^{2}) = (1 - \frac{ε}{I_{n o n} (u_{*})}) I_{n o n} (u_{*}) + {O (∥ δ u ∥}^{2}) .

For the control cost we have

C_{(0, t_{0})} (u) = {〈 u_{*} - ε \tilde{v}, u_{*} - ε \tilde{v} 〉}_{L^{2} (0, t_{0})} = C_{(0, t_{0})} (u_{*}) - 2 ε {〈 u_{*}, \tilde{v} 〉}_{L^{2} (0, t_{0})} + ε^{2} C_{(0, t_{0})} (\tilde{v}) .

Define

p (ε) = K_{T} (u_{*} - ε \tilde{v})

= C_{(0, t_{0})} (u_{*}) - 2 ε {〈 u_{*}, \tilde{v} 〉}_{L^{2} (0, t_{0})} + ε^{2} C_{(0, t_{0})} (\tilde{v}) + γ (1 - \frac{ε}{I_{n o n} (u_{*})}) I_{n o n} (u_{*}) + {O (∥ δ u ∥}^{2}) .

Then we have

p^{'} (ε) = - 2 {〈 u_{*}, \tilde{v} 〉}_{L^{2} (0, t_{0})} + 2 ε C_{(0, t_{0})} (\tilde{v}) - γ + O (ε) .

This yields

p^{'} (0) = - 2 {〈 u_{*}, \tilde{v} 〉}_{L^{2} (0, t_{0})} - γ .

The exact controllability of S implies that there is a control

u_{e x a c t} \in L^{2} (0, t_{0})

with (due to (2))

∥ u_{e x a c t} ∥_{L^{2} (0, t_{0})} \leq C_{e} ∥ {\tilde{x}}_{0} - x_{T} ∥

that generates the state

V_{e x a c t}

with

V_{e x a c t} (0, \cdot) = x_{0}

and

V_{e x a c t} (t_{0}, \cdot) = x_{T}

. For

t > t_{0}

, let

u_{e x a c t} (t) = 0

. Since

u_{e x a c t}

is feasible for

Q (T)

, this yields the inequality

C_{(t_{0}, T)} (u_{*}) \leq K_{T} (u_{*}) \leq K_{T} (u_{e x a c t}) = ∥ u_{e x a c t} ∥_{L^{2} (0, t_{0})}^{2} \leq C_{e}^{2} {∥ x_{0} - x_{T} ∥}_{L^{2} (0, L)}^{2} .

Hence we have

{〈 u_{*}, \tilde{v} 〉}_{L^{2} (0, t_{0})} \leq C_{e} ∥ x_{0} - x_{T} ∥_{L^{2} (0, L)} {∥ \tilde{v} ∥}_{L^{2} (0, t_{0})}

\leq ∥ x_{0} - x_{T} ∥_{L^{2} (0, L)} C_{e}^{2} (\frac{1}{T - t_{0}} + 1) .

Thus if

γ > 2 ∥ x_{0} - x_{T} ∥_{L^{2} (0, L)} C_{e}^{2} (\frac{1}{T - t_{0}} + 1),

we have

p^{'} (0) \leq - γ + 2 {∥ x_{0} - x_{T} ∥}_{L^{2} (0, L)} C_{e}^{2} (\frac{1}{T - t_{0}} + 1) < 0

. This implies that for

ε > 0

sufficiently small we have

K_{T} (u_{*} - ε \tilde{v}) < K_{T} (u_{*}),

which is a contradiction to the optimality of

u^{*}

Hence for any optimal control of

Q (T)

we have

I_{n o n} (u_{*}) = 0

. With inequality (6) this implies that for the optimal state we have

x_{*} (t_{0}) = x_{T}

Now we come back to problem

P (T, γ) : min_{u} J (u)

with J defined in (4). Let

v_{P} (T)

denote the optimal value of

P (T, γ)

and

v_{Q} (T)

denote the optimal value of

Q (T)

. Since

K_{T} (u) \leq J (u)

, we have

v_{Q} (T) \leq v_{P} (T) .

Moreover, any optimal control

u_{*}

for

Q (T)

is feasible for

P (T, γ)

. Since

x_{*} (t_{0}) = x_{T}

, we have

C_{(t_{0}, T)} (u_{*}) = 0

. Hence

v_{P} (T) \leq J (u_{*}) = K_{T} (u_{*}) = v_{Q} (T)

, and thus

v_{P} (T) \leq v_{Q} (T) .

Therefore we have

v_{P} (T) = v_{Q} (T) .

This implies that parameters that are optimal for

P (T)

are also optimal for

Q (T)

and the assertion follows. Thus we have proved Theorem 1. □

3. Existence of Solutions of $P (T, γ)$ for Fixed A

For the sake of completeness of the analysis, we also state an existence result. However we can only prove the existence of a solution for the problem where the matrix A is fixed an not an optimization parameter for

P (T, γ)

. Thus for a given matrix-valued function

A (t)

, we consider the problem

P (T, γ, A) : min_{(\cdot, A, \cdot) \in X (T)} J (\cdot, A, \cdot)

In order to show the existence of a solution of

P (T, γ, A)

, we assume that there exists a number

M > 0

such that for

t \in [0, T]

almost everywhere we have

{max}_{i \in {1, . . ., p}} ∥ (a_{i}) (t) ∥ \leq M

. This is the case if the

a_{i}

are elements of the function space

L^{\infty} (0, T)

, for example if they are step functions over

(0, T)

Theorem 2.

Assume that

{sup}_{x} | σ (x) | \leq 1

and the Lipschitz constant of σ is less than or equal to 1. Assume that

A (t)

is given such that we have

ess sup_{i \in {1, . . ., p}, s \in [0, T]} ∥ (a_{i}) (s) ∥ < \infty .

Then each

T > 0

and

γ > 0

, problem

P (T, γ, A)

has a solution

W, b

such that in

(W, A, b) \in X (T)

A (t) = 0

for

t \geq t_{0}

, for sufficiently large γ each solution has the finite-time turnpike property stated in Theorem 1.

The proof of Theorem 2 uses Gronwall’s Lemma (see for example [18]). For the convenience of the reader we state it here:

Lemma 2

(Gronwall’s Lemma). Let

L > 0

U_{0} \geq 0

ε \geq 0

and an integrable function U on

[0, T]

be given.

Assume that for

t \in [0, T]

almost everywhere the integral inequality

0 \leq U (t) \leq U_{0} + \int_{0}^{t} L U (τ) + ε d τ

hold. Then for

t \in [0, T]

almost everywhere the function U satisfies the inequality

U (t) \leq U_{0} e^{L t} + ε \frac{e^{L t} - 1}{L} .

Now we present the proof of Theorem 2.

Proof of Theorem 2.

Consider a minimizing sequence

{(u_{n})}_{n = 1}^{\infty}

with

u_{n} = (W_{n}, A, b_{n}) \in X (T)

for all

n \in {1, 2, 3, . . .}

. Define the norm

∥ u_{n} ∥_{X (T)} = \sqrt{\int_{0}^{T} ∥ W_{n} {(t) ∥}^{2} + {∥ A (t) ∥}^{2} + {∥ b_{n} (t) ∥}^{2} d t}

and the corresponding inner product that gives a Hilbert space structure to

X (T)

. Due to the definition of J, there exists a number

M > 0

such that for all

n \in {1, 2, . . .}

we have

∥ u_{n} ∥_{X (T)} \leq M,

(7)

that is the sequence is bounded in

X (T)

Hence there exists a weakly-converging subsequence with a limit

u_{*} = (W_{*}, A, b_{*}) \in X (T) .

Let

x_{*}

denote the state generated by

u_{*}

. For the states

x_{n}

generated by the

u_{n}

as a solution of

S

due to the definition of the tracking term R we can assume by increasing M if necessary that we have

sup_{s \in [0, T], n \in {1, 2, 3, . . .}} ∥ x_{n} (s) ∥ \leq M .

Due to Mazur’s Lemma (see for example [19,20]), there exists a subsequence of convex combinations that converges strongly. To be precise, there exist convex combinations

v_{k} = \sum_{m = k}^{N (k)} λ_{m}^{(k)} u_{m}, w i t h λ_{m}^{(k)} \geq 0, k \leq m \leq N (k) a n d \sum_{m = k}^{N (k)} λ_{m}^{(k)} = 1

such that

lim_{k \to \infty} {∥ v_{k} - u_{*} ∥}_{X} (T) = 0 .

This implies

lim_{k \to \infty} \int_{0}^{T} ∥ W_{n} (t) - W_{*} (t) ∥ + ∥ b_{n} (t) - b_{*} (t) ∥ d t = 0 .

Since

σ

is Lipschitz continuous with a Lipschitz constant that is less than or equal to 1, this implies for

i \in {1, . . ., p}

|σ (\sum_{m = k}^{N (k)} λ_{m}^{(k)} [(a_{i}) {(t)}^{⊤} x_{m} (t) + {(b_{i})}_{m} (t))] - σ ((a_{i}) {(t)}^{⊤} x_{*} (t) + {(b_{i})}_{*} (t))|

\leq |\sum_{m = k}^{N (k)} λ_{m}^{(k)} [(a_{i}) {(t)}^{⊤} x_{m} (t) + {(b_{i})}_{m} (t)))] - ((a_{i}) {(t)}^{⊤} x_{*} (t) + {(b_{i})}_{*} (t))| .

(8)

Thus for

t \in [0, T]

almost everywhere we have

∥\sum_{m = k}^{N (k)} λ_{m}^{(k)} x_{m} (t) - x_{*} (t)∥

\leq \sum_{i = 1}^{p} \int_{0}^{t} ∥\sum_{m = k}^{N (k)} λ_{m}^{(k)} {(w_{i})}_{m} (s) - {(w_{i})}_{*} (s)∥ |σ ((a_{i}) {(s)}^{⊤} x_{*} (s) + {(b_{i})}_{*} (s))|

+ ∥\sum_{m = k}^{N (k)} λ_{m}^{(k)} {(w_{i})}_{m} (s)∥ |σ (\sum_{m = k}^{N (k)} λ_{m}^{(k)} (a_{i}) {(s)}^{⊤} x_{m} (s) + {(b_{i})}_{m} (s))) - σ ((a_{i}) {(s)}^{⊤} x_{*} (s) + {(b_{i})}_{*} (s))| d s .

Then the fact that

{sup}_{x} | σ (x) | \leq 1

, the Cauchy-Schwarz inequality , (7) and (8) yield

∥\sum_{m = k}^{N (k)} λ_{m}^{(k)} x_{m} (t) - x_{*} (t)∥ \leq \sum_{i = 1}^{p} \int_{0}^{t} ∥\sum_{m = k}^{N (k)} λ_{m}^{(k)} {(w_{i})}_{m} (s) - {(w_{i})}_{*} (s)∥ d s

+ \sqrt{\int_{0}^{t} {∥\sum_{m = k}^{N (k)} λ_{m}^{(k)} {(w_{i})}_{m} (s)∥}^{2} d s} \sqrt{\int_{0}^{t} {|σ (\sum_{m = k}^{N (k)} λ_{m}^{(k)} (a_{i}) {(s)}^{⊤} x_{m} (s) + {(b_{i})}_{m} (s))) - σ ((a_{i}) {(s)}^{⊤} x_{*} (s) + {(b_{i})}_{*} (s))|}^{2} d s}

\leq \sum_{i = 1}^{p} \int_{0}^{t} ∥\sum_{m = k}^{N (k)} λ_{m}^{(k)} {(w_{i})}_{m} (s) - {(w_{i})}_{*} (s)∥ d s

+ M \sqrt{\int_{0}^{t} {|(\sum_{m = k}^{N (k)} λ_{m}^{(k)} (a_{i}) {(s)}^{⊤} x_{m} (s) + {(b_{i})}_{m} (s)) - ((a_{i}) {(s)}^{⊤} x_{*} (s) + {(b_{i})}_{*} (s))|}^{2} d s}

\leq \sum_{i = 1}^{p} \int_{0}^{t} ∥\sum_{m = k}^{N (k)} λ_{m}^{(k)} {(w_{i})}_{m} (s) - {(w_{i})}_{*} (s)∥ d s

+ M \sqrt{\int_{0}^{t} {|(a_{i}) {(s)}^{⊤} [\sum_{m = k}^{N (k)} λ_{m}^{(k)} x_{m} (s) - x_{*} (s)]|}^{2} d s} + M \sqrt{\int_{0}^{t} {|\sum_{m = k}^{N (k)} λ_{m}^{(k)} {(b_{i})}_{m} (s) - {(b_{i})}_{*} (s)|}^{2} d s .}

Due to Mazur’s Lemma, this yields the existence of a sequence

{(ϵ_{k})}_{k}

with

ϵ_{k} \geq 0

and

{lim}_{k \to \infty} ϵ_{k} = 0

such that

∥\sum_{m = k}^{N (k)} λ_{m}^{(k)} x_{m} (t) - x_{*} (t)∥ \leq ε_{k} + \sum_{i = 1}^{p} M \sqrt{\int_{0}^{t} ess {sup}_{s \in (0, T)} {∥ (a_{i}) (s) ∥}^{2} {∥\sum_{m = k}^{N (k)} λ_{m}^{(k)} x_{m} (t) - x_{*} (t)∥}^{2} d t} .

Thus by increasing the value of M if necessary, we obtain for

t \in [0, T]

almost everywhere

∥\sum_{m = k}^{N (k)} λ_{m}^{(k)} x_{m} (t) - x_{*} (t)∥ \leq ε_{k} + \sum_{i = 1}^{p} M \sqrt{\int_{0}^{t} M^{2} {∥\sum_{m = k}^{N (k)} λ_{m}^{(k)} x_{m} (s) - x_{*} (s)∥}^{2} d s} .

Since

(| u | + {| v |)}^{2} \leq {2 | u |}^{2} + 2 {| v |}^{2}

and

{(\sum_{i = 1}^{p} \sqrt{| z_{i} |})}^{2} \leq p \sum_{i = 1}^{p} | z_{i} |

this yields the integral inequality

{∥\sum_{m = k}^{N (k)} λ_{m}^{(k)} x_{m} (t) - x_{*} (t)∥}^{2} \leq 2 {(ε_{k})}^{2} + 2 p M^{4} \sum_{i = 1}^{p} \int_{0}^{t} {∥\sum_{m = k}^{N (k)} λ_{m}^{(k)} x_{m} (s) - x_{*} (s)∥}^{2} d s .

Now Gronwall’s Lemma yields for

t \in [0, T]

almost everywhere

∥\sum_{m = k}^{N (k)} λ_{m}^{(k)} x_{m} (t) - x_{*} (t)∥ = O (ε_{k}) .

This yields

lim_{k \to \infty} max_{t \in [0 T]} ∥\sum_{m = k}^{N (k)} λ_{m}^{(k)} x_{m} (t) - x_{*} (t)∥ = 0 .

For the time derivatives we obtain again by increasing the value of M if necessary

\int_{0}^{T} ∥\sum_{m = k}^{N (k)} λ_{m}^{(k)} x_{m}^{'} (t) - x_{*}^{'} (t)∥ d t

\leq \sum_{i = 1}^{p} \int_{0}^{T} ∥\sum_{m = k}^{N (k)} λ_{m}^{(k)} {(w_{i})}_{m} (t) - {(w_{i})}_{*} (t)∥ |σ ((a_{i}) {(t)}^{⊤} x_{*} (t) + {(b_{i})}_{*} (t))|

+ ∥\sum_{m = k}^{N (k)} λ_{m}^{(k)} {(w_{i})}_{m} (t)∥ |σ (\sum_{m = k}^{N (k)} λ_{m}^{(k)} (a_{i}) {(t)}^{⊤} x_{m} (t) + {(b_{i})}_{m} (t))) - σ ((a_{i}) {(t)}^{⊤} x_{*} (t) + {(b_{i})}_{*} (t))| d t

\leq \sum_{i = 1}^{p} \int_{0}^{T} ∥\sum_{m = k}^{N (k)} λ_{m}^{(k)} {(w_{i})}_{m} (t) - {(w_{i})}_{*} (t)∥ d t

+ \sqrt{\int_{0}^{T} {∥\sum_{m = k}^{N (k)} λ_{m}^{(k)} {(w_{i})}_{m} (t)∥}^{2} d t} \sqrt{\int_{0}^{T} {|\sum_{m = k}^{N (k)} λ_{m}^{(k)} (a_{i}) {(t)}^{⊤} x_{m} (t) + {(b_{i})}_{m} (t) - ((a_{i}) {(t)}^{⊤} x_{*} (t) + {(b_{i})}_{*} (t))|}^{2} d t}

\leq ε_{k} + \sum_{i = 1}^{p} M \sqrt{\int_{0}^{T} {|\sum_{m = k}^{N (k)} λ_{m}^{(k)} (a_{i}) {(t)}^{⊤} x_{m} (t) - (a_{i}) {(t)}^{⊤} x_{*} (t)|}^{2} d t}

+ M \sqrt{\int_{0}^{T} {|\sum_{m = k}^{N (k)} λ_{m}^{(k)} {(b_{i})}_{m} (t) - {(b_{i})}_{*} (t)|}^{2} d t}

\leq ε_{k} (1 + M) + M \sum_{i = 1}^{p} \sqrt{\int_{0}^{T} {|{(a_{i})}_{m} {(t)}^{⊤} [\sum_{m = k}^{N (k)} λ_{m}^{(k)} x_{m} (t) - x_{*} (t)]|}^{2} d t}

\leq ε_{k} (1 + M) + M \sum_{i = 1}^{p} \sqrt{\int_{0}^{T} ∥ (a_{i}) (t) ∥^{2} {∥ \sum_{m = k}^{N (k)} λ_{m}^{(k)} x_{m} (t) - x_{*} ∥}^{2} d t}

\leq ε_{k} (1 + M)

+ M \sum_{i = 1}^{p} ess sup_{s \in [0, T]} ∥ (a_{i}) (s) ∥ \sqrt{\int_{0}^{T} {∥\sum_{m = k}^{N (k)} λ_{m}^{(k)} x_{m} (t) - x_{*} (t)∥}^{2} d t}

\leq ε_{k} (1 + M) + M \sum_{i = 1}^{p} ess sup_{s \in [0, T]} ∥ (a_{i}) (s) ∥ ε_{k} = O (ε_{k}) .

Thus we have

\underset{k \to \infty}{lim inf} Q (v_{k}) \geq Q (u_{*}), \underset{k \to \infty}{lim inf} R (v_{k}) \geq R (u_{*}) .

This yields

\underset{k \to \infty}{lim inf} J (u_{k}) \geq J (u_{*}) .

Hence

u_{*}

is a solution of

P (T, γ, A)

. This shows that solution of

P (T, γ, A)

exist.

The exact controllability properties that have been used for the construction in the proof of Theorem 1 still hold if the matrix A is fixed. Hence the assertion follows. □

4. Discussion

We have shown that with a suitable non-smooth loss function each solution of a learning problem has the finite-time turnpike property which means that it reaches the desired state exactly after finite time. Since the finite time

t_{0}

can be considered as a problem parameter, this situation allows to choose

t_{0}

in a convenient way. Thus

t_{0}

arises as an additional design parameter in the design of optimal neural networks, that corresponds to the number of layers. Since for

t \in [t_{0}, T]

the optimal parameters are zero, System

S

does not change the state on

[t_{0}, T]

and thus the time horizon can be cut off at

t_{0}

Hence the problem to find the optimal number of layers in a neural network corresponds in the setting of neural ODEs to the problem of time-optimal control where the task is to find a minimal value of

t_{0}

subject to the constraint that

x (t_{0}) = x_{T}

and for the optimal parameters

u (t)

the constraint

{∥ u (t) ∥}_{X (t_{0})}^{2} \leq ρ

is satisfied. Here the number

ρ

is prescribed as a problem parameter. Let

ω (T, γ)

denote the optimal value of

P (T, γ)

. Then for optimal parameters

u (t)

that solve

P (T, γ)

we have

{∥ u (t) ∥}_{X (t_{0})}^{2} \leq ω (T, γ)

. Since Theorem 1 implies that for the optimal state we have

x (t_{0}) = x_{T}

, we conclude that optimal parameters for

P (T, γ)

also solve the time-optimal control problem with parameter

ρ = ω (T, γ)

and the optimal time is

t_{0}

We have shown the existence of a solution of the nonlinear optimization problem for the case that one of the parameters, namely the matrix

A (t)

is fixed. In order to show that a solution also exists with A as an additional optimization parameter, we expect that an additional regularization term in the objective functional (for example

\int_{0}^{T} ∥ A^{'} (t) ∥^{2} d t)

is necessary. This is a topic for future research. We expect that the finite-time turnpike property also holds in the case

t_{0} = 0

. However, the proof that is presented here does not apply to this case so this is another topic for future research. As a possible application of our results we have in mind the numerical solution of shape inverse problems as described in [21].

Funding

This research was funded by DFG in the framework of the Collaborative Research Centre CRC/Transregio 154, Mathematical Modelling, Simulation and Optimization Using the Example of Gas Networks, project C03 and project C05, Projektnummer 239904186 and the Bundesministerium fÃ¼r Bildung und Forschung (BMBF) and the Croatian Ministry of Science and Education under DAAD grant 57654073 ’Uncertain data in control of PDE systems’.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Marion, P. Generalization bounds for neural ordinary differential equations and deep residual networks. Advances in Neural Information Processing Systems 2024, 36.
Dupont, E.; Doucet, A.; Teh, Y.W. Augmented Neural ODEs. Advances in Neural Information Processing Systems; Wallach, H.; Larochelle, H.; Beygelzimer, A.; d’Alché-Buc, F.; Fox, E.; Garnett, R., Eds. Curran Associates, Inc., 2019, Vol. 32.
Thorpe, M.; van Gennip, Y. Deep limits of residual neural networks. Research in the Mathematical Sciences 2023, 10, 6.
Ãlvarez LÃ³pez, A.; Slimane, A.H.; Zuazua, E. Interplay between depth and width for interpolation in neural ODEs, 2024, [arXiv:math.OC/2401.09902].
Cybenko, G. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems 1989, 2, 303–314.
Pinkus, A. Approximation theory of the MLP model in neural networks. Acta numerica 1999, 8, 143–195.
Schäfer, A.M.; Zimmermann, H.G. Recurrent neural networks are universal approximators. Artificial Neural Networks–ICANN 2006: 16th International Conference, Athens, Greece, September 10-14, 2006. Proceedings, Part I 16. Springer, 2006, pp. 632–640.
Schäfer, A.M.; Udluft, S.; Zimmermann, H.G. Learning long term dependencies with recurrent neural networks. Artificial Neural Networks–ICANN 2006: 16th International Conference, Athens, Greece, September 10-14, 2006. Proceedings, Part I 16. Springer, 2006, pp. 71–80.
Schaefer, A.M.; Udluft, S.; Zimmermann, H.G. Learning long-term dependencies with recurrent neural networks. Neurocomputing 2008, 71, 2481–2488.
Gugat, M.; Schuster, M.; Zuazua, E. The finite-time turnpike phenomenon for optimal control problems: Stabilization by non-smooth tracking terms. Stabilization of distributed parameter systems: design methods and applications. Springer, 2021, pp. 17–41.
Gugat, M.; Schuster, M. Optimal Neumann control of the wave equation with L 1-control cost: the finite-time turnpike property. Optimization 2024, pp. 1–28.
Gugat, M. Optimal boundary control of the wave equation: The finite-time turnpike phenomenon. Mathematical Reports 2022.
Zaslavski, A.J. Turnpike Phenomenon in Metric Spaces; Vol. 201, Springer Nature, 2023.
Grüne, L.; Faulwasser, T. Turnpike properties in optimal control: An overview of discrete-time and continuous-time results. Handbook of Numerical Analysis; Trelat, E.; Zuazua, E., Eds., 2022. [CrossRef]
Grüne, L.; Guglielmi, R. Turnpike properties and strict dissipativity for discrete time linear quadratic optimal control problems. SIAM J. Control Optim. 2018, 56, 1282–1302. doi:10.1137/17M112350X. [CrossRef]
Trélat, E.; Zuazua, E. The turnpike property in finite-dimensional nonlinear optimal control. Journal of Differential Equations 2015, 258, 81–114.
Geshkovski, B.; Zuazua, E. Turnpike in optimal control of PDEs, ResNets, and beyond. Acta Numerica 2022, 31, 135–263.
Gugat, M. Optimal boundary control and boundary stabilization of hyperbolic systems; Birkhäuser, 2015.
Ciarlet, P.G. Mathematical elasticity: Three-dimensional elasticity; SIAM, 2021.
Heuser, H.G. Functional analysis. Transl. by John Horvath. A Wiley-Interscience Publication. Chichester etc.: John Wiley & Sons, 1982.
Jackowska-Strumillo, L.; Sokolowski, J.; Żochowski, A.; Henrot, A. On numerical solution of shape inverse problems. Computational Optimization and Applications 2002, 23, 231–255.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

MDPI Initiatives

Important Links

Choose an area of interest and we will send you notifications of new preprints at your preferred frequency.

Disclaimer

The Finite–Time Turnpike Property in Machine Learning

Abstract

1. Introduction

2. The Finite-Time Turnpike Property

3. Existence of Solutions of P ( T , γ ) for Fixed A

4. Discussion

Funding

Conflicts of Interest

References

MDPI Initiatives

Important Links

Subscribe

3. Existence of Solutions of $P (T, γ)$ for Fixed A