Preprint
Article

The Finite–Time Turnpike Property in Machine Learning

Altmetrics

Downloads

82

Views

42

Comments

0

A peer-reviewed article of this preprint also exists.

This version is not peer-reviewed

Submitted:

16 August 2024

Posted:

20 August 2024

You are already at the latest version

Alerts
Abstract
The finite-time turnpike property describes the situation in an optimal control problem where an optimal trajectory reaches the desired state before the end of the time interval and remains there. We consider a machine learning problem with a neural ordinary differential equation that can be seen as a homogenization of a deep ResNet. We show that with appropriate scaling of the quadratic control cost and the non-smooth tracking term the optimal control problem has the finite-time turnpike property, that is the desired state is reached in the interior of the time interval and the optimal state remains there until the terminal time $T$. This property is useful to achieve a compromise between the depth of the network and the size of the optimal system parameters which we hope will be useful to determine optimal depths for neural network architectures in the future.
Keywords: 
Subject: Computer Science and Mathematics  -   Artificial Intelligence and Machine Learning

1. Introduction

We consider a system that is governed by a neural ODE that can be considered as a continuous-time ResNet. The system S is defined as follows:
S x ( 0 ) = x 0 R d , x ( t ) = i = 1 p σ ( a i ( t ) x ( t ) + b i ( t ) ) w i ( t )
(see for example [1,2]). For i { 1 , . . . , p } we have w i ( t ) R d . The w i ( t ) are the columns of the matrix W ( t ) R d × p . We have a i ( t ) R d and the a i ( t ) are the columns of the matrix A ( t ) R d × p . The bias vector b ( t ) is in R p and has the components b i ( t ) .
The motivation to study S is that a time-discrete version can be considered as a residual neural networks (ResNets) that has been very used in many applications, see [3] for example in image registration and classification problems. A time-discrete version can be obtained for example by an explicit Euler discretization of S .
The activation function σ is assumed to be Lipschitz continuous with a Lipschitz constant that is less than or equal to 1 and differentiable, for example
σ ( z ) = tanh ( z ) ,
σ ( z ) = 1 1 + exp ( z ) . It acts on vectors componentwise.
For a given time horizon T > 0 , we study an optimal control problem on the time interval [ 0 , T ] , where the desired state at T is prescribed by the terminal condition x ( T ) = x T , where x T R d denotes the given desired output of the system. Let t 0 ( 0 , T ) be given. For the training of the system, we study the loss function with a tracking term
Q ( W , A , b ) = t 0 T | x ( t ) x T | + | x ( t ) | d t .
with the non–smooth norm | z | = i = 1 d | z i | .
We define the control cost (regularization term)
R ( W , A , b ) = 0 T 1 2 W ( t ) 2 + 1 2 A ( t ) 2 + 1 2 b ( t ) 2 d t .
Here ( W ( t ) ) denotes the Frobenius norm of W ( t ) . We introduce the space
X ( T ) = { measurable functions   ( W ( t ) , A ( t ) , b ( t ) )   defined on   ( 0 , T )
such that 0 T W ( t ) 2 + A ( t ) 2 + b ( t ) 2 d t < } .
Lemma 10 in [4] states that system S is exactly controllable, that is the terminal condition
x ( t 0 ) = z
can be satisfied for all t 0 > 0 . To be precise, for all t 0 > 0 there exists a constant C e > 0 such that for all z R d we can find a control u e x a c t such that for the state x ˜ that is generated by S with the initial condition x ˜ ( 0 ) = x 0 we have x ˜ ( t 0 ) = z and
u e x a c t L 2 ( 0 , t 0 ) C e z x 0 .
Also the linearized system is exactly controllable in the sense that for all t 0 > 0 there exists a constant C e > 0 such that for all z R d we can find a control v ˜ such that for the state x ˜ that is generated by the linearized system that is stated below with the initial condition x ˜ ( 0 ) = 0 we have x ˜ ( t 0 ) = z and
v ˜ L 2 ( 0 , t 0 ) C e z .
The linearized system at a given u = ( W , A , b ) for the variation δ x of the state that is generated by a variation δ u = ( δ W , δ A , δ b ) of the control is
δ x ( t ) = i = 1 p σ ( a i ( t ) x ( t ) + b i ( t ) ) δ w i ( t ) + i = 1 p σ ( a i ( t ) x ( t ) + b i ( t ) ) w i ( t ) x ( t ) δ a i ( t )
+ i = 1 p σ ( a i ( t ) x ( t ) + b i ( t ) ) w i ( t ) δ b i ( t ) + i = 1 p σ ( a i ( t ) x ( t ) + b i ( t ) ) w i ( t ) a i ( t ) δ x ( t )
with the initial condition δ x ( 0 ) = 0 .
A universal approximation theorem for the corresponding time-discrete case with recurrent neural networks can be found in the seminal paper [5] by Cybenko, see also [6,7,8,9].
For a parameter γ > 0 define
J ( W , A , b ) = γ Q ( W , A , b ) + R ( W , A , b ) .
We study the minimization (training) problem
P ( T , γ ) : min ( W , A , b ) X ( T ) J ( W , A , b )
Our main result is that the optimal control problem P ( T , γ ) has the finite-time turnpike property, that is the desired state is already reached in the interior of the time-interval [ 0 , T ] and remains there until the end of the time interval. The finite-time turnpike property has been studied for example in [10,11] and [12]. In the first two references, the finite time-turnpike property is achieved by the non-smoothness of the objective functional. In this paper, we use a similar approach adapted to the framework of neural ordinary differential equations.
The finite-time turnpike property is an extremal case of the celebrated turnpike property that has originally been studied in economics. The turnpike analysis investigates how the solutions of dynamic optimal control problems with a time evolution are related to the solutions of the corresponding static problems where the time-derivatives are set to zero and the initial conditions are cancelled. It turns out that often for large time horizons on large parts of the time interval the solution of the dynamic problems is very close to the solution of the corresponding static problem. For an overview about the turnpike property, see [13,14,15,16] and the numerous references therein.
In the case of the finite-time turnpike property, after finite time the solution of the dynamic problem coincides with the solution of the static problem. The exponential turnpike property for ResNets and beyond has been studied for example in [17], but not the finite-time turnpike property.

2. The Finite-Time Turnpike Property

The following Theorem contains our main result, which states that the control cost entails the finite-time turnpike property.
Theorem 1.
For each sufficiently large γ > 0 each optimal trajectory for P ( T , γ ) satisfies
x ( t ) = x T , t [ t 0 , T ] ,
that is P ( T , γ ) has the finite-time turnpike property. For t t 0 for the optimal parameters we have W ( t ) = 0 , A ( t ) = 0 and b ( t ) = 0 . The optimal parameters remain unchanged if γ is further enlarged or if T is further enlarged.
For the proof of Theorem 1 we need a result about the embedding of the continuous functions in the Sobolev space W 1 , 1 : Let
L 1 ( 0 , T ) = { f : [ 0 , T ] R , f is measurable ,   i.e. 0 T | f ( t ) | d t < } .
Consider the embedding of the space of continuous functions in the space
W 1 , 1 ( 0 , T ) = { f L 1 ( 0 , T ) : f L 1 ( 0 , T ) } .
Lemma 1.
Let t 0 [ 0 , T ) . For all x W 1 , 1 ( t 0 , T ) we have
max t [ t 0 , T ] | x ( t ) | 1 T t 0 + 1 t 0 T x ( t ) + x ( t ) d t .
Proof of Lemma 1.
For t 1 , t 2 [ t 0 , T ] we have
| x ( t 1 ) x ( t 2 ) | = t 1 t 2 x ( t ) d t t 1 t 2 | x ( t ) | d t .
Thus x is continuous on [ t 0 , T ] . Hence there exists a point t * [ t 0 , T ] with
| x ( t * ) | = min t [ t 0 , T ] | x ( t ) | 1 T t 0 t 0 T | x ( t ) | d t .
Thus for all τ [ t 0 , T ] the following inequality holds:
| x ( τ ) | | x ( t * ) | + | x ( t * ) x ( τ ) | 1 T t 0 t 0 T | x ( t ) | d t + t 0 T | x ( t ) | d t 1 T t 0 + 1 t 0 T x ( t ) + x ( t ) d t .
Now we are prepared for the proof of Theorem 1.
Proof of Theorem 1.
Case 1: If x 0 = x T , the parameters u * = ( W * , A * , b * ) = ( 0 , 0 , 0 ) generate the constant state x ( t ) = x T . Hence u * = 0 solves P ( T , γ ) and the assertion follows.
Case 2: Now we assume that x 0 x T . For u = ( W , A , b ) X ( T ) define the cost
C ( 0 , t 0 ) ( u ) = 0 t 0 1 2 W ( t ) 2 + 1 2 A ( t ) 2 + 1 2 b ( t ) 2 d t .
Define the non-smooth tracking term
I n o n ( u ) = t 0 T | x ( t ) x T | + | x ( t ) | d t .
Define the objective functional
K T ( u ) = C ( 0 , t 0 ) ( u ) + γ I n o n ( u ) .
We consider the auxiliary problem
Q ( T ) : min u X ( T ) K T ( u ) .
We show that for solution u * of Q ( T ) we have
I n o n ( u * ) = 0
by an indirect proof. Suppose that there exists a solution u * = ( W * , A * , b * ) of Q ( T ) such that I n o n ( u * ) > 0 . Then for the corresponding optimal state x * that is generated by S we have x * ( t 0 ) x T ; otherwise we could switch off the control at t 0 and continue with the zero control ( 0 , 0 , 0 ) for t ( t 0 , T ] that generates the constant state x T on ( t 0 , T ] to strictly improve the performance.
Define the auxiliary state
x ˜ ( t 0 ) = x T + 1 I n o n ( u * ) x * ( t 0 ) x T .
The exact controllability of the linearized system implies that we can find a control v ˜ L 2 ( 0 , t 0 ) that due to (3) satisfies the inequality
v ˜ L 2 ( 0 , t 0 ) C e x ˜ ( t 0 ) x T = C e 1 I n o n ( u * ) x * ( t 0 ) x T
that generates the state V ˜ with V ˜ ( 0 , · ) = 0 and V ˜ ( t 0 ) = x ˜ ( t 0 ) x T .
Due to (5) we have
x * ( t 0 ) x T 1 T t 0 + 1 t 0 T x * ( t ) x T + x * ( t ) d t
= 1 T t 0 + 1 I n o n ( u * ) .
Thus we have
v ˜ L 2 ( 0 , t 0 ) C e 1 T t 0 + 1 .
For a step-size ε ( 0 , I n o n ( u * ) ) define
λ = 1 ε I n o n ( u * ) ( 0 , 1 ) .
Consider the control u with
u ( t ) = u * ( t ) ε v ˜ ( t )
for t ( 0 , t 0 ] and for t ( t 0 , T ) we defined v ˜ = ( δ W , δ A , δ b ) with δ W ( t ) = ε I n o n ( u * ) u * ( t ) , δ A ( t ) = ε I n o n ( u * ) A * ( t ) , δ b ( t ) = ε I n o n ( u * ) b * ( t ) .
Then if γ > 0 is sufficiently large, v ˜ is a descent direction in the sense that by a little step in the direction v ˜ we can improve the performance of the control u * . This can be seen as follows.
For the state x = x * + δ x that is generated with the solution δ x of the linearized system with the initial condition δ x ( 0 ) = 0 we have at t 0
x ( t 0 ) x T = ( x * ( t 0 ) x T ) ε ( x ˜ ( t 0 ) x T ) = 1 ε I n o n ( u * ) ( x * ( t 0 ) x T )
= λ ( x * ( t 0 ) x T ) .
Hence on [ t 0 , T ] the state x = x * + δ x that is generated with the solution δ x of the linearized system with the initial condition δ x ( t 0 ) = ε I n o n ( u * ) x * ( t 0 ) x T is
x = x T + λ ( x * ( t ) x T ) .
Thus for the tracking term we have the bound
I n o n ( u ) = λ I n o n ( u * ) + O ( δ u 2 ) = 1 ε I n o n ( u * ) I n o n ( u * ) + O ( δ u 2 ) .
For the control cost we have
C ( 0 , t 0 ) ( u ) = u * ε v ˜ , u * ε v ˜ L 2 ( 0 , t 0 ) = C ( 0 , t 0 ) ( u * ) 2 ε u * , v ˜ L 2 ( 0 , t 0 ) + ε 2 C ( 0 , t 0 ) ( v ˜ ) .
Define
p ( ε ) = K T ( u * ε v ˜ )
= C ( 0 , t 0 ) ( u * ) 2 ε u * , v ˜ L 2 ( 0 , t 0 ) + ε 2 C ( 0 , t 0 ) ( v ˜ ) + γ 1 ε I n o n ( u * ) I n o n ( u * ) + O ( δ u 2 ) .
Then we have
p ( ε ) = 2 u * , v ˜ L 2 ( 0 , t 0 ) + 2 ε C ( 0 , t 0 ) ( v ˜ ) γ + O ( ε ) .
This yields
p ( 0 ) = 2 u * , v ˜ L 2 ( 0 , t 0 ) γ .
The exact controllability of S implies that there is a control u e x a c t L 2 ( 0 , t 0 ) with (due to (2))
u e x a c t L 2 ( 0 , t 0 ) C e x ˜ 0 x T
that generates the state V e x a c t with V e x a c t ( 0 , · ) = x 0 and V e x a c t ( t 0 , · ) = x T . For t > t 0 , let u e x a c t ( t ) = 0 . Since u e x a c t is feasible for Q ( T ) , this yields the inequality
C ( t 0 , T ) ( u * ) K T ( u * ) K T ( u e x a c t ) = u e x a c t L 2 ( 0 , t 0 ) 2 C e 2 x 0 x T L 2 ( 0 , L ) 2 .
Hence we have
u * , v ˜ L 2 ( 0 , t 0 ) C e x 0 x T L 2 ( 0 , L ) v ˜ L 2 ( 0 , t 0 )
x 0 x T L 2 ( 0 , L ) C e 2 1 T t 0 + 1 .
Thus if
γ > 2 x 0 x T L 2 ( 0 , L ) C e 2 1 T t 0 + 1 ,
we have p ( 0 ) γ + 2 x 0 x T L 2 ( 0 , L ) C e 2 1 T t 0 + 1 < 0 . This implies that for ε > 0 sufficiently small we have
K T ( u * ε v ˜ ) < K T ( u * ) ,
which is a contradiction to the optimality of u * .
Hence for any optimal control of Q ( T ) we have I n o n ( u * ) = 0 . With inequality (6) this implies that for the optimal state we have x * ( t 0 ) = x T .
Now we come back to problem
P ( T , γ ) : min u J ( u )
with J defined in (4). Let v P ( T ) denote the optimal value of P ( T , γ ) and v Q ( T ) denote the optimal value of Q ( T ) . Since K T ( u ) J ( u ) , we have v Q ( T ) v P ( T ) .
Moreover, any optimal control u * for Q ( T ) is feasible for P ( T , γ ) . Since x * ( t 0 ) = x T , we have C ( t 0 , T ) ( u * ) = 0 . Hence v P ( T ) J ( u * ) = K T ( u * ) = v Q ( T ) , and thus v P ( T ) v Q ( T ) . Therefore we have
v P ( T ) = v Q ( T ) .
This implies that parameters that are optimal for P ( T ) are also optimal for Q ( T ) and the assertion follows. Thus we have proved Theorem 1. □

3. Existence of Solutions of P ( T , γ ) for Fixed A

For the sake of completeness of the analysis, we also state an existence result. However we can only prove the existence of a solution for the problem where the matrix A is fixed an not an optimization parameter for P ( T , γ ) . Thus for a given matrix-valued function A ( t ) , we consider the problem
P ( T , γ , A ) : min ( · , A , · ) X ( T ) J ( · , A , · )
In order to show the existence of a solution of P ( T , γ , A ) , we assume that there exists a number M > 0 such that for t [ 0 , T ] almost everywhere we have max i { 1 , . . . , p } ( a i ) ( t ) M . This is the case if the a i are elements of the function space L ( 0 , T ) , for example if they are step functions over ( 0 , T ) .
Theorem 2.
Assume that sup x | σ ( x ) | 1 and the Lipschitz constant of σ is less than or equal to 1. Assume that A ( t ) is given such that we have
ess sup i { 1 , . . . , p } , s [ 0 , T ] ( a i ) ( s ) < .
Then each T > 0 and γ > 0 , problem P ( T , γ , A ) has a solution W , b such that in ( W , A , b ) X ( T ) .
If A ( t ) = 0 for t t 0 , for sufficiently large γ each solution has the finite-time turnpike property stated in Theorem 1.
The proof of Theorem 2 uses Gronwall’s Lemma (see for example [18]). For the convenience of the reader we state it here:
Lemma 2
(Gronwall’s Lemma). Let L > 0 , U 0 0 , ε 0 and an integrable function U on [ 0 , T ] be given.
Assume that for t [ 0 , T ] almost everywhere the integral inequality
0 U ( t ) U 0 + 0 t L U ( τ ) + ε d τ
hold. Then for t [ 0 , T ] almost everywhere the function U satisfies the inequality
U ( t ) U 0 e L t + ε e L t 1 L .
Now we present the proof of Theorem 2.
Proof of Theorem 2.
Consider a minimizing sequence ( u n ) n = 1 with u n = ( W n , A , b n ) X ( T ) for all n { 1 , 2 , 3 , . . . } . Define the norm
u n X ( T ) = 0 T W n ( t ) 2 + A ( t ) 2 + b n ( t ) 2 d t
and the corresponding inner product that gives a Hilbert space structure to X ( T ) . Due to the definition of J, there exists a number M > 0 such that for all n { 1 , 2 , . . . } we have
u n X ( T ) M ,
that is the sequence is bounded in X ( T ) .
Hence there exists a weakly-converging subsequence with a limit
u * = ( W * , A , b * ) X ( T ) .
Let x * denote the state generated by u * . For the states x n generated by the u n as a solution of S due to the definition of the tracking term R we can assume by increasing M if necessary that we have
sup s [ 0 , T ] , n { 1 , 2 , 3 , . . . } x n ( s ) M .
Due to Mazur’s Lemma (see for example [19,20]), there exists a subsequence of convex combinations that converges strongly. To be precise, there exist convex combinations
v k = m = k N ( k ) λ m ( k ) u m , w i t h λ m ( k ) 0 , k m N ( k ) a n d m = k N ( k ) λ m ( k ) = 1
such that
lim k v k u * X ( T ) = 0 .
This implies
lim k 0 T W n ( t ) W * ( t ) + b n ( t ) b * ( t ) d t = 0 .
Since σ is Lipschitz continuous with a Lipschitz constant that is less than or equal to 1, this implies for i { 1 , . . . , p }
σ ( m = k N ( k ) λ m ( k ) [ ( a i ) ( t ) x m ( t ) + ( b i ) m ( t ) ) ] σ ( ( a i ) ( t ) x * ( t ) + ( b i ) * ( t ) )
m = k N ( k ) λ m ( k ) [ ( a i ) ( t ) x m ( t ) + ( b i ) m ( t ) ) ) ] ( ( a i ) ( t ) x * ( t ) + ( b i ) * ( t ) ) .
Thus for t [ 0 , T ] almost everywhere we have
m = k N ( k ) λ m ( k ) x m ( t ) x * ( t )
i = 1 p 0 t m = k N ( k ) λ m ( k ) ( w i ) m ( s ) ( w i ) * ( s ) σ ( ( a i ) ( s ) x * ( s ) + ( b i ) * ( s ) )
+ m = k N ( k ) λ m ( k ) ( w i ) m ( s ) σ ( m = k N ( k ) λ m ( k ) ( a i ) ( s ) x m ( s ) + ( b i ) m ( s ) ) ) σ ( ( a i ) ( s ) x * ( s ) + ( b i ) * ( s ) ) d s .
Then the fact that sup x | σ ( x ) | 1 , the Cauchy-Schwarz inequality , (7) and (8) yield
m = k N ( k ) λ m ( k ) x m ( t ) x * ( t ) i = 1 p 0 t m = k N ( k ) λ m ( k ) ( w i ) m ( s ) ( w i ) * ( s ) d s
+ 0 t m = k N ( k ) λ m ( k ) ( w i ) m ( s ) 2 d s 0 t σ ( m = k N ( k ) λ m ( k ) ( a i ) ( s ) x m ( s ) + ( b i ) m ( s ) ) ) σ ( ( a i ) ( s ) x * ( s ) + ( b i ) * ( s ) ) 2 d s
i = 1 p 0 t m = k N ( k ) λ m ( k ) ( w i ) m ( s ) ( w i ) * ( s ) d s
+ M 0 t ( m = k N ( k ) λ m ( k ) ( a i ) ( s ) x m ( s ) + ( b i ) m ( s ) ) ( ( a i ) ( s ) x * ( s ) + ( b i ) * ( s ) ) 2 d s
i = 1 p 0 t m = k N ( k ) λ m ( k ) ( w i ) m ( s ) ( w i ) * ( s ) d s
+ M 0 t ( a i ) ( s ) m = k N ( k ) λ m ( k ) x m ( s ) x * ( s ) 2 d s + M 0 t m = k N ( k ) λ m ( k ) ( b i ) m ( s ) ( b i ) * ( s ) 2 d s .
Due to Mazur’s Lemma, this yields the existence of a sequence ( ϵ k ) k with ϵ k 0 and lim k ϵ k = 0 such that
m = k N ( k ) λ m ( k ) x m ( t ) x * ( t ) ε k + i = 1 p M 0 t ess sup s ( 0 , T ) ( a i ) ( s ) 2 m = k N ( k ) λ m ( k ) x m ( t ) x * ( t ) 2 d t .
Thus by increasing the value of M if necessary, we obtain for t [ 0 , T ] almost everywhere
m = k N ( k ) λ m ( k ) x m ( t ) x * ( t ) ε k + i = 1 p M 0 t M 2 m = k N ( k ) λ m ( k ) x m ( s ) x * ( s ) 2 d s .
Since ( | u | + | v | ) 2 2 | u | 2 + 2 | v | 2 and
i = 1 p | z i | 2 p i = 1 p | z i | this yields the integral inequality
m = k N ( k ) λ m ( k ) x m ( t ) x * ( t ) 2 2 ( ε k ) 2 + 2 p M 4 i = 1 p 0 t m = k N ( k ) λ m ( k ) x m ( s ) x * ( s ) 2 d s .
Now Gronwall’s Lemma yields for t [ 0 , T ] almost everywhere
m = k N ( k ) λ m ( k ) x m ( t ) x * ( t ) = O ( ε k ) .
This yields
lim k max t [ 0 T ] m = k N ( k ) λ m ( k ) x m ( t ) x * ( t ) = 0 .
For the time derivatives we obtain again by increasing the value of M if necessary
0 T m = k N ( k ) λ m ( k ) x m ( t ) x * ( t ) d t
i = 1 p 0 T m = k N ( k ) λ m ( k ) ( w i ) m ( t ) ( w i ) * ( t ) σ ( ( a i ) ( t ) x * ( t ) + ( b i ) * ( t ) )
+ m = k N ( k ) λ m ( k ) ( w i ) m ( t ) σ ( m = k N ( k ) λ m ( k ) ( a i ) ( t ) x m ( t ) + ( b i ) m ( t ) ) ) σ ( ( a i ) ( t ) x * ( t ) + ( b i ) * ( t ) ) d t
i = 1 p 0 T m = k N ( k ) λ m ( k ) ( w i ) m ( t ) ( w i ) * ( t ) d t
+ 0 T m = k N ( k ) λ m ( k ) ( w i ) m ( t ) 2 d t 0 T m = k N ( k ) λ m ( k ) ( a i ) ( t ) x m ( t ) + ( b i ) m ( t ) ( a i ) ( t ) x * ( t ) + ( b i ) * ( t ) 2 d t
ε k + i = 1 p M 0 T m = k N ( k ) λ m ( k ) ( a i ) ( t ) x m ( t ) ( a i ) ( t ) x * ( t ) 2 d t
+ M 0 T m = k N ( k ) λ m ( k ) ( b i ) m ( t ) ( b i ) * ( t ) 2 d t
ε k ( 1 + M ) + M i = 1 p 0 T ( a i ) m ( t ) m = k N ( k ) λ m ( k ) x m ( t ) x * ( t ) 2 d t
ε k ( 1 + M ) + M i = 1 p 0 T ( a i ) ( t ) 2 m = k N ( k ) λ m ( k ) x m ( t ) x * 2 d t
ε k ( 1 + M )
+ M i = 1 p ess sup s [ 0 , T ] ( a i ) ( s ) 0 T m = k N ( k ) λ m ( k ) x m ( t ) x * ( t ) 2 d t
ε k ( 1 + M ) + M i = 1 p ess sup s [ 0 , T ] ( a i ) ( s ) ε k = O ( ε k ) .
Thus we have
lim inf k Q ( v k ) Q ( u * ) , lim inf k R ( v k ) R ( u * ) .
This yields
lim inf k J ( u k ) J ( u * ) .
Hence u * is a solution of P ( T , γ , A ) . This shows that solution of P ( T , γ , A ) exist.
The exact controllability properties that have been used for the construction in the proof of Theorem 1 still hold if the matrix A is fixed. Hence the assertion follows. □

4. Discussion

We have shown that with a suitable non-smooth loss function each solution of a learning problem has the finite-time turnpike property which means that it reaches the desired state exactly after finite time. Since the finite time t 0 can be considered as a problem parameter, this situation allows to choose t 0 in a convenient way. Thus t 0 arises as an additional design parameter in the design of optimal neural networks, that corresponds to the number of layers. Since for t [ t 0 , T ] the optimal parameters are zero, System S does not change the state on [ t 0 , T ] and thus the time horizon can be cut off at t 0 .
Hence the problem to find the optimal number of layers in a neural network corresponds in the setting of neural ODEs to the problem of time-optimal control where the task is to find a minimal value of t 0 subject to the constraint that x ( t 0 ) = x T and for the optimal parameters u ( t ) the constraint u ( t ) X ( t 0 ) 2 ρ is satisfied. Here the number ρ is prescribed as a problem parameter. Let ω ( T , γ ) denote the optimal value of P ( T , γ ) . Then for optimal parameters u ( t ) that solve P ( T , γ ) we have u ( t ) X ( t 0 ) 2 ω ( T , γ ) . Since Theorem 1 implies that for the optimal state we have x ( t 0 ) = x T , we conclude that optimal parameters for P ( T , γ ) also solve the time-optimal control problem with parameter ρ = ω ( T , γ ) and the optimal time is t 0 .
We have shown the existence of a solution of the nonlinear optimization problem for the case that one of the parameters, namely the matrix A ( t ) is fixed. In order to show that a solution also exists with A as an additional optimization parameter, we expect that an additional regularization term in the objective functional (for example 0 T A ( t ) 2 d t ) is necessary. This is a topic for future research. We expect that the finite-time turnpike property also holds in the case t 0 = 0 . However, the proof that is presented here does not apply to this case so this is another topic for future research. As a possible application of our results we have in mind the numerical solution of shape inverse problems as described in [21].

Funding

This research was funded by DFG in the framework of the Collaborative Research Centre CRC/Transregio 154, Mathematical Modelling, Simulation and Optimization Using the Example of Gas Networks, project C03 and project C05, Projektnummer 239904186 and the Bundesministerium für Bildung und Forschung (BMBF) and the Croatian Ministry of Science and Education under DAAD grant 57654073 ’Uncertain data in control of PDE systems’.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Marion, P. Generalization bounds for neural ordinary differential equations and deep residual networks. Advances in Neural Information Processing Systems 2024, 36.
  2. Dupont, E.; Doucet, A.; Teh, Y.W. Augmented Neural ODEs. Advances in Neural Information Processing Systems; Wallach, H.; Larochelle, H.; Beygelzimer, A.; d’Alché-Buc, F.; Fox, E.; Garnett, R., Eds. Curran Associates, Inc., 2019, Vol. 32.
  3. Thorpe, M.; van Gennip, Y. Deep limits of residual neural networks. Research in the Mathematical Sciences 2023, 10, 6.
  4. Álvarez López, A.; Slimane, A.H.; Zuazua, E. Interplay between depth and width for interpolation in neural ODEs, 2024, [arXiv:math.OC/2401.09902].
  5. Cybenko, G. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems 1989, 2, 303–314.
  6. Pinkus, A. Approximation theory of the MLP model in neural networks. Acta numerica 1999, 8, 143–195.
  7. Schäfer, A.M.; Zimmermann, H.G. Recurrent neural networks are universal approximators. Artificial Neural Networks–ICANN 2006: 16th International Conference, Athens, Greece, September 10-14, 2006. Proceedings, Part I 16. Springer, 2006, pp. 632–640.
  8. Schäfer, A.M.; Udluft, S.; Zimmermann, H.G. Learning long term dependencies with recurrent neural networks. Artificial Neural Networks–ICANN 2006: 16th International Conference, Athens, Greece, September 10-14, 2006. Proceedings, Part I 16. Springer, 2006, pp. 71–80.
  9. Schaefer, A.M.; Udluft, S.; Zimmermann, H.G. Learning long-term dependencies with recurrent neural networks. Neurocomputing 2008, 71, 2481–2488.
  10. Gugat, M.; Schuster, M.; Zuazua, E. The finite-time turnpike phenomenon for optimal control problems: Stabilization by non-smooth tracking terms. Stabilization of distributed parameter systems: design methods and applications. Springer, 2021, pp. 17–41.
  11. Gugat, M.; Schuster, M. Optimal Neumann control of the wave equation with L 1-control cost: the finite-time turnpike property. Optimization 2024, pp. 1–28.
  12. Gugat, M. Optimal boundary control of the wave equation: The finite-time turnpike phenomenon. Mathematical Reports 2022.
  13. Zaslavski, A.J. Turnpike Phenomenon in Metric Spaces; Vol. 201, Springer Nature, 2023.
  14. Grüne, L.; Faulwasser, T. Turnpike properties in optimal control: An overview of discrete-time and continuous-time results. Handbook of Numerical Analysis; Trelat, E.; Zuazua, E., Eds., 2022. [CrossRef]
  15. Grüne, L.; Guglielmi, R. Turnpike properties and strict dissipativity for discrete time linear quadratic optimal control problems. SIAM J. Control Optim. 2018, 56, 1282–1302. doi:10.1137/17M112350X. [CrossRef]
  16. Trélat, E.; Zuazua, E. The turnpike property in finite-dimensional nonlinear optimal control. Journal of Differential Equations 2015, 258, 81–114.
  17. Geshkovski, B.; Zuazua, E. Turnpike in optimal control of PDEs, ResNets, and beyond. Acta Numerica 2022, 31, 135–263.
  18. Gugat, M. Optimal boundary control and boundary stabilization of hyperbolic systems; Birkhäuser, 2015.
  19. Ciarlet, P.G. Mathematical elasticity: Three-dimensional elasticity; SIAM, 2021.
  20. Heuser, H.G. Functional analysis. Transl. by John Horvath. A Wiley-Interscience Publication. Chichester etc.: John Wiley & Sons, 1982.
  21. Jackowska-Strumillo, L.; Sokolowski, J.; Żochowski, A.; Henrot, A. On numerical solution of shape inverse problems. Computational Optimization and Applications 2002, 23, 231–255.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

© 2024 MDPI (Basel, Switzerland) unless otherwise stated