1. Introduction
We consider a system that is governed by a neural ODE that can be considered as a continuous-time ResNet. The system
is defined as follows:
(see for example [
1,
2]). For
we have
. The
are the columns of the matrix
. We have
and the
are the columns of the matrix
. The bias vector
is in
and has the components
.
The motivation to study
is that a time-discrete version can be considered as a residual neural networks (ResNets) that has been very used in many applications, see [
3] for example in image registration and classification problems. A time-discrete version can be obtained for example by an explicit Euler discretization of
.
The activation function
is assumed to be Lipschitz continuous with a Lipschitz constant that is less than or equal to 1 and differentiable, for example
. It acts on vectors componentwise.
For a given time horizon
, we study an optimal control problem on the time interval
, where the desired state at
T is prescribed by the terminal condition
, where
denotes the given desired output of the system. Let
be given. For the training of the system, we study the loss function with a tracking term
with the non–smooth norm
.
We define the control cost (regularization term)
Here (
) denotes the Frobenius norm of
. We introduce the space
Lemma 10 in [
4] states that system
S is exactly controllable, that is the terminal condition
can be satisfied for all
. To be precise, for all
there exists a constant
such that for all
we can find a control
such that for the state
that is generated by
S with the initial condition
we have
and
Also the linearized system is exactly controllable in the sense that for all
there exists a constant
such that for all
we can find a control
such that for the state
that is generated by the linearized system that is stated below with the initial condition
we have
and
The linearized system at a given
for the variation
of the state that is generated by a variation
of the control is
with the initial condition
.
A universal approximation theorem for the corresponding time-discrete case with recurrent neural networks can be found in the seminal paper [
5] by Cybenko, see also [
6,
7,
8,
9].
For a parameter
define
We study the minimization (training) problem
Our main result is that the optimal control problem
has the finite-time turnpike property, that is the desired state is already reached in the interior of the time-interval
and remains there until the end of the time interval. The finite-time turnpike property has been studied for example in [
10,
11] and [
12]. In the first two references, the finite time-turnpike property is achieved by the non-smoothness of the objective functional. In this paper, we use a similar approach adapted to the framework of neural ordinary differential equations.
The finite-time turnpike property is an extremal case of the celebrated turnpike property that has originally been studied in economics. The turnpike analysis investigates how the solutions of dynamic optimal control problems with a time evolution are related to the solutions of the corresponding static problems where the time-derivatives are set to zero and the initial conditions are cancelled. It turns out that often for large time horizons on large parts of the time interval the solution of the dynamic problems is very close to the solution of the corresponding static problem. For an overview about the turnpike property, see [
13,
14,
15,
16] and the numerous references therein.
In the case of the finite-time turnpike property, after finite time the solution of the dynamic problem coincides with the solution of the static problem. The exponential turnpike property for ResNets and beyond has been studied for example in [
17], but not the finite-time turnpike property.
2. The Finite-Time Turnpike Property
The following Theorem contains our main result, which states that the control cost entails the finite-time turnpike property.
Theorem 1.
For each sufficiently large each optimal trajectory for satisfies
that is has the finite-time turnpike property. For for the optimal parameters we have , and . The optimal parameters remain unchanged if γ is further enlarged or if T is further enlarged.
For the proof of Theorem 1 we need a result about the embedding of the continuous functions in the Sobolev space
: Let
Consider the embedding of the space of continuous functions in the space
Lemma 1.
Let . For all we have
Proof of Lemma 1. For
,
we have
Thus
x is continuous on
. Hence there exists a point
with
Thus for all
the following inequality holds:
□
Now we are prepared for the proof of Theorem 1.
Proof of Theorem 1.
Case 1: If , the parameters generate the constant state . Hence solves and the assertion follows.
Case 2: Now we assume that
. For
define the cost
Define the non-smooth tracking term
Define the objective functional
We consider the auxiliary problem
We show that for solution
of
we have
by an indirect proof. Suppose that there exists a solution
of
such that
. Then for the corresponding optimal state
that is generated by
we have
; otherwise we could switch off the control at
and continue with the zero control
for
that generates the constant state
on
to strictly improve the performance.
Define the auxiliary state
The exact controllability of the linearized system implies that we can find a control
that due to (
3) satisfies the inequality
that generates the state
with
and
.
Due to (
5) we have
Thus we have
For a step-size
define
Consider the control
u with
for
and for
we defined
with
,
,
.
Then if is sufficiently large, is a descent direction in the sense that by a little step in the direction we can improve the performance of the control . This can be seen as follows.
For the state
that is generated with the solution
of the linearized system with the initial condition
we have at
Hence on
the state
that is generated with the solution
of the linearized system with the initial condition
is
Thus for the tracking term we have the bound
For the control cost we have
Then we have
This yields
The exact controllability of
S implies that there is a control
with (due to (
2))
that generates the state
with
and
. For
, let
. Since
is feasible for
, this yields the inequality
Hence we have
Thus if
we have
. This implies that for
sufficiently small we have
which is a contradiction to the optimality of
.
Hence for any optimal control of
we have
. With inequality (
6) this implies that for the optimal state we have
.
Now we come back to problem
with
J defined in (
4). Let
denote the optimal value of
and
denote the optimal value of
. Since
, we have
Moreover, any optimal control
for
is feasible for
. Since
, we have
. Hence
, and thus
Therefore we have
This implies that parameters that are optimal for are also optimal for and the assertion follows. Thus we have proved Theorem 1. □
3. Existence of Solutions of for Fixed A
For the sake of completeness of the analysis, we also state an existence result. However we can only prove the existence of a solution for the problem where the matrix
A is fixed an not an optimization parameter for
. Thus for a given matrix-valued function
, we consider the problem
In order to show the existence of a solution of , we assume that there exists a number such that for almost everywhere we have . This is the case if the are elements of the function space , for example if they are step functions over .
Theorem 2.
Assume that and the Lipschitz constant of σ is less than or equal to 1. Assume that is given such that we have
Then each and , problem has a solution such that in .
If for , for sufficiently large γ each solution has the finite-time turnpike property stated in Theorem 1.
The proof of Theorem 2 uses Gronwall’s Lemma (see for example [
18]). For the convenience of the reader we state it here:
Lemma 2 (Gronwall’s Lemma). Let , , and an integrable function U on be given.
Assume that for almost everywhere the integral inequality
hold. Then for almost everywhere the function U satisfies the inequality
Now we present the proof of Theorem 2.
Proof of Theorem 2. Consider a minimizing sequence
with
for all
. Define the norm
and the corresponding inner product that gives a Hilbert space structure to
. Due to the definition of
J, there exists a number
such that for all
we have
that is the sequence is bounded in
.
Hence there exists a weakly-converging subsequence with a limit
Let
denote the state generated by
. For the states
generated by the
as a solution of
due to the definition of the tracking term
R we can assume by increasing
M if necessary that we have
Due to Mazur’s Lemma (see for example [
19,
20]), there exists a subsequence of convex combinations that converges strongly. To be precise, there exist convex combinations
such that
This implies
Since
is Lipschitz continuous with a Lipschitz constant that is less than or equal to 1, this implies for
Thus for
almost everywhere we have
Then the fact that
, the Cauchy-Schwarz inequality , (
7) and (
8) yield
Due to Mazur’s Lemma, this yields the existence of a sequence
with
and
such that
Thus by increasing the value of
M if necessary, we obtain for
almost everywhere
Since
and
this yields the integral inequality
Now Gronwall’s Lemma yields for
almost everywhere
This yields
For the time derivatives we obtain again by increasing the value of
M if necessary
This yields
Hence
is a solution of
. This shows that solution of
exist.
The exact controllability properties that have been used for the construction in the proof of Theorem 1 still hold if the matrix A is fixed. Hence the assertion follows. □
4. Discussion
We have shown that with a suitable non-smooth loss function each solution of a learning problem has the finite-time turnpike property which means that it reaches the desired state exactly after finite time. Since the finite time can be considered as a problem parameter, this situation allows to choose in a convenient way. Thus arises as an additional design parameter in the design of optimal neural networks, that corresponds to the number of layers. Since for the optimal parameters are zero, System does not change the state on and thus the time horizon can be cut off at .
Hence the problem to find the optimal number of layers in a neural network corresponds in the setting of neural ODEs to the problem of time-optimal control where the task is to find a minimal value of subject to the constraint that and for the optimal parameters the constraint is satisfied. Here the number is prescribed as a problem parameter. Let denote the optimal value of . Then for optimal parameters that solve we have . Since Theorem 1 implies that for the optimal state we have , we conclude that optimal parameters for also solve the time-optimal control problem with parameter and the optimal time is .
We have shown the existence of a solution of the nonlinear optimization problem for the case that one of the parameters, namely the matrix
is fixed. In order to show that a solution also exists with
A as an additional optimization parameter, we expect that an additional regularization term in the objective functional (for example
is necessary. This is a topic for future research. We expect that the finite-time turnpike property also holds in the case
. However, the proof that is presented here does not apply to this case so this is another topic for future research. As a possible application of our results we have in mind the numerical solution of shape inverse problems as described in [
21].
Funding
This research was funded by DFG in the framework of the Collaborative Research Centre CRC/Transregio 154, Mathematical Modelling, Simulation and Optimization Using the Example of Gas Networks, project C03 and project C05, Projektnummer 239904186 and the Bundesministerium für Bildung und Forschung (BMBF) and the Croatian Ministry of Science and Education under DAAD grant 57654073 ’Uncertain data in control of PDE systems’.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Marion, P. Generalization bounds for neural ordinary differential equations and deep residual networks. Advances in Neural Information Processing Systems 2024, 36.
- Dupont, E.; Doucet, A.; Teh, Y.W. Augmented Neural ODEs. Advances in Neural Information Processing Systems; Wallach, H.; Larochelle, H.; Beygelzimer, A.; d’Alché-Buc, F.; Fox, E.; Garnett, R., Eds. Curran Associates, Inc., 2019, Vol. 32.
- Thorpe, M.; van Gennip, Y. Deep limits of residual neural networks. Research in the Mathematical Sciences 2023, 10, 6.
- Ãlvarez López, A.; Slimane, A.H.; Zuazua, E. Interplay between depth and width for interpolation in neural ODEs, 2024, [arXiv:math.OC/2401.09902].
- Cybenko, G. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems 1989, 2, 303–314.
- Pinkus, A. Approximation theory of the MLP model in neural networks. Acta numerica 1999, 8, 143–195.
- Schäfer, A.M.; Zimmermann, H.G. Recurrent neural networks are universal approximators. Artificial Neural Networks–ICANN 2006: 16th International Conference, Athens, Greece, September 10-14, 2006. Proceedings, Part I 16. Springer, 2006, pp. 632–640.
- Schäfer, A.M.; Udluft, S.; Zimmermann, H.G. Learning long term dependencies with recurrent neural networks. Artificial Neural Networks–ICANN 2006: 16th International Conference, Athens, Greece, September 10-14, 2006. Proceedings, Part I 16. Springer, 2006, pp. 71–80.
- Schaefer, A.M.; Udluft, S.; Zimmermann, H.G. Learning long-term dependencies with recurrent neural networks. Neurocomputing 2008, 71, 2481–2488.
- Gugat, M.; Schuster, M.; Zuazua, E. The finite-time turnpike phenomenon for optimal control problems: Stabilization by non-smooth tracking terms. Stabilization of distributed parameter systems: design methods and applications. Springer, 2021, pp. 17–41.
- Gugat, M.; Schuster, M. Optimal Neumann control of the wave equation with L 1-control cost: the finite-time turnpike property. Optimization 2024, pp. 1–28.
- Gugat, M. Optimal boundary control of the wave equation: The finite-time turnpike phenomenon. Mathematical Reports 2022.
- Zaslavski, A.J. Turnpike Phenomenon in Metric Spaces; Vol. 201, Springer Nature, 2023.
- Grüne, L.; Faulwasser, T. Turnpike properties in optimal control: An overview of discrete-time and continuous-time results. Handbook of Numerical Analysis; Trelat, E.; Zuazua, E., Eds., 2022. [CrossRef]
- Grüne, L.; Guglielmi, R. Turnpike properties and strict dissipativity for discrete time linear quadratic optimal control problems. SIAM J. Control Optim. 2018, 56, 1282–1302. doi:10.1137/17M112350X. [CrossRef]
- Trélat, E.; Zuazua, E. The turnpike property in finite-dimensional nonlinear optimal control. Journal of Differential Equations 2015, 258, 81–114.
- Geshkovski, B.; Zuazua, E. Turnpike in optimal control of PDEs, ResNets, and beyond. Acta Numerica 2022, 31, 135–263.
- Gugat, M. Optimal boundary control and boundary stabilization of hyperbolic systems; Birkhäuser, 2015.
- Ciarlet, P.G. Mathematical elasticity: Three-dimensional elasticity; SIAM, 2021.
- Heuser, H.G. Functional analysis. Transl. by John Horvath. A Wiley-Interscience Publication. Chichester etc.: John Wiley & Sons, 1982.
- Jackowska-Strumillo, L.; Sokolowski, J.; Żochowski, A.; Henrot, A. On numerical solution of shape inverse problems. Computational Optimization and Applications 2002, 23, 231–255.
|
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).