Preprint
Article

Online Inverse Optimal Control for Time-Varying Cost Weights

Altmetrics

Downloads

99

Views

19

Comments

0

A peer-reviewed article of this preprint also exists.

This version is not peer-reviewed

Submitted:

19 December 2023

Posted:

20 December 2023

You are already at the latest version

Alerts
Abstract
Inverse optimal control is a method for recovering the cost function used in an optimal control problem in expert demonstrations. Most studies on inverse optimal control have focused on building the unknown cost function through the linear combination of given features with unknown cost weights, which are generally considered to be constant. However, in many real-world applications, the cost weights may vary over time. In this study, we propose an adaptive online inverse optimal control approach based on a neural-network approximation to address the challenge of recovering time-varying cost weights. We conduct a well-posedness analysis of the problem and suggest a condition for the adaptive goal, under which the weights of the neural network generated to achieve this adaptive goal are unique to the corresponding inverse optimal control problem. Furthermore, we propose an updating law for the weights of the neural network to ensure the stability of the convergence of the solutions. Finally, simulation results for an example linear system are presented to demonstrate the effectiveness of the proposed strategy. The proposed method is applicable to a wide range of problems requiring real-time inverse optimal control calculations
Keywords: 
Subject: Engineering  -   Control and Systems Engineering

1. Introduction

The integration of biological principles with robotic technology heralds a new era of innovation, with a significant focus on applying optimal control and optimization methods to analyze animal motion. This approach guides robotic movement development, evident in the study [1], which explores the intricate control systems in mammalian locomotion. Such research underpins the development of robots that emulate the efficiency and adaptability found in nature.
These advancements in understanding animal locomotion through optimal control methods set the stage for the relevance of inverse optimal control (IOC). IOC offers a retrospective analysis of expert movements—human or animal—to infer underlying cost functions optimized in these motions. This methodology is crucial when direct modeling of optimal strategies is complex or unknown.
The use of inverse optimal control (IOC) to identify suitable cost functions from the observable control input and state trajectories of experts is becoming increasingly important. Several successful applications of IOC in estimating the cost weights of multi-features have been reported. For example, the knowledge and expertise of specialists can be categorized and exploited in several fields, including robot control and autonomous driving. [2], who employed game theory in tailoring robot–human interactions, proposed a method for estimating the human cost function and selecting the robot’s cost function based on the results, leading to the Nash equilibrium in human–robot interactions. [3] applied IOC to analyze taxi drivers’ route choices. To investigate the cost combination of human motion, [4] conducted an experiment using IOC techniques to study human motion during the performance of a goal-achieving task using one arm. Additionally, [5] represented the learning of biological behavior as an inverse linear quadratic regulator (LQR) problem and proposed adaptive methods for modeling and analyzing human reach-to-grasp behavior. Furthermore, [6] employed an IOC method to segment human movement.
Linear quadratic regulation is a common optimal control method for linear systems. In the 1960s and 1970s, numerous researchers offered solutions to the inverse LQR problem ([7,8,9]). Recently, the theory of linear matrix inequality was employed to solve the inverse LQR problem ( [5,10,11]). Regarding the application of the IOC method for nonlinear systems, several approaches involving methods such as passivity-based condition monitoring ([12]) or robust design ([13]) have been reported.
Feature-based IOC methods, which involve modeling the cost function as a linear combination of various feature functions with unknown weights, have gained acclaim in recent years ([14,15,16,17]). However, it may be difficult to apply these methods to the analysis of complex, long-term behaviors using simple feature functions, e.g., analyzing human jumping ([18]). To address this challenge, [19] proposed a technique for recovering phase-dependent weights that switch at unknown phase-transition points. This method employs a moving window along the observed trajectory to identify the phase-transition points, with the window length determined by a recovery matrix aimed at minimizing the number of observations required for successful cost-weight recovery. Although this method is effective in estimating phase-dependent cost weights, the complex computational requirements limit its use in real-time applications, such as human–robot collaboration tasks. Additionally, in this method, the cost weights in each phase are assumed to be fixed, which may not be generalizable. For example, the human jump motion in ([18]) was analyzed using time-varying, continuous cost weights.
Overall, the IOC still has several shortcomings that need to be addressed, particularly when applied in approximating complex, multi-phase, continuous cost functions in real-time. In this paper, we propose a method for recovering the time-varying cost weights in the IOC problem for linear continuous systems using neural networks. Our approach involves constructing an auxiliary estimation system that closely approximates the behavior of the original system, followed by determining the necessary conditions for tuning the weights of the neurons in the neural network to obtain a unique solution for the IOC problem. We demonstrate that the unique solution corresponds to achieving a zero error between the original system state and the auxiliary estimated system state, as well as zero error between the original costate and the integral of the estimated costate. Based on this analysis, we develop two neural-network frameworks: one for approximating the cost-weight function and the other for addressing the error introduced by the auxiliary estimation system. Additionally, we discuss the necessary requirements for the feature functions to ensure the well-posedness of our online IOC method. Finally, we validate the effectiveness of our method through simulations.
This work makes several significant contributions:
  • We provide a solution for the recovery of time-varying cost weights, essential for analyzing real-world animal or human motion.
  • Our method operates online, suitable for a broad spectrum of real-time calculation problems. This contrasts with previous online IOC methods that mainly focused on constant cost weights for discrete system control.
  • We introduce a neural network and state observer-based framework for online verification and refinement of estimated cost weights. This innovation addresses the critical need for solution uniqueness and robustness against data noise in IOC applications.

2. Problem Formulation

2.1. System description and problem statement

Consider an object’s system dynamics formulated as
x ˙ = A x + B u
where A R n × n and B R n × m are two time-invariant matrixes, x R n represents the system states, u = [ u 1 , . . . , u m ] T R m denotes the control input of the system, and x 0 denotes the initial state of the system.
To minimize the following cost function while accounting for dynamics (1), the classic optimal control problem is required to design the optimal control input u * ( t ) , and generate a sequence of optimal states x * ( t ) . (superscript ∗ stands for the optimal condition).
V ( x , t ) = t t f L 0 ( x , u , τ ) d τ
Here, L 0 has the following form:
L 0 = q T F ( x ) + r T G ( u )
where q = [ q 1 , q 2 , . . . , q n f ] T R n f and r = [ r 1 , r 2 , . . . , r m ] T R n r i > 0 represent the cost weight vectors, F ( x ) is referred to as the general union feature vector with respect to x and G ( u ) indicates the feature vector that is only relevant to the control input u. n f represents the feature’s number which is different from the dimension of system states. For simplicity, we assume that r T G ( u ) = u T R u where R is an unknown matrix with R = r 1 0 0 r m . Additionally, it is assumed that ( A , B ) is controllable, B is a full column rank matrix, and A and B are bounded such that | | A | | δ A | | B | | δ B .

2.2. Maximum principle in forward optimal control

To minimize the cost function as is the case in (2) with L 0 defined in (3), there exists a costate variable vector λ that satisfies Pontryagin’s maximum principle as follows:
λ ˙ = F ¯ x T q A T λ
R u + B T λ = 0
where F ¯ x = F ( x ) x and λ R n denote the costate variables (Lagrange multiplier). The initial value of λ can be represented as λ 0 .
The optimal control input u * of the system expressed by (1) is given as
u * = R 1 B T λ
where λ is unknown. Thus, using this optimal control input, we have
x ˙ = A x H λ
where H denotes the matrix H = B R 1 B T . Notably, given that B is a full column rank matrix, it is clear that H is invertible. In addition, since B is a bounded constant matrix, there exists a positive scalar δ H such that H satisfies | | H | | δ H .
Additionally, the time derivatives of the system dynamics can be formulated as follows:
x ¨ = A x ˙ H λ ˙

2.3. Analysis of the IOC problem

We assume that the system states x [ t , t f ] and the control input u [ t , t t h e f ] , which represent the time series of the system states and control inputs from time point t to t f , provide the solution to the optimal minimization of the cost function (2). In addition, we assume that the optimal system states and control input satisfy the boundary conditions | | x | | δ x | | u | | δ u | | u ˙ | | δ u ˙ .
The objective of the IOC problem is to recover the unknown cost weight’s vector q ( t ) . Furthermore, IOC, for example, may be employed to analyze different behaviors such as the effect of different occasions on the relative importance of certain human motion feature functions. A rigorous analysis of the derived cost weights that can recreate the original data x [ t , t f ] , u [ t , t f ] is required for the aforementioned applications. To begin, we consider two problems :
  • What happens when a different feature function is selected?
In previous studies, it was assumed that the cost weight vector q is either a constant value ([14]) or a step function with multiple phases ([19]). These assumptions have been effective in recovering the cost weights used in the analysis of optimal control methods for a robot’s motion control, such as analyzing the motion of a robot controlled by a LQR approach. However, occasionally, it may be inappropriate to assume that the cost weights are constants or step functions when analyzing the complex behaviors of natural objects, such as human motion. In particular, deciding which feature function to adopt when evaluating the motion of natural objects could pose a challenge.
Proposition 1. 
Depending on the different selections of feature functions F ( x ) for the IOC, the original constant cost weight q may become a time-varying continuous function.
Proof. 
From (7), for the objects’ original feature function, we have
H 1 ( x ¨ + A x ˙ + H A T H 1 B u ) = F ¯ o x T q o
where q o denotes the original time-invariant cost weight vector, and F ¯ o ( x ) denotes the partial derivative with respect to x of the original feature function. When we choose a different feature function F n ( x ) , the above equation becomes
H 1 ( x ¨ + A x ˙ + H A T H 1 B u ) = F ¯ n x T q n
where F ¯ n x denotes the partial derivative with respect to x of the new selected feature function and q n is the corresponding cost weights on F ¯ n x . Thus, we have
F ¯ o x T q o = F ¯ n x T q n t 0 t t f
From this equation, it follows that q n may be a time-varying function when F ¯ o x and F ¯ n x are not equivalent, and as F ¯ o x and F ¯ n x are continuous functions, we can reasonably conclude that q n is also a continuous function.
Based on this proposition, it is crucial to expand the definition of cost weights to include time-varying values, as this will facilitate a more accurate analysis of the motion of increasingly complex natural objects. Despite the need for time-varying cost weight recovery in many applications, it has received minimal research attention thus far.
  • Whether or not the given set x [ t , t f ] , u [ t , t f ] in the IOC problem has a unique solution { q ( t ) , r } .
The uniqueness of the solution to the IOC problem when cost weights are constant has been discussed in many studies. In this work, we determine if there is still a unique solution to the IOC problem when q is a time-varying function.
From (9), we can find different continuous functions q ( t ) such that the equation is satisfied for different values of R (different values of H). This implies that if q is considered as a time-varying function, the set { q ( t ) , r } will not have a unique solution.
Therefore, when we consider the unique solution of the IOC problem with the time-varying function q ( t ) , it is necessary to introduce additional conditions to ensure that the IOC problem has a unique solution and that the resulting unique solution is meaningful.
In this study, for simplicity, we assume that R = I ([20,21]), where I is the identity matrix. In actual optimal control cost functions, when we focus on reducing one of the control inputs u i , the convergence of the i-th system state x i related to u i will also be affected. Consequently, the final control result shows that the change in each state of the system is not solely influenced by the chosen cost weights q ( t ) , but also by R ( t ) . In the IOC problem, setting R ( t ) = I allows the effect of different weights on different control inputs in the original system to be reflected in the current estimate of q ( t ) . This enables us to view the estimated weights on the system states as representing the relative importance of each state in the system’s dynamic evolution, without considering the impact of the control input on these weights.
Based on our conclusion that q may be time-varying when different feature functions are chosen and on the corresponding conditions under which a unique solution exists, we can define the IOC problem to be solved in this study as follows:
Preprints 93826 i001

3. Adaptive Observer-based Neural Network Approximation of time-varying Cost Weights

In this study, we estimate time-varying cost weight functions online using an observer-based adaptive neural network estimation approach, as opposed to earlier studies that required a large number of time series of x and u to recover fixed cost weights offline.

3.1. Construction of the observer

Following the introduction of q ^ ( t ) R n denoting the estimation of q ( t ) , we define the estimation of the associated costate variable λ ^ as follows:
λ ^ ˙ = F ¯ x ^ T q ^ ( t ) A T λ ^
where F ¯ x ^ = F ( x ^ ) x ^ denotes the partial derivatives of the feature functions that are only relevant to the estimated system states x ^ obtained by inserting λ ^ into (7) :
x ^ ˙ = A x ^ H λ ^
where the initial state x ^ 0 of this system is selected to be x ^ 0 = x 0 . Thus, compared with that of the original system, the error generated by the new estimation system can be expressed as
x ˜ ˙ = A x ˜ H λ ˜
λ ˜ ˙ = F ¯ x T q ( t ) + F ¯ x ^ T q ^ ( t ) A T λ ˜
where λ ˜ = λ λ ^ and x ˜ = x x ^ . Here, the feature function is selected such that its partial derivative with respect to x is bounded and it is assumed that | | F ¯ x x | | δ n x , | | F ¯ x ^ | | δ n x ^ and | | F ¯ x x ( x ) F ¯ x ^ ( x ^ ) | | ζ | | x ˜ | | where δ n x , δ n x ^ and ζ denote a positive scalar.
Additionally, the time derivatives of (13) can be expressed as
x ˜ ¨ = A x ˜ ˙ H λ ˜ ˙
Thus, the following equation can be satisfied:
s ˙ = A r s + T x q ˜ + ( T x T x ^ ) q ^
where s = x ˜ ˙ λ ˜ , A r = A H A T 0 A T , T x = H F ¯ x F ¯ x , T x ^ = H F ¯ x ^ F ¯ x ^ . q ˜ denotes the error of estimating q. Here, | | F ¯ x x ( x ) F ¯ x ^ ( x ^ ) | | ζ | | x ˜ | | implies that there exists a positive scalar ζ such that | | T x T x ^ | | ζ | | x ˜ | | holds. Based on the bound of F ¯ x x ( x ) , F ¯ x ^ ( x ^ ) , H , it follows that there are two positive scalars δ t x and δ t x ^ such that the following inequalities hold: | | T x | | δ t x and | | T x ^ | | δ t x ^ .
Moreover, from (6) and (7), λ can be calculated as follows:
λ = H 1 B u

4. Neural Network Based Approximation of Time Varying Cost Weights

In this section, a neural network-based cost weight approximation algorithm is proposed. To calculate an approximation of the time-varying vector q, we adopt a neural network in which the chosen inputs are u I = x 0 u . Based on this, we assume that time-invariant weight matrixes W R n f × l exist that satisfy the following expression :
q = W T ϕ ( u I ) + ϵ 1 ( u I )
where ϕ ( u I ) denotes the activation function and ϵ 1 ( u I ) denotes the structure approximation error of the neural networks. In addition, the activation function selected enables the activation function as well as its partial derivative to satisfy the following boundary condition: | | ϕ ( u I ) | | δ p and | | ϕ ( u I ) u I | | δ p u where δ p and δ p u represent two positive scalars. Additionally, | | ϵ 1 ( u I ) | | ϵ n where ϵ n is a positive scalar.
The estimate of vector q is constructed as follows:
q ^ = W ^ T ϕ ( u I )
where W ^ denotes the estimation of W.
Therefore, the error of estimating q can be expressed as
q ˜ = q q ^ = W ˜ T ϕ ( u I ) + ϵ 1 ( u I )
where W ˜ denotes the error of estimating W. By substituting q ˜ into (15), we have
s ˙ = A r s + T x W ˜ T ϕ ( u I ) + ( T x T x ^ ) q ^ + T x ϵ 1 ( u I )
To profoundly comprehend the necessary condition for the convergence of the estimation error W ˜ , we define uniformly ultimately bounded(UUB) below.
Definition 1. 
A time-varying signal σ ( t ) can be said as UUB if there exists a compact set S R n so that for all σ S , there exists a bound μ 0 and a time T such that | | σ | | μ for all t t 0 + T .
Lemma 1. 
If the following conditions are satisfied, W ˜ becomes UUB.
  • t 0 t i s d t , s become UUB after a time point t 1 ( | | t 0 t i s d t | | δ 1 , and | | s | | δ 2 )
  • The change in W ^ approaches zero
  • Matrix C defined below will become a full row rank matrix.
C = t 1 + 1 t 1 + 2 T x ( I ϕ ( u I ) ) T d t t i 1 t i T x ( I ϕ ( u I ) ) T d t
where t 1 t i t f and any term in C satisfies the persistent excitation (PE) condition defined below.
| | t j t j + 1 T x ( I ϕ ( u I ) ) d t ) T | | β j t 1 t j t i
Here, β j is a positive value.
Proof. 
From (20)
s = A r t 0 t i s d t + t 0 t i T x W ˜ T ϕ ( u I ) d t + t 0 t i ( T x T x ^ ) q ^ d t + t 0 t i T x ϵ 1 ( u I ) d t
Since t 0 t i s d t 0 , s 0 reaches a steady state and A r is a constant, we can obtain the following:
| | s A r t 0 t i s d t | | δ s i
where δ s i denotes a small positive scalar. Additionally, since both ϵ 1 ( u I ) and T x are bounded, we have
| | t 0 t i T x ϵ 1 ( u I ) d t | | δ T ϵ
where δ T ϵ denotes a small positive scalar. The term t 0 t i T x ϵ 1 ( u I ) d t captures the effect of the structural error of the neural network on state s. Since T x is bounded, when the neural network approximates the cost weight function adequately, the value of ϵ 1 ( u I ) decreases, which in turn minimizes the overall integral value. In other words, a well-selected neural network structure with a good approximation of the cost weight function will produce a small structure error and therefore a small overall integral value t 0 t i T x ϵ 1 ( u I ) d t .
From (23)(24)(25), we have
| | t 0 t i T x W ˜ T ϕ ( u I ) d t + t 0 t i ( T x T x ^ ) q ^ d t | | δ s i + δ T ϵ
Similarly, we can obtain a similar relation for the duration [ t 0 , t 1 ]
| | t 0 t 1 T x W ˜ T ϕ ( u I ) d t + t 0 t 1 ( T x T x ^ ) q ^ d t | | δ s i + δ T ϵ
Based on (26)(27), we have
| | t 1 + 1 t i T x W ˜ T ϕ ( u I ) d t + t 1 + 1 t i ( T x T x ^ ) q ^ d t | | 2 ( δ s i + δ T ϵ )
Furthermore, considering t 0 t i s d t 0 after t 1 , the definition of s and | | T x T x ^ | | ζ | | x ˜ | | , we have
| | t 1 + 1 t i ( T x T x ^ ) q ^ d t | | t 1 + 1 t i | | ( T x T x ^ ) | | | | q ^ | | d t t 1 + 1 t i ζ δ x ˜ δ q ^ d t δ ζ ( t i t 1 1 )
where δ x ˜ and δ q ^ represent the bounds of x ˜ and q ^ respectively. Thus, we have
| | t 1 + 1 t i T x W ˜ T ϕ ( u I ) d t | | 2 ( δ s i + δ T ϵ ) + δ ζ ( t i t 1 1 )
In this case, when W ^ ˙ approaches zero, we have
| | t 1 + 1 t i T x ( I ϕ ( u I ) ) T v e c ( W ˜ ) d t | | = | | t 1 + 1 t i T x ( I ϕ ( u I ) ) T d t v e c ( W ˜ ) | | 2 ( δ s i + δ T ϵ ) + δ ζ ( t i t 1 1 )
Based on this relation, we have
| | t 1 + 1 t 1 + 2 T x ( I ϕ ( u I ) ) T d t v e c ( W ˜ ) | | 2 ( δ s i + δ T ϵ ) + δ ζ ( 1 )
where δ ζ ( 1 ) = t 1 + 1 t 1 + 2 ζ δ x ˜ δ q ^ d t = = t i 1 t i ζ δ x ˜ δ q ^ d t .
Thus, we have
| | C v e c ( W ˜ ) | | ( t i t 1 1 ) ( 2 ( δ s i + δ T ϵ ) + δ ζ ( 1 ) )
where C is defined in (21). Since C is full row rank, we have
| | v e c ( W ˜ ) | | | | C + | | | | C v e c ( W ˜ ) | | | | C + | | ( t i t 1 1 ) ( 2 ( δ s i + δ T ϵ ) + δ ζ ( 1 ) )
From (22), we have | | C + | | 1 ( t i t 1 1 ) β j 2
| | v e c ( W ˜ ) | | t i t 1 1 β j 2 ( 2 ( δ s i + δ T ϵ ) + δ ζ ( 1 ) )
Thus, W ˜ is UUB.
Notably, β j evaluates the lower bound of the norm of t j t j + 1 T x ( I ϕ ( u I ) ) d t ) T , it can increase when the data x cause the norm of the integral to deviate significantly from zero. The size of δ ζ ( 1 ) , δ s i is related to the minimization of s and t 0 t i s d t , and the size of δ T ϵ is related to the approximation ability of the chosen neural network. The bound of W ˜ after t 1 can be minimized by the excited x, successfully minimizing s and t 0 t i s d t while appropriately designing the structure of the neural network.

4.1. Construction of the neural network

As shown in Lemma 1, the convergence of t 0 t s d τ is essential in the convergence of W ˜ to 0. Therefore, it is necessary to incorporate this consideration in the approximation design.
First, we divide the estimation of the weights of the neural network into two parts:
W ^ = W ^ 1 + W ^ 2
and
q ^ = q ^ 1 + q ^ 2 = ( W ^ 1 + W ^ 2 ) T ϕ ( u I )
where q ^ 1 = W ^ 1 T ϕ ( u I ) and q ^ 2 = W ^ 2 T ϕ ( u I ) .
The state equation describing the error dynamics can be obtained as follows:
s ˙ = A r s + T x q ˜ 1 + ( T x T x ^ ) q ^ 1 T x ^ q ^ 2
where s = x ˜ ˙ λ ˜ , A r = A H A T 0 A T , T x = H F ¯ x F ¯ x , T x ^ = H F ¯ x ^ F ¯ x ^ .
Further, to effectively minimize t 0 t s d τ , we define vector e as follows:
e = ( T x T x ^ ) q ^ 1 + K s + K p t 0 t s d τ T x ^ q ^ 2 + A r s
where K = d i a g ( [ k , , k ] ) R 2 n × 2 n and K p = d i a g ( [ k p , , k p ] ) R 2 n × 2 n . Parameters k and k p are two positive scalars, thus, (38) can be written as:
s ˙ = K s K p t 0 t s d τ + T x q ˜ 1 + e
We suppose that an ideal time-invariant weight matrix W 2 R n f × l exists, which guarantees that
( T x T x ^ ) q ^ 1 + K s + A r s + K p t 0 t s d τ = T x ^ q = T x ^ ( W 2 T ϕ ( u I ) + ϵ 2 ( u I ) )
where u I = x 0 u .
The estimation error of the neural network can be represented as
q ˜ 1 q q ^ 1 = W ˜ 1 T ϕ ( u I ) + ϵ 1 ( u I ) q ˜ 2 q q ^ 2 = W ˜ 2 T ϕ ( u I ) + ϵ 2 ( u I )
and e can be represented as
e = T x ^ ( W ˜ 2 T ϕ ( u I ) + ϵ 2 ( u I ) )
Therefore, (40) becomes
s ˙ = K s K p t 0 t s d τ + T x ( W ˜ 1 T ϕ ( u I ) + ϵ 1 ( u I ) ) + T x ^ ( W ˜ 2 T ϕ ( u I ) + ϵ 2 ( u I ) )

4.2. Tuning law of the neural network for the estimation of q ( t )

An updating law for a neural network that estimates q ( t ) can be represented in Theorem 1, based on the error system’s dynamics that were derived in (44).
Theorem 1. 
If we choose the updating laws for the neural network weights W ^ 1 and W ^ 2 as shown in (45), respectively, where Γ 1 , Γ 2 and k e are positive scalar constants, then state s, t 0 t s d τ , and error e will be UUB.
W ^ ˙ 1 = Γ 1 ϕ ( u I ) s T T x W ^ ˙ 2 = Γ 2 ϕ ( u I ) ( s + k e e ) T T x ^
In addition, if there exist positive constants t δ , β 1 , β 2 , β 3 , and β 4 such that the inequalities in (46) are satisfied for all initial times t 0 , then the signals W ˜ 1 and W ˜ 2 will also be UUB.
β 2 I t 0 t 0 + t δ C p 1 ( t ) T C p 1 ( t ) d t β 1 I β 4 I t 0 t 0 + t δ C p 2 ( t ) T C p 2 ( t ) d t β 3 I
Here, C p 1 ( t ) = T x ( I ϕ ( u I ) T ) , C p 2 ( t ) = T x ^ ( I ϕ ( u I ) T )
Proof. 
A proof of this theorem can be found in the appendix. □
Applying (45) results in s, t 0 t s d τ , and e being UUB, as shown in Theorem 1. Additionally, (45) shows that when s and e decreases, W ^ ˙ 1 and W ^ ˙ 2 decrease as well, resulting in a decrease in W ^ ˙ = W ^ ˙ 1 + W ^ ˙ 2 . At this point, as stated in Lemma 1, if the condition of matrix C (defined in Lemma 1), being a full row rank matrix, is satisfied, then W ˜ = W ˜ 1 + W ˜ 2 will also be UUB. Thus, the solution to the IOC problem can be derived by applying (37).

5. Simulations

5.1. Basic simulation conditions

To verify the effectiveness of our method, we performed the simulations using a sample linear system controlled by the optimal control method with the original cost weights R selected in two cases.
The sample linear system dynamics can be formulated as follows:
θ ˙ = A θ + B τ
where θ = [ θ 1 , θ 2 ] T R 2 represents the system states. We select A = 30 80 60 0 , B = 2 0 0 4 and τ R 2 denoting the control input.
The cost function selected in these simulations is formulated as
V r = 1 2 0 t f θ T Q θ + τ T R τ ) d t
when all the elements of θ satisfying | θ i | θ r l and Q ( t ) = q 1 0 0 q 2 is the continuous time-varying cost weights on system states θ . R = r 1 0 0 r 2 represents the cost weights on the control inputs.
Moreover, in our simulations, we select 0 as the initial value of all the elements of both W ^ 1 and W ^ 2 . Actuation function ϕ ( u I ) was selected as ϕ ( u I ) = [ ϕ 1 ( u I ) , , ϕ i ( u I ) , , ϕ l ( u I ) ] T with ϕ i ( u I ) designed as
ϕ i ( u I ) = e x p ( ( u I ψ i ) T ( u I ψ i ) ν )
where ν denotes a positive scalar.
Two cases are considered in the simulation:
(1) In the first case, we apply the optimal control of the sample system with cost weights θ as the signal ( q 1 ( t ) = 1 + c o s ( t ) and q 2 ( t ) = 2 + s i n ( t ) ). The proposed IOC method is employed online to estimate the cost weights, with the simultaneous online recovery of the original system trajectory. Parameters Γ 1 and Γ 2 in the updating law are set to Γ 1 = 1 and Γ 2 = 1 , respectively. Parameters k and k p are set to k = 50 and k p = 625 , respectively. The initial values of W ^ 1 and W ^ 2 are set to matrixes with all elements equal to zero. The original r 1 and r 2 are set to r 1 = 1 and r 2 = 1 , respectively. The simulation also uses 49 nodes in the neural network.
(2) In the second case, we perform the simulation of our IOC method, but with the original r 1 and r 2 set to r 1 = 3 and r 2 = 4 , respectively. All other simulation settings are the same as in the first case.
Similar to the simulation sections in previous works ([19] and [6]), we use the control input from the simulation, which ignores the measurement issues with the control input and measurement errors that may occur in real-world applications. This allows us to purely evaluate the performance of our method in solving the IOC problem. In actual applications, the control input can be calculated by substituting the measured θ ˙ into (47), as described by [19].

5.2. Results

The simulation results are shown in the figures below.
In Figure 1, the blue solid line represents the original variation in the cost weights whereas the gray solid line represents the estimated cost weights. After a brief period of oscillation at the initial time, our method accurately recovers the original cost weights when R = I . Notably, similar to the case in other adaptive control methods and adaptive neural network based control methods, the initial oscillation is a result of the adaptive initialization of the weights in (45) due to the large initial errors in W ˜ 1 and W ˜ 2 .
Figure 2 demonstrates the impact of selecting R = I on the estimation results when the original R value is arbitrary. The solid blue line represents the original time-varying cost weights, whereas the dotted gray line represents the final estimated values. Although the estimated values differ from the original values, the general trend of the changes is preserved. In addition, the gray line represents the mutual weights in the dynamics of the system state, whereas the original weights among the control inputs are reflected in the current estimate of q ( t ) . From the figure, we can observe that the bottom lines in blue and gray colors represent the value of the original and estimated q 2 . Evidently, the blue line for q 2 is larger than that for q 1 from 4.8s to 5s. Additionally, in the original settings, r 2 is 4, which confers greater importance to the decrease in u 2 compared with the case when r 1 = 3 , leading to the weakening of the convergence of the θ 2 term associated with u 2 . In our estimates, the value of the dashed line for the estimated q 2 , which also considers the impact from original setting of R is not greater than the value of estimated q 1 between 4.8s and 5s. This indicates that the convergence of θ 2 is weakened by considering the impact from the cost weights on control input. Our dashed line more accurately reflects the actual situation compared to the blue line.
In Figure 3, Figure 4 and Figure 5, we show the results of error e, states s and t 0 t s d τ in two cases. The blue lines show the results of the first case, whereas the gray dotted lines show the results of the second case. From the figures, we can observe that all the values effectively decrease to a low range during the simulation, and most importantly, in the second case, the different selections of R do not affect the convergence of these values. This demonstrates the effectiveness of our method and highlights that even with different values of R, the recovered cost weights are still feasible solutions to the IOC problem, as they can be utilized to regenerate a similar system trajectory and control inputs ( t 0 t s d τ = x ˜ t 0 t λ ˜ d τ 0 ).

6. Discussion

6.1. Robustness of the proposed method to noisy data

In (45), Γ 1 and Γ 2 decrease the error by regulating the updating speed of the estimated values. Adjusting these two terms may successfully reduce the impact of data noise to a certain degree. Their roles are similar to that of a low-pass filter’s time constant. For example, in the setting of the first case, when noise exists, x N ( 0 , 10 1 ) and u N ( 0 , 10 4 ) , the simulation results show that different sets of Γ 1 and Γ 2 (e.g. Γ 1 = 10 , Γ 2 = 10 ; Γ 1 = 1 , Γ 2 = 1 ) can significantly influence the noise reduction performance.
As shown in Figure 6, while relatively small values of Γ 1 and Γ 2 may result in a low convergence rate, they effectively reduce the impact of data noise. Our method demonstrates robustness against noise by allowing for the adjustment of parameters Γ 1 and Γ 2 .

6.2. Calculation complexity and real-time calculation

The proposed algorithm has a low computational complexity, as it only involves the calculation of dot products between matrixes and vectors as well as the summation of vectors. Additionally, it does not require any iterative or optimization calculations. This makes it an efficient solution for real-time calculations. In fact, our simulation shows that a single iteration of the algorithm using case 1 settings takes only approximately 0.23ms in Matlab 2016b to complete the SIOC’s calculation, which is fast enough to meet real-time calculation requirements.

6.3. Advantages of using R = I

The simulation results suggest that one of the key advantages of setting R as a constant I is that it effectively consolidates the impact of cost weights on state convergence, which would have been influenced by different settings of R, into the estimated value of q ( t ) . This allows for a comprehensive evaluation of the system state convergence, as it only depends on q ( t ) , without needing to account for additional considerations. Furthermore, by maintaining a consistent value of R = I , it is possible to standardize the analysis of the same motion across multiple agents, which is crucial for various applications.

7. Conclusion

In this paper, we proposed a neural network based method for recovering the time-varying cost weights in the IOC problem for linear continuous systems. Our approach involved constructing an auxiliary estimation system that closely approximates the behavior of the original system, followed by determining the necessary conditions for tuning the weights of the neurons in the neural network to obtain a unique solution for the IOC problem. We discussed the necessary requirements for the previous settings to ensure the well-posedness of our online IOC method. We showed that the unique solution corresponds to achieving a nearly zero error between the original system state and the auxiliary estimated system state, as well as nearly zero error between the original costate and the integral of the estimated costate. Based on this analysis, we developed two neural network frameworks: one for approximating the cost weight function and the other for addressing the error introduced by the auxiliary estimation system and terms. Finally, we validated the effectiveness of our method through simulations, highlighting its ability to recover time-varying cost weights and its robustness against different original choices of R. Overall, our method represents a significant advancement in the field of online IOC, and it is applicable to a wide range of problems requiring real-time IOC calculations.

8. Proof of Theorem 1

Proof. 
Considering the Lyapunov candidate selected as follows
V = 1 2 s T s + 1 2 ( t 0 t s d τ ) T K p t 0 t s d τ + 1 2 t r [ W ˜ 1 T Γ 1 1 W ˜ 1 + W ˜ 2 T Γ 2 1 W ˜ 2 ]
The derivative of V can be expressed as
V ˙ = s T s ˙ + s T K p t 0 t s d τ t r [ W ˜ 1 T Γ 1 1 W ^ ˙ 1 + W ˜ 2 T Γ 2 1 W ^ ˙ 2 ]
By introducing (44) and utilizing the proposed updating law of W ^ 1 and W ^ 2 in (45), V ˙ becomes
V ˙ = s T k s + s T T x q ˜ 1 + s T e t r [ W ˜ 1 T Γ 1 1 W ^ ˙ 1 + W ˜ 2 T Γ 2 1 W ^ ˙ 2 ] = s T k s + s T T x ( W ˜ 1 ϕ ( u I ) + ϵ 1 ( u I ) ) + s T T x ^ ( W ˜ 2 ϕ ( u I ) + ϵ 2 ( u I ) ) t r [ W ˜ 1 T Γ 1 1 W ^ ˙ 1 + W ˜ 2 T Γ 2 1 W ^ ˙ 2 ] = s T k s + s T T x ϵ 1 ( u I ) + s T T x ^ ϵ 2 ( u I ) + k e e T T x ^ ϵ 2 ( u I ) e T k e e
Here, with introducing a new vector p defined as p = s k e k e and considering (43), (52) can be rewritten as
V ˙ = p T K p + p T p ϵ
where p ϵ = T x ϵ 1 ( u I ) + T x ^ ϵ 2 ( u I ) k T x ^ ϵ 2 ( u I ) .
By considering the boundedness condition of T x , T x ^ , ϵ 1 ( u I ) and ϵ 2 ( u I ) , we have
| | p ϵ | | ( δ t x ϵ n 1 + δ t x ^ ϵ n 2 ) 2 + k δ t x ^ ϵ n 2 δ t p ϵ
From this boundedness condition, (52) becomes
V ˙ k | | p | | 2 + | | p | | | | p ϵ | | k | | p | | 2 + | | p | | δ t p ϵ = | | p | | ( k | | p | | δ t p ϵ )
From (55), the left hand side of (55) would be negative when | | p | | δ t p ϵ k , implying that V ˙ 0 and p would maintain convergence when | | p | | δ t p ϵ k . Moreover, due to the vector p = s k e k e , s as well as e would all be bounded satisfying
| | s | | δ s
| | e | | δ e
That is, s, e would all be UUB. Moreover, due to the continuity, s ˙ would also be UUB satisfying the following condition as
| | s | | δ s ˙
Notably, with increasing k, the bound δ t p ϵ k of p decreases. Furthermore, since V decreases continuously while | | p | | δ t p ϵ k , t 0 t s d τ would also be UUB.
Conversely, from (40), we have
| | T x q ˜ 1 | | = | | s ˙ + K s + K p t 0 t s d τ e | | B f h
where B f h denotes a positive scalar. Furthermore, by considering (42), we have
| | T x W ˜ 1 T ϕ ( u I ) | | = | | T x q ˜ 1 T x ϵ 1 ( u I ) | | | | T x q ˜ 1 | | + | | T x | | | | ϵ 1 ( u I ) | | = B f h + δ t x ϵ n 1
Similarly, from the boundedness of e, ϵ 2 ( u I ) and (43), we have
| | T x ^ W ˜ 2 T ϕ ( u I ) | | δ e + δ t x ^ ϵ n 2
From (45), the dynamics related to W ˜ 1 and W ˜ 2 can be respectively given by
W ˜ ˙ 1 = Γ 1 ϕ ( u I ) s T T x y 1 = T x W ˜ 1 T ϕ ( u I )
W ˜ ˙ 2 = Γ 2 ϕ ( u I ) ( s + e ) T T x ^ y 2 = T x ^ W ˜ 2 T ϕ ( u I )
where y 1 and y 2 denote the outputs of two systems and are both bounded following (60) and (61).
Thus, the vector dynamics of the two systems can be given as
d d t v e c ( W ˜ 1 ) = ( I Γ 1 ϕ ( u I ) ) T x T s = B p 1 ( t ) s y 1 = T x ( I ϕ ( u I ) T ) v e c ( W ˜ 1 ) = C p 1 ( t ) v e c ( W ˜ 1 )
d d t v e c ( W ˜ 2 ) = ( I Γ 2 ϕ ( u I ) ) T x ^ T ( s + k e e ) = B p 2 ( t ) ( s + k e e ) ) y 1 = T x ^ ( I ϕ ( u I ) T ) v e c ( W ˜ 2 ) = C p 2 ( t ) v e c ( W ˜ 2 )
where B p 1 ( t ) = ( I Γ 1 ϕ ( u I ) ) T x T and B p 2 = ( I Γ 2 ϕ ( u I ) ) T x ^ T would be bounded with ensuring the boundedness of ϕ ( u I ) and T x , T x ^ . Thus, from Lemma 4.2.1 in [22], if (46) is satisfied, the boundedness of y 1 y 2 as well as those of s and s + k e e assures the boundedness of W ˜ 1 , W ˜ 2 . That is, there exist two positive scalars δ W ˜ 1 , δ W ˜ 2 such that | | W ˜ 1 | | δ W ˜ 1 , | | W ˜ 2 | | δ W ˜ 2 . Thus, W ˜ 1 and W ˜ 2 would be UUB. □

References

  1. Frigon, A.; Akay, T.; Prilutsky, B.I. Control of Mammalian Locomotion by Somatosensory Feedback. Comprehensive Physiology 2021, 12, 2877–2947. [Google Scholar] [CrossRef] [PubMed]
  2. Li, Y.; Tee, K.P.; Yan, R.; Chan, W.L.; Wu, Y. A framework of human–robot coordination based on game theory and policy iteration. IEEE Transactions on Robotics 2016, 32, 1408–1418. [Google Scholar] [CrossRef]
  3. Ziebart, B.D.; Maas, A.L.; Bagnell, J.A.; Dey, A.K. Human Behavior Modeling with Maximum Entropy Inverse Optimal Control. AAAI spring symposium: human behavior modeling, 2009, Vol. 92.
  4. Berret, B.; Chiovetto, E.; Nori, F.; Pozzo, T. Evidence for composite cost functions in arm movement planning: an inverse optimal control approach. PLoS computational biology 2011, 7, e1002183. [Google Scholar] [CrossRef] [PubMed]
  5. El-Hussieny, H.; Abouelsoud, A.; Assal, S.F.; Megahed, S.M. Adaptive learning of human motor behaviors: An evolving inverse optimal control approach. Engineering Applications of Artificial Intelligence 2016, 50, 115–124. [Google Scholar] [CrossRef]
  6. Jin, W.; Kulić, D.; Mou, S.; Hirche, S. Inverse optimal control from incomplete trajectory observations. The International Journal of Robotics Research 2021, 40, 848–865. [Google Scholar] [CrossRef]
  7. Kalman, R.E. When is a linear control system optimal? 1964.
  8. Molinari, B. The stable regulator problem and its inverse. IEEE Transactions on Automatic Control 1973, 18, 454–459. [Google Scholar] [CrossRef]
  9. Obermayer, R.; Muckler, F.A. On the inverse optimal control problem in manual control systems; Vol. 208, Citeseer, 1965.
  10. Boyd, S.; El Ghaoui, L.; Feron, E.; Balakrishnan, V. Linear matrix inequalities in system and control theory; SIAM, 1994.
  11. Priess, M.C.; Conway, R.; Choi, J.; Popovich, J.M.; Radcliffe, C. Solutions to the inverse LQR problem with application to biological systems analysis. IEEE Transactions on control systems technology 2014, 23, 770–777. [Google Scholar] [CrossRef] [PubMed]
  12. Rodriguez, A.; Ortega, R. Adaptive stabilization of nonlinear systems: the non-feedback linearizable case. IFAC Proceedings Volumes 1990, 23, 303–306. [Google Scholar] [CrossRef]
  13. Freeman, R.A.; Kokotovic, P.V. Inverse optimality in robust stabilization. SIAM journal on control and optimization 1996, 34, 1365–1391. [Google Scholar] [CrossRef]
  14. Johnson, M.; Aghasadeghi, N.; Bretl, T. Inverse optimal control for deterministic continuous-time nonlinear systems. 52nd IEEE Conference on Decision and Control. IEEE, 2013, pp. 2906–2913.
  15. Abbeel, P.; Ng, A.Y. Apprenticeship learning via inverse reinforcement learning. Proceedings of the twenty-first international conference on Machine learning, 2004, p. 1.
  16. Ziebart, B.D.; Maas, A.L.; Bagnell, J.A.; Dey, A.K. ; others. Maximum entropy inverse reinforcement learning. Aaai. Chicago, IL, USA, 2008, Vol. 8, pp. 1433–1438.
  17. Molloy, T.L.; Ford, J.J.; Perez, T. Online inverse optimal control for control-constrained discrete-time systems on finite and infinite horizons. Automatica 2020, 120, 109109. [Google Scholar] [CrossRef]
  18. Gupta, R.; Zhang, Q. Decomposition and Adaptive Sampling for Data-Driven Inverse Linear Optimization. INFORMS Journal on Computing 2022. [Google Scholar] [CrossRef]
  19. Jin, W.; Kulić, D.; Lin, J.F.S.; Mou, S.; Hirche, S. Inverse optimal control for multiphase cost functions. IEEE Transactions on Robotics 2019, 35, 1387–1398. [Google Scholar] [CrossRef]
  20. Li, Y.; Yao, Y.; Hu, X. Continuous-time inverse quadratic optimal control problem. Automatica 2020, 117, 108977. [Google Scholar] [CrossRef]
  21. Zhang, H.; Ringh, A. Inverse linear-quadratic discrete-time finite-horizon optimal control for indistinguishable homogeneous agents: A convex optimization approach. Automatica 2023, 148, 110758. [Google Scholar] [CrossRef]
  22. Lewis, F.; Jagannathan, S.; Yesildirak, A. Neural network control of robot manipulators and non-linear systems; CRC press, 2020.
Figure 1. Estimated cost weights ( r 1 = 1 , r 2 = 1 )
Figure 1. Estimated cost weights ( r 1 = 1 , r 2 = 1 )
Preprints 93826 g001
Figure 2. Estimated cost weights ( r 1 = 3 , r 2 = 4 )
Figure 2. Estimated cost weights ( r 1 = 3 , r 2 = 4 )
Preprints 93826 g002
Figure 3. Variation of Error e ( r 1 = 1 , r 2 = 1 and r 1 = 3 , r 2 = 4 )
Figure 3. Variation of Error e ( r 1 = 1 , r 2 = 1 and r 1 = 3 , r 2 = 4 )
Preprints 93826 g003
Figure 4. Variation of Error s ( r 1 = 1 , r 2 = 1 and r 1 = 3 , r 2 = 4 )
Figure 4. Variation of Error s ( r 1 = 1 , r 2 = 1 and r 1 = 3 , r 2 = 4 )
Preprints 93826 g004
Figure 5. Variation of t 0 t s d τ ( r 1 = 1 , r 2 = 1 and r 1 = 3 , r 2 = 4 )
Figure 5. Variation of t 0 t s d τ ( r 1 = 1 , r 2 = 1 and r 1 = 3 , r 2 = 4 )
Preprints 93826 g005
Figure 6. Estimated cost weights (Noisy Case):(1) Γ 1 = 10 , Γ 2 = 10 (2) Γ 1 = 1 , Γ 2 = 1
Figure 6. Estimated cost weights (Noisy Case):(1) Γ 1 = 10 , Γ 2 = 10 (2) Γ 1 = 1 , Γ 2 = 1
Preprints 93826 g006
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

© 2024 MDPI (Basel, Switzerland) unless otherwise stated