Loading [MathJax]/jax/output/HTML-CSS/jax.js
Preprint
Article

Robust Control of An Inverted Pendulum System Based on Policy Iteration in Reinforcement Learning

This version is not peer-reviewed.

Submitted:

16 October 2023

Posted:

17 October 2023

You are already at the latest version

A peer-reviewed article of this preprint also exists.

Abstract
This paper is primarily focused on the robust control of an inverted pendulum system based on the policy iteration in reinforcement learning. First, a mathematical model of the single inverted pendulum system is established through a force analysis of the pendulum and trolley. Second, based on the theory of robust optimal control, the robust control of the uncertain linear inverted pendulum system is transformed into an optimal control problem with an appropriate performance index. Moreover, for the uncertain linear and nonlinear systems, two reinforcement-learning control algorithms are proposed using the policy iteration method. Finally, two numerical examples are provided to validate the reinforcement learning algorithms for the robust control of the inverted pendulum systems.
Keywords: 
robust control; optimal control; inverted pendulum system; reinforcement learning
Subject: 
Engineering  -   Control and Systems Engineering

1. Introduction

In the last decade, there has been an increasing interest in the robust control of the inverted pendulum system (IPS) owing to its high potential in testing a variety of advanced control algorithms. Robust control is widely used in power electronic [1], flight control [2], motion control [3], network control [4], and IPSs [5], in addition to other fields. The research on the robust control of an IPS has provided advantageous results in recent years. An inverted pendulum is an experimental device having insufficient drive, absolute instability, and uncertainty. It has become an excellent benchmark in the field of automatic control over the last few decades as it provides better explanations for model-based nonlinear control techniques and is a typical experimental platform for verifying classical and modern control theories [6].
Although the earliest research on the IPS can be traced back to 1908 [7], there was almost no literature on this subject between 1908 and 1960. Until 1960, a number of tall, slender structures survived the Chilean earthquake, while structures that appeared more stable were severely damaged. Therefore, some scholars conducted more in-depth research to obtain a suitable explanation [8]. The pendulum structure under the effect of earthquakes is modeled as a base and rigid block system, and block overturning is studied by applying a horizontal acceleration, sinusoidal pulses, and seismic-type excitations to the system. It was observed that there is an unexpected scaling effect that makes the large block more stable than the small block among two geometrically similar blocks. Furthermore, tall blocks exhibit greater stability during earthquakes when exposed to horizontal forces. Since then, with the development of modern control theory, various control methods have been applied to different types of IPSs, such as the proportional-integral–derivative control [9,10], cloud model control [11,12], fuzzy control [13,14,15,16], sliding mode control [17,18], and neural network control methods [19,20,21]. These methods provide different ideas for the control of IPSs.
As is known, the IPS is an uncertain system, and the uncertainty of its model is naturally within the scope of consideration. The aim of the robust control of an IPS is to find a controller capable of addressing system uncertainties. When the system is disturbed by uncertainty, robust control laws can stabilize the system. Because it is difficult to directly solve the robust control problem, Lin et al. transformed the robust control problem into an optimal control problem [22,23,24]. However, the pioneering methods for solving optimal control problems mainly include the dynamic programming [25] and maximum principle [26] methods. In the dynamic programming method, solving the Hamilton–Jacobi–Belman (HJB) equation yields the optimal control of the system. However, the HJB equation is a nonlinear partial differential equation, and obtaining its solution has proven to be more difficult than solving the optimal control problem. As for the optimal control problem of a linear system with a quadratic performance index, irrespective of whether it is a continuous system or a discrete system, it finally comes down to solving an algebraic Riccati equation (ARE). However, when the dimension of the state vector or control input vector in the dynamic system is large, the so-called "curse of dimensionality" appears when the dynamic programming method is used to solve the optimal control problem [27]. To overcome this weakness, some scholars have proposed the use of the reinforcement learning (RL) policy to solve the optimal control problem [28,29].
When RL was initially used for system control, it was primarily focused on discrete-time systems or discretized continuous-time systems in research on problems such as the billiard game problem [30], scheduling problem [31,32,33], and robot navigation problem [34]. Furthermore, the application of RL algorithms to continuous-time and continuous-state systems was initially extended by Doya et al. [35]. They used the known system model to learn the optimal control policy. In the context of control engineering, the RL and adaptive dynamic programming link traditional optimal control methods with adaptive control methods [36,37,38]. Vrabie et al. [39] used the RL algorithm to solve the optimal control problem of the continuous time system. In the case of the linear system, the system data is collected, and the solution of the HJB equation is obtained via online policy iteration (PI) using the least squares method. Xu et al. [40,41] proposed an RL algorithm based on linear and nonlinear continuous-time systems to solve the robust tracking problems through online PI. The algorithm takes into consideration the uncertainty in the system’s state and input matrices and improves the method for solving robust tracking.
The use of the RL method to iteratively solve the optimal control problem has attracted extensive attention from scholars and resulted in relatively good results. The IPS demonstrates a positive impact in the validation of advanced control algorithms. Although there is a significant amount of literature on the IPS [9,11,15], to the best of our knowledge, there are almost no research results comprising the use of the RL for solving the control of IPS. In this study, an attempt is made to solve the robust control problem of an uncertain continuous-time IPS using the RL algorithm. The dynamic equations of the nominal system need not be known when using the RL control algorithm, and this study lays a theoretical foundation for the wide application of the RL control algorithm in engineering systems. The main contributions of this study are as follows.
1) The state space model of the IPS is established and a robust optimal control design method for uncertain systems is proposed. By constructing appropriate performance indicators, the optimal control method is used for the first time to design a robust control law for an uncertain IPS, which is an original approach.
2) A PI algorithm in the RL has been designed for realizing the robust optimal control of an IPS. The use of RL algorithms to solve the robust control problem of the IPS does not require that the nominal system matrix be known, and it can also overcome the challenges resulting from the "curse of dimensionality.’’ The first application of the RL for solving the control problem of an IPS has significance for its potential application in practical engineering.
The organization of this paper is as follows. In Section 2, the state-space equation of the IPS and a linearization model are established. The robust control and RL algorithm for linearized the IPS are presented in Section 3 and 4. In Section 5, we establish the nonlinear state-space model of the IPS and propose the corresponding RL algorithm. The RL algorithm is then verified via a simulation in Section 6. Finally, we summarize the work of this paper and potential future research directions.

2. Preliminaries

In this section, we established a physical model of a first-order linear IPS according to Newton’s second law. By selecting appropriate state variables, the state space model with uncertainty is derived.

2.1. Modeling of Inverted Pendulum System

The inverted pendulum experimental device comprises a pendulum and a trolley. Its structure is presented in Figure 1. Moreover, its simplified physical model is presented in Figure 2, which mainly includes the pendulum and trolley. In Figure 2, owing to the interaction between the trolley and pendulum, the trolley is subjected to a force F 3 from the pendulum, which acts in the lower left direction. Furthermore, the pendulum is subjected to a force F 4 from the trolley, which acts in the upper right direction. In addition, the pendulum and trolley are also subjected to other forces, as shown in Figure 3 and 4, respectively. The trolley is driven by a motor to perform horizontal movements on the guide rail. In Figure 3, the trolley is subjected to the force F 1 from the motor and gravity. F 2 represents the resistance between the trolley and guide rail. Furthermore, N 1 and P 1 are the two components of force F 3 . In Figure 4, the pendulum is subjected to gravity G = m 1 g , and N 2 and P 2 are the two components of force F 4 .
To facilitate subsequent calculations, we define the parameter of the first-order IPS, as shown in Table 1. The time parameter symbol ( t ) is omitted, which indicates that x represents x ( t ) . According to Newton’s second Law, in the horizontal direction, the trolley satisfies the following equation
F 1 N 1 F 2 = m 2 x ¨
Table 1. IPS parameter symbols.
Table 1. IPS parameter symbols.
Parameter Unit Significance
m 1 k g Mass of the trolley
m 2 k g Mass of the pendulum
L m Half the length of the pendulum
z N / m / s Friction coefficient between the trolley and guide rail
x m Displacement of the trolley
θ r a d Angle from the upright position
I k g · m 2 Moment of inertia of pendulum
We assume that the resistance is proportional to the speed of the trolley. Therefore, F 2 = z x ˙ , z is the proportional coefficient. Moreover, in the horizontal direction, the pendulum satisfies the following equation
N 2 = m 1 d 2 d t 2 ( x L s i n θ ) = m 1 x ¨ + m 1 L θ ˙ 2 s i n θ m 1 L θ ¨ c o s θ
Considering that N 1 = N 2 , and on substituting (2) into (1), we obtain
F 1 = ( m 1 + m 2 ) x ¨ + z x ˙ + m 1 L θ ˙ 2 s i n θ m 1 L θ ¨ c o s θ
Next, we analyze the resultant force in the vertical direction of the pendulum, and the following equation can be obtained.
P 2 m 1 g = m 1 d 2 d t 2 ( L c o s θ ) = m 1 L θ ˙ 2 c o s θ m 1 L θ ¨ s i n θ
The component force of N 2 in the direction perpendicular to the pendulum is
N 2 c o s θ = m 1 d 2 d t 2 ( x L s i n θ ) c o s θ = m 1 x ¨ c o s θ + m 1 L θ ˙ 2 s i n θ cos θ m 1 L θ ¨ c o s 2 θ
Based on the torque balance, we can obtain the following equation
I θ ¨ = P 2 L s i n θ + N 2 L c o s θ
where I is the moment of inertia of the pendulum. On substituting equations (4) and (5) into Equation (6),
( I + m 1 L 2 ) θ ¨ m 1 g L s i n θ = m 1 L x ¨ c o s θ
Thus far, equations (3) and (7) constitute the dynamic model of the IPS. Moreover, it can be assumed that the rotation angle of the pendulum is very small, that is, θ < < 1 ( r a d i a n ) . Therefore, it can be approximated that
sin θ θ , cos θ 1
Therefore, it follows from equations (3) and (7),
F 1 = ( m 1 + m 2 ) x ¨ + z x ˙ + m 1 L θ ˙ 2 θ m 1 L θ ¨ ( I + m 1 L 2 ) θ ¨ m 1 g L θ = m 1 L x ¨

2.2. State Space Model with Uncertainty

In Section 2.1, we established the dynamic model of the IPS as shown in Equation (8). Next, we will derive the state space model of the IPS.
As the rotation angle of the pendulum θ is very small, it can be approximated that θ ˙ 0 , θ ˙ 2 0 . It follows from (8) that
F 1 = ( m 1 + m 2 ) x ¨ + z x ˙ m 1 L θ ¨ ( I + m 1 L 2 ) θ ¨ m 1 g L θ = m 1 L x ¨
The state variables of the system can be defined as
x 1 = x , x 2 = x ˙ , x 3 = θ , x 4 = θ ˙ .
Therefore, the following state space equation can be derived.
x 1 ˙ = x 2 x 2 ˙ = ( I + m 1 L 2 ) z I ( m 1 + m 2 ) + m 1 m 2 L 2 x 2 + m 1 2 g L 2 I ( m 1 + m 2 ) + m 1 m 2 L 2 x 3 + I + m 1 L 2 I ( m 1 + m 2 ) + m 1 m 2 L 2 u x 3 ˙ = x 4 x 4 ˙ = m 1 L z I ( m 1 + m 2 ) + m 1 m 2 L 2 x 2 + m 1 g L ( m 1 + m 2 ) I ( m 1 + m 2 ) + m 1 m 2 L 2 x 3 + m 1 L I ( m 1 + m 2 ) + m 1 m 2 L 2 u
where u represents the force F 3 from the motor. Using W = I ( m 1 + m 2 ) + m 1 m 2 L 2 , Equation (10) can be written as
x ˙ = A x ( t ) + B u ( t )
where
A = 0 1 0 0 0 ( I + m 1 L 2 ) z W m 1 2 g L 2 W 0 0 0 0 1 0 m 1 L z W m 1 g L ( m 1 + m 2 ) W 0 , B = 0 I + m 1 L 2 W 0 m 1 L W T
However, the accurate model of the IPS is difficult to obtain, and all its parameters have uncertainties. In this paper, the friction coefficient z between the trolley and guide rail is selected as an uncertain parameter. The numerical values of the other parameters in Table 1 are known, where m 1 = 0 . 109 , m 2 = 1 . 096 , L = 0 . 25 , a n d I = 0 . 0034 . Therefore, the state space model of the uncertain IPS can be abbreviated as
x ˙ = A ( z ) x + B u
where
A ( z ) = 0 1 0 0 0 0.8832 z 0.6293 0 0 0 0 1 0 2.3566 z 27.8285 0 , B = 0 0.8832 0 2.3566 T
Here we choose z = 0 . 1 as the nominal value and denote the nominal matrix of the system as A ( z 0 ) . Therefore, the nominal system corresponding to the uncertain system (11) is
x ˙ = A ( z 0 ) x + B u
where
A ( z 0 ) = 0 1 0 0 0 0.0883 0.6293 0 0 0 0 1 0 0.2357 27.8285 0

3. Robust Control of Uncertain Linear System

This section mainly presents with robust optimal control methods for the uncertain IPS modeled in the previous section by selecting the appropriate performance index function and solving an ARE to construct the robust control law. When the uncertain parameters of the system change within a certain range, this robust control law can cause the system to become asymptotically stable.
The following lemmas are proposed to prove the main results of this paper.
Lemma 1.
The nominal system (12) corresponding to system (11) is stabilizeable.
Proof. 
For the four-dimensional continuous time-invariant system presented in system ( 12 ) , the controllability matrix is constructed as
G = [ B A ( z 0 ) B A ( z 0 ) 2 B A ( z 0 ) 3 B ]
Therefore, we have
r a n k ( G ) = r a n k 0 0.8832 0.0780 1.4899 0.8832 0.0780 1.4899 0.2626 0 2.3566 0.2082 65.5990 2.3566 0.2082 65.5990 6.1442 = 4
Therefore, system (12) is completely controllable, which means that the system can be stabilized. This concludes the proof. □
Lemma 2.
There is an m × n matrix Δ ( z ) , such that the system matrices A ( z ) and A ( z 0 ) satisfy the following matched condition.
A ( z ) A ( z 0 ) = B Δ ( z )
Proof. 
A ( z ) A ( z 0 ) = 0 0 0 0 0 0.8832 ( 0.1 z ) 0 0 0 0 0 0 0 2.3566 ( 0.1 z ) 0 0 = 0 0.8832 0 2.3566 Δ ( z ) = B Δ ( z )
where
Δ ( z ) = 0 0.1 z 0 0
This concludes the proof. □
Lemma 3.
For any z [ 0 , 1 ] , there exists a positive semidefinite matrix F, such that Δ ( z ) satisfies
Δ ( z ) T Δ ( z ) F
where F 0 .
Proof. 
According to lemma 2, we can obtain
Δ ( z ) T Δ ( z ) = 0 0 0 0 0 ( 0.1 z ) 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.81 0 0 0 0 0 0 0 0 0 0 = F
This concludes the proof. □
For nominal system (12), we construct the following ARE.
S A ( z 0 ) + A ( z 0 ) T S + Q S B B T S = 0
where Q = F + I . According to the above three lemmas and ARE (18), we propose the following theorem.
Theorem 1.
Let us suppose that S is a symmetric positive definite solution to ARE (18). Then, for all uncertainties z [ 0 , 1 ] , the feedback control u = K x , K = B T S can make the system (11) asymptotically stable.
(Proof of Theorem 1).
We define the following Lyapunov function.
V ( x ) = x T S x
We set u = K x and take the time derivative of the Lyapunov function (19) along the system (11). We can then obtain
V ˙ ( x ) = x T A ( z ) T + K T B T S x + x T S A ( z ) + B K x
According to Lemma 2, we can obtain
V ˙ ( x ) = x T A ( z 0 ) + B Δ ( z ) T + K T B T S x + x T S A ( z 0 ) + B Δ ( z ) + B K x = x T [ A ( z 0 ) T S + S A ( z 0 ) + Δ ( z ) T B T S ] x + x T S B Δ ( z ) x + 2 x T S B K x
On substituting ARE (18) into the above equation, we obtain
V ˙ ( x ) = x T Q x + x T S B T S x + x T Δ ( z ) T B T S x + x T S B Δ ( z ) x + 2 x T S B K x
because K = B T S ,
V ˙ ( x ) = x T Q x + x T K T K x x T Δ ( z ) T K x + x T S B Δ ( z ) x + 2 x T S B K x
As
x T K T K x 2 x T K T Δ ( z ) x = x T K + Δ ( z ) T K + Δ ( z ) x + x T Δ ( z ) T Δ ( z ) x x T Δ ( z ) T Δ ( z ) x
we can obtain
V ˙ ( x ) = x T Q x + x T K T K x x T Δ ( z ) T K x + x T S B Δ ( z ) x + 2 x T S B K x = x T Q x x T K T K x 2 x T Δ ( z ) T K x = x T Q x x T K + Δ ( z ) T K + Δ ( z ) x + x T Δ ( z ) T Δ ( z ) x x T Q x + x T Δ ( z ) T Δ ( z ) x x T ( Q F ) x x T x
Therefore,
V ˙ ( x ) = 0 x = 0 V ˙ ( x ) 0 x 0
According to the Lyapunov stability theorem, the uncertain system (11) is asymptotically stable. Theorem 1 has thus been proved. □

4. RL Algorithm for Robust Optimal Control

In this section, we propose an RL algorithm for solving the robust control problem of an IPS through online PI. According to ARE (18), the following optimal control problem is constructed. For the nominal system,
x ˙ = A ( z 0 ) x + B u
we find a control u, such that the following performance index reaches a minimum.
J = t [ x T Q x + u T u ] d t
where Q = F + I > 0 . For any initial time t, the optimal cost can be written as
V [ x ( t ) ] = t x T Q x + u T u d t = t t + Δ t x T Q x + u T u d t + t + Δ t x T Q x + u T u d t = t t + Δ t x T Q x + u T u d t + V [ x ( t + Δ t ) ]
From Lyapunov function (19), we obtain
x ( t ) T S x ( t ) = t t + Δ t x T Q x + u T u d t + x ( t + Δ t ) T S x ( t + Δ t )
where S is the solution to ARE (18). We propose the following RL algorithm for solving a robust controller.
Algorithm 1.RL algorithm for robust control of uncertain linear IPS
(1)
Q = F + I is computed.
(2)
An initial stabilization control gain K 0 is selected.
(3)
Policy evaluation: S i is solved using the equation x T ( t ) S i x ( t ) = t t + Δ t x T ( Q + K i T K i ) x d t + x T ( t + Δ t ) S i x ( t + Δ t ) .
(4)
Policy improvement: K i + 1 = B T S i .
(5)
We set i = i + 1 , and steps 3 and 4 are repeated until S i + 1 S i ϵ , where ϵ > 0 is a small constant.
In Algorithm 1, by providing an initial stabilizing control law, repeated iterations are performed between steps 3 and 4 until convergence. We can then obtain the robust control gain K of system (11).
Remark 1.
Step 3 in Algorithm 1 is the policy evaluation, and step 4 is the policy improvement. Equivalently, the solving of the equation in step 3 is actually solving a least squares problem. In the integral interval, if sufficient data are obtained in the system, the least square method can be used to solve S i . Although it is also possible to directly solve the ARE (18) to obtain S i , the state matrix of the system (12) is required to be known. The implementation of Algorithm 1 does not necessitate that the state matrix of the system be known.
Next, we prove the convergence of Algorithm 1. However, it is necessary to prove the following Lemma first.
Lemma 4.
On assuming that the matrix A ( z 0 ) + B K i is stable, solving the matrix S i from step 3 of Algorithm 1 becomes equivalent to solving the following equation.
S ( A ( z 0 ) + B K i ) + ( A ( z 0 ) + B K i ) T S + Q + K i T K i = 0
Proof. We rewrite the equation of step 3 in Algorithm 1 as follows
lim Δ t 0 x T ( t + Δ t ) S i x ( t + Δ t ) x T ( t ) S i x ( t ) Δ t + lim Δ t 0 t t + Δ t x T ( Q + K i T K i ) x d t Δ t = 0
According to the definition of the derivative, it can be observed that the first term of Equation (23) is the derivative of x T ( t ) S i x ( t ) with respect to time t. We thus obtain
d ( x T ( t ) S i x ( t ) ) d t + lim Δ t 0 d d Δ t t t + Δ t x T ( Q + K i T K i ) x d t = 0
Further re-arranging Equation (24) yields
x T [ S ( A ( z 0 ) + B K i ) + ( A ( z 0 ) + B K i ) T S + Q + K i T K i ] x = 0
which means that (22) is established. Next, we reverse the process.
Along the stable system x ˙ = ( A + B K i ) x , the time derivative of the Lyapunov function V i ( x ) = x T S i x is calculated. We can then obtain
V ˙ i ( x ) = d x T ( t ) S i x ( t ) d t = x T ( A ( z 0 ) + B K i ) T S i x + x T S i ( A ( z 0 ) + B K i ) x
On integrating both sides of the above equation in the interval ( t , Δ t ) , we obtain
x T ( t + Δ t ) S i x ( t + Δ t ) x T ( t ) S i x ( t ) = t t + Δ t x T ( Q + K t T K i ) x d t
This concludes the proof.
According to the existing conclusions [42], iterative relations (22) and step 3 of Algorithm 1 converge to the solution of ARE (18).

5. Robust Control of Nonlinear IPS

In this section, we establish a nonlinear state-space model of the IPS and construct an appropriate auxiliary system and a performance index. The problem of the robust control of the IPS is then transformed into the optimal control problem of the auxiliary system. We finally propose the corresponding RL algorithm.

5.1. Nonlinear State-Space Representation of IPS

Based on the uncertain linear inverted pendulum model (11) established in Section 2.1, we consider the following uncertain nonlinear system.
x ˙ = A ( z ) x ( t ) + B u ( t ) + F ( z , x )
where F ( z , x ) represents the nonlinear perturbation of the system and can be used to represent various nonlinearity factors in the system. Based on the modeling process in Section 2 and literature [23], it is assumed that
F ( z , x ) = 0 ( I + m 1 L 2 ) z W ( c o s ( x 1 x 2 + x 3 x 4 ) + 0.5 x 1 + 2 x 3 4 x 4 x 2 1 ) x 2 0 m 1 L z W ( c o s ( x 1 x 2 + x 3 x 4 ) + 0.5 x 1 + 2 x 3 4 x 4 x 2 1 ) x 2
where the parameters I, m 1 , L, and W are the same as those in (10). On rewriting system (26), we obtain
x ˙ = A ¯ x + B u + F ¯ ( z , x )
where
A ¯ = 0 1 0 0 0 0 m 1 2 g L 2 W 0 0 0 0 1 0 0 m 1 g L ( m 1 + m 2 ) W 0 , F ¯ ( z , x ) = 0 ( I + m 1 L 2 ) z W ( c o s ( x 1 x 2 + x 3 x 4 ) + 0.5 x 1 + 2 x 3 4 x 4 x 2 ) x 2 0 m 1 L z W ( c o s ( x 1 x 2 + x 3 x 4 ) + 0.5 x 1 + 2 x 3 4 x 4 x 2 ) x 2
On substituting the parameter values into system (27), we can obtain
A ¯ = 0 1 0 0 0 0 0.6293 0 0 0 0 1 0 0 27.8285 0 , B = 0 0.8832 0 2.3566 , F ¯ ( z , x ) = 0 0.8832 z x 2 0 2.3566 z x 2 [ c o s ( x 1 x 2 + x 3 x 4 ) + 0.5 x 1 + 2 x 3 4 x 4 x 2 ]

5.2. Robust Control of Nonlinear IPS

To obtain the robust control law for an uncertain nonlinear IPS, we propose the following two lemmas.
Lemma  5.
There exists an uncertain function G ( z , x ) such that F ¯ ( z , x ) can be decomposed into the following form.
F ¯ ( z , x ) = B G ( z , x )
Proof. 
F ¯ ( z , x ) = 0 ( I + m 1 L 2 ) z N x 2 0 m 1 L z N x 2 [ c o s ( x 1 x 2 + x 3 x 4 ) + 0.5 x 1 + 2 x 3 4 x 4 x 2 ] = 0 I + m 1 L 2 N 0 m 1 L N ( z x 2 [ c o s ( x 1 x 2 + x 3 x 4 ) + 0.5 x 1 + 2 x 3 4 x 4 x 2 ] ) = B G ( z , x )
where G ( z , x ) = z x 2 [ c o s ( x 1 x 2 + x 3 x 4 ) + 0 . 5 x 1 + 2 x 3 4 x 4 x 2 ] . Proof completed. □
Lemma 6.
There exists an upper bound function f m a x ( x ) such that G ( z , x ) satisfies
G ( z , x f m a x ( x )
Proof. 
G ( z , x ) = z x 2 [ c o s ( x 1 x 2 + x 3 x 4 ) + 0.5 x 1 + 2 x 3 4 x 4 x 2 ] = z x 2 c o s ( x 1 x 2 + x 3 x 4 ) + z ( 0.5 x 1 + 2 x 3 4 x 4 ) x 2 c o s ( x 1 x 2 + x 3 x 4 ) + z ( 0.5 x 1 + 2 x 3 4 x 4 ) x 2 + 0.5 x 1 + 2 x 3 4 x 4 = f m a x ( x )
This concludes the proof. □
We constructing the optimal control problems for a nominal system.
x ˙ = A ¯ x ( t ) + B u ( t )
We determine a controller u such that u = u ( x ) to minimize the following performance index.
J ( x 0 , u ) = 0 [ f m a x 2 ( x ) + x T x + u T u ] d t
According to the performance index function (30), the cost function corresponding to the admissible control policy u ( x ) is
V ( x ) = t [ f m a x 2 ( x ) + x T x + u T u ] d t
Taking the time derivative on both sides of function (31) yields the following Bellman equation.
f m a x 2 ( x ) + x T x + u T u + V T [ A ¯ x + B u ] = 0
where V is the gradient of the cost function V ( x ) with respect to x.
We define the following Hamiltonian function.
H ( x , u , V ) = f m a x 2 ( x ) + x T x + u T u + V T [ A ¯ x + B u ]
On assuming that the minimum exists and is unique, the optimal control function for the given problem is then obtained as
u * = 1 2 B T V *
On substituting Equation (34) into Equation (32), the HJB equation that satisfies the optimal function V * ( x ) can be obtained as
f m a x 2 ( x ) + x T x + V * T A ¯ x 1 4 V * T B B T V * = 0
with the initial condition V * ( 0 ) = 0 .
On solving the optimal function V * ( x ) from Equation (35), the solution of the optimal control problem can be obtained. The solution of the robust control problem can then be obtained. The following theorem shows that the optimal control u * = 1 2 B T V * is a robust controller for a nonlinear IPS.
Theorem 2.
On considering the nominal system (29) with the performance index (30) and assuming that solution V * ( x ) of the HJB Equation (35) exists, the optimal control law (34) can then globally stabilize the IPS (27).
(Proof of Theorem 2).
We select V * ( x ) as the Lyapunov function. On considering the performance index function (30), V * ( x ) is obviously positive definite, and V * ( 0 ) = 0 . Taking the time derivative of V * ( x ) along system (27) yields
d V * d t = V * T [ A ¯ x + F ( z , x ) ] 1 2 V * T B B T V
According to Lemma 5, it follows that
d V * d t = V * T A ¯ x + V * T B G ( z , x ) 1 2 V * T B B T V
According to HJB Equation (35), we can obtain
V * T A ¯ x = f m a x 2 ( x ) x T x + 1 4 V * T B B T V *
On substituting Equation (37) into Equation (36), we obtain
d V * d t = f m a x 2 ( x ) x T x + 1 4 V * T B B T V * + V * T B G ( z , x ) 1 2 V * T B B T V = f m a x 2 ( x ) x T x 1 4 V * T B B T V * + V * T B G ( z , x )
From Equation (38), we can obtain
d V * d t = f m a x 2 ( x ) x T x 1 4 [ V * T B B T V * 4 V * T B G ( z , x ) + 4 G T ( z , x ) G ( z , x ) ] + G T ( z , x ) G ( z , x ) = x T x + G T ( z , x ) G ( z , x ) f m a x 2 ( x ) 1 4 H T ( z , x ) H ( z , x ) x T x
where H ( z , x ) = B T V * 2 G ( z , x ) . According to the Lyapunov stability criterion, the optimal control (34) can make the uncertain nonlinear IPS (27) asymptotically stable for all the allowable uncertainties. Therefore, for a constant p > 0 , there is a neighborhood N = { x : x < p } of the origin, such that, if the state x ( t ) enters the neighborhood N , then x 0 when t . However, x ( t ) cannot remain outside the domain N forever; else, x ( t ) p for all t > 0 , which implies that
V * [ x ( t ) ] V * [ x ( 0 ) ] = 0 t V ˙ * ( x ( τ ) ) d τ 0 t x T x d τ 0 τ p 2 d τ = p 2 t
Let t , then
V * [ x ( t ) ] V * [ x ( 0 ) ] p 2 t

5.3. RL Algorithm for Nonlinear IPS

For a nonlinear IPS, we consider the optimal control problems (29) and (30). For any admissible control, the cost function corresponding to the optimal control problem can be expressed as
V [ x ( t ) ] = t [ f m a x 2 ( x ) + x T x + u T u ] d t = t t + T [ f m a x 2 ( x ) + x T x + u T u ] d t + t + T [ f m a x 2 ( x ) + x T x + u T u ] d t
where T > 0 is an arbitrarily selected constant. We can then obtain the integral reinforcement relation satisfied by the cost function
V [ x ( t ) ] = t t + T [ f m a x 2 ( x ) + x T x + u T u ] d t + V [ x ( t + T ) ]
Based on the integral-based reinforcement relations (40) and the optimal control (34), the RL algorithm for the robust control of the nonlinear IPS is given below.
Algorithm 2.RL algorithm for robust control of uncertain nonlinear IPS
(1)
A non-negative function f m a x ( x ) is selected.
(2)
An initial stabilization control law u 0 ( x ) is selected.
(3)
Policy evaluation: the V i ( x ) from V i [ x ( t ) ] = t t + T [ f m a x 2 ( x ) + x T x + u i T ( x ) u i ( x ) ] d t + V i [ x ( t + T ) ] is solved.
(4)
Policy improvement: u i + 1 ( x ) = 1 2 B T V i .
(5)
i = i + 1 is set and steps 3 and 4 are repeated until V i + 1 V i ϵ , where ϵ > 0 is a small constant.
In Algorithm 2, by providing an initial stabilizing control law, the algorithm iterates repeatedly between steps 3 and 4 until convergence. We can then obtain the robust control gain u of system (27).
Remark 2.
Although it is possible to directly solve the ARE to obtain S i , the state matrix of the system (12) is required to be known. The implementation of Algorithm 2 does not necessitate that the state matrix of the system be known.
Next, we prove the convergence of Algorithm 2. The following conclusion provides an equivalent form of the integral strengthening relation in step 3.
Lemma 7.
On assuming that u i ( x ) is the stabilization control function of the nominal system (29), solving the value function V i ( x ) from the equation in step 3 in Algorithm 2 is equivalent to solving the following equation.
f m a x 2 ( x ) + x T x + u i T ( x ) u i ( x ) + V i [ A ¯ x + B u i ] = 0
Proof. 
On dividing both sides of the equation in step 3 by T and taking the limit, we obtain
lim T 0 V i x ( t + T ) V i x ( t ) T + lim T 0 t t + T [ f m a x 2 ( x ) + x T x + u i T ( x ) u i ( x ) ] d t T = 0
Based on the definition of the function limit and L’ Hopital’s rule, we obtain
d V i x ( t ) d t + lim T 0 d d T t t + T [ f m a x 2 ( x ) + x T x + u i T ( x ) u i ( x ) ] d t = 0
Therefore,
f m a x 2 ( x ) + x T x + u i T ( x ) u i ( x ) + V i [ A ¯ x + B u i ] = 0
However, along the stable system x ˙ = A ¯ x + B u i , taking the time derivative of V i ( x ) yields
d d t ( V i ( x ) ) = V i ( A ¯ x + B u i )
On integrating both sides of the above equation from t to t + T , we obtain
V i [ x ( t + T ) ] V i [ x ( t ) ] = t t + T V i ( A ¯ x + B u i ) d τ
Then, from (41), we obtain
V i [ x ( t ) ] = t t + T f m a x 2 ( x ) + x T x + u i T ( x ) u i ( x ) d τ + V i [ x ( t + T ) ]
The above equation is the equation in the third step of Algorithm 2. This concludes the proof. □
According to the conclusion of [39,43], if the stabilizing initial control policy u 0 ( x ) is given, the follow-up control policy calculated using the iterative relation (34) and Equation (41) is also a stabilizing control policy. Furthermore, the iteratively calculated cost function sequence converges to the optimal value function. From Lemma 7, we know that Equation (41) and the equation of step 3 are equivalent. Therefore, the iterative relationship between steps 3 and 4 in Algorithm 2 converges on the optimal control and optimal cost functions. This contradicts the positive definiteness of V * [ x ( t ) ] . Therefore, the system (27) is globally asymptotically stable. This concludes the proof. □

6. Numerical Simulation Results

In this section, two simulation examples are provided to demonstrate the feasibility of the theoretical results for the robust control of the uncertain IPS.

6.1. Example 1

Considering system (11), our objective is to obtain a robust control u such that it is stable. Based on Lemmas 1, 2, and 3, the weighting matrix Q is selected as
Q = F + I = 1 0 0 0 0 1.81 0 0 0 0 1 0 0 0 0 1
We present the initial stability control law
u 0 = [ 1.0900 4.1230 24.8908 6.7726 ] x
The initial state of the nominal system is selected as x 0 = [ 0 1 1 1 ] T . The time step size for the collecting system status and input information is set as 0.01 s. Algorithm 1 converges after six iterations, and the S d matrix and control gain K d converge to the following optimal solutions:
S d = 2.4465 2.0822 6.2489 1.2066 2.0822 4.3346 14.0702 2.7082 6.2489 14.0702 100.8262 18.9646 1.2066 2.7082 18.9646 3.6646
and
K d = [ 1.0044 2.5538 32.2652 6.2440 ]
There are 10 independent numerical samples in the matrix S d . These 10 numerical samples are collected in each iteration to address with the least squares problem. The evolution of the control signal u is presented in Figure 5. Figure 6 illustrates the iterative convergence process of the S matrix, where S ( i , j ) represents the element lying at the intersection of the i-th row and the j-th column in the symmetric matrix S, where i = 1 , 2 , 3 , 4 , j = 1 , 2 , 3 , 4 .
Using MATLAB to directly solve the ARE (18), we obtain the following optimal feedback gain and the S matrix. To avoid confusion, we use the following notations.
S = 2.4455 2.0802 6.2326 1.2039 2.0802 4.3307 14.0381 2.7032 6.2326 14.0381 100.5677 18.9228 1.2039 2.7032 18.9228 3.6579
K = [ 1.000 2.5455 32.1952 6.2327 ]
As is apparent, the results obtained using the RL method are only marginally different from those obtained via the direct solution of the ARE. Figure 7 presents the closed-loop trajectory of the system when z = 0 . 1 , 0 . 4 , 0 . 7 , 1 . 0 . It is easy to observe that the system is stable, which means that the controller is valid.
The corresponding partial eigenvalues of the closed-loop uncertain linear system with u = K x for different z are listed in Table 2. From Table 2, we can observe that the eigenvalues of the closed-loop system all have negative real parts. Thus, the uncertain linear system (11) with robust control u = K x is asymptotically stable for all 0 z 1 .
Table 2. Characteristic root of system (11) when z takes different values.
Table 2. Characteristic root of system (11) when z takes different values.
z λ 1 λ 2 λ 3 λ 4
0.1 -6.60 -4.23 -0.85+0.32i -0.85-0.32i
0.2 -6.73 -4.33 -0.78+0.43i -0.78-0.43i
0.3 -6.86 -4.41 -0.71+0.50i -0.71-0.50i
0.4 -7.00 -4.48 -0.65+0.55i -0.65-0.55i
0.5 -7.14 -4.54 -0.60+0.59i -0.60-0.59i
0.6 -7.28 -4.59 -0.55+0.62i -0.55-0.62i
0.7 -7.42 -4.63 -0.50+0.65i -0.50-0.65i
0.8 -7.56 -4.67 -0.46+0.67i -0.46-0.67i
0.9 -7.70 -4.70 -0.42+0.68i -0.42-0.68i
1.0 -7.84 -4.73 -0.38+0.69i -0.38-0.69i

6.2. Example 2

Let us consider the nonlinear IPS (27). According to Lemma 5, the system (27) can be rewritten as
x ˙ = A ¯ x + B u + B G ( z , x )
The optimal control problem for the IPS is as follows: for nominal system (29), we find the control function u such that the performance index (30) achieves a minimum.
According to lemma 6, we obtain
G ( z , x ) = z x 2 [ c o s ( x 1 x 2 + x 3 x 4 ) + 0.5 x 1 + 2 x 3 4 x 4 x 2 ] | x 2 + 0.5 x 1 + 2 x 3 4 x 4 | = f m a x ( x )
then
f m a x 2 ( x ) = ( x 2 + 0.5 x 1 + 2 x 3 4 x 4 ) 2 = x T 0.25 0.5 1 2 0.5 1 2 4 1 2 4 8 2 4 8 16 x
According to performance index (30), the weight matrix Q is selected as
Q = 1.25 0.5 1 2 0.5 2 2 4 1 2 5 8 2 4 8 17
Based on Algorithm 2, present give the initial control policy
u 0 = [ 1.0900 4.1230 24.8908 6.7726 ] x
The initial state of the system is selected as x 0 = [ 0 1 1 1 ] T . Algorithm 2 converges after six iterations, and the S d matrix and control gain K d converge to the following optimal solutions.
S d = 2.4325 2.4398 6.2874 1.3888 2.4398 5.0469 14.0539 3.0045 6.2874 14.0539 130.4535 18.9735 1.3888 3.0045 18.9735 4.2715
and
K d = [ 1.1180 2.6229 32.3006 7.4126 ]
The evolution of the control signal u is presented in Figure 8. Figure 9 presents the convergence process of the S d matrix.
We also selected z = 0 . 1 , 0 . 4 , 0 . 7 , 1 . 0 . Figure 10 presents the trajectory of the closed-loop system for different values of z.

7. Conclusions

In this paper, the robust control problem of the first-order IPS is studied. The linearization and nonlinear state-space representation are established, and an RL algorithm for the robust control of the IPS is proposed. The controller of the uncertain system is obtained using the method of online PI. The results thus obtained show that the error between the controller obtained using the RL algorithm and by directly solving ARE is very small. Moreover, the use of the RL algorithm can effectively circumvent the "curse of dimensionality.’’ Moreover, the algorithm can provide a controller that meets the requirements without the nominal matrix of the system being known. This improves the current state at which the robust control of the IPS relies excessively on the nominal matrix. In future research, we intend to take into consideration that the input matrix of the system also has uncertainty and extend the RL algorithm to more general systems.

Author Contributions

All the authors contributed equally to the development of the research.

Funding

This research was funded by the National Natural Science Foundation of China under Grant No. 61463002.

Acknowledgments

The authors thank to the Journal editors and the reviewers for their helpful suggestions and comments.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Dai L. Strong decoupling in singular systems. Mathematical Systems Theory. 1989,22(1):275–2.
  2. Pal B, Chaudhuri B. Robust Control in Power Systems. Springer Science & Business Media.
  3. Liu L, Liu Y J, Tong S, et al. Relative threshold-based event-triggered control for nonlinear constrained systems with application to aircraft wing rock motion. IEEE Transactions on Industrial Informatics. 2022,18(2):911–921.
  4. Wang Z, She J, Wang F, et al. Further Result on Reducing Disturbance-Compensation Error of Equivalent-Input-Disturbance Approach. IEEE/ASME Transactions on Mechatronics. IEEE/ASME Transactions on Mechatronics (Early Access). [CrossRef]
  5. Yue D, Han QL, Lam J. Network-based robust H control of systems with uncertainty. Automatica. 2005,41(6):999–1007.
  6. Grasser F, D’arrigo A, Colombi S, Rufer AC. JOE: a mobile, inverted pendulum. IEEE Transactions on industrial electronics. 2002,49(1):107–114.
  7. Stephenson A. Memoirs and Proceedings of the Manchester Literary and Philosophical Society 52, 1–10. On a new type of dynamic stability. 1908.
  8. Housner GW. The behavior of inverted pendulum structures during earthquakes. Bulletin of the seismological society of America. 1963,53(2):403–417.
  9. Wang JJ. Simulation studies of inverted pendulum based on PID controllers. Simulation Modelling Practice and Theory. 2011,19(1):440–449.
  10. Prasad LB, Tyagi B, Gupta HO. Optimal Control of Nonlinear Inverted Pendulum System Using PID Controller and LQR: Performance Analysis Without and With Disturbance Input. International Journal of Automation and Computing. 2014,11:661–670.
  11. Li D, Chen H, Fan J, Shen C. A novel qualitative control method to inverted pendulum systems. IFAC Proceedings Volumes. 1999,32(2):1495–1500.
  12. Kwon T, Hodgins JK. Momentum-Mapped Inverted Pendulum Models for Controlling Dynamic Human Motions. ACM Transactions on Graphics (TOG). 2017,36(1):1–14.
  13. Yamakawa T. Stabilization of an inverted pendulum by a high-speed fuzzy logic controller hardware system. Fuzzy sets and Systems. 1989,32(2):161–180.
  14. Huang CH, Wang WJ, Chiu CH. Design and implementation of fuzzy control on a two-wheel inverted pendulum. IEEE Transactions on Industrial Electronics. 2010,58(7):2988–3001.
  15. Su X, Xia F, Liu J, Wu L. Event-triggered fuzzy control of nonlinear systems with its application to inverted pendulum systems. Automatica. 2018,94:236–248.
  16. Nasir ANK, Razak AAA. Opposition-based spiral dynamic algorithm with an application to optimize type-2 fuzzy control for an inverted pendulum system. Expert Systems with Applications. 2022,195:116661.
  17. Wai RJ, Chang LJ. Adaptive stabilizing and tracking control for a nonlinear inverted-pendulum system via sliding-mode technique. IEEE Transactions on Industrial Electronics. 2006,53(2):674–692.
  18. Huang J, Guan ZH, Matsuno T, Fukuda T, Sekiyama K. Sliding-mode velocity control of mobile-wheeled inverted-pendulum systems. IEEE Transactions on robotics. 2010,26(4):750–758.
  19. Jung S, Kim SS. Control experiment of a wheel-driven mobile inverted pendulum using neural network. IEEE Transactions on Control Systems Technology. 2008,16(2):297–303.
  20. Yang C, Li Z, Li J. Trajectory planning and optimized adaptive control for a class of wheeled inverted pendulum vehicle models. IEEE Transactions on Cybernetics. 2012,43(1):24–36.
  21. Yang C, Li Z, Cui R, Xu B. Neural network-based motion control of an underactuated wheeled inverted pendulum model. IEEE Transactions on Neural Networks and Learning Systems. 2014,25(11):2004–2016.
  22. Feng Lin and A. W. Olbrot, "An LQR approach to robust control of linear systems with uncertain parameters," Proceedings of 35th IEEE Conference on Decision and Control, Kobe, Japan, 1996, pp. 4158-4163 vol.4. [CrossRef]
  23. Lin F, Brandt RD. An optimal control approach to robust control of robot manipulators. IEEE Transactions on robotics and automation. 1998,14(1):69–77.
  24. Lin F. An optimal control approach to robust control design. International journal of control. 2000,73(3):177–186.
  25. Bellman R. Dynamic programming. Science. 1966,153(3731):34–37.
  26. Neustadt LW, Pontrjagin LS, Trirogoff K. The mathematical theory of optimal processes. Interscience, 1962.
  27. Powell WB. Approximate Dynamic Programming: Solving the curses of dimensionality. 703. John Wiley & Sons, 2007.
  28. Li H, Liu D. Optimal control for discrete-time affine non-linear systems using general value iteration. IET Control Theory & Applications. 2012,6(18):2725–2736.
  29. Wei Q, Liu D, Lin H. Value iteration adaptive dynamic programming for optimal control of discrete-time nonlinear systems. IEEE Transactions on cybernetics. 2015,46(3):840–853.
  30. Tesauro G. TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural computation. 1994,6(2):215–219.
  31. Crites R, Barto A. Improving elevator performance using reinforcement learning. Advances in neural information processing systems. 1995,8.
  32. Zhang W, Dietterich T. High-performance job-shop scheduling with a time-delay TD (λ) network. Advances in neural information processing systems. 1995,8.
  33. Singh S, Bertsekas D. Reinforcement learning for dynamic channel allocation in cellular telephone systems. Advances in neural information processing systems. 1996,9.
  34. Maja J M. Reward Functions for Accelerated Learning. In: Cohen WW, Hirsh H. , eds. Machine learning proceedings 1994, 1 ed., Morgan Kaufmann, 1994,181–189.
  35. Doya K. Reinforcement learning in continuous time and space. Neural computation. 2000,12(1):219–245.
  36. Krstic M, Kokotovic PV, Kanellakopoulos I. Nonlinear and adaptive control design. John Wiley & Sons, Inc., 1995.
  37. Ioannou P, Fidan B. Adaptive Control Tutorial, vol. 11 of Advances in Design and Control. SIAM, Philadelphia, Pa, USA,. 2006.
  38. Åström KJ, Wittenmark B. Adaptive control. Courier Corporation, 2013.
  39. Vrabie D, Pastravanu O, Abu-Khalaf M, Lewis FL. Adaptive optimal control for continuous-time linear systems based on policy iteration. Automatica. 2009,45(2):477–484.
  40. Xu D, Wang Q, Li Y. Adaptive optimal control approach to robust tracking of uncertain linear systems based on policy iteration. Measurement and Control. 2021, 54(5-6): 668-680.
  41. Xu D, Wang Q, Li Y. Optimal guaranteed cost tracking of uncertain nonlinear systems using adaptive dynamic programming with concurrent learning. International Journal of Control, Automation and Systems. 2020,18(5):1116–1127.
  42. Kleinman D. On an iterative technique for Riccati equation computations. IEEE Transactions on Automatic Control. 1968,13(1):114–115.
  43. Abu-Khalaf M, Lewis FL. Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach. Automatica. 2005,41(5):779–791.
Figure 1. Inverted pendulum system diagram.
Figure 1. Inverted pendulum system diagram.
Preprints 87953 g001
Figure 2. First-order inverted pendulum physical model.
Figure 2. First-order inverted pendulum physical model.
Preprints 87953 g002
Figure 3. Force analysis of the trolley.
Figure 3. Force analysis of the trolley.
Preprints 87953 g003
Figure 4. Force analysis of the pendulum.
Figure 4. Force analysis of the pendulum.
Preprints 87953 g004
Figure 5. Control signal u of the linearized system.
Figure 5. Control signal u of the linearized system.
Preprints 87953 g005
Figure 6. S-matrix iterative process of the linearized system.
Figure 6. S-matrix iterative process of the linearized system.
Preprints 87953 g006
Figure 7. Trajectory of closed-loop linearized system.
Figure 7. Trajectory of closed-loop linearized system.
Preprints 87953 g007
Figure 8. Control signal u of the nonlinear system.
Figure 8. Control signal u of the nonlinear system.
Preprints 87953 g008
Figure 9. S-matrix iterative process of the nonlinear system.
Figure 9. S-matrix iterative process of the nonlinear system.
Preprints 87953 g009
Figure 10. Trajectory of closed-loop nonlinear system.
Figure 10. Trajectory of closed-loop nonlinear system.
Preprints 87953 g010
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Downloads

172

Views

48

Comments

0

Subscription

Notify me about updates to this article or when a peer-reviewed version is published.

Email

Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

© 2025 MDPI (Basel, Switzerland) unless otherwise stated