4.1. Planar two-link manipulator
The plane two-link manipulator simulation system is shown in
Figure 2, and the simulation platform adopts V-REP PRO EDU 3.6.0. The settings of the two-link manipulator in the simulation environment are as follows. The length of the rods are
, and the mass of the rods are
. Each joint adopts the incremental control method, that is, the joint rotates a fixed angle
in the direction given by the control signal
at any time
; where
,
. The state of the entire simulation system
, where
are the angle and angular velocity of each joint at
th time.
is the position of the endpoint of the manipulator at time
, and the desired target point position
is set as follows:
This experiment is mainly used to verify the feasibility of the reinforcement learning algorithm in the field of robotic arm path tracking. Because the experimental environment is simple and the output action dimension of the two-link robotic arm is low, it can be approximated as a discrete quantity, so the algorithm adopts the classic DQN algorithm. The strategy network structure in the DQN algorithm is: the input state is 8-dimensional, the output action is 2-dimensional, the hidden layer has two layers and the number of nodes in each layer is 50. The hyperparameters are set as: replaybuffer = 1e6, learning-rate = 3e-4, discount-factor = 0.99, batch-size = 64, the update between the Q network and the target Q network adopts the soft update method, and its soft parameter tau=0.001. In addition, the setting of the reward is:
, where
are the target path points at time t respectively and the position of the end point of the robot arm. The tracking curve results of this experiment are shown in
Figure 3:
The red line is the desired target path, and the blue line is the actual running end path. From the tracking results in
Figure 3, it can be seen that the method based on deep reinforcement learning perfectly achieves the tracking of the target path.
Figure 4 shows the experimental results in the simulation environment:
The experimental results show that it is completely feasible to use the deep reinforcement learning algorithm to achieve path tracking on a simple two-link manipulator.
4.2. UR5 manipulator
After exploring the application of reinforcement learning in path tracking, and successfully applied to the two-link manipulator to achieve the tracking target. We will further explore the application of the multi-degree-of-freedom manipulator-UR5 to realize path tracking under continuous control. The UR5 simulation system is shown in
Figure 5, and the simulation platform still uses V-REP PRO EDU 3.6.0. The system is used to realize the path tracking by using the deep reinforcement learning algorithm after the path is generated by the traditional path generation algorithm when there are obstacles. The system actions
, and states set as
, where
is the angle and angular velocity of the first joint,
are the distance between the endpoints
p and the corresponding desired target points
. The initial position of the endpoint: [-0.095, -0.160, 0.892], the initial position of the target point: [-0.386, 0.458, 0.495]. The desired path is generated by the traditional RRT [
26] path generation algorithm with the stride set to 100.
In addition, this experiment set up 4 additional variables to explore the impact of these factors on tracking performance:
The upper control method of the manipulator adopts two control methods, position control or velocity control. The position control is the control of the joint angle, and the input action is the increment of the joint angle. The range of the increment at each moment is set as [- 0.05, 0.05] rad. The velocity control is the control of joint angular velocity. The input action is the increment of joint angular velocity. The increment range of each moment is set to [-0.8, 0.8] rad/s in the experiment. In addition, the underlying control of the manipulator adopts the traditional PID torque control algorithm.
Adding the noise to the observations: We set up two groups of control experiments, one of which adds random noise to the observations, the noise is adopted from the standard normal distribution N(0,1), the size is 0.005 *N(0,1).
Setting the time interval distance n. The target path points given by the manipulator at every n time are the target path points at the N*n time, where N=1, 2, 3..., and study the effect of different interval points on the tracking results. In our experiments, we set the interval distance interval=0, 5, 10 respectively.
Terminal reward. Setting up a control experiment that during the training process, when the distance between the endpoint of the robotic arm and the target point is within 0.05m (the termination condition is met), an additional +5 reward is given to study its impact on the tracking results.
The continuous control reinforcement learning algorithm SAC is used in this experiment. All network structures are as follows: each network contains two hidden layers, the number of nodes in each layer is 200, and the activation function of the hidden layer is set to Relu. The hyperparameters are set as: replaybuffer = 1e6, discount-factor = 0.99, batch-size = 128, the update between the Q network and the target Q network adopts the soft update method, the soft parameter tau=0.01, and the learning rate of Actor and Critic network are both set to learning-rate = 1e-3, the weight coefficient of policy entropy during the entire training process
=1e-3. The reward settings for this experiment are:
where n is the interval distance. In addition, an experiment is terminated when the robot arm runs for 100 steps or the distance between the end point of the robot arm and the target point is within 0.05m.
The experimental results are shown in the following figures.
Figure 6(a),(b) are the path tracking results without observation noise and with observation noise in the position control mode, respectively.
Figure 6(c),(d) are the speed control results, respectively. The path tracking results with and without observation noise in the mode. Different time intervals are set in each picture, and the upper three curves of each picture are the results of not giving the terminal reward, and the lower three curves are the results of giving the terminal reward.
In addition, this experiment also quantitatively analyzed the tracking results, and calculated the average error between the obtained path and the target path under different experimental conditions and the average distance between the endpoint of the manipulator and the target point at the last moment. The results are shown in
Table 1 and
Table 2 shows:
The training process curve is shown in
Figure 7:
Through the path tracing results, it can be found that:
In this experimental scenario, the target path generated by the RRT algorithm has obvious non-smoothness; while the tracking path generated by the reinforcement learning algorithm SAC based on the target path is also very smooth under the condition that the tracking accuracy is satisfied.
By analyze the influence of the n value on the generated path, it can be found that when the n value is too large (n=10), its approaching effect on the target point is better, but its tracking effect on the target path is poor. But when n=1, the situation is opposite. Therefore, when selecting the value of n, it is necessary to balance the path tracking effect and the final position of the endpoint.
Adding noise to the observations of the system during the simulation training process helps to improve the robustness of the control strategy and the anti-interference to noise, so that the strategy has better performance.
During the simulation training process, when the endpoint of the manipulator reaches the allowable error range of the target point, adding a larger reward to the current strategy can make the approaching result of the robotic arm to the target point better, but it will lose some precision of path tracking;
Experiments show that the algorithm achieves good results in both position control and velocity control, and it can be seen from the curve of the training process that the curve can converge at an earlier time.
In addition, since the system dynamics model is not considered in the path tracking experiment based on deep reinforcement learning. In order to verify the advantage of the method based on deep reinforcement learning that does not need the dynamic model, we further explore the influence of the change of dynamic characteristics on the experimental results. So, we change the quality of the end effector, the trained model is tested, and the experimental results are shown in
Table 3 and
Table 4.
The experimental results show that in the model trained under the condition of fixed load quality, the path tracking based on deep reinforcement learning can still ensure sufficient stability when the load changes. That is, the change of dynamic characteristics will not affect the algorithm effect.
Furthermore, we also compare the results of the proposed algorithm and the traditional inverse kinematics method in terms of energy consumption and trajectory smoothness. The experimental results are shown in
Table 5 and
Table 6. Among them, the calculation method of the energy consumption of the manipulator during the entire path tracking process is:
where
is the
th path point in the entire path,
is the
th joint of the manipulator,
is the joint torque and joint speed, M is the number of path points,
is the distance between the path-points.
The smoothness of the trajectory is measured by the angle between the tangent vectors of adjacent points of the curve, and the degree of smooth movement of the robotic arm is measured by analyzing the mean value of the turning angles in the entire trajectory.
The experimental results show that the algorithm proposed in this paper is superior to the traditional inverse kinematics method in terms of energy consumption and trajectory smoothness.
4.3. Redundant manipulator
The algorithm proposed in this paper is experimentally verified and analyzed on the UR5 manipulator. The experimental results show that the algorithm proposed in this paper can effectively solve the path tracking problem of the manipulator. In order to further verify the effectiveness and generalization of the algorithm in this paper, we also conduct the verification on a redundant manipulator. The 7-DOF redundant manipulator simulation system is shown in
Figure 8. The simulation platform still uses V-REP PRO EDU 3.6.0, and the simulated manipulator uses the KUKA LBR ii wa 7 R800 redundant manipulator. The setting of actions is
, and states set as
,where
is the angle and angular velocity of the first joint and
is the distance between the end-point
of the manipulator and the corresponding desired target point
. The initial position of the end-point is [0.0044, 0.0001, 1.1743], and the initial position of the target point is [0.0193, 0.4008, 0.6715]. The expected path is an arc trajectory generated by the path generation algorithm, and the step size is set to 50.
The setup of the redundant manipulator path tracking experiment is exactly the same as that of UR5. The experiment still uses the continuous control reinforcement learning algorithm SAC, and all network structures are also the same as the UR5 setup. That is, each network contains two hidden layers, the number of nodes in each layer is 200, and the activation function of the hidden layer is set to Relu. The hyperparameter settings are also: replaybuffer = 1e6, discount-factor = 0.99, batch-size = 128, the update between the Q network and the target Q network adopts the soft update method, the soft parameter tau=0.01, and the learning rate of Actor and Critic network are both set to learning-rate = 1e-3, the weight coefficient α=1e-3 of policy entropy during the whole training process.
The reward settings are also the same as before. The only difference is that the redundant manipulator path tracking experiment is an experiment to verify the generalization of the algorithm, so it does not involve the results when other hyperparameter conditions change. Therefore, the default separation distance in the reward setting is n=1.
Since this experiment is a verification experiment, this experiment only explores the path tracking results in the speed control mode. The path tracking results of the redundant manipulator are shown in
Figure 9.
In addition, we consider the randomness of the sampling of the deep reinforcement learning algorithm. And in order to reflect the stability of our method, we also carried out multiple experiments under the setting of multiple random seeds. The training process curve of the experimental results is shown in
Figure 10.
The experimental results show that our method still has a good tracking effect on the redundant manipulator, and the training results under different random seed settings show that our method can be guaranteed in terms of generalization and stability