3.1. Q-Leaning
The Q-learning (QL) algorithm uses the concept of reward and punishment is created in the environment.
Figure 2 illustrates how a mobile robot selects an action based on the appropriate policy, executes that action, and receives status (s) and rewards (r) from the navigation environment. A state contains the robot's current position in its workspace while optimizing paths, while an action is a movement the robot transitions from one state to another.
The Q value was built so that for robot to decide to earn the greatest reward. This is calculated as follows:
where:
represents the learning rate,
is the discount factor,
signifies the maximum
among all feasible actions in the new state
, and denotes the immediate reward/penalty earned by the agent after executing a move at state
.
Based on Eq. (9) is state matrix – which acts as a lookup table. From there, for each state of the robot, we found the action with the most significant Q value (
Figure 3).
Reinforcement learning is a random process, therefore the Q values will differ before and after the action. This is called a temporary difference.
Thus, the matrix
needs to update the weights based on Eq. (11):
where: α is an arithmetic coefficient. Through the times the robot performs actions;
it will gradually converge.
The programming program for the Q-learning algorithm for robot pathfinding is expressed as follows:
Algorithm 1: Classical Q-Learning algorithm begin Initialization: |
, (states and 𝒎 actions) for (each episode):
- (1)
-
Set ← a random state from the states set s;
while ( Goal stage)
- (2)
Choose in by using an adequate policy (𝜺 -greedy, etc.);
- (3)
Perform action , and receive reward/penalty and ;
- (4)
-
Update using equation (9);
end-while end-for end |
The size of the Q table increases exponentially with the number of states and actions in an environment with conditions.
In this situation, the process becomes computationally expensive and requires considerable memory to hold the Q values. Imagine a game in which each state has 1000 actions. A table with one million cells is required. Given the vast amount of computational time [
31], one of the main problems when using the QL algorithm in path optimization is that accessing all the action-state pairs during the mining process is complex, which affects orbital convergence.
3.2. Deep Q-Leaning
The DQL algorithm replaces the regular Q table with a neural network. Instead of mapping a (state-action) pair to a Q-value, the neural network maps the input states to (action, Q-value) pairs, as shown in
Figure 4.
State evolution data were used for the neural network input, and the Q value corresponded to each separate output node of the neural network. Therefore, each predicted Q value of an individual action is in state.
The proposed model for deep learning neural networks has four layers: one input layer, two hidden layers, and one output layer (
Figure 5). There were 1856 training parameters in the first hidden layer, comprising architecture neurons that are entirely associated with 28 laser sensor inputs. Four thousand one hundred sixty parameters were trained in the second hidden layer, including 64 neurons and 64 inputs from the first.
A programming program for the DQL algorithm as follows:
Input: data , learning factor α, discount factor, epsilon-greedy policy , robot pose, safety constraints
Output: , states’ s є S, actions a є A, weight θ
Begin
Initialize replay memory to capacity N
Initialize with random weights
Initialize , with random weights
for episode = 1, M do
Randomly set the robots pose in the scenario Observe initial states of robots s
for t = 1, T due to:
Select an action
With probability select a random action
Otherwise select
Execute action , observer state , compute reward
Store training in relay memory EASY
Sample random minibatch of transition from EASY
Calculate the predicted value
Calculate target value for each minibatch transition
If is terminal state the
Otherwise
Train neural networks using end for
end for
The robot makes decisions and performs actions according to the Q-based (ε-greedy) policy. A mobile robot must successfully consider more than short-term gains over the long term. Mention any prizes it might win in Class 1, Class 2, Class L -1 future class. In addition, because the environment is unpredictable, the mobile robot can never be sure to receive the same reward the following time it takes the same activities. Robot can diverge further as they advance in the future. Therefore, we employ a future discount reward in this study. The following formula is used to calculate the return on the future dilution factor at time t:
where:
is the direct reward, and T is the time step at which the robot action ends, the more future the reward, the less the robot considers it.
The goal of the robot is to interact with the environment by choosing actions that maximize future rewards. A technique called experience replay, in which we record the robot's experience at each time step, in a data set which pooled over many learning cycles (episodes). at the end of the learning cycle into replay memory.
Update the weights of neural networks. First sample random transitions from replay memory D with finite memory size N. For each given transformation, the algorithm performs the following steps:
- Step 1: Transition through the neural network for the current state to obtain the predicted value .
- Step 2: If the transition sampled is a collision sample, then the evaluation for this pair ( ) is directly set as the termination reward. Otherwise, forward neural networks are performed for the next state s', the maximum overall network output is calculated, and the target for the action is computed using the Bellman equation (r + ). For all other activities, the target value is set to be the same as that initially returned in step 1.
-
Step 3: The Q-learning update algorithm uses the following loss function:
The neural network weights were be changed using a loss function through backpropagation and stochastic gradient descent. The mobile robot stores the learned neural networks in its brain when the training is over and uses them for future testing and work.