3.1. Contrastive Language-Image Pre-Trained Model
The CLIP model leverages the relationships between images and their corresponding text descriptions to achieve zero-shot learning capabilities. In this section, we describe the core components and mechanisms of the CLIP model and how it contributes to our proposed method. As in
Figure 1, the method can be described as below:
Images I: train set images with the corresponding text descriptions.
Semantic attributes T: different text descriptions, including those that match the images and those that do not.
Vision transformer :[
23] interpret image information into vectors
.
Text encoder [
24]: transform semantic attributes into text embedding
.
Shared space: a mapping relationship formed by contrastive learning.
The similarity between an image and a text description is then quantified by a similarity metric
S, typically the cosine similarity, which is computed as:
Elements in
are separated into corresponding pairs and non-corresponding ones. During training, the model is optimized to maximize the similarity between corresponding images and texts while minimizing the similarity between non-corresponding pairs [
25]. Such a semantic-based zero-shot learning approach allows the model to acquire the correct embedding of images when faced with previously unseen objects by leveraging the relationship between semantic information and image features. Because this encoder is trained using a contrastive learning strategy, the model can only focus on the relative relationships between images and semantic information and cannot attend to absolute information, such as the spatial coordinates of objects. This limitation is addressed by the training approach of the proposed method, which utilizes masks to specify exact objects related to decision-making.
3.2. Visual-Act Model
This part utilizes the zero-shot image encoder to transform images into embedding vectors. Then, the model figures out the relationship between visual information and action selection, which can extract semantic information related to decision-making, thus optimizing the path-planning process.
Algorithm 1: Optimized model for high-level macro-action decision making in action selection |
- 1:
Input: Image dataset with annotations, Set of masks , Number of epochs N
- 2:
Output: Optimized model for action decision , processed image embeddings
- 3:
Initialize the zero-shot image encoder Z
- 4:
Initialize action-specific networks for each action
- 5:
Initialize parameters for contrastive loss
- 6:
Define total loss function with weighting factor
- 7:
for each epoch do
- 8:
for each in dataset do
- 9:
- 10:
for each mask in Set of masks do
- 11:
- 12:
end for
- 13:
Initialize task-specific losses for this batch
- 14:
for each action a in {forward, left, right} do
- 15:
- 16:
end for
- 17:
- 18:
- 19:
- 20:
Backpropagate and update parameters
- 21:
end for
- 22:
Optionally evaluate the model on the validation set
- 23:
end for
- 24:
Save the optimized model parameters
|
(1) Mask generation and encoding
For each input image
I, a series of object masks [
26] are applied to segment the image into different regions. The final number of masks is capped at sixteen by disregarding far-away and trivial objects. The original image and mask-processed images are represented as
E and
. By putting masks on images, the zero-shot image encoder can enlarge the features of the mask-applied part, and it can be reflected in the output embeddings in a way that allows attention mechanism and contrast learning.
For each action
, we calculate a reward label
considering several factors in this direction: the distance to obstacles, the dynamic/static nature of these obstacles, and the distance to the goal. The backward action is excluded from our model due to its focus on forward-moving scenarios. Mathematically, the reward label for action
a can be expressed as:
where
denotes the distance to the nearest obstacle,
represents the distance to the goal, and
indicates whether the obstacle is static or dynamic. The function
f computes the reward label, encapsulating the trade-offs between navigating safely around obstacles and efficiently moving toward the goal.
The original image and processed images are labelled as R and . For , we focus on whether the masked object is in the corresponding direction, then denote this with a difference as a mark for contrastive loss. A weight is stitched to to decrease the loss value if the masked object belongs to the background.
(2) Linking semantics of images with actions
Based on the differences between the image labels
after mask processing and the original label
R at each label element, the images that have undergone mask processing are labeled as positive samples
for the positions with differences, while those without any changes are labeled as negative samples
.
where
Q,
K, and
V represent queries, keys, and values separately.
Then, these samples are sent to a masked mult-head attention network [
27], which aims to forge a shared feature representation pivotal for ensuing action decisions. For every pre-defined action, action-specific networks
process this shared representation, engendering action determinations. The self-attention mechanism allows the model to focus on specific elements in semantic embeddings. This dynamic is orchestrated by incorporating each action’s loss
into a cumulative loss function, where the contrastive loss
plays a crucial role in guiding the model towards discerning distinct actions.
where
denotes the cosine similarity between two vectors
x and
y.
x is the anchor sample,
is a positive sample similar to the anchor, and
represents a negative sample dissimilar to the anchor.
is a temperature scaling parameter that controls the separation between positive and negative pairs.
The model outputs serve as supplemental guidance for high-level policy in hierarchical reinforcement learning and provide instruction for high-level decisions.
Algorithm 2: Hierarchical path planning |
- 1:
Initialize high-level and low-level policies
- 2:
Initialize policy network and target network
- 3:
Initialize experience replay buffer D with capacity N
- 4:
for each episode do
- 5:
for each step in episode do
- 6:
Observe visual input and current state
- 7:
Generate low-level action
- 8:
if high-level block is unexplored then
- 9:
Generate direction using APF as equation (14-16)
- 10:
Decompose direction into actions for unexplored blocks
- 11:
else
- 12:
Follow policy for explored blocks
- 13:
end if
- 14:
Execute action , observe reward and next state
- 15:
Store transition in replay buffer D
- 16:
end for
- 17:
Synthesize global strategy
- 18:
Sample random batch from replay buffer D
- 19:
Compute target Q-value:
- 20:
Perform gradient descent step on
- 21:
Update policy network
- 22:
if episode // 100 == 0 then
- 23:
Compare with historical strategies
- 24:
Update target network
- 25:
end if
- 26:
Update policy network based on target network
- 27:
Dynamically adjust B and b based on
- 28:
end for
|
3.3. Hierarchical Policy
In this section, we utilize HRL to complete path-planning tasks under the guidance of leveraging visual cues for the action selection model. A two-level policy layer is conducted to increase efficiency without sacrificing exploring capability.
(1) Environment representation
The environments are represented in two forms, e.g. high-level grid form and low-level ones. The high-level network is designed to achieve global decision-making based on above work, while the lower-level is used for better local planning.
High-level grid map: It selects the next macro-action, directing the agent towards a specific region on the grid. The grid map is divided into larger blocks, and each block is treated as a high-level state. The high-level policy determines the sequence of blocks to be explored based on the current state and the goal location. The settings of the high-level grid help agents avoid areas where visual information has not been fully detected and accelerate the training process in the task of robotics path planning.
Figure 4.
Data flow of the hierarchical policy.
Figure 4.
Data flow of the hierarchical policy.
Low-level grid map: Within each selected high-level block, the low-level strategy navigates through the individual grid cells. The policy network produces action instructions for robotics when visual and state information is obtained in grid cells.
(2) RL structure
We model after the design of DQN and incorporate this structure into the proposed framework: a policy network and a target network. As shown in
Figure 5, the policy network is updated continuously based on the Bellman equation and the target network and generates policies for explored areas
. APF is applied to generate policies for unexplored areas
.
and
are combined to form policy
for the target net. The target network’s parameters are updated less frequently, specifically, every 100 episodes, by comparing the global strategy with the most recent policy network.
The action selection is based on the
-greedy policy, balancing exploration and exploitation. The Q-value updates follow the standard Bellman equation:
where
is the reward at time
t,
is the discount factor, and
are the parameters of the target network.
Additionally, the policy network parameters are updated based on the target network to ensure stability:
where
.
(3) Base policy and reward settings
Visual servoing: For areas already explored, we employ the visual-act model to map input images to action directives. The network processes the visual input and outputs an action for the agent. During every step, these action instructions are added to a policy pool, which serves as the optimal policy to determine the parameters of the target net.
where
is the action at the coordination
,
is the visual input, and
is the policy function parameterized by the pre-trained visual-act decision model.
is the policy developed by the pre-trained model for the explored area.
Artificial potential field for unexplored regions: For high-level blocks that have not been explored, we utilize the artificial potential field method to generate navigation strategies. The APF method uses an attractive potential to pull the agent towards the goal and a repulsive potential to avoid obstacles. The direction generated by the APF is then decomposed to produce specific actions. The potential field is represented as
U, which can be decomposed into two parts,
and
. The formula is represented as below:
where
and
are constants,
d is the distance to the goal,
is the distance to the obstacle, and
is the influence distance of the obstacle.
The combined potential field
U determines the direction
for navigation:
This direction is then translated into specific actions within the unexplored high-level blocks as the . The whole process is conducted after every episode is done.
Global strategy: It is synthesized by combining the policies derived from both the explored and unexplored regions. The synthesized strategy is then compared with the historical strategy obtained from past episodes. The target network parameters are updated based on the comparison to ensure the new strategy optimizes over the past strategies.
Rewards settings: The reward functions are listed below. The represents the next state of the agent. The represents the target state. The , represent the reward value the agent receives in collision condition and safe condition respectively.
3) whether the agent reach the target
(4) Adaptive experience replay management
The management of the experience replay buffer [
28] is adaptively adjusted based on the cumulative average reward, which serves as an indicator of the learning progress and the efficiency of the current policy. The cumulative average reward,
, is computed as follows:
where
is the immediate reward received after the
i-th action,
represents rewards of achieving the goal,
represents rewards of encountering obstacles, and
N is the total number of actions taken up to the current point in time. If the agent achieves the goal or encounters obstacles,
or
is added to
and suspend this episode instantly.
Based on
, we adjust the buffer size,
, and the batch size,
, of the experience replay buffer to enhance the learning efficiency. The adjustments are made according to the following rules:
where
and
represent the minimum and maximum buffer sizes, respectively; similarly,
and
denote the minimum and maximum batch sizes.
is a target average reward that indicates an optimal learning performance. These formulas ensure that as the average reward approaches the target, both the buffer size and the batch size increase, allowing the model to learn from a larger set of experiences. Conversely, if the performance drops, the model focuses on a smaller, potentially more relevant set of experiences to adjust its policy more rapidly.