1. Introduction
With the rapid development of urban motorization, there has been a serious imbalance between the traffic demand and supply. Traffic congestion has become a major traffic problem faced by most cities, and its environmental, social, and economic consequences are well documented [
1,
2,
3]. Adaptive traffic signal control (ATSC) is one of the effective means to solve traffic congestion. It balances the traffic flow in the road network by coordinating the timing scheme of traffic lights in the control area, so as to reduce the number of stops, delay time, and energy consumption. Promoting the development of traffic control systems is of great significance for giving full play to the traffic benefits of the road system, mitigating environmental pollution, and assisting the sustainable development of the traffic system.
In recent years, machine learning methods have been widely used in various fields as a new artificial intelligence technology [
4,
5,
6,
7]. In the reinforcement learning (RL) based control framework, the traffic signal control system no longer relies on heuristic assumptions and equations but learns to optimize the signal control strategy through continuous trial and error through real-time interaction with the road network. Therefore, compared with traditional traffic control methods, RL signal control methods can usually achieve better control effects [
8,
9,
10]. Early RL-based models solve traffic signal control problems by querying q-tables that record traffic state, actions, and rewards [
11,
12]. This is easy to implement in environments with relatively simple traffic conditions, but the processing method will occupy a large amount of storage space in a relatively complex traffic environment. In this regard, some scholars choose to use q-network to fit q-table, apply deep learning (DL) to enhance the ability of RL-based algorithms to cope with complex environments, and propose the deep reinforcement learning (DRL) algorithm [
13]. Since then, a large number of studies have used DRL algorithms to solve TSC problems, and have achieved good results in practice [
14,
15,
16,
17].
However, for signal control of multiple intersections within a certain area (the collaborative control task under a multi-agent system), the partial observability of the traffic environment makes the mapping from road network state to actions face many challenges [
18]. Communication collaboration between intersections has become an important link that cannot be ignored in effective regional signal control, and the multi-agent reinforcement learning (MARL) algorithm has gradually become one of the most promising methods in large-scale traffic signal control (TSC) [
19,
20,
21]. According to the collaborative method, MARL-based control methods can be divided into two types: centralized control methods and distributed control methods.
In the centralized control method, all signal lights (agents) in the road network are controlled by a unified central controller. Each agent passes the observed local traffic state to the central controller, and the central controller uses a deep network (DNN) to fit the joint action function value performs action sampling from the corresponding policy network, and then sends it to each agent for execution. The centralized method combines the information of all agents and implies a communication and coordination mechanism between agents, so it is easier to obtain the global optimal solution. However, action decisions also need to be made after the traffic state statistics of all agents are completed, and the strategy formulation speed is relatively slow. In addition, as the number of agents increases, the action space and state space of the algorithm will grow exponentially [
11]. Therefore, in large-scale TSC, the centralized learning paradigm is generally not used to avoid the "curse of dimensionality" problem. The distributed control method assumes that the agent is in a stable environmental state and regards other agents as part of the environment. Each agent optimizes its own strategy in the direction of maximizing global reward based on its own observations, so the scalability of distributed control methods is relatively good. However, the independent learning method also makes the distributed control method more likely to fall into local optimality [
22].
In order to solve this problem, most scholars have incorporated the communication mechanism into the TSC model framework to achieve better control effects. Specifically, communication mechanisms can be mainly divided into two types: "explicit" communication and "implicit" communication [
23,
24]. The core of explicit communication is to explore how intelligent agents communicate. Among them, the selection of communication objects can be achieved through heuristic frameworks [
25,
26] and gating mechanisms [
27]; the adjustment of communication content and time is based on DL methods such as attention mechanisms [
18,
28], recurrent neural networks [
29], and graph neural networks [
30,
31,
32]. Implicit communication mainly affects the behavioral strategy formulation of the signal agent through value function decomposition and centralized value function [
20,
23,
33,
34,
35]. Most implicit communication MARL frameworks use the centralized-training decentralized-execution (CTDE) learning paradigm, which allows agents to use global (road network) information for centralized learning during the training phase. After the training is completed, each agent can complete the selection of action execution only through its own observation and local information interaction, which greatly reduces communication overhead while ensuring agent communication cooperation.
In this article, we use the adjustment plan of signal timing as optimization variables, with the goal of minimizing the average vehicle delay in the road network, and design a multi-agent deep reinforcement learning model considering communication content based on QMIX [
33], namely CMARL. This model combines two communication mechanisms and belongs to the distributed control method under the CTDE paradigm. The contributions of the present study lie in:
1. A MARL model that considers communication content is proposed to solve the regional TSC problem. This model decouples the complex relationships among multi-signal agents through the CTDE paradigm and uses a modified DNN network to realize the mining and selective transmission of traffic flow features. It enriches the information content while reducing the communication overhead caused by the increase in information.
2. We design several comparison experiments using traffic data sets from the real world, and prove the advantages of CMARL in regional traffic signal control tasks by comparing with six baseline methods including the fixed signal control model and other five advanced DRL control models.
The remainder of the article is organized as follows.
Section 2 reviews related research on traffic signal control based on DRL.
Section 3 introduces the definition of the problem and the CMARL algorithm framework proposed in this paper. Experiments and performance evaluation are presented in
Section 4. Finally, we conclude our work and future prospects in
Section 5.
3. Methodology
3.1. Problem Definition
In CMARL, the traffic light in the road network is regarded as independent agent , and each agent obtains state that characterize the current environment based on sensor observations within the respective intersection range. The detailed definitions are as follows:
State: For each agent, the traffic state of the intersection consists of the number of vehicles in each lane, the number of queuing vehicles in each lane, and the current phase number of the traffic light. Among them, represents the number of entrance lanes at the intersection n, and is the set of lanes at all intersections in the road network. The global state is the set obtained by splicing the traffic states of each agent.
Action: The phase sequence of the signalized intersection is fixed. Action is set to the adjustment of the current green light phase, that is, whether to switch the current phase to the next phase: indicates a switch to the next phase, and indicates that the current phase is maintained. In addition, we set the maximum and minimum green light time and the constraint rules of the yellow phase that must be implemented to convert the phase to ensure the reasonable passage of traffic flow.
Reward: Select the distance delay
as the parameter to construct the reward function:
3.2. Model Structure
Figure 1 shows the network framework of the CMARL model. As shown in the figure, the model consists of three modules: information processing, feature mining, and action value function fitting. Among them, the information processing module simulates the traffic flow in the actual road network through the simulation of urban mobility (SUMO) and obtains the state parameters for subsequent network training. The feature mining module is mainly composed of an improved DNN network. The input information of the network is the initial state
of each agent at time
t and the action
of the previous moment (the action in the initial state defaults to 0). The output is the corresponding feature matrix
and the communication matrix
. Based on communication signals, each agent can selectively communicate with other agents in the road network to obtain the final state characteristic matrix
. The action value function fitting network is consistent with the QMIX network. The overall network is mainly composed of the local action value function network (red box network) and the joint action value function network (green box network). The local action value function network belongs to the recurrent neural network (RNN). The input and output of the network are the final feature matrix
of each agent and the action value function value
of each action respectively. Based on
, each agent uses a greedy strategy to select the optimal action
suitable for the current environment to act on the environment. The environment then moves to the next state and returns the reward value
under the group of joint actions
.
The joint action value function network also uses a neural network structure, consisting of a yellow parameter generation network and a purple inference network. The difference is that the weights and biases of the inference network are generated by the parameter generation neural network. At time t, the parameter generation network accepts the global state and generates weights and biases. On this basis, the inference network receives the action function values of all agents, and assigns the weights and biases generated by the generation network to its own network, thereby inferring the joint action function value . During the training process of the network, based on the joint action value function and reward function of the extracted data, we can calculate the loss function and update the parameters of the network.
3.3. Feature Extraction Module
The main framework of the feature extraction module is a modified DNN. Specifically, we use GRU to replace a hidden layer in the DNN network to better extract features. As shown in Eqs. (2)-(5), the features carrying traffic flow information and the historical action are first mapped to a higher-dimensional vector space to obtain richer semantic information. Then based on the GRU network, we mine the temporal features in the historical data, and obtain the final feature matrix
and communication matrix
through two multi-layer perceptron structures with a single hidden layer.
where the parameters with
W and
b as variables are the trainable weights and biases in the network respectively;
and
are the hidden states at
t time and
t-1 time respectively,
;
and
are nonlinear activation functions, which can enhance the representation ability and learning ability of the network;
is a rounding function that can return the operation result rounded according to the specified number of decimal places.
On this basis, each agent realizes communication and interaction with each other based on the communication matrix. Taking the signal agent
n as an example, the communication information corresponding to agent
n is located in the
n row of the communication matrix
, that is,
. This is an
-dimension bull vector, which is a binary vector composed of 0 and 1. If
(the
bit in
), agent
n will refer to the environmental information of the
agent to select an action; otherwise, the environment information of agent number
will be ignored. The above process can be expressed by Eqs. (7)-(8):
where
is the feature matrix that contains information about other agents,
.
To facilitate subsequent calculations, we use a fully connected layer to change the dimension of
and add it to the state vector
to generate the final state feature
. Eqs. (9)-(11) also take agent n as an example to illustrate the flow of information during the generation of final state features.
3.4. Action_value Function Fitting Module
The composition of the action value function fitting network has been introduced in
Section 3.1 , so this section mainly shows the specific equations corresponding to the module, as well as the detailed meaning of the parameters in it. Eqs. (12)-(14) show the RNN network, that is, the local value function fitting network. The input of the network is the feature matrix
of all signal agents, and the output is the action function value
corresponding to each action in the action set of the signal light agent.
where the parameters with
w and
b as variables are the trainable weights and biases in the network; the definition of
、
、
are also consistent with the above,
.
The calculation of the joint action function value requires the optimal local action function value as input. In order to implement distributed control under global optimal conditions, the joint action value function and the local value function need to have the same monotonicity, which means the action that can maximize the joint action value function should be equivalent to the local optimal action set:
where the
is used to take parameters (set) of the function and return the action label corresponding to the maximum value of the action value function;
、
…
are the action value function of road network and signal intersection respectively;
、
…
are the historical actions of road network and signal intersection respectively.
The QMIX network converts the above equation into the constraint condition shown in Eq. (16), and satisfies the constraint by restricting the weights and in the joint action value function network (making their values positive).
This section may be divided by subheadings. It should provide a concise and precise description of the experimental results, their interpretation, as well as the experimental conclusions that can be drawn.
In summary, the joint action function value
of the road network can be calculated by the following equation:
where
is the action function value under greedy strategy selection.
3.5. Model Update
The update method of CMARL is similar to that of traditional DQN. Both use TD error to calculate the loss function and use the backpropagation algorithm to update network parameters. This process involves two networks: the evaluation network and the target network. The two network structures are identical, as shown in
Figure 1, but the input and output information of the two networks are different. The evaluation network takes the features and historical actions in state
s as input and outputs the actual joint action function value
. The target network takes the features and historical actions of the road network in state
as input and calculates the target (expected) action function value
. The difference between the output contents of the two networks constitutes the TD error in state
s:
where
R is the reward value in state
s;
is the action function value corresponding to all actions of the target network,
.
It can be seen from the above that the calculation of TD error requires knowing the road network state
s at the current moment, the actual joint action taken
a, the road network state after taking the action, the reward
R returned by reaching the state
, and the actual joint actions
in history proceed below. Therefore, the calculation of TD error is not real-time, but is performed after a certain amount of experience has been accumulated. On this basis, the loss function is expressed as follows:
where
represents the
b-th experience in a batch of extracted experiences;
B represents the number of extracted experiences.
In summary, the update process of the CMARL framework has the following expression:
5. Discussion
This paper designs a multi-agent deep reinforcement learning model with an emphasis on communication content to solve the signal control problem of road networks. In order to alleviate the instability of model learning caused by local observable states, we use a modified DNN network to excavate and selectively share nonlinear features in traffic flow data, enriching the information content and reducing the communication overhead caused by the increase of information. Using real data sets, we conduct a comparative analysis between CMARL and six advanced traffic signal control methods, and come to the following conclusions:
(1) CMARL can operate stably in a variety of scenarios and has good control effects. Compared with the optimal method MN_Light among the baseline methods, CMARL's queue length during peak hours was reduced by 9.12%, the average waiting time was reduced by 7.67%, and the average travel time was reduced by 3.31%; the queue length during off-peak hours was reduced 5.43%, the average waiting time decreased by 2.72%, and the average travel time decreased by 3.83%.
(2) In relatively complex traffic environments, further extraction of high-dimensional nonlinear features helps the agent select optimal actions. After adding the feature extraction module, the model control effect of QMIX was greatly improved, and the queue length and average waiting time during peak hours were reduced by 9.73% and 5.64% respectively.
In future work, we will further expand the scale of the road network and explore the applicability of different types of MARL in large-scale road network signal control problems.