DDPG is an RL framework that can handle the continuous state and action spaces based on policy and Q-value evaluation. It employs a direct policy search method for obtaining action value at a time slot as
. Here
is the policy evaluation NN with parameter
, that takes the state vector
as input and outputs corresponding action vector
. Being a feed-forward NN,
consists of an input layer of thirteen neurons, three successive hidden layers of
,
, and
neurons, and an output layer of six neurons. As the normalized action vector can only be a positive value for a given positive value state vector, we apply the
sigmoid activation function to better tune the policy NN model. After taking action
at the current state
, the immediate reward
is generated and the current state is updated to the next state
. Then, the sample data tuple
is stored in the experience memory. During the training phase, a mini-batch of
random samples is selected from the memory to train another NN, i.e.,
. Here,
is Q-value evaluation NN with parameter
that takes the state and action vectors as input and provides the state-action value
as output. Being a feedforward NN,
consists of an input layer of nineteen neurons, three successive hidden layers of
,
, and
neurons, and an output layer of one neuron. As the desired output Q value is always a positive number, we apply the
sigmoid activation function to tune the Q value NN. The policy and Q-value target NNs, respectively represented by
and
with parameters
and
replicating the same structure as the policy and Q-value evaluation NNs respectively are applied to stabilize the training process. The parameter of the Q-value evaluation NN,
, is updated by minimizing the temporal difference (TD) error loss, which is expressed as [
21]:
where
, the output of the Q-value target NN is calculated using the output of the policy target network as [
21]:
where
is the discount factor. The parameters of policy evaluation NN can be updated through the deterministic policy gradient method, which is given as [
21]: