4. TRAINING OF NEURAL NETWORKS BY DIFFERENTIAL EVOLUTION
The three-layered perceptron depicted in
Figure 2 includes M+L unit biases and NM+ML connection weights, resulting in a total of M+L+NM+ML parameters. Let D represent the quantity M+L+NM+ML. For this study, the author sets N=6 and L=1, leading to D=8M+1. The training of this perceptron is essentially an optimization of the D-dimensional real vector. Let
denote the D-dimensional vector, where each
corresponds to one of the D parameters in the perceptron. By applying the value of each element in
to its corresponding connection weight or unit bias, the feedforward calculation (described by eqs. (2)-(6) in Part1 [
16]) can be processed.
In this study, the D-dimensional vector
is optimized using Differential Evolution [
12,
13,
14]. DE treats
as a chromosome (a genotype vector) and applies evolutionary operators to manipulate it. The fitness of
is evaluated based on eq. (1) described in Part1 [
16].
Figure 3 illustrates the DE process. Steps 1-3 are the same as those in Evolution Strategy, which are described in Part1 [
16]. Step 1 initializes vectors
,
, …,
randomly within a predefined range, denoted as
, where
represents the number of offsprings. In Step 2, the values in each vector
(c=1, 2, ...,
) are fed into the MLP, which subsequently controls the Acrobot system for a single episode consisting of 200 time steps. The fitness of
is evaluated based on the outcome of the episode. In Step 3, the evolutionary training loop concludes upon meeting a preset condition. A straightforward example of such a condition is reaching the limit number of fitness evaluations.
In Step4, new offsprings are created by applying the crossover operator to the parents , , …, and donors. The donors are created before the crossover. The reproduction scheme called DE/rand/1/bin is adopted in this study. Let denote a genotype vector of the c-th donor; c=1, 2, ..., . is determined as follows:
Among the parents , three parents are randomly selected where .
d=1, 2, …, D, where is a preset scaling factor.
Let
denote a genotype vector of the c-th offspring;
c=1, 2, ...,
.
is determined as follows:
In eq.(1), c=1, 2, ...,
and d=1, 2, …, D. CR is a preset crossover rate, 0
CR
1. RAND is a uniform random integer, RAND
{1, 2, …, D}. RAND and rand are sampled for each c
{1, 2, ...,
}. Any element of the new offspring vectors is truncated to the domain [
] if the element exceeds the domain. In Step5, fitness of each offspring
is evaluated by the same method as each parent
is. In Step6, the better of the parent
or the offspring
is selected as a new parent
(c=1, 2, ...,
) for the next generation.
5. EXPERIMENT
In the previous studies using GA and ES, the number of fitness evaluations included in one experiment trial was set to 5,000 [
15,
16]. The number of new offsprings generated per generation,
, was either of (a) 10 and (b) 50. The number of generations was 500 for (a) and 100 for (b) respectively. The total number of fitness evaluations were 10 × 500 = 5,000 for (a) and 50 × 100 = 5,000 for (b). The experiment using DE in this study employed the same settings. The hyperparameter configurations for DE are shown in
Table 1.
The domain of genotype vectors,
, was kept consistent with the previous experiments [
15,
16], i.e., [-10.0, 10.0]
D. The number of hidden units, M, was also consistently set to the four variations: 4, 8, 16, and 32. An MLP with either of 4, 8, 16, or 32 hidden units underwent independent training 11 times.
Table 2 presents the best, worst, average, and median fitness scores of the trained MLPs across the 11 trials. Each of the two hyperparameter configurations (a) and (b) in
Table 1 was applied.
Comparing the scores in
Table 2 between configurations (a) and (b), there is no clear commonality in the relationship between the two configurations, i.e. scores of (a) is sometimes larger and vice versa. The Wilcoxon signed rank test confirmed that the difference between scores of configuration (a) and those of (b) was not statistically significant (p = .35), while configuration (a) was slightly better. In previous studies, configuration (b) contributed significantly better for ES [
16], while configuration (a) contributed significantly better for GA [
15]. Configuration (a) promotes exploitation in the later stage of search due to the large number of generations, and configuration (b) promotes exploration in the early stage of search due to the large number of offsprings. Since ES is not good at exploration, configuration (b) compensates for this disadvantage. On the other hand, since GA is not good at exploitation, configuration (a) compensates for this disadvantage. There was no significant difference between the two configurations for DE, indicating that DE performed a better balance between exploration and exploitation than ES and GA did.
Next, upon comparing the fitness scores obtained using configuration (a) among the four variations of M (the number of hidden units), it is observed that the scores with M=4 are worse than those of M=8, 16 and M=32. The Wilcoxon rank sum test confirmed that the difference between M=4 and either of M=8, 16, or 32 was statistically significant (p = .044, .033 and .058 respectively). No other difference was found to be statistically significant. This result indicates that 4 hidden units are not sufficient for this task, which is consistent with the previous study [
15]. Although the difference between M=8 and either of M=16 or 32 was not statistically significant, the p-values obtained by the test showed that M=8 was better than both of M=16 and M=32. It can be concluded that, in this task, M=8 was the best choice in terms of performance and model size, which is also consistent with the previous study [
15,
16].
Figure 4 presents learning curves of the best, median, and worst runs among the 11 trials where the configuration is (a). Note that the horizontal axis of these graphs is in a logarithmic scale. The shapes of these graphs are similar to the results of the previous experiments using GA and ES [
15,
16], and the manner in which the fitness scores became larger along with the progression of evaluation counts showed a common pattern among the three algorithms. In these graphs, the fitness scores started at nearly 0.0 and eventually rose to around 0.4 to 0.45, but in the meantime, there were two intervals in which the increase in the scores stagnated and the graph became roughly flat. The first flat interval appeared from the first to the tenth evaluations, and the fitness scores were less than 0.1. The second flat interval occurred from the 10th to the 100th evaluations where the scores were 0.2 to 0.3. The first flat interval occurred because it took time to learn to swing the Acrobot chains. The second flat interval occurred because it took time again to learn from the time the chains began to rotate (so that the free end of linear chains exhibited periodic oscillations between ascent and descent) until it became possible to maintain the free end at higher positions. For all the four M variations, even in the worst trials the final fitness scores are not significantly worse than the corresponding best trials. This indicates that DE could robustly optimize the MLP so that the variance was small within the 11 trials.
Figure 5(a) illustrates the actions by the MLP and the heights p
y(t) (see eq. (1) in Part1 [
16]) in the 200 steps prior to training, while
Figure 5(b) displays the corresponding actions and heights after training. Supplementary videos are provided which demonstrate the motions of the chain
2,3. In this scenario, the MLP employed 8 hidden units, and the configuration (a) in
Table 1 was utilized. In the previous articles [
15,
16] which reported the corresponding results by GA and ES, the action and the heights are illustrated in
Figure 5(a)(b). The three
Figure 5(a) appear to be similar: before training, the MLPs could only output values close to -1 during the 200 steps and the Acrobot chain hardly moved, resulting in the heights p
y(t) hardly increasing from the initial 0.0. On the other hand, when comparing
Figure 5(b) among DE, GA, and ES, the figure for DE (
Figure 5(b) in this article) is particularly similar to the corresponding figure for GA (
Figure 5(b) in [
15]). From the 1st step to around 75 steps, the action values alternated approximately between -1 and 1. As a result, the Acrobot chain swung and the heights rose to approximately 0.8. Afterwards, until the last step, the heights were generally maintained at 0.5 or above, and when the heights fell below 0.5, they were quickly risen back to above 0.5.
Compared to
Figure 5(b) in [
16] for ES,
Figure 5(b) in this article for DE and
Figure 5(b) in [
15] for GA showed smaller number of times where the heights fell below 0.5 in the later steps. This means that DE and GA could train the MLPs more appropriately than ES so that the trained MLPs became able to adjust the torque to keep the free end of the Acrobot chain at a height of 0.5 or more. However, even in the case of DE, it was not possible for the trained MLP to maintain the entire chain in an inverted state and bring the free end of the chain to the highest position at 1.0, which was the same as the result with GA and ES.