4. Training of neural networks by genetic algorithm
A three-layered perceptron, as depicted in
Figure 2, includes M+L unit biases and NM+ML connection weights, resulting in a total of M+L+NM+ML parameters. Let D represent the quantity M+L+NM+ML. For this study, the author sets N=6 and L=1, leading to D=8M+1. The training of this perceptron is essentially an optimization of the D-dimensional real vector. Let
denote the D-dimensional vector, where each
corresponds to one of the D parameters in the perceptron. By applying the value of each element in
to its corresponding connection weight or unit bias, the feedforward calculation in eqs. (2)-(6) can be processed.
In this study, the D-dimensional vector
is optimized using Genetic Algorithm [
8,
9,
10,
11]. GA treats
as a chromosome (a genotype vector) and applies evolutionary operators to manipulate it. The fitness of
is evaluated based on eq. (1) described in Part1 [
15].
Figure 3 illustrates the GA process. Steps 1-3 are the same as those in Evolution Strategy, which are described in Part1 [
15]. Step 1 initializes vectors
,
, …,
randomly within a predefined range, denoted as
, where
represents the number of offsprings. In Step 2, the values in each vector
(c=1, 2, ...,
) are fed into the MLP, which subsequently controls the Acrobot system for a single episode consisting of 200 time steps. The fitness of
is evaluated based on the outcome of the episode. In Step 3, the evolutionary training loop concludes upon meeting a preset condition. A straightforward example of such a condition is reaching the limit number of fitness evaluations.
The term “elites” refers to the highest performing offspring vectors. To leverage the optimal solutions found thus far, elite vectors are maintained as potential parents for genetic crossover. Initially, the elite population is empty. Let E denote the number of elite vectors. From the union of the E elites and the
offsprings, the best E vectors (i.e., vectors with the top E fitness scores) are saved as new Elites. Step 5 consists of crossover and mutation operations. During Step 5.1, new
offspring vectors are generated by applying the crossover operator to the union of the elite vectors
and the vectors
. A single offspring is produced through a single crossover, which involves randomly selecting two parents from the union set. This process is repeated
times to generate new
offspring vectors. In this study, the blend crossover method (BLX-α) [
18] is utilized. The new offspring vectors replace the current population
. Any element of the new offspring vectors is truncated to the domain [
] if the element exceeds the domain. In Step 5.2, the mutation probability is applied to each element in the new offspring vectors, resulting in their reinitialization to a uniformly randomly generated number within the domain.
5. Experiment
In the previous study using ES, the number of fitness evaluations included in one experiment trial was set to 5,000 [
15]. The number of new offsprings generated per generation (
) was either of (a) 10 and (b) 50. The number of generations for each case of (a) or (b) was 500 and 100 respectively. The total number of fitness evaluations were 10 × 500 = 5,000 for (a) and 50 × 100 = 5,000 for (b). The experiments using GA in this article employed the same settings. The hyperparameter settings for GA are shown in
Table 1. The number of elite individuals to be preserved, denoted as E, was set to 20% of the λ offsprings for both (a) and (b). The parameter
for the blend crossover was set to the conventional value of 0.5. The mutation probability was set as 1/D, where D is the genotype length. The value of D varied according to the number of hidden units M in the MLP. For instance, if M = 4 then D = 8M + 1 = 33, resulting in a mutation probability of 1/33.
The domain of genotype vectors,
, was kept consistent with the previous experiment [
15], i.e., [-10.0, 10.0]
D. The number of hidden units M was also consistently set to the four variations: 4, 8, 16, and 32. An MLP with either of 4, 8, 16, or 32 hidden units underwent independent training 11 times.
Table 2 presents the best, worst, average, and median fitness scores of the trained MLPs across the 11 trials. Each of the two hyperparameter configurations (a) and (b) in
Table 1 was applied.
Upon comparing the scores in
Table 2 between configurations (a) and (b), it is evident that the values obtained using configuration (a) are higher than those obtained using configuration (b). This result indicates that configuration (a) outperforms configuration (b). The Wilcoxon signed rank test confirmed that this difference is statistically significant (p<.01). Hence, in this study, increasing the number of generations rather than the number of offspring allowed GA to discover superior solutions. Note that the reverse was true for ES [
15], i.e., configuration (b) was significantly better than configuration (a) (p<.01).
This difference arises from the distinction in reproduction methods between GA and ES. The reproduction method of GA can generate offsprings that are distant from the parents, especially in the early stages of exploration. As a result, GA excels in broad exploration. On the other hand, the reproduction method of ES, when set with a small step size, produces offsprings that are close to the parents even in the early stages of exploration, making ES adept at local exploration. Comparing configurations (a) and (b), (a) can enhance local exploration more than (b), while conversely, (b) can promote broad exploration more than (a). Generally, achieving a proper balance between exploration and exploitation is crucial for successfully optimizing solutions using stochastic multi-point search methods. Therefore, configuration (a) is considered more desirable in GA as it complements exploitation, and configuration (b) is deemed preferable in ES as it complements exploration. This conclusion aligns with the findings of the author’s previous study using the pendulum task [
19].
Next, upon comparing the fitness scores obtained using configuration (a) among the four variations of M (the number of hidden units), it is observed that the scores with M=4 are worse than those of M=8, 16 and M=32. The Wilcoxon rank sum test confirmed that (1) the difference between M=4 and either of M=8 or 16 is statistically significant (p<.05), and (2) the difference between M=4 and M=32 is also statistically significant (p<.1). No other difference was found to be statistically significant. This result indicates that 4 hidden units are not sufficient for this task, which is consistent with the previous study [
15]. Since M=8 was not significantly inferior to M=16 or M=32, it can be concluded that, in this task, M=8 was the best choice in terms of performance and model size.
Figure 4 presents learning curves of the best, median, and worst runs among the 11 trials where the configuration is (a). Note that the horizontal axis of these graphs is in a logarithmic scale. The shapes of these graphs are similar to the results of the previous experiment using ES [
15], and the manner in which the best fitness score became larger along with the progression of evaluation counts showed a common pattern between ES and GA. The random solutions for the first evaluation have fitness values almost exclusively at 0.0. The graphs in
Figure 4 show increases in fitness from 0.1 to 0.2 by approximately the first 100 evaluations, which corresponds to 2% of the total 5,000 evaluations. Afterward, over the remaining 4,900 evaluations, the fitness gradually rises to around 0.4 to 0.45. For all the four M variations, even in the worst trials the final fitness scores are not significantly worse than the corresponding best trials. This indicates that GA could robustly optimize the MLP so that the variance was small within the 11 trials. In the case of M=16 (Figure 7(iii)), the fact that the fitness evaluation value of the worst trial is not near 0.0 from the beginning, but rather exceeds 0.2, is likely due to the random initialization of a solution that happened to be a good one. Despite starting the optimization with the better initial solution than those in the other 10 trials, this particular trial eventually became the worst among the 11 trials. This suggests that commencing optimization from better initial solutions is not necessarily desirable, as premature convergence can hinder the subsequent evolution of solutions.
Figure 5a illustrates the actions by the MLP and the heights p
y(t) (see eq. (1) in Part1 [
15]) in the 200 steps prior to training, while
Figure 5b displays the corresponding actions and heights after training. In this scenario, the MLP employed 8 hidden units, and the configuration (a) in
Table 1 was utilized. These graphs also bear a resemblance to the graphs in the previous study with ES [
15]. As a distinguishing feature, when comparing
Figure 5a after the training by GA with Figure 8b in the previous article using ES [
15], it can be observed that, for the case of GA, during the latter 100 steps in the episode, there is an oscillation in the range of heights greater than 0.4. In the previous case of ES, such oscillations were not present; instead, a larger oscillation in height was observed, spanning from a value close to the minimum of 0.0 to a value close to the maximum of 1.0. Supplementary videos are provided which demonstrate the motions of the chain
23. By examining the chain motion achieved by the GA-trained MLP in the video, it becomes evident that during the latter 100 steps, only the second chain beyond the joint portion was rotated while keeping the first chain up to the joint portion in an inverted state. As a result, the free end of the second chain was maintained at a height greater than 0.4, allowing for a higher fitness evaluation value. The best MLP trained by ES was not capable of achieving this kind of control [
15]. This difference is likely to stem from the fact that GA explores solutions more broadly and searches for a greater diversity of solutions compared with ES. However, even in the case of GA, it was not possible to maintain the entire chain in an inverted state and bring the free end of the chain to the highest position at 1.0, which was the same as the result with ES.