• The goal of the paper is to compare non-gradient population-based genetic algorithms (GAs) to state of the art gradient-based algorithms such as deep neural nets (NNs), policy gradient, q-learning, and evolutionary strategies (ES) on difficult reinforcement learning tasks.

• Regarding general performance on the Atari benchmarks:

While the GA always outperforms random search, interestingly we discovered that in some Atari games random search outperforms powerful deep RL algorithms (DQN on 3/13 games, A3C on 6/13, and ES on 3/13).

• The GAs they tested in this paper train using many fewer resources than current deep NN implementations.

• One of their conclusions is that gradient-based search impedes progress on some tasks and GAs or other gradient-free methods might be more appropriate for those domains.

• TODO: find out more about ES. They don’t calculate the gradient directly, but they approximate.

• A rundown of genetic algorithms as used in this paper:
• GAs evolve a population $$P$$ of $$N$$ individuals over successive generations.
• Each individual is a neural network parameter vector $$\theta$$.
• Every individual $$\theta_i$$ gets a fitness score $$F(\theta_i)$$ each generation.
• The top $$T$$ individuals become the parents for the next generation.
• To produce the next generation, the following is done $$N-1$$ times:
• A parent $$\theta$$ is selected uniformly at random
• The parent is mutated by applying Gaussian noise $$\theta'=\theta + \sigma\epsilon$$ where $$\epsilon \sim \mathcal{N}(0,I)$$ (unclear how $$I$$ is determined, $$\sigma$$ is determined empirically).
• The $$N^{th}$$ individual is the best performer from the previous generation (“elitism”). There is no crossover between parents.
• They introduce a compact representation for individuals: Rather than storing the whole vector $$\theta$$, they store the random seed used to generate $$\theta$$ and the seeds used to generate the Gaussian noise for each successive generation.
• Size increases linearly with the number of generations (often thousands of generations) and it is independent of the size of $$\theta$$ itself (often millions of parameters).
• This representation parallelizes well and can be used with CPUs instead of GPUs.
• They also test a variant of genetic algorithm novelty search (GA-NS). In this algorithm, reward is ignored during evolution and novel behavior is rewarded instead. This is theorized to help with avoidance of local mins/maxes.

• They run experiments on Atari, maze, and locomotion benchmarks.
• In a head-to-head comparisons, the GA performs better than ES, A3C, and DQN on 6 games out of 13
• GAs will often find high performing individuals in few generations. This suggests that many high-quality policies exist in the region where the random initialization produces policies.

• Next question: are we doing any better than random search? GAs outperformed random search in every game tested.
• Random search outperforms DQN on 3 of 13 games, ES on 3 of 13 games, and A3C on 6 of 13 games.
• In one example, random search found a policy with a very high score of 3,620 in less than 1 hour vs an average score of 797 produced by DQN after 7-10 days of optimization.
• The difficulty that NNs have with beating random policies suggest that for some tasks gradient-based methods may be getting trapped in local minima.
• They then test a maze benchmark designed to have deceptive reward signals and local min/maxes. GA-NS performs well in this scenario because it ignores reward and encourages exploration.

• GAs disappointed on a robot locomotion task; they were significantly slower at finding a solution than the other algorithms.

• These results suggest that densely sampling policies in a region around the origin is sufficient to find good policies for some tasks. This sometimes beats state-of-the-art, gradient-based methods.
• Tasks with densely packed “good” policies seem like they’d be well tailored for GAs.

### Thoughts

• It seems that the performance of RL algorithms are heavily affected by the distribution of good policies in the policy space which is a property of the task itself.

• I like the idea of “blindly” modifying the parameters of the NN directly. It is interesting to think that a trained net is, ultimately, a vector of numbers and that you can operate on that vector outside of its NN context.