• The goal of the paper is to compare non-gradient population-based genetic algorithms (GAs) to state of the art gradient-based algorithms such as deep neural nets (NNs), policy gradient, q-learning, and evolutionary strategies (ES) on difficult reinforcement learning tasks.

  • Regarding general performance on the Atari benchmarks:

    While the GA always outperforms random search, interestingly we discovered that in some Atari games random search outperforms powerful deep RL algorithms (DQN on 3/13 games, A3C on 6/13, and ES on 3/13).

  • The GAs they tested in this paper train using many fewer resources than current deep NN implementations.

  • One of their conclusions is that gradient-based search impedes progress on some tasks and GAs or other gradient-free methods might be more appropriate for those domains.

  • TODO: find out more about ES. They don’t calculate the gradient directly, but they approximate.

  • A rundown of genetic algorithms as used in this paper:
    • GAs evolve a population \(P\) of \(N\) individuals over successive generations.
    • Each individual is a neural network parameter vector \(\theta\).
    • Every individual \(\theta_i\) gets a fitness score \(F(\theta_i)\) each generation.
    • The top \(T\) individuals become the parents for the next generation.
    • To produce the next generation, the following is done \(N-1\) times:
      • A parent \(\theta\) is selected uniformly at random
      • The parent is mutated by applying Gaussian noise \(\theta'=\theta + \sigma\epsilon\) where \(\epsilon \sim \mathcal{N}(0,I)\) (unclear how \(I\) is determined, \(\sigma\) is determined empirically).
      • The \(N^{th}\) individual is the best performer from the previous generation (“elitism”). There is no crossover between parents.
  • They introduce a compact representation for individuals: Rather than storing the whole vector \(\theta\), they store the random seed used to generate \(\theta\) and the seeds used to generate the Gaussian noise for each successive generation.
    • Size increases linearly with the number of generations (often thousands of generations) and it is independent of the size of \(\theta\) itself (often millions of parameters).
    • This representation parallelizes well and can be used with CPUs instead of GPUs.
  • They also test a variant of genetic algorithm novelty search (GA-NS). In this algorithm, reward is ignored during evolution and novel behavior is rewarded instead. This is theorized to help with avoidance of local mins/maxes.

  • They run experiments on Atari, maze, and locomotion benchmarks.
    • In a head-to-head comparisons, the GA performs better than ES, A3C, and DQN on 6 games out of 13
  • GAs will often find high performing individuals in few generations. This suggests that many high-quality policies exist in the region where the random initialization produces policies.

  • Next question: are we doing any better than random search? GAs outperformed random search in every game tested.
    • Random search outperforms DQN on 3 of 13 games, ES on 3 of 13 games, and A3C on 6 of 13 games.
    • In one example, random search found a policy with a very high score of 3,620 in less than 1 hour vs an average score of 797 produced by DQN after 7-10 days of optimization.
    • The difficulty that NNs have with beating random policies suggest that for some tasks gradient-based methods may be getting trapped in local minima.
  • They then test a maze benchmark designed to have deceptive reward signals and local min/maxes. GA-NS performs well in this scenario because it ignores reward and encourages exploration.

  • GAs disappointed on a robot locomotion task; they were significantly slower at finding a solution than the other algorithms.

  • These results suggest that densely sampling policies in a region around the origin is sufficient to find good policies for some tasks. This sometimes beats state-of-the-art, gradient-based methods.
    • Tasks with densely packed “good” policies seem like they’d be well tailored for GAs.

Thoughts

  • It seems that the performance of RL algorithms are heavily affected by the distribution of good policies in the policy space which is a property of the task itself.

  • I like the idea of “blindly” modifying the parameters of the NN directly. It is interesting to think that a trained net is, ultimately, a vector of numbers and that you can operate on that vector outside of its NN context.