1. Introduction

  • RL methods based on the Markov Decision Process (MDP) formalism have been one strategy that has seen success in learning complex behavior.

  • Another approach is black-box optimization methods. This is also known as direct policy search or neuro-evolution when applied to neural nets (NNs).

  • This paper explores Evolution Strategies as a black box approach to solving RL problems. Key contributions and observations about ES:
    • Virtual batch normalization and reparameterization greatly improves ES performance.
    • ES is highly parallelizable using CPUs.
    • Less data efficient than A3C (needs 3x to 10x the data) but more computationally efficient.
    • Broader exploration than typically seen with gradient-based methods.
    • Generally robust and not much hyperparameter tuning to move between environments.
  • Black-box optimization algorithms are attractive because:
    • Indifference to sparse or dense reward distribution
    • No need for backpropagation or gradient calculations
    • Tolerance for long time horizons

2. Evolution Strategies

  • At a high-level: you have generations and each generation has a population. Each member of the population is tested for fitness. The top performers are recombined into the next generation and the process repeats.

  • This paper looks at a Natural Evolution Strategy (NES). The description here is too condensed for me to fully grok (TODO: checkout Sehnke et al.’s “Parameter-exploring policy gradients”) but my guess is it works as follow:
    • You represent a population as a distribution over parameters. This distribution is parameterized by \(\psi\).
    • You run the population on a few episodes to evaluate the fitness.
    • You calculate the gradient with regard to fitness and adjust \(\psi\) in a way that improves fitness across the population.
    • They use some trick to smooth the rewards from the potentially-sparse RL environment so they can calculate a closed-form for the gradients (gaussian blurring).
  • Overall the algorithm proceeds as following:
    • Perturb the parameters and run an episode of the resulting policy in the environment.
    • Combine the episode results, calculate the stochastic gradient, and update the parameters.
  • ES algorithms are amenable to parallelization.
    • They operate on complete episodes
    • We only care about (in a global sense) the scalar reward of each episode
    • We don’t require value function approximations (and the resulting dependencies on a particular policy).
  • They note that in the extreme this reduces to finite difference if every worker only perturbs a single parameter (TODO: lookup finite difference methods).

  • Q-learning and policy gradients derive the learning signal from sampling a policy. ES derives a signal from perturbing a policy. Thus you have to ensure some perturbations perform better than others or else you won’t get any exploration. This required some normalization in some benchmarks.

3. Smoothing

A large source of difficulty in RL stems from the lack of informative gradients of policy performance: such gradients may not exist due to non-smoothness of the environment or policy, or may only be available as high-variance estimates because the environment usually can only be accessed via sampling.

  • Policy gradient methods add noise to the action space (probabilistic sampling like softmax for selecting actions) to smooth gradients. ES adds noise to the parameter space.

  • ES variance is fixed whereas it grows nearly linearly for policy-gradient methods as the length of an episode grows. ES should thus have an advantage for long episodes (in practice, discounting often reduces the effective length of the episode).

  • ES can be reduced to a finite difference method.

  • Because only total episode reward is used, ES can more easily handle sparse rewards.

4. Experiments

  • They compared ES to TRPO to solve a number of MuJoCo benchmarks. In some it trained to TRPO levels significantly faster (i.e. using 30% of the training time) in others it took longer.

  • They compared ES to A3C on the Atari suite. It beat A3C in 23 games and lost to A3C in 28 games.

  • ES algorithms parallelize well on EC2.

  • Never heard of frameskip before, but apparently sometimes you just ignore some frames and only choose an action, for example, every fourth frame.

See also