Evolution Strategies as a Scalable Alternative to Reinforcement Learning

1. Introduction

RL methods based on the Markov Decision Process (MDP) formalism have been one strategy that has seen success in learning complex behavior.
Another approach is black-box optimization methods. This is also known as direct policy search or neuro-evolution when applied to neural nets (NNs).
This paper explores Evolution Strategies as a black box approach to solving RL problems. Key contributions and observations about ES:
- Virtual batch normalization and reparameterization greatly improves ES performance.
- ES is highly parallelizable using CPUs.
- Less data efficient than A3C (needs 3x to 10x the data) but more computationally efficient.
- Broader exploration than typically seen with gradient-based methods.
- Generally robust and not much hyperparameter tuning to move between environments.
Black-box optimization algorithms are attractive because:
- Indifference to sparse or dense reward distribution
- No need for backpropagation or gradient calculations
- Tolerance for long time horizons

2. Evolution Strategies

At a high-level: you have generations and each generation has a population. Each member of the population is tested for fitness. The top performers are recombined into the next generation and the process repeats.
This paper looks at a Natural Evolution Strategy (NES). The description here is too condensed for me to fully grok (TODO: checkout Sehnke et al.’s “Parameter-exploring policy gradients”) but my guess is it works as follow:
- You represent a population as a distribution over parameters. This distribution is parameterized by \(\psi\).
- You run the population on a few episodes to evaluate the fitness.
- You calculate the gradient with regard to fitness and adjust \(\psi\) in a way that improves fitness across the population.
- They use some trick to smooth the rewards from the potentially-sparse RL environment so they can calculate a closed-form for the gradients (gaussian blurring).
Overall the algorithm proceeds as following:
- Perturb the parameters and run an episode of the resulting policy in the environment.
- Combine the episode results, calculate the stochastic gradient, and update the parameters.
ES algorithms are amenable to parallelization.
- They operate on complete episodes
- We only care about (in a global sense) the scalar reward of each episode
- We don’t require value function approximations (and the resulting dependencies on a particular policy).
They note that in the extreme this reduces to finite difference if every worker only perturbs a single parameter (TODO: lookup finite difference methods).
Q-learning and policy gradients derive the learning signal from sampling a policy. ES derives a signal from perturbing a policy. Thus you have to ensure some perturbations perform better than others or else you won’t get any exploration. This required some normalization in some benchmarks.

3. Smoothing

A large source of difficulty in RL stems from the lack of informative gradients of policy performance: such gradients may not exist due to non-smoothness of the environment or policy, or may only be available as high-variance estimates because the environment usually can only be accessed via sampling.

Policy gradient methods add noise to the action space (probabilistic sampling like softmax for selecting actions) to smooth gradients. ES adds noise to the parameter space.
ES variance is fixed whereas it grows nearly linearly for policy-gradient methods as the length of an episode grows. ES should thus have an advantage for long episodes (in practice, discounting often reduces the effective length of the episode).
ES can be reduced to a finite difference method.
Because only total episode reward is used, ES can more easily handle sparse rewards.

4. Experiments

They compared ES to TRPO to solve a number of MuJoCo benchmarks. In some it trained to TRPO levels significantly faster (i.e. using 30% of the training time) in others it took longer.
They compared ES to A3C on the Atari suite. It beat A3C in 23 games and lost to A3C in 28 games.
ES algorithms parallelize well on EC2.
Never heard of frameskip before, but apparently sometimes you just ignore some frames and only choose an action, for example, every fourth frame.

Evolution Strategies as a Scalable Alternative to Reinforcement Learning (2017)

1. Introduction

2. Evolution Strategies

3. Smoothing

4. Experiments

See also