### 1. Introduction

• RL methods based on the Markov Decision Process (MDP) formalism have been one strategy that has seen success in learning complex behavior.

• Another approach is black-box optimization methods. This is also known as direct policy search or neuro-evolution when applied to neural nets (NNs).

• This paper explores Evolution Strategies as a black box approach to solving RL problems. Key contributions and observations about ES:
• Virtual batch normalization and reparameterization greatly improves ES performance.
• ES is highly parallelizable using CPUs.
• Less data efficient than A3C (needs 3x to 10x the data) but more computationally efficient.
• Generally robust and not much hyperparameter tuning to move between environments.
• Black-box optimization algorithms are attractive because:
• Indifference to sparse or dense reward distribution
• No need for backpropagation or gradient calculations
• Tolerance for long time horizons

### 2. Evolution Strategies

• At a high-level: you have generations and each generation has a population. Each member of the population is tested for fitness. The top performers are recombined into the next generation and the process repeats.

• This paper looks at a Natural Evolution Strategy (NES). The description here is too condensed for me to fully grok (TODO: checkout Sehnke et al.’s “Parameter-exploring policy gradients”) but my guess is it works as follow:
• You represent a population as a distribution over parameters. This distribution is parameterized by $$\psi$$.
• You run the population on a few episodes to evaluate the fitness.
• You calculate the gradient with regard to fitness and adjust $$\psi$$ in a way that improves fitness across the population.
• They use some trick to smooth the rewards from the potentially-sparse RL environment so they can calculate a closed-form for the gradients (gaussian blurring).
• Overall the algorithm proceeds as following:
• Perturb the parameters and run an episode of the resulting policy in the environment.
• Combine the episode results, calculate the stochastic gradient, and update the parameters.
• ES algorithms are amenable to parallelization.
• They operate on complete episodes
• We only care about (in a global sense) the scalar reward of each episode
• We don’t require value function approximations (and the resulting dependencies on a particular policy).
• They note that in the extreme this reduces to finite difference if every worker only perturbs a single parameter (TODO: lookup finite difference methods).

• Q-learning and policy gradients derive the learning signal from sampling a policy. ES derives a signal from perturbing a policy. Thus you have to ensure some perturbations perform better than others or else you won’t get any exploration. This required some normalization in some benchmarks.

### 3. Smoothing

A large source of difficulty in RL stems from the lack of informative gradients of policy performance: such gradients may not exist due to non-smoothness of the environment or policy, or may only be available as high-variance estimates because the environment usually can only be accessed via sampling.

• Policy gradient methods add noise to the action space (probabilistic sampling like softmax for selecting actions) to smooth gradients. ES adds noise to the parameter space.

• ES variance is fixed whereas it grows nearly linearly for policy-gradient methods as the length of an episode grows. ES should thus have an advantage for long episodes (in practice, discounting often reduces the effective length of the episode).

• ES can be reduced to a finite difference method.

• Because only total episode reward is used, ES can more easily handle sparse rewards.

### 4. Experiments

• They compared ES to TRPO to solve a number of MuJoCo benchmarks. In some it trained to TRPO levels significantly faster (i.e. using 30% of the training time) in others it took longer.

• They compared ES to A3C on the Atari suite. It beat A3C in 23 games and lost to A3C in 28 games.

• ES algorithms parallelize well on EC2.

• Never heard of frameskip before, but apparently sometimes you just ignore some frames and only choose an action, for example, every fourth frame.