Background¶
Evolution Strategies history, theory, and motivation.
History¶
Evolution Strategies (ES) were developed in the 1960s-1970s as a class of optimization algorithms inspired by biological evolution. Key milestones:
- 1960s: Ingo Rechenberg and Hans-Paul Schwefel develop ES
- 1990s: CMA-ES (Covariance Matrix Adaptation) improves performance
- 2010s: ES applied to deep learning (OpenAI ES, Natural ES)
- 2020s: EGGROLL introduces low-rank perturbations for efficiency
Theory¶
Basic ES Algorithm¶
- Initialize parameters θ₀
- Sample perturbations ε₁, ..., εₙ ~ N(0, σ²I)
- Evaluate fitness f(θ + εᵢ) for each perturbation
- Compute update: where wᵢ are fitness weights
- Update θ ← θ + Δθ
- Repeat until convergence
Why ES Works¶
ES approximates the gradient of the expected fitness:
This is estimated via Monte Carlo sampling:
Advantages¶
- ✅ No gradients needed - Works with non-differentiable objectives
- ✅ Robust to noise - Naturally handles stochastic objectives
- ✅ Parallelizable - Population evaluations are independent
- ✅ Global search - Can escape local minima
Disadvantages¶
- ❌ Sample inefficient - Requires many evaluations
- ❌ Memory intensive - Stores perturbations for all parameters
- ❌ Slow convergence - Compared to gradient-based methods
EGGROLL Innovation¶
EGGROLL addresses the main limitations:
Problem: Memory and Computation¶
For a matrix parameter W ∈ R^(m×n): - Memory: O(mn) per population member - Computation: O(mn) to apply perturbations
For large models, this becomes prohibitive.
Solution: Low-Rank Perturbations¶
Instead of full-rank noise N ∈ R^(m×n), use: - A ∈ R^(m×r), B ∈ R^(n×r) where r << min(m,n) - Form perturbation as A @ B.T
This reduces: - Memory: O(mn) → O(r(m+n)) - Computation: O(mn) → O(r(m+n))
How It Still Works¶
Even with low-rank perturbations, EGGROLL achieves high-rank updates through:
- Population averaging: Multiple low-rank perturbations combine
- Per-layer updates: Each parameter tensor handled independently
- Fitness weighting: Better perturbations contribute more
Comparison with Other Methods¶
ES vs Gradient Descent¶
| Aspect | ES | Gradient Descent |
|---|---|---|
| Gradients | Not needed | Required |
| Sample efficiency | Lower | Higher |
| Non-differentiable | Works | Fails |
| Parallelization | Easy | Harder |
| Global search | Better | Worse |
EGGROLL vs Standard ES¶
| Aspect | Standard ES | EGGROLL |
|---|---|---|
| Memory | O(mn) | O(r(m+n)) |
| Computation | O(mn) | O(r(m+n)) |
| Speedup | 1x | ~100x |
| Population size | Limited | Large (256+) |
Applications¶
ES and EGGROLL are useful for:
- Reinforcement Learning - Policy optimization
- Non-differentiable objectives - Discrete/combinatorial problems
- Robust optimization - Noisy or adversarial settings
- Hyperparameter optimization - Black-box tuning
- Neural architecture search - Discrete architectures
References¶
- Rechenberg, I. (1973). Evolutionsstrategie: Optimierung technischer Systeme nach Prinzipien der biologischen Evolution
- Hansen, N. (2016). The CMA Evolution Strategy: A Tutorial
- Salimans, T., et al. (2017). Evolution Strategies as a Scalable Alternative to Reinforcement Learning
- EGGROLL Blog
Next Steps¶
- See EGGROLL Algorithm for implementation details
- Check Benchmarks for performance analysis