Core Concepts¶
Understanding Evolution Strategies and the EGGROLL algorithm.
Evolution Strategies Overview¶
Evolution Strategies (ES) are a class of black-box optimization algorithms that:
- Don't require gradients
- Work with non-differentiable objectives
- Evaluate multiple perturbed versions of a model (population)
- Use fitness scores to guide parameter updates
Example: Training 3D Agents¶
ES excels at training agents in complex 3D environments where gradients may be unavailable or unreliable:
How ES Works¶
- Initialize a model with parameters θ
- Sample perturbations ε₁, ε₂, ..., εₙ for n population members
- Evaluate fitness f(θ + εᵢ) for each perturbed model
- Compute update Δθ based on fitness-weighted perturbations
- Update parameters θ ← θ + α·Δθ (where α is learning rate)
- Repeat until convergence
Advantages¶
- ✅ No gradients needed
- ✅ Works with discrete/discontinuous objectives
- ✅ Naturally parallelizable
- ✅ Robust to noise
Disadvantages¶
- ❌ Requires many evaluations (population size)
- ❌ Can be slower than gradient-based methods
- ❌ Less sample-efficient for smooth objectives
EGGROLL Algorithm¶
EGGROLL (Evolution Guided General Optimization via Low-rank Learning) addresses the main limitation of ES: computational cost.
The Problem¶
Standard ES requires: - Memory: O(mn) for each matrix parameter W ∈ R^(m×n) - Computation: O(mn) to apply perturbations
For large models, this becomes prohibitive.
The Solution: Low-Rank Perturbations¶
Instead of sampling full-rank noise N ∈ R^(m×n), EGGROLL samples: - A ∈ R^(m×r), B ∈ R^(n×r) where r << min(m,n) - Forms perturbation as A @ B.T
This reduces: - Memory: O(mn) → O(r(m+n)) - Computation: O(mn) → O(r(m+n))
How It Still Works¶
Even with low-rank perturbations, EGGROLL achieves high-rank updates through: 1. Population averaging: Multiple low-rank perturbations combine 2. Per-layer updates: Each parameter tensor handled independently 3. Fitness weighting: Better perturbations contribute more
When to Use EGGROLL¶
✅ Use EGGROLL when: - You have large models with many matrix parameters - You need to train with large population sizes - Memory/computation is a bottleneck - You want gradient-free optimization
❌ Consider alternatives when: - Your model is very small (< 1K parameters) - You can compute gradients efficiently - You need maximum sample efficiency
Key Terminology¶
Population Size¶
Number of perturbed models evaluated per generation. Larger populations: - Provide better gradient estimates - Are more robust to noise - Require more computation
EGGROLL makes large populations feasible (256-1024+).
Sigma (σ)¶
Standard deviation of perturbations. Controls exploration vs exploitation: - Large σ: More exploration, slower convergence - Small σ: More exploitation, may get stuck
Typical range: 0.01 - 0.1
Learning Rate (α)¶
Step size for parameter updates. Similar to SGD: - Large α: Faster but may overshoot - Small α: Slower but more stable
Typical range: 0.001 - 0.1
Rank (r)¶
Rank of low-rank perturbations. Controls memory/computation tradeoff: - r = 1: Minimum memory, fastest - r = 2-4: Better expressivity, still efficient - r >> 1: Approaches full-rank (not recommended)
Default: r = 1 (often sufficient)
Fitness Function¶
Function that evaluates model performance. Higher is better.
For loss minimization:
Comparison: Full-Rank vs Low-Rank¶
Full-Rank ES (VanillaESTrainer)¶
Memory: O(mn) per population member
Computation: O(mn) per evaluation
Low-Rank ES (EGGROLLTrainer)¶
# For W ∈ R^(m×n), sample A ∈ R^(m×r), B ∈ R^(n×r)
A = torch.randn(m, r) * sigma
B = torch.randn(n, r) * sigma
W_perturbed = W + A @ B.T
Memory: O(r(m+n)) per population member
Computation: O(r(m+n)) per evaluation
Speedup: ~100x for typical models!
Next Steps¶
- Learn about Trainers - How to use the classes
- See Fitness Functions - Writing evaluation functions
- Check Advanced Usage - Customization and tuning