Optuna with ASHA Pruning
This experiment uses Optuna as the hyperparameter search algorithm, combined with ASHA (Asynchronous Successive Halving Algorithm) as the pruning scheduler. The setup allows us to explore a wide hyperparameter space, while quickly eliminating poorly performing trials to save compute.
Process Overview
-
Hyperparameter Search (Optuna)
- The search space is defined around PPO’s learning rate (
lr) and number of epochs (num_epochs). - Optuna suggests new parameter combinations within:
lr: sampled from a log-uniform range (9e-5to1.3e-4)num_epochs: sampled from a discrete uniform range (22to30in steps of 2)
- These candidate configurations are passed to Ray Tune.
- The search space is defined around PPO’s learning rate (
-
Pruning Scheduler (ASHA)
- ASHA evaluates all trials asynchronously.
- Trials start with a grace period of 40 iterations (no pruning).
- After this period, trials that perform poorly on the metric
score_predare stopped early. - Resources are reallocated to better-performing trials.
- Reduction factor
3controls the aggressiveness of pruning: about 1/3 of trials are kept each promotion step.
-
Custom Early Stoppers
- In addition to ASHA pruning, two custom stoppers are used:
- Plateau Stopper: ends a trial if
score_predshows little improvement (low variance) over 12 recent iterations. - Drop Stopper: ends a trial if performance drops significantly compared to recent history.
- Plateau Stopper: ends a trial if
- A MaximumIterationStopper also caps training at the configured maximum iterations (
max_iters).
- In addition to ASHA pruning, two custom stoppers are used:
-
Callbacks & Metrics
PredatorScorecomputes a predator-specific metric (score_pred = predator_return / 100.0) and injects it into Ray Tune. We decided to pick this metric rather than a metric which would combine both predator and prey rewards. A combined metric is prone to a subjectivity and interferes with the independent training of predator and prey. Therefore we have chosen predator rewards as the sole metric, as maximimizing it would inherently also keep prey rewards into play. This obviously result from the fact that predators kan not prosper without prey and would go extinct otherwise.- As soon as a trial reaches
score_pred >= 1.0, it is logged once inpredator_100_hits.csv. - The
FinalMetricsLoggercallback writes apredator_final.csvrow immediately when each trial ends, including:- Trial name
- Iteration reached
- Final score
- Hyperparameters (
lr,num_epochs) - Stop reason (e.g.,
"plateau","drop","max_iters","asha_early_stop","completed", or"error").
-
Resource Allocation
- The script runs on 32 CPUs with up to 4 concurrent trials:
- Each trial: 1 driver + 1 learner + 3 environment runners (2 CPUs each).
- Total: 8 CPUs per trial × 4 trials = 32 CPUs fully utilized.
- The script runs on 32 CPUs with up to 4 concurrent trials:
Why This Setup?
- Optuna ensures broad, efficient coverage of the hyperparameter space (smarter than random search).
- ASHA pruning aggressively cuts off poor performers, preventing wasted resources.
- Custom stoppers add safety nets (plateau detection and performance drops).
- Immediate CSV logging makes trial eliminations transparent while training is still running.
Example Outcome
- Trials that failed to improve were stopped early with
asha_early_stop. - Some trials hit plateaus and were stopped with
plateau. - The best trials ran longer (50–70 iterations), achieving predator scores above 1.0 (≥100 predator reward).
- Hyperparameters around lr ≈ 1.1e-4 and num_epochs ≈ 26–30 consistently produced the best results.
In summary:
This workflow is Optuna-driven hyperparameter search with ASHA pruning plus custom stoppers, fully utilizing available CPUs and logging every trial’s fate in real time.