To do
Dynamic training
-
Create training algorithm of competing policies and select 'winner' after each iteration/a number of iterations. Competing policies have different environment configs. Goal: optimize environment parameters more efficiently and automatically at run time rather than manually after full (10 hour) experiments. Determine success:
- fitness metrics
- ability to co-adapt
-
curriculum reward tuning
-
Fitness parameters:
- offspring per agent
- offspring per agent per energy
-
Protocol for storing/retrieving stats per step (outside env: in evaluation loop)
Examples to try out
-
Meta learning example RLlib ("learning-to-learn"): https://github.com/ray-project/ray/blob/master/rllib/examples/algorithms/maml_lr_supervised_learning.py
-
curriculum: https://github.com/ray-project/ray/blob/master/rllib/examples/curriculum/curriculum_learning.py
-
curiosity: https://github.com/ray-project/ray/tree/master/rllib/examples/curiosity
-
Explore examples: https://github.com/flairox/jaxmarl?tab=readme-ov-file
Environment enhancements
- Male & Female reproduction instead of asexual reproduction
- Build wall or move wall
- Adding water/rivers
Improve network
-
model_config={
"conv_filters": [
[16, [3, 3], 1],
[32, [3, 3], 1],
[64, [3, 3], 1],
],
} -
the network is now adjustable to the observation range. was not effectively tuned for observation_range = 9 (for Prey)
-
TODO: test experiment"_network_tuning" on HBP computer
-
change config_ppo_gpu
Experiments
-
Tuning hyperparameters and env parameters simultaneously (see chat)
-
max_steps_per_episode: For policy learning performance: 500–2000 steps per episode is a common sweet spot in multi-agent RL — long enough for interactions to unfold, short enough for PPO to assign credit.
For open-ended co-evolution (your case): you might intentionally want longer episodes (e.g. 2000–5000) so emergent dynamics have time to play out, even if training is slower.
A good trick is to curriculum the horizon:
Start short (e.g. 500–1000) → agents learn basic survival.
Gradually increase (e.g. +500 every N iterations) → expose them to longer ecological timescales.
“works-in-practice” plan for your PredPreyGrass run, plus what to tweak as you lengthen episodes.
Recommended episode horizon + hyperparameters (curriculum)
Start shorter for stability/throughput, then stretch to let eco-dynamics (booms, busts, Red-Queen) unfold.
Phase A (bootstrap)
max_steps = 1_000
gamma = 0.995
(effective credit horizon ≈ 1/(1−γ) ≈ 200 steps)lambda_ (GAE) = 0.95–0.97
Phase B (mid)
max_steps = 2_000–3_000
gamma = 0.997–0.998
(horizon ≈ 333–500)lambda_ = 0.96–0.97
Phase C (long-term dynamics)
max_steps = 4_000–5_000
gamma = 0.998–0.999
(horizon ≈ 500–1 000)lambda_ = 0.97
Why that mapping? PPO’s useful credit horizon is ~1/(1−γ). As you increase
max_steps
, you raiseγ
so actions can “see” far enough ahead without making variance explode.Remember: in your env, if
max_steps
isn’t set in the config, it silently defaults to 10 000—so set it explicitly to avoid accidental long runs.Batch/throughput knobs to adjust as episodes get longer
Keep ~4–10 episodes per PPO iteration so you still get decent reset diversity:
- train_batch_size: roughly
episodes_per_iter × max_steps
. Example: atmax_steps=1_000
, use8_000–16_000
. When you move tomax_steps=3_000
, bump toward24_000–48_000
. - rollout_fragment_length: increase with horizon so GAE has longer contiguous fragments (e.g., 200 → 400 → 800).
- num_envs_per_env_runner: raise a bit as episodes lengthen to maintain sampler throughput.
- KL/clip: leave defaults unless you see instability; longer horizons often benefit from slightly smaller learning rate rather than big clip/kl changes.
When to stop stretching episodes
- If
timing/iter_minutes
balloons or TensorBoard curves update too slowly, hold the currentmax_steps
for a while. - If you see extinction before the cap, longer episodes won’t help—tune ecology (e.g., energy gains/losses) instead.
Make available the BHP archive in a repository
LT-goal acquire more wealth as a population
- Energy as a proxy of wealth
- Only the top 10% of energy reproduces?
- Escaping the Malthusian trap
Integrate Dynamic Field Theory
- Wrapper around brain
- Visualize first!!!
Posting on the linkedin The Behavior Patterns Project?