Interaction Evolved-Learned Cooperation
A Two-Timescale Theory of Cooperation
Cooperation can emerge through two different adaptive processes:
- Learning within lifetimes (behavioral plasticity, reinforcement learning)
- Selection across generations (evolutionary dynamics)
These processes operate on entirely different timescales:
Learning timescale <<< Evolutionary timescale
In natural systems they interact. The two-timescale simulation family in this section provides a controlled framework where both processes can be analyzed together.
Fast and Slow Dynamics
Fast timescale — learning
Agents update their policy during their lifetime to increase expected reward.
Where:
- = social state (partner history, local interaction context),
- = action (cooperate or defect),
- = interaction reward/payoff,
- = learning rate.
In this simulation family, reward is defined by donation-game payoffs and partner-specific interaction outcomes.
Slow timescale — evolution
Population composition changes across generations:
frequency_next = (fitness / mean_fitness) * frequency
Fitness is the payoff accumulated over a lifetime of interactions under the learned policy:
Where is the policy the agent has learned by the end of its lifetime and is the number of interactions per generation. Agents that learned to cooperate with reliable partners and defect against exploiters accumulate higher payoffs and therefore reproduce more.
Evolution therefore selects based on learning outcomes.
The Baldwin Effect
The Baldwin effect describes how learning changes evolutionary trajectories without requiring inheritance of learned behavior. James Mark Baldwin proposed it in 1896–1897 under the name Organic Selection; George Gaylord Simpson gave it its modern name in 1953. The mechanism is Darwinian throughout — no acquired traits are inherited.
The core insight: an organism that can learn a beneficial behavior survives long enough to reproduce even before its genes encode that behavior directly. Over generations, genetic variants that facilitate the learned behavior accumulate — not because the learned trait is passed on, but because those variants are selected for.
Step 1 — Plasticity enables adaptive behavior
Individuals that can learn cooperative strategies survive and reproduce even when their starting genotype alone would not suffice. Plasticity keeps them viable while genetic variants that support cooperation spread through the population.
Step 2 — Selection favors learnability
Evolution favors traits that:
- reduce learning cost
- bias initial behavior toward cooperation
- increase learning speed
- improve partner discrimination
Step 3 — Partial genetic assimilation
Cooperation becomes easier or faster to learn and may become partially innate. This is distinct from Waddington's genetic assimilation, where a trait becomes fully encoded and developmentally canalized. The Baldwin effect produces facilitation of learning — the learned behavior becomes cheaper or faster — rather than necessarily replacing it.
Why learning smooths the fitness landscape
Hinton and Nowlan (1987) showed computationally that learning converts a needle-in-a-haystack fitness landscape into a smooth gradient that evolution can climb. Without learning, a cooperative genotype must be nearly complete to provide any fitness advantage. With learning, a partial genotype gets finished within a lifetime and still reproduces — turning a cliff into a slope.
In the cooperation context:
- Without learning: a genotype predisposed to cooperate has low fitness unless partners are also cooperative, which is rare in a defector-dominated population — cooperative genotypes are eliminated before they can spread.
- With learning: an agent with even a weak cooperative predisposition can learn to discriminate — cooperating with cooperators, withholding from defectors — and accumulate net positive payoff even in a mixed population.
Learning rescues cooperative genotypes that selection alone would eliminate.
Learning creates new selection pressures
A learned cooperative strategy, once widespread in the population, creates selection pressure for genetic variants that achieve the same behavior at lower cost:
- lower learning rates suffice when the behavior is already partially encoded
- initial cooperation biases can be set more aggressively when the social environment has become reliably cooperative
- discrimination thresholds can loosen as defectors become rarer
This feedback loop — learning expands what is reachable; evolution consolidates what learning discovered — is the core dynamic this simulation family is designed to capture.
Fitness Landscape Interpretation
Without learning:
- cooperative strategies may have low initial fitness
- evolution cannot discover them
With learning:
- agents discover cooperative policies during life
- these increase reproductive success
- evolution favors individuals predisposed to those behaviors
Learning smooths the fitness landscape and guides selection.
Interaction Regimes
Learning and evolution can interact in different ways:
-
Learning accelerates evolution
Plasticity enables rapid discovery of cooperation that selection stabilizes. -
Learning masks selection
If all agents learn equally well, fitness differences shrink. -
Learning opposes evolution
Short-term learned defection may increase individual reward but reduce population fitness. -
Coevolution of learning ability
Selection may favor faster or more robust learners.
Manifestation in the Simulation Suite
In the integrated two-timescale cooperation simulations:
- interactions are local by default (ring structure)
- agents update behavior within generation (trust or Q-values)
- agents reproduce between generations based on accumulated payoff
This creates a Baldwin-style pathway:
- Agents learn partner-contingent cooperation during life
- Learners with better long-run payoff leave more offspring
- Offspring inherit parameter settings that make successful learning more likely
Cooperation shifts from:
context-dependent learning alone -> learning supported by evolved predispositions
What Can Evolve
Selection can act on:
- trust predispositions (
trust_prior) - social responsiveness to experience
- reinforcement-learning parameters (
alpha,epsilon,gamma, bias) - social-cognitive parameters (reputation weighting, rejection threshold, forgiveness)
This leads to cooperation-friendly learning phenotypes rather than fixed cooperative strategies.
Testable Predictions
The two-timescale framework generates testable predictions:
- Populations with repeated interaction should evolve higher cooperation than one-shot regimes.
- Selection should favor parameter combinations that improve partner discrimination.
- In stranger-rich environments, reputation-mediated mechanisms should outperform pure partner-memory mechanisms.
- Different learning rules (trust update vs Q-learning) should produce different cooperation-payoff trade-offs.
All of these are testable within the integrated model family.
Relation to Classical Theories
Classical evolutionary models:
- fixed strategies
- cooperation via selection only
Pure reinforcement learning models:
- cooperation within lifetimes
- no generational dynamics
This framework unifies both:
Cooperation = f(learning dynamics, evolutionary dynamics)
Related Work (Closest by Axis)
No single landmark paper fully matches this integrated setup across all dimensions (reciprocal cooperation, local interaction structure, reinforcement learning, and between-generation selection over learning parameters). The most relevant work falls into three overlapping groups: reciprocal altruism theory, network reciprocity theory, and modern multi-agent learning studies of social dilemmas.
Within this project, Model 1 maps most directly to direct reciprocity and network reciprocity theory, Model 2 maps most directly to learned reciprocity in multi-agent reinforcement learning, and Model 3 extends that line toward reputation, partner choice, and socially mediated cooperation with strangers.
| Work | Closest axis to this simulation family | Main gap vs this simulation family |
|---|---|---|
Trivers (1971), The Evolution of Reciprocal Altruism | Foundational logic of repeated reciprocal cooperation | Verbal evolutionary theory rather than an explicit learning-plus-selection simulation |
Axelrod and Hamilton (1981), The Evolution of Cooperation | Repeated-interaction conditions under which reciprocity can stabilize cooperation | Strategy tournament framework rather than agents that learn within life and evolve between generations |
Nowak (2006), Five Rules for the Evolution of Cooperation | Direct reciprocity, indirect reciprocity, and network reciprocity as a unifying framework | Analytic synthesis rather than a concrete parameter-evolution simulation |
Ohtsuki et al. (2006), A Simple Rule for the Evolution of Cooperation on Graphs and Social Networks | Why local interaction structure can protect cooperation | Graph-theoretic selection result rather than within-lifetime partner learning |
Claus and Boutilier (1998), The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems | Foundational emergence of cooperation in multi-agent reinforcement learning | Small, abstract cooperative games with no explicit generational inheritance of learning traits |
Eccles et al. (2019), Learning Reciprocity in Complex Sequential Social Dilemmas | Reciprocity under temporal and social complexity in learned agents | No explicit reproduction-selection loop over inherited social-learning parameters |
Taken together, these works capture the core logic behind the present simulation family: reciprocal altruism, repeated interaction, local network structure, and partner-contingent learning. The distinctive contribution here is that these ingredients are combined in a single two-timescale setup where learning unfolds within life and learning parameters themselves evolve across generations.
Adjacent computational environments
Some broader MARL and artificial-society papers remain relevant as neighboring context, but they are no longer the primary fit for this page's argument:
- Leibo et al. (2017), Multi-agent Reinforcement Learning in Sequential Social Dilemmas for sequential social-dilemma structure
- Leibo et al. (2018), Malthusian Reinforcement Learning for ecology-linked population dynamics
- Zheng et al. (2018), MAgent and Suarez et al. (2019), Neural MMO for large-scale multi-agent environments
- Leibo et al. (2021), Melting Pot for broad evaluation of social behaviors
These works are best understood here as adjacent environment or benchmark context, not as the closest direct precedents for the trust-learning, Q-learning, and extended reciprocity models documented in this section.
Simulation Companion
The concrete two-timescale experiments documented for this site are available in:
That section contains model-by-model results (trust learning, Q-learning, extended social mechanisms), the network-diversity experiment, and focused appendices.
Summary
The interaction between learning and selection:
- couples fast behavioral adaptation with slow population change
- enables the Baldwin effect
- allows plasticity to guide evolution
- explains how cooperation emerges, stabilizes, or collapses
This interaction forms the core mechanism linking nurture and nature in the integrated two-timescale cooperation simulations.