Model 2 - Q-learning

Model 2 (two_timescale_q_learning.py) keeps the same two-timescale framework as Model 1 but replaces scalar trust with action-value learning, giving agents a more principled reinforcement-learning algorithm.

Key findings

Repeated-interaction payoff rises to 611 — nearly double Model 1's 313 — because Q-learning agents explicitly price in the future value of a cooperative relationship via the discount factor.
Repeated cooperation rate is lower than Model 1 (0.56 vs 0.98). Q-learning agents keep exploring (ε ~0.11 at convergence), occasionally defecting to probe partners.
In one-shot interaction, Q-learning achieves 0.445 cooperation — far above Model 1's zero — because initial_q_bias can encode a standing cooperative prior without needing learned history.
Trust learning maximises cooperation rate; Q-learning maximises payoff by retaining strategic exploration and leveraging future relationship value.

Learning during a lifetime

Each agent stores partner-specific Q-values for both actions:

Q[i, j, COOPERATE]
Q[i, j, DEFECT]

Updates follow temporal-difference learning:

new Q = old Q + alpha * (reward + gamma * max future Q - old Q)

Action selection is epsilon-greedy: with probability ε the agent tries a random action, otherwise it picks the higher Q-value for that partner.

Evolution between generations

Agents inherit and mutate four RL parameters:

exploration_rate (epsilon)
learning_rate (alpha)
discount_factor (gamma)
initial_q_bias

Results summary

Metric	One-shot	Repeated
Final cooperation rate	0.445	0.560
Final mean payoff	8.150	611.350
Final exploration rate	0.581	0.109
Final learning rate	0.353	0.180
Final discount factor	0.621	0.360
Final initial Q-bias	−0.509	0.889

Display 1: Evolved parameter values and cooperation outcomes for Model 2 (Q-Learning) under one-shot and repeated interaction.

One-shot interaction

Q-learning one-shot cooperation — **Display 1:** Model 2 cooperation trajectory in one-shot interaction.

Q-learning one-shot parameters — **Display 2:** Model 2 evolved Q-learning parameters in one-shot interaction.

Repeated interaction

Q-learning repeated cooperation — **Display 3:** Model 2 cooperation trajectory in repeated interaction.

Q-learning repeated parameters — **Display 4:** Model 2 evolved Q-learning parameters in repeated interaction.

Summary

With proper Q-learning, repeated-interaction payoff rises to 611 — nearly double Model 1's 313. Q-learning agents discount the long-term value of the cooperative relationship, not just a single round, making cooperation even more individually rational over time.

Repeated cooperation rate is lower (0.56 vs 0.98). Q-learning agents maintain higher exploration (ε = 0.11) even late in evolution, occasionally defecting to probe partners. The trust-learning model converges to near-universal cooperation because its deterministic threshold eventually locks in high responsiveness. The trade-off is real: trust learning maximises cooperation rate; Q-learning maximises payoff.

Q-learning agents stay strategically selective. They never fully stop exploring — they defect not randomly, but informationally, probing whether a partner is still worth cooperating with. Because γ > 0, they also know a good cooperative relationship has compounding future value, so they actively protect it. Q-learning agents detect and punish defectors faster, value long-term relationships more accurately, and don't blindly cooperate with everyone.

Humans are probably closer to the Q-learning model than the trust model. We don't cooperate unconditionally even with close partners; we maintain low-level vigilance in trusted relationships; we discount the future more heavily in unstable environments and cooperate more when the future feels secure. The high payoff of Q-learning reflects an evolutionary logic: strategic, selective cooperation with future-orientation outperforms both pure defection and unconditional cooperation.

For a deeper analysis of what this trade-off means for human psychology, see Appendix: Strategic and psychological interpretation.

Key findings​

Learning during a lifetime​

Evolution between generations​

Results summary​

One-shot interaction​

Repeated interaction​

Summary​