Musing 119: Reproducibility Study of Cooperate or Collapse: Emergence of Sustainable Cooperation in a Society of LLM Agents
Rare reproducibility study out of University of Amsterdam
Today’s paper: Reproducibility Study of "Cooperate or Collapse: Emergence of Sustainable Cooperation in a Society of LLM Agents". Curvo et al. 14 May 2025. https://arxiv.org/pdf/2505.09289
Proper reproducibility studies in computer science are still rare, which makes today’s paper all the more impressive. The authors evaluate and extend the findings made by Piatti et al. (2024), who introduced GovSim, a simulation framework designed to assess the cooperative decision-making capabilities of large language models (LLMs) in resource-sharing scenarios. GovSim is a simulation platform with specific metrics and environment dynamics. Each simulation includes 5 agents, each using their own instance of the same LLM. It includes three different scenarios for agents to interact in, all of them made to study cooperation, negotiation, and competition between them. The scenarios are mathematically equivalent to each other, differing only in the context of the shared resource. Therefore the same metrics are used to evaluate the agents’ performance across all scenarios. The three scenarios are as follows:
(1) Fishery, where agents share a fish-filled lake and decide how many tons of fish to catch each month;
(2) Pasture, where agents, as shepherds, control flocks of sheep and decide how many sheep to allow on a shared pasture; and
(3) Pollution, where factory owners must balance production with pollution.
By replicating key experiments, the authors validate claims regarding the performance of large models, such as GPT-4-turbo, compared to smaller models. They also evaluate additional models, such as DeepSeek-V3 and GPT-4o-mini, to test whether cooperative behavior generalizes across different architectures and model sizes. Furthermore, they introduce new settings: we create a heterogeneous multi-agent environment, study a scenario using Japanese instructions, and explore an “inverse environment” where agents must cooperate to mitigate harmful resource distributions.
The goal of GovSim scenarios is to create a resource-sharing environment where agents must balance their individual goals- maximizing their resource consumption and survival- with the collective goal of sustainability, enforcing cooperation (or not). Each scenario is described by two main dynamic components that change over time: h(t), the amount of shared resource at time t, and f(t), the sustainability threshold at time t. The sustainability threshold is the maximum amount of resource that can be extracted from the environment at time t without depleting it at time t + 1, considering that the resources recover based on a predefined growth rate, which determines how much the shared resource increases each month.
The metrics used to evaluate the agents’ performance are survival rate, survival time, total gain, efficiency, equality, and over-usage. The formulation of these metrics is detailed in the original paper. Cooperation is achieved in a given simulation if, over time, the agents manage to sustainably extract the shared resource without depleting it.
Each agent receives identical instructions that explain the dynamics of GovSim. The simulation is based on two main phases: harvesting and discussion. At the beginning of the month, the agents harvest the shared resource. All agents submit their actions privately (how much of the resource they would like to consume up to the total resources available). Their actions are then executed simultaneously, and each agent’s individual choices are made public. At this point, the agents have an opportunity to communicate freely with each other using natural language. At the end of the month, the remaining shared resources are doubled (capped by 100). When h(t) falls below C = 5 the resource collapses and nothing else can be extracted. Each simulation takes T = 12 months/time steps.
Let’s get into the experiments. The authors first started by replicating the results of the original paper. Next, to conduct experiments using new models, they followed the same procedure as for the original models. They added the new models to the configuration files and ran the experiments for the Fishery scenario, both in the default and universalization setups. The results were then compared with those of the original models to assess the performance of the new models in the GovSim platform. Some models did not require any additional setup, such as GPT-4o-mini, while others, like DeepSeek-V3 API-based, required specific configurations to be added to the codebase.
For new results, the authors did several things, including implementing an inverse environment where they modified the codebase to create a “negative environment,” where agents must eliminate a harmful resource at a cost. They modeled this as a shared house scenario in which agents must remove accumulating trash to prevent the house from becoming unlivable. Evaluation metrics were adjusted accordingly.
Also, to introduce a heterogeneous multi-agent scenario, they modified the codebase to allow different models to be assigned to each agent. This was achieved by updating the configuration files to specify which models would be used by each agent. The primary hypothesis they aimed to test was whether a high-performing model could influence the behavior of a low-performing model to prevent collapse, and vice versa. To test this, they ran the default scenario with two model combinations: DeepSeek-V3 and GPT-4o-mini in a 4-to-1 ratio, and the same models in a 2-to-3 ratio. This can also test if a larger model can enhance the performance of a smaller model, or vice versa, through their interactions. The results were compared with those of the original models to assess how different model combinations impacted agent behavior. Additionally, they analyzed the behavior of individual agents to determine if the performance of one model influenced the behavior of others within the simulation.
Let’s get to the results. For the inverse environment experiment (Figure 7 and Table 5 below), except for Mistral-7B and Qwen2.5-0.5B, all models maintained cooperation for the full 12 months. However, their harvesting behavior was noticeably more erratic than in the default fishery scenario. A striking contrast is that, while most models failed the sustainability test in the default setting, nearly all succeeded in the trash scenario. This suggests that agents perceive the two scenarios differently despite their mathematical equivalence, leading to a different behavior and aligning with the concept of loss aversion — where agents take greater risks to avoid losses than to achieve gains. One key difference between the two scenarios is the emergence of discussions about a rotating system in the trash scenario, which is sometimes applied and sometimes not — a behavior absent in the fishery setting. This likely reflects cultural patterns in which undesirable tasks, especially household chores, are commonly shared and rotated. Such tendencies may have emerged from the models’ training and fine-tuning, reinforcing cooperative behaviors related to task distribution.
Another result (albeit less surprising and more expected) is that language does not seem to matter for the models’ choices:
In closing the musing, authors’ results confirm that the GovSim benchmark can be applied to new models, scenarios, and languages, offering valuable insights into the adaptability of LLMs in complex cooperative tasks. Moreover, the experiment involving heterogeneous multi-agent systems demonstrates that high-performing models can influence lower-performing ones to adopt similar behaviors. This finding has significant implications for other agent-based applications, potentially enabling more efficient use of computational resources and contributing to the development of more effective cooperative AI systems. However, what is most impressive to me is that the authors took the time to do the study and write about it, and made it clear in the title and abstract itself that this is a reproducibility study, nor an ‘original’ piece of research. We need more such studies of important LLM papers in the future.