Musing 5: Embodied LLM Agents Learn to Cooperate in Organized Teams

This paper introduces a novel framework that integrates LLMs into multi-agent systems to enhance cooperation through structured communication and organizational roles.

Mar 20, 2024

Today’s paper: Embodied LLM Agents Learn to Cooperate in Organized Teams. Guo et al. Mar. 19, 2024. arXiv:2403.12482. Link

Research on LLMs continues at a fast pace, and this paper out of Tsinghua, Penn State, Oregon State, and Princeton, is an exciting one that is currently under review. It’s key contribution is a multi-agent framework that allows LLM agents to cooperate in physical or simulated environments through organized communication, inspired by human organizational structures.

Let’s take a step back to understand why such problem solving is important. First, we humans do it all the time. We cooperate and divide tasks between us; just look at any company! Sometimes, we don’t cooperate enough or talk over each other, and things don’t get done (think partisan politics). Most of the time though, we’re able to get a lot done, and research has shown that humans learn to collectively problem-solve at a fairly young age. So how can we get LLMs to match these abilities?

It’s not that this paper is the first one to address the problem. But as the authors state in their abstract, “LLM agents tend to over-report and comply with any instruction, which may result in information redundancy and confusion in multi-agent cooperation.” They consequently frame two research questions:

What role do organizational structures play in multi-LLM-agent systems?
How can we optimize these organizational structures to support efficient multiagent coordination?

An organizational structure is exactly what it sounds like: a hierarchical CEO-driven company would be one example, but a collectively governed commune would be another one. Given that the former has proven to work efficiently in many scenarios, the authors explore “hierarchy” as their first organizational structure. They find that, with a designated leader, LLM agents work more efficiently and collaboratively. For the example of a three-agent team, imposing a leader improves efficiency by up to 30% with almost no extra communication cost (up to 3%), consistent with findings for human organizations. I thought this was quite remarkable although we shouldn’t take it to mean that LLM agents will soon be forming their own companies and running the world (but maybe one day?)

As practitioners know by now, prompting LLMs properly is always an interesting problem depending on the application. Therefore, the authors propose a novel approach for optimizing organizational structures using a dual-LLM setup, called Criticize-Reflect. This method iteratively improves team performance by analyzing previous outcomes and suggesting enhanced organizational prompts, leading to novel, more effective team structures. These figures below from the paper nicely illustrate the architecture.

Here’s a more technical breakdown of Criticize-Reflect:

Here's a detailed breakdown of the Criticize-Reflect method:

Criticize phase:

Inspired by the Actor-Critic method of reinforcement learning, the Criticize phase involves an LLM serving as a “critic” that evaluates the team's performance based on the actions taken and the communication that occurred during a task.

Input: This LLM critic receives the dialogue and action history of the team as its input.

Analysis: It analyzes this information to identify key behaviors, decision-making processes, and interaction patterns that contributed to the team's performance.

Output: The critic then generates a textual evaluation, highlighting strengths, weaknesses, and areas for improvement. This evaluation includes specific feedback on agents' behaviors and suggestions for how the organizational structure or communication strategies could be adjusted.

Reflect Phase:

Following the critique, the Reflect phase utilizes another LLM, termed the “coordinator,” which takes the critic's feedback into account to propose modifications to the team's organizational structure.

Input: The coordinator processes the critic's evaluation along with performance metrics (e.g., task completion time, communication overhead) from the recent episode.

Proposal Generation: Leveraging this analysis, the coordinator generates several potential new organization prompts that are designed to address the identified issues and optimize team performance.

Selection and Implementation: Among these proposals, the best is selected (either through an automated process or manual selection) to be tested in the next episode, guiding the team's organizational structure moving forward.

The Criticize-Reflect method is iterative, with each cycle intended to refine the team's organizational structure further based on empirical evidence of what works best. The intended outcome is that teams of LLM agents can dynamically evolve their organizational strategies to reduce communication costs, avoid inefficiencies, and ultimately improve collective task performance. Ultimately, the experimental success of the method shows the potential of combining AI with organizational theory to create efficient, adaptive, and intelligent teams capable of tackling complex tasks through cooperation. Maybe we humans also have a thing or two to learn here.

Let’s jump to experiments. The authors use an environment called VirtualHome-Social, and extended it to support multi-LLM-agent communication and interaction. In this environment, agents are humanoid helpers in a virtual home doing housekeeping, where the tasks include Prepare afternoon tea, Wash dishes, and so on.

The simulated environment produces symbolic representations of household objects and their relationships. Agents have visibility only over objects within open containers present in the same room as they are, as well as over other agents sharing the room, although they have the ability to move to different rooms for exploration purposes. Communication among agents is unrestricted by distance, allowing any agent to interact with others regardless of their location. At the beginning of each episode, agents are placed randomly throughout the environment with all containers initially closed. The episode concludes once the assigned task has been successfully completed. The efficiency of a team is assessed by counting the number of steps required to complete a task and reporting the average amount of communication, measured in tokens, exchanged between agents during each step. For each experimental trial, a unique and randomly generated state is used to calculate average results and confidence intervals. The experiments use GPT-4, GPT-3.5-turbo, and Llama2-70B as the LLMs powering the agents.

For the results, I provide some highlights below:

Teams with a designated leader showed significant improvements in efficiency compared to disorganized teams. For example, in a setup with three GPT-3.5-turbo agents, the presence of a designated leader led to a 9.76% improvement in performance, as evidenced by a reduction in task completion time (statistically significant with p < .05).
The increase in communication cost was minimal, suggesting that a hierarchical structure does not necessarily lead to more communication overhead. In some cases, there was even a slight decrease in communication costs.
Implementing a leadership election process, where agents could elect a leader dynamically, resulted in a further improvement in team efficiency. For instance, in a team of three GPT-4 agents, this process improved efficiency compared to having a consistent, predetermined leader, with the change being statistically significant (p < .05). For this result however, there was a substantial increase in communication (i.e., token usage), similar to what might happen in real-world scenarios where a less hierarchical structure can lead to more extensive discussions.
When a human player replaced an agent to act as the team leader among GPT-4 agents, the human-led team outperformed those led by AI in both task completion time and communication efficiency. This was quantitatively supported by improved performance metrics in experiments involving three human players. So humans are still besting LLMs in these kinds of collective problem solving problems, at least for now.
The Criticize-Reflect method led to the discovery of new, more effective team structures. For example, for a team of three GPT-3.5-turbo agents, the new organizational structure (derived from the Criticize-Reflect process) improved the team’s task completion efficiency (statistically significant with p < .05) with a slight increase in communication cost.
The study also quantitatively analyzed the cooperative behaviors emerging from the structured communications, such as information sharing, leadership and assistance, and requests for guidance. This analysis was facilitated by a GPT-4-based classifier with an accuracy of 91.67% on labeled dialogue samples.

I’ll conclude by saying that the paper is a really interesting read because it doesn’t just tell us about LLMs. I think it also has something to say on the nature of organization itself. There’s a lot of social science work on this, but what I find exciting is that we can now use LLMs to better understand organizational structure. This paper is likely just the beginning of what I hope will be a long and productive line of work on building multiagent LLM frameworks.

AI Scientist

Discussion about this post