Musing 98: Benchmarking Large Language Models in Behavioral Economics Games
Interesting benchmarking paper out of UMich, Stanford and MobLab
Today’s paper: How Different AI Chatbots Behave? Benchmarking Large Language Models in Behavioral Economics Games. Xie et al. 16 Dec. 2024. https://arxiv.org/pdf/2412.12362
Even as LLMs continue to be deployed in real-world applications, understanding their behavioral patterns and decision-making strategies is critical. Such insights not only help optimize their performance in specific applications, but also enable better assessment of their reliability in contexts involving significant responsibilities.
Behavioral economics games are one way to do so, and that’s what today’s paper is about. Behavioral economics games are structured experiments designed to observe and analyze how individuals make decisions, particularly in situations that involve risk, reward, cooperation, or competition. These games, often rooted in game theory and experimental economics, provide insights into human behavior that challenge the assumptions of traditional economic models, such as the idea that individuals are always rational and utility-maximizing.
A classic example of such a game, which many are aware of, is the prisoner’s dilemma. There are two players, who can either cooperate or defect. Mutual cooperation yields moderate rewards, but if one player defects while the other cooperates, the defector gets a larger reward, and the cooperator gets nothing. If both defect, both receive minimal rewards. This game demonstrates how trust and self-interest influence cooperation, reflecting real-world scenarios like trade negotiations or public goods dilemmas. It continues to be well studied and can be used to derive some surprisingly complex properties, including the value of altruism.
Figure 1 below has small font, but it illustrates the distributions of AI choices across the six games. Overall, the distributions of AI chatbots are notably more concentrated compared to human distributions (the top row), capturing only specific modes of human behavior. Additionally, different AI chatbots exhibit distinctly varied behavioral patterns (last column), reflecting their unique orientations across multiple behavioral dimensions.
The authors consider several AI chatbot families as shown in Table 1 below, so there’s a lot to be learned from their results. They employ six classic behavioral economics games to evaluate multiple dimensions of AI behavior, including altruism, fairness, trust, risk aversion, and cooperation. These games include Dictator, Ultimatum, Trust, Public Goods, Bomb Risk, and Prisoner’s Dilemma. I’ve already intuitively described the last of these, but here are brief descriptions of the others:
The ultimatum game involves two players deciding how to split a sum of money. The first player proposes a division, and the second player can either accept or reject it. If the offer is rejected, neither player gets anything. Rational economic theory predicts that the second player will accept any non-zero offer, but experiments show people often reject "unfair" splits, even at a personal cost, highlighting preferences for fairness over pure self-interest.
The dictator game is a simplified version of the Ultimatum Game where the first player decides how to split the money, and the second player has no choice but to accept. This game measures altruism, as the "dictator" has no incentive to share but often does so, reflecting concerns for social norms or moral preferences.
In the public goods game, participants contribute to a communal "pot," which is multiplied and redistributed. While contributing benefits the group, individuals face the temptation to "free ride" by contributing nothing while still benefiting. The game reveals the tension between collective welfare and individual incentives. (I’ve often played this game with undergrad students visiting my group in the summer, to show them how even simple rules can lead to complex and unpredictable outcomes)
In the trust game, Player A gives an amount of money to Player B, who receives a multiplied version of the amount and can return a portion to Player A. The game explores trust and reciprocity, examining whether people act generously when they expect trust to be rewarded.
In the bomb risk game, participants are shown a grid of boxes (typically 100 boxes). One of these boxes contains a "bomb" that, if selected, nullifies their earnings. Players decide how many boxes they want to collect (e.g., 10, 20, 50 boxes, etc.). Each box collected increases their reward, but also increases the risk of selecting the box with the bomb. If the bomb is not in one of the selected boxes, the participant keeps the reward based on the number of boxes collected (e.g., $1 per box). If the bomb is in one of the selected boxes, the participant earns nothing. The bomb's location is randomly assigned and remains unknown to participants. The game is an excellent test of measuring a player’s risk tolerance.
On to results. Using the collected behavior distributions of AI chatbots and the excerpted human behaviors, the authors conduct Turing tests following the methodology outlined in a previous paper. Adopting the same procedure as described in that paper, each round of the test involves independently sampling one human action and one action from the AI behavior distribution. These samples are then compared based on their probabilities within the human distribution.
Figure 2 below shows how similar the chatbots are to humans following this procedure. Overall, all tested AI chatbots demonstrate a remarkable ability to pass the Turing test, with Meta Llama 3.1 405B achieving the highest winning rate against humans at 46.4%. In certain games though, the chatbots exhibit significant challenges in replicating human behavior. For instance, in the Trust Game - Investor role (Fig. 2e), AI chatbots tend to invest conservatively, whereas a substantial fraction of human players opt to invest their entire amount (Fig. 1d). Differences are also observed in Prisoner’s Dilemma.
However, while the Turing test is a valuable method for evaluating an AI’s ability to act like a single human player, it has inherent limitations in capturing the complete spectrum of the behavior distribution. To overcome these limitations, the authors also try a complementary approach: a distribution similarity test that assesses whether AI chatbots can accurately represent the behavior distribution of a human population. Table 3 below presents the pairwise dissimilarities of behavior distributions, measured using the Wasserstein distance. Smaller distances indicate greater similarity between two distributions, whether comparing chatbots or humans and chatbots.
According to the table above, among the AI chatbots, gpt-3.5-turbo-0613 demonstrates the highest similarity to the human population, likely due to its ability to produce relatively diverse choices. However, despite this similarity, a significant gap remains between the human behavior distribution and AI-generated actions, with no chatbot achieving a distribution that closely mirrors human behaviors.
Finally, to uncover the intrinsic objectives underlying the behaviors of AI chatbots, the authors perform analyses to identify and characterize their payoff preferences. The objective function of AI chatbots is quantitatively estimated by assessing the degree to which their behaviors align with optimization goals. The details are complicated (and provided in the paper) but the figure below tells you what oyu need to know. It displays the optimization errors for various values of 𝑏 for human players and each AI chatbot, computed under the assumption that the AI chatbots are interacting with a random human player. A lower optimization error indicates greater optimization efficiency, suggesting that the model is more likely to be optimizing for that particular objective.
What we find is that AI chatbots place a stronger emphasis on fairness, as indicated by the lowest optimization error consistently occurring at 𝑏 = 0.5. In contrast, human players exhibit a slight preference for selfishness, with their lowest optimization error occurring at 𝑏 = 0.6. Additionally, AI chatbots demonstrate significantly higher optimization efficiency than humans when maximizing the partner’s payoff (𝑏 = 0), but they exhibit lower efficiencies when optimizing their own payoff (𝑏 = 1). Finally, different AI models exhibit varying levels of optimization efficiency.
This study benchmarked LLM-based AI chatbots across a series of behavioral economics games. The analyses revealed the following common and distinct behavioral patterns of the chatbots:
(1) All tested chatbots successfully capture specific human behavior modes, leading to highly concentrated decision distributions;
(2) Although flagship chatbots demonstrate a notable probability of passing the Turing test, AI chatbots can merely produce a behavior distribution similar to humans;
(3) Compared to humans, AI chatbots place greater emphasis on maximizing fairness in their payoff preferences;
(4) AI chatbots may exhibit inconsistencies in their payoff preferences across different games;
(5) Different AI chatbots exhibit distinct behavioral patterns in games, which can be further distinguished in our analyses.
In closing this musing, these findings collectively highlight the effectiveness of the authors’ behavioral benchmark in profiling and differentiating AI chatbots. It could contribute to a deeper understanding of AI behaviors and serve as a foundation for future studies in AI behavioral science. This is a new area of research that has largely become possible because of the LLMs. The observed inconsistencies in AI behaviors across games underscore the importance of developing generalizable preferences and objectives for AI systems that can adapt effectively across various scenarios. Also, the discrepancies between Turing test results and distribution dissimilarities highlight the need for further alignment objectives that enable LLMs to better represent the diversity of the human behaviors.