Musing 59: LiveBench: A Challenging, Contamination-Free LLM Benchmark

An interesting multi-institutional paper out of Abacus.AI, NYU, Nvidia, UMD, and USC

Jul 02, 2024

Today’s paper: LiveBench: A Challenging, Contamination-Free LLM Benchmark. White et al. 27 June 2024. https://arxiv.org/pdf/2406.19314

The advent of LLMs has had some significant consequences, but an interesting issue that people in the AI community are well aware of (and people outside tend not to be) is that traditional machine learning benchmark frameworks are no longer adequate for evaluating new models. These benchmarks, often published online, are included in the training data of modern LLMs. If an LLM encounters benchmark questions during its training, its performance on those benchmarks is artificially inflated, rendering many LLM benchmarks unreliable.

Evidence of test set contamination includes a noticeable drop in LLMs’ performance on Codeforces after the training cutoff date and a strong correlation between performance and the frequency of problem appearances on GitHub before the cutoff. Additionally, a recent hand-crafted variant of the GSM8K math dataset shows that several models have overfitted to this benchmark. To reduce dataset contamination, benchmarks using LLM or human prompting and judging have become more popular. However, these techniques have significant downsides. While LLM judges offer advantages like speed and the ability to evaluate open-ended questions, they are prone to mistakes and biases.

Today’s paper proposes LiveBench, the first benchmark claimed to have these three desiderata: (1) LiveBench contains frequently-updated questions based on recent information sources; (2) LiveBench is scored automatically according to the objective ground truth without the use of an LLM judge; and (3) LiveBench questions are drawn from a diverse set of six categories. The authors ensure (2) by only including questions that have an objectively correct answer. LiveBench questions are difficult: no current model achieves higher than 65% accuracy. Questions will be added and updated on a monthly basis, and they will release new tasks and harder versions of tasks over time so that LiveBench can distinguish among the capabilities of LLMs as they improve in the future.

LiveBench currently comprises 18 tasks across six categories: math, coding, reasoning, language, instruction following, and data analysis. Each task falls into one of two types:

1. Tasks using an information source for their questions, such as data analysis questions based on recent Kaggle datasets or fixing typos in recent arXiv abstracts.

2. Tasks that are more challenging or diverse versions of existing benchmark tasks, like those from Big-Bench Hard, IFEval, or AMPS.

The categories and tasks included in LiveBench are:

Math: Questions from high school math competitions over the past 12 months (AMC12, AIME, USAMO, IMO, SMC) and harder versions of AMPS questions.

Coding: Code generation questions from Leetcode and AtCoder (via LiveCodeBench), along with a novel code completion task.

Reasoning: A more difficult version of Web of Lies from Big-Bench Hard and Zebra Puzzles (see below).

Language Comprehension: Connections word puzzles, a typo-fixing task, and a movie synopsis unscrambling task for recent movies on IMDb and Wikipedia.

Instruction Following: Four tasks to paraphrase, simplify, summarize, or generate stories about recent news articles from The Guardian, subject to specific instructions such as word limits or incorporating particular elements in the response.

Data Analysis: Three tasks using recent datasets from Kaggle and Socrata, including table reformatting (among JSON, JSONL, Markdown, CSV, TSV, and HTML), predicting which columns can join two tables, and predicting the correct type annotation of a data column.

LiveBench evaluates dozens of models, including proprietary models and open-source models with sizes ranging from 0.5B to 8x22B. All questions, code, and model answers are released, inviting community engagement and collaboration. The codebase is available at https://github.com/livebench/livebench, and the leaderboard can be accessed at https://livebench.ai.

In the full paper, all of the tasks above are described in depth. To demonstrate the benchmark’s utility, the authors include 49 LLMs total in their testing, with a mix of top proprietary models, large open-source models, and small open-source models. In evaluating these models in the table below, they find that claude-3-5-sonnet-20240620 performs the best across all six categories and performs the best overall, 6% better than all other models. claude-3-5-sonnet-20240620 substantially outperforms all other models in the reasoning and coding categories. The next best model is gpt-4o-2024-05-13, with gpt-4-turbo-2024-04-09 and claude-3-opus-20240229 not far behind. Although gpt-4-turbo-2024-04-09 is third overall, it is on-par or better than gpt-4o-2024-05-13 across all categories except language.

To further validate LiveBench, the authors compare to two prominent benchmarks, ChatBot Arena and Arena-Hard. They find that there are generally similar trends to LiveBench, yet some models are noticeably stronger on one benchmark vs. the other. For example, gpt-4-0125-preview and gpt-4-turbo-2024-04-09 perform substantially better on Arena-Hard compared to LiveBench, likely due to the known bias from using gpt-4 itself as the LLM judge.

In closing, the authors acknowledge that, while they attempted to make all tasks and categories fair for all models, there are still biases due to certain LLM families favoring certain prompt types. They plan to update the prompts (at the start and end of each question) in the future, as new prompt strategies are developed, and to potentially include non-English tasks.

I am personally excited by the benchmark, as I believe it is among the fairest current benchmarks out there for evaluating the latest LLMs, while providing reasonable guarantees that the LLM does not ‘contain’ the benchmark in its corpus. When the next big model comes out (GPT-5?) it will be interesting to see how much it improves compared to the current crop of LLMs.

AI Scientist

Discussion about this post