Musing 2 : MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
A multimodal LLM paper straight out of Apple!
I am super excited to be covering this paper, posted earlier this week on arXiv. It just came out of Apple, which we know (like everyone else out there pretty much) has also leaped onto the GenAI bandwagon. It’s obvious to me and many industry watchers that they’ve been working on GenAI for a while now, and they’ve really put a lot of heart into the paper. No offense to Microsoft, but as a scientist, I see the differences between this paper and the now-cringey Microsoft paper from last year on GPT-4 having AGI, even in the title itself. This is not a paper using a few experiments to make the case that we have now have computers as bright, flexible or robust as humans. Maybe that day’s not far, but we’re objectively not there yet, as many, many people immediately pointed out (I especially recommend this post from Gary Marcus; it quotes from the Bible) after that paper was first put on arXiv.
What we have here is some very scientific analysis and a lot of open thoughts on what works and doesn’t when pre-training multimodal LLMs. I should note at the outset that, at least right now, only a company with the resources to pre-train such models can actually write such a paper at this one, so it’s an important one to read. Even since the early days of transformer neural models like BERT, which now seems ancient, we’ve known that pre-training language models is an expensive proposition. What makes transformers great is their ability to be fine-tuned, and more recently for generative models, to do in-context learning (basically, a fancy type of prompt engineering that mimics localized or ‘contextual’ learning from what’s provided in the prompt itself).
So let’s just jump right into it. Unlike the last paper I covered, this one’s at 41 pages but it’s space well spent. Again, I want to disclaim here that this post is no substitute for slogging through the paper in all its experimental detail, and its appendices, an impressive 16 pages. But I’ll try to give you the best bits of it, in my subjective opinion.
First of all, we should begin with the question of just what is a multimodal LLM (MLLM). I’ll quote from the paper itself: MLLMs are “large-scale foundation models that consume image and text data and produce text. After the rise of LLMs, MLLMs are emerging as the next frontier in foundation models.”
ChatGPT is, of course, our most famous LLM right now. But I’ve tried using it for generating and understanding images and have been frustrated by what I found even with DALL E integrated into it. It usually gives me something that looks like it understands what I might be asking for, but parts of it are derivative (or even ugly), if there’s text in the image it’s warped, you name it. So it remains a hard problem.
Apple’s aiming to contribute to this line of work by building MM1, “a family of multimodal models up to 30B parameters” and they show that, due to large-scale pre-training, MM1 “enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting.”
It’s not my practice to lift text or figures from papers or preprints (even in blogs like these) unless I can keep them short and put double quotes around them, but it’s useful to present a figure or two here to show what they believe these appealing features amount to on real data. The figure below is Fig. 11 from their paper, so full credit to them. I picked this one as it’s not one you’re likely to see unless you read the full paper, and that is a very technical endeavor.
In perhaps a rare move for Apple (historically speaking, given the almost mystical culture of secrecy it’s tended to have), the authors aim to “distill principles and lessons of how to build such models that might outlive concrete component implementations […] we hope are of use to the community.”
Now on to the technical contributions. What this paper focuses on is understanding the impact of architecture components, data choices, and training strategies on building effective MLLMs. Specifically, they provide:
Comprehensive Ablation Studies: The authors conduct detailed ablation studies on different components of the image encoder, the vision-language connector, and different types of pre-training data. This approach helps in identifying the most impactful factors in multimodal pre-training.
Data Strategy Insights: The authors describe the importance of a balanced mix of image-caption, interleaved image-text, and text-only data for achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks. As the saying goes, don’t put all your eggs in one basket. The study confirms that different data types contribute uniquely to zero-shot and few-shot learning capabilities.
Importance of Architecture: They show that the image encoder, along with image resolution and the number of image tokens, significantly impacts model performance. Interestingly, the design of the vision-language connector plays a comparatively negligible role. I’m not a computer vision person but I find that to be very interesting.
MM1 Model Development: Using their ablation studies, the authors develop the 30B parameters-family of models that they call MM1 and demonstrate SOTA pre-training metrics and competitive performance after supervised fine-tuning across established multimodal benchmarks.
Enhanced In-context Learning and Reasoning: Because of large-scale multimodal pre-training, MM1 is shown to exhibit enhanced in-context learning and multi-image reasoning capabilities. This includes few-shot chain-of-thought prompting, further evidence for the model's advanced reasoning skills.
Instruction Tuning and Scaling: The authors also consider instruction tuning on top of the pre-trained model and explores model scaling strategies, including the use of Mixture-of-Experts (MoE) to increase model capacity efficiently.
As in the previous paper I covered, I’ll state again that I’m an unabashed empiricist. All of these ideas, which sound good in theory, are not worth their while if they can’t be backed up by actual experimental data. I’ll summarize some of the quantitative highlights here. Some here set important state-of-the-art milestones, but others are along the spirit of a scientific study:
We’ll begin with the most obvious: an apples-to-apples comparison (no pun intended) comparing MM1’s variants to models with a similar number of parameters. The authors found that MM1-3B-Chat and MM1-7B-Chat outperformed all listed models of the same size, setting a new state of the art for these model sizes.
Increasing image resolution from 224 to 336 pixels resulted in approximately a 3% increase in performance metrics across all architectures tested. Doubling the model size from ViT-L to ViT-H led to a modest performance increase, usually less than 1%. Adding synthetic caption data (VeCap-300M) yielded more than a 1% boost in few-shot scenarios.
Experimentally, the authors showed the importance of interleaved data for few-shot and text-only performance, while captioning data primarily lifted zero-shot performance. A mix of 45% interleaved image-text documents, 45% image-text pair documents, and 10% text-only documents was identified as optimal for maintaining both zero- and few-shot performance. An easy way to remember this is the 5:5:1 rule (caption:interleaved:text data). A picture may not quite be worth a thousand words depending on how you interpret the above BUT it is undoubtedly worth more.
Supporting higher image resolutions (up to 1344 x 1344 pixels) led to a 15% relative increase in average performance of the supervised fine-tuning (SFT) evaluation metrics compared to a baseline model with an image resolution of only 336 pixels. So image quality does matter, and by quite a lot.
For all the coffee lovers out there, I can’ t help but close with this image:
Not bad at all. Of course, since I can’t play with the model, I can’t really ask it whether it’s possible for me to make a 9 oz cup of coffee with this machine, or how I can do that. And this is probably not the only MLLM out there that can understand this type of question. But what it does show is, barring anything else we might learn, we should be excited about what Apple has coming up in GenAI.
Final shout out: if you like what you’re reading, do subscribe and tell your friends. This substack will always be free, and it will always cover research literature. I believe science should be accessible to all, and I know everyone’s too busy to be slogging through lots of papers. We do that for a living, so this is my small way of bringing a little of that to you. So do subscribe!
A great read! As someone who has newfound interest in Natural Language Processing, this post made complex concepts seem not so scary. Thank you!