Musing 100: What did I learn about LLM research from the previous 99 musings?
Yours truly, only.
As we enter into a brand new year, I want to start by taking a look back at when I first started this substack in March, 2024. At the time, I did not have any goals for it (and really, still don’t) except to use it as a professional journaling exercise. LLMs had already been advancing rapidly for at least a year, and a lot of the really exciting progress was happening on the preprint server arXiv, and for soundbites, social media, blogs, and press. That’s not to say peer-reviewed conferences and journals had lost their glamor in the AI community and beyond (on the contrary, especially for the really prestigious journals and conferences, like Nature and NeurIPS), but whether we like to think of it as a feature or a bug, the process of peer review is not known for its speed.
On the other hand, you can get something on a preprint server in just a couple of days, with only minor moderation, but with no guarantee of quality. So when I started in March, I knew that my main task was curatorial: I wanted to browse through the list of AI papers being published on arXiv on any given day that I chose to write a substack post, see what caught my eye, then read and muse about it. I wasn’t looking to create some kind of meta-narrative and I gave myself permission to write about papers that were not in my specific area of expertise. So there were musings not just about knowledge graphs, enterprise AI applications, and machine common sense (some of my core current interests) but also about robotics and AI ethics. It was fun covering those, and I learned a lot just by writing about them.
Now arises an interesting question: looking back at the 99 musings, and considering that hindsight is meant to be 20/20, is there a pattern that emerges, even from the titles and subheadings of the papers that I ended up selecting? To answer this question, I went through the musings and engaged in a ‘sense-making’ exercise. I resisted the urge to do a wordcloud analysis or some other data-driven undertaking; after all, this is still a musing. So I won’t make this too long, but I’ll point out four interesting things communicate something about how far LLMs have come even in just these last 9 months, and what the state of global research on LLMs is shaping up to be:
First, a substantial number of papers that were covered in these musings involved industry, which supports an almost-too-obvious conclusion: companies are going all in on LLMs, even more so than academics, some of whom maintain a healthy skepticism about overly broad (and frankly, magical-sounding) claims about what LLMs can and can’t do. In looking through the musings, I found papers that I covered that came out of IBM (the very first musing), this musing out of Apple, this one out of Meta, a Microsoft musing, a Google Research musing, this one out of Salesforce…even Thomson Reuters, Intel, Ant Group, and the list goes on. There are too many even to link here. It seems that I either had a preference for, or naturally gravitated toward, papers that had industry collaborators. I also suspect there is a correlation between such papers and catchy titles/content.
Second (and again, not really a surprise), collaboration and globalism are both at a high point when it comes to LLM research. Several such pieces were perspectives, rather than formal research pieces, but they are well worth a read and contain many interesting ideas (and are forward thinking). For example, this musing on how advanced AI systems will impact democracy hailed authors from OpenAI, Anthropic, Oxford, Harvard, and a whole range of prestigious institutions.
Third, there was less focus as the year went on, on mechanistic tasks like prompt engineering, and more on AGI tasks like adult-level reasoning, multi-modal problem solving, theory of mind…and so on. Some musings that covered research on these are #46 (out of Google Research, DeepMind, University of Oxford and Johns Hopkins) and #81 (out of George Mason and Tencent AI). This list is still evolving and no doubt, we will see much more of it now that reasoning-first models like OpenAI’s o1 are available.
Fourth, and perhaps most applicable for many of you, some truly amazing applications and possibilities are now on the horizon because of these LLMs. I was especially excited about this musing which covered a preprint on a health language model out of Google. Other applications include automated data science and software coding, and military applications.
This is less related to the 99 musings than to my interpretation of them, but I do think they do support Sam Altman’s claim that we will achieve AGI (in whatever reasonable way we choose to define it) a lot sooner than was predicted, but that it will matter less. I, and many others, have been truly amazed by the kinds of tasks LLMs can now do. Memories are short in technology, but we were very far from achieving some of these milestones, or so we thought, pre-LLMs. When ChatGPT first came out, I personally know some people who were in denial. I would gently remind them that this was only the ‘first’ commercial iteration of such a technology, and to compare the kinds of computers they have today with the first desktops they played on as a child. Two years later, and ChatGPT is certainly a lot better. It is also facing competition. Before we start thinking this is the end, a reminder comes due: we’re still only just two years in!
An interesting point about many of the papers I covered is that GenAI was used in writing portions of at least a few of them, and it wasn’t all that subtle. This is not a criticism: writing is hard work, and it’s not as if the LLM is coming up with the content or doing the research. I see it as an advanced spell-check when writing papers: on the one hand, you don’t want it to ‘replace’ your voice and style of writing, nor do you want it to have the final word on what you put in print, but everyone is already using it to do some of the gruntwork (converting bullet points into paragraphs, or adding more context, for instance) so they can spend more time developing the content that people are actually interested in. I certainly used it in moderate quantities in these musings, and I also used a lot of verbatim paragraphs from the papers themselves to explain the concept in the authors’ own words, rather than paraphrase and botch.
What didn’t I find that I had hoped to see more of? I was disappointed not to see as much work as I liked on smaller language models (one good exception is this musing). This is not surprising ultimately: we all care about building models that have the best performance, and so far, bigger has been better. That doesn’t mean we are training or using a model with 150 billion parameters in an optimal way, or that there isn’t considerable room for improvement, but that’s a function of our evolving state of knowledge. No doubt what we can do today with 300b parameters, we may very well be able to do the same thing tomorrow with 150b parameters. But it also goes the other way, unless we hit a real wall: what 300b parameters can do today, it will likely be able to do much more tomorrow.
Scaling laws so far have been like Moore’s law: their demise was predicted yesterday, but they’re still going strong. Unlike Moore’s law, scaling laws for LLMs are still just in their infancy. So on the one hand, I’m disappointed we didn’t see more on smaller LMs; on the other hand, I continue to be fascinated by just how far scaling will go. Will it eventually lead us to the promised (or dreaded) singularity?
And on that note, we will be back with our regular substacks on arXiv preprints soon. I’ll end this hundredth musing by wishing you a happy new year, and hoping that you continue to stay excited about the progress of AI (and stay with me as I muse on it). There is no denying that these are interesting times where technology is concerned.