Musing 122: Small Language Models are the Future of Agentic AI
Interesting Agentic AI take from Nvidia and Georgia Tech
Today’s paper: Small Language Models are the Future of Agentic AI. Belcak et al. 2 June 2025. https://arxiv.org/pdf/2506.02153
Large language models (LLMs) are often praised for exhibiting near-human performance on a wide range of tasks and valued for their ability to hold a general conversation. The rise of agentic AI systems is, however, ushering in a mass of applications in which language models perform a small number of specialized tasks repetitively and with little variation. Today’s paper lays out the position that small language models (SLMs) are sufficiently powerful, inherently more suitable, and necessarily more economical for many invocations in agentic systems, and are therefore the future of agentic AI.
Let’s get started. Recent surveys show that more than a half of large IT enterprises are actively using AI agents, with 21% having adopted just within the last year. Aside from the users, markets also see substantial economic value in AI agents: As of late 2024, the agentic AI sector had seen more than USD 2bn in startup funding, was valued at USD5.2bn, and was expected to grow to nearly USD 200bn by 2034. Put plainly, there is a growing expectation that AI agents will play a substantial role in the modern economy.
The core components powering most modern AI agents are (very) large language models. It is the LLMs that provide the foundational intelligence that enables agents to make strategic decisions about when and how to use available tools, control the flow of operations needed to complete tasks, and, if necessary, to break down complex tasks into manageable subtasks and to perform reasoning for action planning and problem-solving. A typical AI agent then simply communicates with a chosen LLM API endpoint by making requests to centralized cloud infrastructure that hosts these models.
The authors recognize the dominance of the standard operational model but verbally challenge one of its aspects, namely the custom that the agents’ requests to access language intelligence are in spite of their comparative simplicity– handled by singleton choices of generalist LLMs. They then provide an outline of a conversion algorithm for the migration of agentic applications from LLMs to SLMs, and call for a wider discussion.
But why have SLMs in the first place? The authors argue that a particularly notable and desirable consequence of SLM flexibility when put in place of LLMs is the ensuing democratization of agents. When more individuals and organizations can participate in developing language models with the aim for deployment in agentic systems, the aggregate population of agents is more likely to represent a more diverse range of perspectives and societal needs. This diversity can then help with reducing the risk of systemic biases and encourage competition and innovation. With more actors entering the field to create and refine models, the field will advance more rapidly.
For the purpose of concretizing their position, the authors start by proposing the following working definitions (WD):
WD1: A SLM is a LM that can fit onto a common consumer electronic device and perform inference with latency sufficiently low to be practical when serving the agentic requests of one user.
WD2: An LLM is a LM that is not a SLM.
The authors use the words agent and agentic system interchangeably, preferring the former when emphasizing the software with some agency as a whole (e.g., “as seen in popular coding agents”) and the latter when highlighting the systems aspect of the agentic application as a sum of its components (e.g., “not all LMs of an agentic system are replaceable by SLMs”). For brevity, they focus on LMs as the bedrock of agentic applications and do not explicitly consider vision-language models, although they note that their position and most arguments readily extend to vision-language models as well.
The authors then lay out a rather bold position statement contending that SLMs are:
V1 principally sufficiently powerful to handle language modeling errands of agentic applications;
V2 inherently more operationally suitable for use in agentic systems than LLMs;
V3 necessarily more economical for the vast majority of LM uses in agentic systems than their general-purpose LLM counterparts by the virtue of their smaller size;
And that on the basis of views V1–V3 SLMs are the future of agentic AI. The authors support these views using the following non-exclusive arguments:
SLMs are already sufficiently powerful for use in agents
A1 SLMs are sufficiently powerful to take the place of LLMs in agentic systems. This argument stands in support of view V1.
Over the past few years, the capabilities of small language models have advanced significantly. Although the LM scaling laws remain observed, the scaling curve between model size and capabilities is becoming increasingly steeper, implying that the capabilities of newer small language models are much closer to those of previous large language models. Indeed, recent advances show that well-designed small language models can meet or exceed the task performance previously attributed only to much larger models.
Extensive comparisons with large models have been conducted in hundreds of papers, but not all capabilities assessed by benchmarks are essential to their deployment in the agentic context. The authors highlight their aptitude for commonsense reasoning (an indicator of basic understanding), tool calling and code generation (both indicators of the ability to correctly communicate across the model→tool/code interface; see Figure 1 below), and instruction following (ability to correctly respond back across the code←model interface).
SLMs are more economical in agentic systems
A2 SLMs are more economical than LLMs in agentic systems. This argument supports view V3. Small models provide significant benefits in cost-efficiency, adaptability, and deployment flexibility. These advantages are specifically valuable in agentic workflows where specialization and iterative refinement are critical.
SLMs are more flexible
A3 SLMs possess greater operational flexibility in comparison to LLMs. This argument stands in support of views V2 and V3.
Due to their small size and the associated reduction in pre-training and fine-tuning costs, SLMs are inherently more flexible than their large counterparts when appearing in agentic systems. As such, it becomes much more affordable and practical to train, adapt, and deploy multiple specialized expert models for different agentic routines. This efficiency enables rapid iteration and adaptation, making it feasible to address evolving user needs, including supporting new behaviors, meeting new output formatting requirements, and complying with changing local regulation in selected markets.
Agents expose only very narrow LM functionality
A4 Agentic applications are interfaces to a limited subset of LM capabilities. This supports views V1 and V2.
An AI agent is essentially a heavily instructed and externally choreographed gateway to a language model featuring a human-computer interface and a selection of tools that, when engaged correctly, do something of utility. From this perspective, the underlying large language model that was engineered to be a powerful generalist is through a set of tediously written prompts and meticulously orchestrated context management restricted to operate within a small section of its otherwise large pallet of skills. Thus, we argue that a SLM appropriately fine-tuned for the selected prompts would suffice while having the above-mentioned benefits of increased efficiency and greater flexibility.
A range of other similar arguments are also offered, and to their credit, the authors also consider “alternative views” and offer possible rebuttals.
LLM generalists will always have the advantage of more general language understanding
AV1 Let T be a single task using general language and let L,S be a large and a small language model of the same generation, respectively. The performance of L on T will always trump that of S.
There are two main counter-arguments implicated here. First, there is substantial empirical evidence indicating that LLMs outperform SLMs in general language understanding, even when both are from the same generation. This superior performance is attributed to scaling laws, which suggest that as model size increases, so does the ability to handle a broad range of natural language tasks such as text generation, translation, and reasoning. LLMs consistently surpass smaller models trained either in a general manner or specifically for those tasks.
Second, recent research suggests that LLMs may develop a "semantic hub"—a mechanism that enables them to integrate and abstract information across languages and modalities. This capability allows LLMs to generalize more effectively than SLMs, which lack the capacity for such abstraction. While smaller models may be suitable for narrow or specialized tasks, their limited scale constrains their ability to internalize complex concepts. Consequently, LLMs are likely to maintain a consistent advantage in performance across both general and specialized language tasks, making them more suitable for agentic applications.
However, the authors counter that:
A8 Popular scaling law studies assume the model architecture to be kept constant within the same generation, whereas the recent work on small language model training demonstrates that there are distinct performance benefits to considering different architectures for different model sizes.
A9 The flexibility of small language models comes to the rescue. A small language model can be easily fine-tuned for the task T of alternative view AV1 to perform to the desired level of reliability. This is unaccounted for in scaling law studies.
A10 Reasoning (or, more generally, test-time compute scaling) is significantly more affordable. A small language model, still retaining its benefits of greater cross-device agility can be reasonable expected to be scalable at inference time to the desired level of reliability.
The authors also claim that:
A11 The utility of the purported “semantic hub” shows itself when tasks or inputs at hand to be processed by the LM are complex. However, advanced agentic systems are either designed in their entirety or at least actively prompted to perform decompositions of complex problems and inputs. Therefore, the authors argue to the contrary that invocations of small language models within agentic systems would be on appropriately broken-down into sub-tasks so simple that any general abstract understanding due to the hub would be of little utility.
(Some other rebuttals and alternative views are considered, but I wanted to point attention to the nature and structure of the argumentation here. It’s an interesting way to write a paper; much more like a bulleted philosophical paper. We certainly need more papers like this!)
It would be prudent to ask oneself: If the arguments A1–A7 are truly compelling, why do the ever newer generations of agents seemingly just perpetuate the status quo of using generalist LLMs? The authors claim the following:
B1 Large amounts of upfront investment into centralized LLM inference infrastructure. As detailed earlier, large capital bets have been made on the centralized LLM inference being the leading paradigm in providing AI services in the future. As such, the industry has been much quicker at building the tools and infrastructure to that end, omitting any considerations for the possibility that more decentralized SLM or on-device inference might be equally feasible in the near future.
B2 Use of generalist benchmarks in SLM training, design, and evaluation. It must be pointed out that much of the work on SLM design and development follows the tracks of LLM design, focusing on the same generalist benchmarks in their development. On this point, one paper] notes that if one focuses solely on benchmarks measuring the agentic utility of agents, the studied SLMs easily outperform larger models.
B3 Lack of popular awareness. SLMs often do not receive the level of marketing intensity and press attention LLMs do, despite their better suitability in many industrial scenarios.
The authors close the paper by presenting a method, which is unusual in itself (usually, the method is the star of a computer science or AI paper, and its bang in the center. Here it comes in the very end!). They call this an “LLM-to-SLM Agent Conversion Algorithm”. The algorithm begins with the secure collection of usage data (Step S1), where all non-human-computer-interaction (HCI) agent calls are logged, including prompts, responses, tool calls, and optionally latency metrics, ensuring privacy and access control. The collected data is then curated and filtered (S2) to remove any personally identifiable or sensitive information, often using automated tools or paraphrasing techniques to retain general content without leaking private details. Next, the algorithm proceeds to task clustering (S3), where unsupervised techniques are used to identify recurring task types from the data, which inform the selection of appropriate SLMs (S4) for each task. These models are chosen based on capabilities, benchmark performance, licensing, and deployment feasibility. Each selected SLM is then fine-tuned (S5) on a task-specific dataset using techniques such as LoRA, QLoRA, or knowledge distillation from the original LLM to preserve nuanced performance while optimizing for scale.
AS the last step, the algorithm supports ongoing iteration and refinement (S6), enabling periodic retraining and updating of SLMs and their routing logic to adapt to changing use cases and maintain high performance over time. This process enables a seamless and scalable migration to SLM-based architectures in agentic applications.
Finally, the authors end with a short call for discussion. It is their view that any expense savings or improvements on the sustainability of AI infrastructure would act as a catalyst for this transformation, and that it is thus eminently desirable to explore all options for doing so.
If you have thoughts on any of the above, you can actually write to them directly: “We therefore call for both contributions to and critique of our position, to be directed to agents@nvidia.com, and commit to publishing all such correspondence at research.nvidia.com/labs/lpr/slm-agents.” Kudos to them for doing that.
excellent. a natural evolution of LLM/architecture to make them more adaptable and affordable, especially in the direction of their likely future competitors -- active inference agents.
"May you live in interesting times".