Musing 24: Pre-training Small Base Language Models with Fewer Tokens
An experimentally heavy paper on how to train smaller language models (LMs) while still remaining effective
Today’s paper: Pre-training Small Base LMs with Fewer Tokens. Sanyal et al. 15 Apr 2024. https://arxiv.org/pdf/2404.08634.pdf
As is evident in their name, ‘large’ language models are big, and getting bigger. This has implications for who can train them, how they train them, and how much energy the models can use. Furthermore, the largest models, which are almost all commercial, are also proprietary and we don’t know a lot about how they are trained. All of this is to say that, performance notwithstanding, if we can make the models more efficient (aka make them smaller, and train them faster) and their training process more transparent, we would be better of scientifically, not to mention benefit the environment.
The authors of today’s paper explore the effectiveness of a straightforward method for constructing a small base language model (LM) by using an existing larger LM. This method involves inheriting a subset of transformer blocks from the larger LM and then training the smaller model on a minimal portion (0.1%) of the original pretraining data. Referred to as "Inheritune," the approach is first showcased in building a 1.5B parameter base LM from a starting few layers of a 3B parameter LM, trained on 1B tokens, using a single A6000 GPU in under half a day. Evaluation across 9 varied datasets and the MMLU benchmark reveals that the resulting model compares favorably to publicly available base models of 1B-2B size, some of which were trained on significantly more tokens.
Further exploration of Inheritune involves training smaller LMs using larger LMs and their complete pretraining datasets. Results show that smaller LMs trained with layers from GPT2-medium (355M) and GPT-2-large (770M) can effectively match the validation loss of their larger counterparts when trained from scratch for the same number of steps on the OpenWebText dataset with 9B tokens. Extensive experimentation and analysis underscore Inheritune's efficacy across diverse settings. The code for this approach is accessible at https://github.com/sanyalsunny111/LLM-Inheritune.
The paper is technical and experimental, and it has to be. Much of the advance in the paper is of an engineering nature; we all ‘want’ smaller models that work just as well as the bigger ones, but it takes excellent engineering to get there. This paper makes headway in that direction, and to be convincing, it has to (and does) present a lot of empirical evidence.
Rather than go into a lot of detail into the ‘how’, I’m always interested in the results themselves. If the authors are right, we can make the models much smaller. How much performance do we lose by doing so?
Let’s study the table above. The authors compare their target model (M_tgt) derived using Inheritune with reference model (M_ref) and other baseline models of similar size when pre-trained from scratch and pre-trained with inherited weights and pruning. The model, although trained with fewer tokens, achieves comparable performance compared to the baseline models. The authors highlighted all the scores in bold where their 1.5B model achieves at least 90% of the score compared it’s reference LM or it outperforms at least two of the publicly available baseline LMs. All the tasks are evaluated using 0 shot except MMLU which is 5-shot. The models against which n/a is mentioned are trained from scratch. Other results are similarly convincing, as shown below.
So what are the implications?
Developing small base LMs affordably and effortlessly. Pre-training a small base LM with 1-2B parameters from scratch proves to be prohibitively costly. For example, the TinyLLaMA-1.1B model (proposed by Peiyuan Zhang and Lu, 2023) undergoes pre-training with 16 A100 GPUs over a span of 3 months. In contrast, the authors’ 1.5B LM variant demonstrates competitive performance despite being trained using just 1 A6000 GPU in under 12 hours. Typically, small base LMs undergo fine-tuning for specific tasks before deployment and are not utilized in their base form. With Inheritune, the authors introduce a remarkably straightforward and cost-effective approach to develop a small base LM for subsequent fine-tuning before deployment.
Naive baseline for pre-training a scaled down variant of large base LMs. Typically small variants of large base LMs are pre-trained using the same pre-training data. The authors show that even with a small fraction of pre-train data (randomly sampled) and few initial layers of the large base LM one can develop a small base LM. Therefore their Inheritune recipe has the potential to become the naive baseline for any pre-training pipeline aiming to develop a smaller variant of a large base LM.
Sufficient depth for bigger LLMs. The architectural choices, especially the total numbers of layers used for an LLM, are largely taken arbitrarily. The authors show a method to identify sufficient depth of a particular model without losing on pretraining validation loss. They claim that the usage of their recipe can go beyond depth.
In closing, the study demonstrates the efficacy of the Inheritune method for pre-training small base language models using significantly fewer data tokens and computational resources than traditional methods. The authors proposed a very simple way of creating a smaller LM using a large reference model and a small training set. Their findings have a lot of potential.