Musing 28: XFT: Unlocking the Power of Code Instruction Tuning by Simply Merging Upcycled Mixture-of-Experts.
A paper from UIUC on better code instruction tuning (code/program synthesis) for LLMs
Today’s paper: XFT: Unlocking the Power of Code Instruction Tuning by Simply Merging Upcycled Mixture-of-Experts. Ding et al. Apr 23, 2024. https://arxiv.org/pdf/2404.15247
Recently, instruction tuning of code Large Language Models (LLMs) has been used to improve many coding tasks, such as text-to-code generation, code completion, and data science engineering. Examples of such work include Code Evol-Instruct which uses ChatGPT to obtain complex synthetic code instructions with heuristic prompts, and OSS-INSTRUCT which prompts ChatGPT to generate new coding problems by drawing inspiration from open source code snippets.
Today’s paper contributes in this direction by presenting XFT: a new training scheme for code instruction tuning. XFT involves two steps: upcycling and merging. With only 1.3B parameters, XFT achieves 67.1 pass@1 on HumanEval and 64.6 pass@1 on HumanEval+, which is the new state-of-the-art for tiny code LLMs (<3B). Compared with normal supervised fine-tuning (SFT), X FT achieves 13% improvement on HumanEval+.
Let’s get into some details. The paper relies heavily on Mixture of Experts or MoE. What is it? It is a machine learning architecture designed to scale up neural network models efficiently. It involves dividing a network into multiple 'expert' sub-networks (experts), each specialized in handling different types of tasks or data. Here’s how MoE generally functions:
Structure: An MoE model consists of several expert networks that are typically smaller than the full model. These experts are orchestrated by a gating mechanism that decides which expert should be activated for a given input.
Gating Mechanism: The gating mechanism dynamically routes inputs to one or more relevant experts based on the input data characteristics. This routing is crucial for leveraging the specialization of experts effectively.
Efficiency: Since only a subset of experts is activated for each input, MoE models can manage more parameters with less computational overhead compared to a dense model where all parameters are used for every input. This leads to sub-linear increases in computation despite a linear or super-linear increase in the number of parameters.
Scalability: MoE allows for significant scalability in model size and capacity without a corresponding increase in computation during inference, making it suitable for tasks where large models are beneficial but computational resources are limited.
Experimentally, the authors use several benchmarks, including a multiprogramming benchmark (MultiPL-E) that supports 18 programming languages in addition to Python, to evaluate the multilingual ability and generalizability of XFT. Among these, they chose 6 representative programming for their distinct language features: Java, JavaScript, C++, PHP, Swift, and Rust. Also, the DS-1000 dataset, which is a collection of 1000 realistic data science coding problems ranging from 7 popular data science libraries in Python, including Matplotlib (plt), NumPy (np), Pandas (pd), SciPy (scp), Scikit-Learn (sk), PyTorch (py), and TensorFlow (tf).
In comparison to a baseline model (SFT_DS), which performs SFT on the same dataset, XFT demonstrates superior performance. This is attributed to its innovative approach of integrating a shared expert setting and a routing weight normalization strategy, which overcomes the scale mismatch issues present in previous sparse upcycling methods. There are some other ablation studies, all of which point in a positive direction for the method.
One limitation of the paper is that the authors are unable to showcase the impact of XFT on larger models. Also, they are not able to provide a strong theoretical basis for the performance improvements. These are all issues that they will likely address in future work.