The cost of cutting-edge: Scaling compute and limiting access
In a year marked by extraordinary advancements in artificial intelligence, largely driven by the evolution of large language models (LLMs), one factor stands out as a universal accelerator: the exponential growth in computational power.
Over the last few years, researchers have continuously pushed the boundaries of what a ‘large’ language model means, increasing both the size of these models and the volume of data they were trained on. This exploration has revealed a consistent trend: as the models have grown, so have their capabilities.
Fast forward to today, and the landscape has radically transformed. Training state-of-the-art LLMs like ChatGPT has become an immensely costly endeavor. The bulk of this expense stems from the staggering amount of computational resources required. To train an LLM, researchers process enormous datasets using the latest, most advanced GPUs (graphics processing units). The cost of acquiring just one of these GPUs, such as the H100, can reach upwards of $30,000. Moreover, these units are power-hungry, contributing to significant electricity usage. It's reported that OpenAI invested over $100 million in the development of GPT-4, their latest model—an expenditure far beyond the reach of most organizations and research teams.
Peering through the fog: Transparency, reproducibility, and the evaluation dilemma
While there have been substantial efforts by researchers and organizations towards openness in the development of large language models, in addition to the compute challenges, three major hurdles have also hampered efforts to level the playing field:
- Limited Transparency: The specifics of model architecture, data sources, and tuning parameters are often closely guarded.
- Challenging Reproducibility: Due to the swift pace of innovation, independently replicating these models on your own infrastructure can be very difficult.
- Disjointed Evaluation: The absence of a universally accepted benchmark for LLM evaluation complicates direct comparisons. The multitude of available tasks and frameworks, each assessing different capabilities, means comparisons are often inconclusive.
Concentrated power: A threat to innovation?
The high compute requirements, coupled with the opaqueness and the complexities of developing and evaluating LLMs, are hampering progress for researchers and smaller organizations striving for openness in the field. This not only threatens to reduce the diversity of innovation but also risks centralizing control of powerful models in the hands of a few large entities. The critical question arises: are these entities prepared to shoulder the ethical and moral responsibilities this control entails? Moreover, what steps can we take to bridge the divide between open innovators and those who hold the keys to the leading LLM technology?
The NeurIPS challenge: Breaking down barriers to innovation
In November 2023, we took part in the NeurIPS Large Language Model Efficiency Challenge, which aimed to help democratize access to language models and was led by PyTorch and Microsoft Research. We co-sponsored the challenge, along with AWS, Lambda Labs, and Microsoft (a full list of sponsors and organizers can be found here).
The challenge was straightforward yet ambitious: adapt a foundation model for specific tasks by fine-tuning it on a single GPU within a 24-hour window. The competition rewarded teams that delivered the most performant models based on predefined criteria.
The goal of this challenge was to explore three major issues that limit easy development of language models:
- Insufficient access to dedicated hardware prevents widespread availability and usage of these models.
- Lack of transparency around model training methods leads to a majority of models being not reproducible.
- The absence of a standard benchmark to evaluate these models side-by-side.
The results of this competition showcased the potential for innovation while also highlighting critical areas for improvement, paving the way for a more inclusive future in language model development.
LLM fine-tuning: Promise for low-compute options
The NeurIPS competition brought together a number of teams competing in the craft of fine-tuning an open-source LLM. Fine-tuning is a process where a pre-trained base LLM is further trained on a smaller, specific dataset to specialize its abilities for particular tasks or domains. This method contrasts with pre-training, which involves training a model from scratch on vast amounts of general data, requiring significant computational power and resources. Fine-tuning is crucial because it offers a more resource-efficient approach to adapt LLMs to specialized applications, leveraging the broad knowledge base of the pre-trained model while requiring considerably less compute. For example, we might fine-tune a generalist model on texts from our group chat to make it sound more like our friends, or on cardiology data for specific medical applications.
The competition provided great insight into how the field is approaching the development of LLM with limited computational power through fine-tuning. We’ve summarized some of the takeaways from the competition’s top teams:
- Many entrants used Llama-2 or Mistral as their base model, however, Qwen, a lesser-known Chinese model developed at Alibaba cloud, proved to be an extremely resilient base model for this competition
- Data curation, and more specifically, manual data curation proved to be extremely important. Top teams dedicated time to combining & formatting a number of datasets, which often included open platypus and LIMA.
- Many used PEFT - parameter-efficient fine-tuning, an umbrella term for a variety of techniques used to reduce the number of parameters that need to be updated during fine-tuning. More specifically, LoRA (Low-Rank Adaptation of Large Models) was used by all the top teams
The top teams all demonstrated the potential of fine-tuning an open-source language model on custom data for improved performance. Importantly, this was done:
- With very limited compute resources - the model training had to run in under 24 hours with 1 GPU
- In full transparency - the GitHub repositories for the teams can be found here
The highest-scoring teams in the competition had models scoring about 68-70% on the MMLU performance benchmark. By comparison, in December 2023, Google declared their Gemini Ultra model scored 90.0% on the same test, albeit with a much larger model than that used by the teams in the competition.
Even though the fine-tuned open-source models produced by the competition fall short of the performance of models developed by industry leaders, it’s worth noting they were fine-tuned for a broad range of tasks. Fine-tuning open-source models becomes particularly interesting when there is a highly specific task in mind. For instance, LoRA Land showcases a collection of 25 Mistral-7b models, fine-tuned at a cost of about USD 8.00, that outperform GPT-4 by 4-15% in certain tasks. Mistral-7b is several orders of magnitude smaller than GPT-4, which also helps to make the cost of using it much cheaper. This highlights a significant opportunity for organizations to deploy cost-effective, specialized language models for specific tasks, without compromising on the quality of the model outputs.
Let's now explore what we mean when we talk about ‘model performance’ and expose another fundamental challenge: why evaluating the effectiveness and performance of language models is inherently complex.
Challenges in evaluation
Why LLMs make model evaluation harder than ever
Evaluation of LLMs involves assessing their performance and capabilities across various tasks and benchmarks and provides a measure of progress and highlights areas where models excel or need improvement.
‘Traditional’ machine learning evaluation is quite straightforward: if we develop a model to predict lung cancer from X-ray images, we can test its accuracy by using a collection of X-rays that doctors have already diagnosed as either having cancer (YES) or not (NO). By comparing the model's predictions with the doctor-diagnosed cases, we can assess how well it matches the expert classifications.
In contrast, LLMs can complete an almost endless number of tasks: summarization, autocompletion, reasoning, generating recommendations for movies and recipes, writing essays, telling stories, generating good code, and so on. Evaluation of performance therefore becomes much, much harder.
LLM evaluation frameworks offer a set of tools to assess an LLM's capabilities across various tasks. Benchmarks within the framework are pre-defined datasets and tasks that act as standardized tests. In theory, an evaluation framework should help enable objective comparison between the performance of different LLMs. In reality, there are several evaluation frameworks to choose from, and each one has slight differences in their implementation of a benchmark. Due to the sensitive nature of LLMs, tiny differences in their implementation, like adding ‘whitespace’ in the question presented to the LLM can lead to different answers on many tasks. Consequently, this means frameworks may disagree on which LLM performs better, impacting community consensus and trust in evaluation results.
How LLMs are evaluated in practice
One of the key goals for this competition was to shed light on how people perform evaluation on LLMs in practice.
For this competition, evaluation was conducted by running a subset of HELM tasks. HELM is a framework developed by Stanford University for the public evaluation of large language models. To meet the difficult task of LLM evaluation, HELM includes the ability to run evaluation tasks on 10 scenarios, ranging from multiple-choice questions (MMLU) to grade-school-math (GSM8K).
What does one of these scenarios look like in practice? If we take MMLU as an example, the LLM is given a question, for example:
“When you drop a ball from rest it accelerates downward at 9.8 m/s². If you instead throw it downward assuming no air resistance its acceleration immediately after leaving your hand is: (A) 9.8 m/s² ; (B) more than 9.8 m/s²; (C) less than 9.8 m/s²; (D) Cannot say unless the speed of throw is given.”
The model generates texts based on this question, which is then compared to the correct text answer, and gets one point if the answer is correct. This is done across hundreds of questions across categories such as humanities, social science, STEM, and others. The score is then averaged across these categories for a final overall score, representing the % of correct answers.
During the competition, when assessing participating teams’ submissions, we found reproducing model evaluations with HELM to be challenging due to its delicate configuration and dependency management. Issues ranged from interrupted evaluations due to external dependencies to the necessity of frequent updates and modifications, all of which required significant troubleshooting. Alongside this, since there were varying levels of experience from the participating teams, there was a wide range in submission artifact ‘quality’, leading to further evaluation infrastructure challenges.
It's worth noting that while using evaluation frameworks like HELM can give us a good understanding of an LLM's abilities, scoring well on benchmarks doesn't guarantee that the model will perform just as well in real-world applications. The most reliable way to make sure an LLM meets your specific needs is to create test cases based on real-world scenarios and review them manually.
Bridging the gap and democratizing access to AI
The NeurIPS competition offered valuable insights into the complex world of large language models. It helped demystify the process of fine-tuning an open-source model and how this could be done with low limited computational power. It also helped to showcase the breadth of an evolving evaluation landscape and its complexities, alongside the need for reproducible infrastructure for running evaluation more easily.
At Mozilla.ai, we believe that making open-source model fine-tuning and evaluation as accessible as possible is an important step to helping people own and trust the technology they use. Currently, this still requires expertise and the ability to navigate a rapidly evolving ecosystem of techniques, frameworks, and infrastructure requirements. We want to make this less overwhelming for organizations and developers, which is why we’re building tools that:
- Help them find the right open-source LLM for their needs
- Make it simple to fine-tune an open-source LLM on their own data
- Make language model evaluation and validation more accessible
We think there will be a significant opportunity for many organizations to use these tools to develop their own small and specialized language models that address a range of meaningful use cases in a cost-effective way. We strive to empower developers and organizations to engage with and trust the open-source ecosystem and minimize their dependence on using closed-source models over which they have no ownership or control.