Open Source in the Age of LLMs

Open Source in the Age of LLMs
Photo by Wes Hicks / Unsplash

Like our parent company, Mozilla.ai’s founding story is rooted in open-source principles and community collaboration. Since our start last year, our key focus has been exploring state-of-the-art methods for evaluating and fine-tuning large-language models (LLMs), including continued open-source LLM evaluation experiments and establishing our GPU cluster’s infrastructure.

This initial work paved the way to build lm-buddy, our reference framework for evaluation and fine-tuning. Building on top of this, in early 2024 we started developing our product, a platform designed to empower developers and organizations to independently develop, deploy, and evaluate large language models within an open-source ecosystem. 

Throughout this process, we’ve been diving into the open-source ecosystem around LLMs. What we’ve found is an electric environment where everyone is building. As Nathan Lambert writes in his post, “It’s 2024 and they just want to learn.”

“While everything is on track across multiple communities, that also unlocks the ability for people to tap into excitement and energy that they’ve never experienced in their career (and maybe lives).”

The energy in the space, with new model releases every day, is made even more exciting by the promise of open source where, as I’ve observed before, anyone can make a contribution and have it be meaningful regardless of credentials, and there are plenty of contributions to be made. If the fundamental question of the web is, “Why wasn’t I consulted,” open-source in machine learning today offers the answer, “You are as long as you can productively contribute PRs, come have a seat at the table.”

Within MzAI, we’ve been actively contributing to the ecosystem across several different projects focused on open-source large language model development, evaluation, and serving. Our recent contributions have included: 

Even though some of us have been active in open-source work for some time, building and contributing to it at a team and company level is a qualitatively different and rewarding feeling. And it's been especially fun watching upstream make its way into both the communities and our own projects. 

At a high level, here’s what we’ve learned about the process of successful open-source contributions: 

  1. Start small when you’re starting with a new project. If you’re contributing to a new project for the first time, it takes time to understand the project’s norms: how fast they review, who the key people are, their preferences for communication, code review style, build systems, and more. It’s like starting a new job entirely from scratch. Be gentle with both yourself and the reviewers and pick something like a documentation task, or a “good first issue” label just to get a feel for how things work.

  2. Be easy to work with. There are specific norms around working with open source, and they closely follow this fantastic post of understanding how to be an effective developer - “As a developer you have two jobs: to write code, and be easy to work with.”In open source, being easy to work with means different things to different people, but I generally see it as: 
    1. Submitting clean PRs with working code that passes tests or gets as close as possible. No one wants to fix your build. 
    2. Making small code changes by yourself, and proposing larger architecture changes in a group before getting them down in code for approval. Asking “What do you think about this?” Always try to also propose a solution instead of posing more problems to maintainers: they are busy!
    3. Write unit tests if you’re adding a significant feature, where significant is anything more than a single line of code. 
    4. Remembering Chesterton’s fence: that code is there for a reason, study it before you suggest removing it. 

  3. Assume good intent, but make intent explicit. When you’re working with people in writing, asynchronously, potentially in other countries or timezones, it’s extremely easy for context, tone, and intent to get lost in translation. Implicit knowledge becomes rife. Assume people are doing the best they can with what they have, and if you don’t understand something, ask about it first.

  4. The AI ecosystem moves quickly. Extremely quickly. New models come out every day and are implemented in downstream modules by tomorrow. Make sure you’re ok with this speed and match the pace. Something you can do before you do PRs is to follow issues on the repo and follow the repo itself so you get a sense of how quickly things move/are approved. If you’re into fast-moving projects, jump in. Otherwise, pick one that moves at a slower cadence.

  5. The LLM ecosystem is currently bifurcated between HuggingFace and OpenAI compatibility: An interesting pattern has developed in my development work on open-source in LLMs. It’s become clear to me that, in this new space of developer tooling around transformer-style language models at an industrial scale, you are generally conforming to be downstream of one of two interfaces: 
    1. models that are trained and hosted using HuggingFace libraries and particularly the HuggingFace hub as infrastructure - in practicality, this means dependence on PyTorch’s programming paradigm, which HuggingFace tools wrap (although they now also provide interop between Tensorflow and JAX)
    2. Models that are available via API endpoints, particularly as hosted by OpenAI. Given that OpenAI was a first mover in the product LLM space, they currently have the API advantage, and many tools that have developed have developed OpenAI-compatible endpoints which don’t always mean using OpenAI, but conform to the same set of patterns that the Chat Completions API v1/chat/completionsoffers.

      For example, adding OpenAI-style interop chat completions allowed us to stand up our own vLLM OpenAI-compatible server that works against models we’ve started with on HuggingFace and fine-tuned locally. 

      If you want to be successful in this space today, you as a library or service provider have to be able to interface with both of these.

  6. Sunshine is the best disinfectant. As the recent xz issue proved, open code is better code, and issues get fixed more quickly. This means don’t be afraid to work out in the open. All code has bugs, even yours and mine, and discovering those bugs is a natural process of learning and developing better code rather than a personal failing. 

We’re looking forward to continuing our contributions, upstreaming them, and learning from them as we continue our product development work.

Read more