Let’s build an app for evaluating LLMs

Lumigator 🐊 is a self-hosted, open-source Python application for evaluating large language models using offline metrics. It targets common machine learning use-cases, starting with summarization, and is extensible at the task and job level.

Vicki Boykis

Dec 5, 2024 — 8 min read

Photo by Andrik Langfield / Unsplash

With a very special thank-you to everyone also working on building Lumigator: Juliana Araujo, Mario Cariñana, Davide Eynard, Alejandro Gonzalez, Hasan Gözlügöl, David Manzano Macho, Dimitris Poulopoulos, George Syrianos, Javier Torres, Irina Vidal, and Peter Wilson, and community contributors.

We’ve written before about why we picked the initial use-case of summarization, the specific metrics, and the models we’re working with and introduced Lumigator at a high level. Let's now take a deeper dive into how we’ve been designing and building the application.

You can find all the code here, the contribution guidelines here, and the documentation is here.

How Lumigator Works

Lumigator is a FastAPI web app coordinating a set of services for working with a curated set of large language models and user data. The app’s core functionality is to create, schedule, and execute a variety of stateful long-running asynchronous jobs specific to serving and evaluating large language models and keeping track of metadata and datasets related to those jobs.

It also includes a UI for interacting with the app’s API, as well as an SDK for writing and managing jobs programmatically. We manage the infra for Lumigator within the same repo using Dockerfiles and Helm charts, and the documentation is available here. Across the application, job state and metadata are currently managed using an SQLite database, and data itself is stored in S3-compatible object storage: S3 (or S3-compatible storage like Ceph) if hosted, and containerized Localstack in local deployments.

Lumigator can be deployed on any machine or cluster with access to GPUs. While GPUs are not a requirement, they are strongly recommended for working with large language model artifacts because model training and inference require many parallelized matrix calculations, which GPU architectures excel at. Lumigator can be run on a single machine using docker-compose. But, we also do have a Helm chart for deploying in a team’s Kubernetes infrastructure. The deployment should have the ability to access LLM APIs (currently OpenAI and Mistral) if you use these for evaluation, and the ability to use models downloaded from LLM model repositories, primarily HuggingFace’s Model Hub.

While all of this is reasonable at a high level, the devil is in the details. Even in building an application based on a pattern that’s already been implemented many times, there are a lot of details to get right, particularly when working with fast-moving open-source libraries common in the world of LLMs.

Starting with User Needs and Translating to Code

When we initially started building LLM-based applications at Mozilla AI, we started with use-cases involving fine-tuning and evaluating models, which resulted in the release of our first library, lm-buddy (blog post here), which makes use of YAML-based configuration files as inputs to CLI commands for Ray jobs, and tracked input/output artifacts.

As our internal use-cases moved from fine-tuning to focus on evaluation based on internal needs and what we were seeing developing in the open-source model community, we realized we needed an easier way to work with machine learning model inference and evaluation.

In either case, our main need was to make use of open-source model artifacts loaded directly on GPUs and perform inference on them in batch, as well as give folks the ability to compare to the current set of models.

Following Machine Learning Development Best Practices

LLM development has gone from 0 to 100 since 2022, but LLM applications don’t arise fully-formed from the ether without context. When we develop with LLMs, we’re still following the best practice patterns of developing production-grade machine learning applications that we’ve been doing for the past 20+ years which include:

Developing solid end-to-end data pipelines.
Starting with reasonable objectives, aka concrete machine learning tasks we’d like to accomplish.
Starting with a reasonable, smaller baseline models.
Making it easy to test the application locally.
Implementing metrics for both model and platform evaluation.
Getting the infrastructure correct.

In our case, Lumigator is slightly different because we’re not building a direct end-consumer product, but a tool for folks looking to find the right model to build those products. But, best practices still apply, and we’re looking to start from a solid engineering and machine learning foundation.

We are focusing on evaluating large language models as an end-goal for Lumigator. Within evaluation, there are two different approaches, each of which seeks to answer the ultimate question: “Is the model we trained good enough to generalize beyond its training data for the machine learning task we want it to perform?”

Offline evaluation: We train a model and, using a holdout test set from our training data, we evaluate on metrics like precision, accuracy, recall, NDCG, RMSE from “classical” machine learning, and BLEU and ROUGE for large language models, as relevant for our machine learning domain. The essential question we’re trying to answer is: “Given the ground truth in your test data, how well does the trained model make predictions?” The closer the match, the better it is.
Online evaluation: We ship the model to production as a customer-facing application or feature (i.e. ChatGPT or Claude or Gemini), and have people use it and give implicit or explicit feedback on whether the results were good. This could involve looking at user activity or text input/output in the application where we deploy our model.

The essential question we’re trying to answer is: “Given the live model, how good do people think it is?” The more people use it or the more relevant the results, the better the model actually is. Because offline and online scores don’t always match, it’s important to assess whether offline metrics are good proxies for online performance.

In our current set of tasks, we’re focused on offline evaluation against academic benchmarks, although this won’t always be the case and we’re working on providing users with the possibility to write their own custom jobs (e.g. live inference, multi-turn evaluation and inference, and more.)

Building for LLM Interfaces

Within the world of LLM inference, we are usually conforming to be downstream of one of several interfaces:

Models that are trained with PyTorch and hosted on Huggingface as safetensors artifacts–practically, this means dependence on PyTorch’s APIs, which HuggingFace tools wrap (although they now also provide interop between Tensorflow and JAX).
Model artifacts generally derived from safetensors models and compressed and available in a format optimized for local inference, typically GGUF via applications like llama.cpp or llamafile.
Models that are available via API endpoints, particularly as hosted by OpenAI. Given that OpenAI was a first mover in the product LLM space, they currently have the API advantage, and many tools have developed OpenAI-compatible endpoints which don’t always mean using OpenAI, but conform to the same set of patterns that the Chat Completions API v1/chat/completions offers.

Within the infrastructure of the application, we want local development to be an option, which in practice means docker-compose, and, within cloud and hybrid environments, using Helm charts to launch applications as logical groups of Kubernetes pods.

Designing a Local, Fast, and Opinionated Stack

This is a good starting set of requirements, but we need some more in order to optimize our development profile.

For Lumigator specifically, our top-level considerations include:

Using Python. We’re working with the universe of LLMs and the LLM ecosystem runs on Python. Specifically, we're using FastAPI because that web stack, combined with Pydantic type validation, has become the lingua franca of web-based LLM applications. FastAPI also bundles OpenAPI specs for API and SDK access, which has made interactive testing very fast for us.
Designing for long-running jobs that produce stateful artifacts. We use Ray for running evaluation jobs. There are many other frameworks that do job orchestration, but Ray and its component libraries (Serve, Data, Train, and more) are specifically designed for LLM inference, fine-tuning, and data storage access. We deliberated between using frameworks like Spark but decided against it, for several reasons:
- Spark is well-suited for data parallelism, where an operation is applied repeatedly to data. This makes it extremely well-suited for large batch ETL flows. However, working with LLMs often means having to perform multiple independent tasks concurrently (deep learning model training that involves parallel matrix calculations, fine-tuning and inference), and Ray's primitives are better at this.

Spark has Python libraries, but ultimately its compute, packaging, and performance optimization paradigms are inherited from its JVM, whereas Ray was developed with Python in mind from the ground up, making it easier for interop with our other Python libraries and the ML ecosystem at large.
Ray works much better out of the box with GPU-intensive tasks, Spark instead was developed during an era of large-scale CPU computing.

Ray jobs are easier to ship in free-standing cluster mode, involving as little as writing a single task file plus Python requirements. In thinking about how our system could be extended and evolved, this paradigm made it the easiest for folks to bring their own code.
Spark started in 2009, ray in 2017. Still, Ray already almost has the same amount of stars on GitHub as Spark (34k vs 40k) indicating broad adoption and enough fixed issues to search through and code examples to draw from.

Designing for our users. We want to make it easy for individuals or small teams who have the need to get going quickly. This means prioritizing local development, both for the developer team and end-users. Here, we use docker compose for quick local development loops.
Access to models across the LLM universe. HuggingFace model access as a first-class citizen, with support for Seq2Seq and causal models, as well as OpenAI compatibility for other endpoints.
Building in the open. Any update to Lumigator’s code is committed and discussed publicly, so building clean interfaces that make it easy to understand where impact happens is important. We also include documentation out of the box and keep it in sync with development.
Open source by default. Our license is open-source and we want to build open source, as well as contribute back to the ecosystem. We bias open-source over proprietary solutions for our product, and we make contributions back into the ecosystem.
Opinionated constraints. The LLM ecosystem is full of libraries, models, APIs, and choices. We’d like to make most of the decisions for the user so they can minimize decision fatigue and focus on what they want to do: pick a model for their specific business/research use-case.
Using common modern Python idioms and patterns. Type annotations, uv for building and packaging, and pytest testing patterns by default for Lumigator all contribute to codebase legibility.
Keeping our dev footprint small. We use Docker Compose for fast local development and testing loops, ruff for linting and uv for fast dependency resolution, as well as smaller seq2seq models like BART and tiny random LLama for integration testing, not only because they work extremely well for our use-case of summarization, but because, in the case of the latter, they’re fast to spin up for inference, even on local machines with CPUs. We use SQLite as a baseline database for metadata storage with optionality to move to Postgres or other databases since we abstract out storage providers.

We’re still at the start of building out Lumigator’s capabilities, so keeping these as guiding principles while building new components will allow us to move quickly, along with the fast-paced LLM landscape and user expectations around what folks are looking to build with these models.

Let’s build an app for evaluating LLMs

Vicki Boykis

Read more

Wasm-agents: AI agents running in your browser

Smarter Prompts for Better Responses: Exploring Prompt Optimization and Interpretability for LLMs

The Challenge of Choosing the Right LLM

What do you mean by AI testing...?