Running an open-source LLM in 2025

The landscape of LLMs has evolved dramatically since ChatGPT burst onto the scene in late 2022. At Mozilla.ai, we’re focused on improving trust in open-source AI by supporting their use in appropriate situations and their proper evaluation.

Running an open-source LLM in 2025
Photo by Javier Allegue Barros / Unsplash

The landscape of Large Language Models (LLMs) has evolved dramatically since ChatGPT burst onto the scene in late 2022. Soon to follow, the release of Meta's Llama model in 2023 was one of the most important developments of that year. They released the model weights, allowing anyone to download and further experiment with the model (after agreeing to the terms of use). Llama proved that open-weight models could compete with closed-source models. There's still an active debate about what exactly "open-source" AI means, and how open-weight models like Llama and DeepSeek models aren’t completely open-source. At Mozilla.ai, we’re laser-focused on improving trust in open-source AI by supporting their use in appropriate situations and supporting their proper evaluation. The field is progressing quickly and it’s important to support community-driven innovation. We’ve written in the past about open-source in the age of LLMs, and we believe that it’s critical for our future.

On top of transparency, open-source LLMs offer a compelling advantage: Control. Organizations continue to pivot towards fully-open datasets for use in LLM training (Mozilla 🤝 EleutherAI). Additionally, having access to the model weights can give deeper insight into what's happening under the hood. This enables organizations that control the model to perform interpretability experiments like Anthropic has done with its proprietary Claude models, and supports efforts like Mozilla’s to increase model transparency. While the closed vs. open-source debate remains evergreen, the real added complexity of using an open weight model often lies in the plethora of deployment options available.

With closed-source models like GPT-4 or Claude, accessing them is relatively straightforward because you only have a limited number of options. Although there are a few other services offering access to these popular models, they are primarily:

But what happens when you choose to use an open-source model? The options explode! Let's use the recent earth-shattering DeepSeek-R1 model as an example – a powerful open-weight model from Deepseek AI. They shocked the AI world by releasing a state-of-the-art reasoning model at a fraction of the price of other big AI research labs. With the model weights available via Huggingface, there are three paths for using the model: Fully managed deployment, partially managed deployment, or local deployment.

1. The Cloud-Native Route: Managed APIs 🌩️

This approach most closely mirrors the closed-source experience. Cloud providers handle the heavy lifting such as infrastructure, scaling, and maintenance requirements, and you pay for what you use. For Deepseek-R1, some options are:

This approach is perfect for getting started quickly, but be cautious of the costs as you scale – you're likely paying a premium for convenience.

2. The Power User Path: Custom Infrastructure 💪

Basically, this option means that some company (maybe yours) has already purchased the hardware needed to run the model, and they will let you build with their infrastructure for a price. This can either be an on-premise datacenter or training cluster that your company manages, or hardware that you pay to use from cloud providers.

Generally the hardware are NVIDIA GPU(s) designed for LLM workloads, but they can also be a variety of custom chips specially designed for LLM inference. Normally, you can either run a virtual machine and configure all of the internals yourself, or you can pay to use one of their managed services where you write the application code to load and inference the model, but they'll manage the scaling and hosting complexities for you. Common libraries for using the models onto virtual machines or the managed services are open-source tools like vLLM or Huggingface Text-Generation-Inference.

This path offers the scaling and production capabilities of the previously mentioned managed API solution, but with greater control over deployment. While it requires more technical expertise than fully managed APIs, it can be more cost-effective at scale. The break-even point varies based on your use case, model choice, and provider pricing.

3. The Local Hero: Running on Your Own PC 💻

Local deployment shines for hobbyist developers and security-conscious teams. Running everything locally on consumer-grade hardware (CPU, consumer NVIDIA GPU, or Apple Silicon) offers great data privacy—your data never leaves your machine.

Applications like llamafile, llama.cpp or ollama allow you to use any open model with almost no knowledge of the technical details! In a certain sense, you get the best of both worlds: it doesn't cost you anything because you're using your own hardware, and you get an easy-to-use API. However, there are important trade-offs to consider.

Your computer is probably designed for watching Netflix, not for loading the Deepseek-R1 LLM. Consumer hardware, while capable, isn't optimized for LLM inference. Even with decent hardware like a gaming GPU or Apple Silicon, you'll face constraints like higher model latency and reduced generation length support.

On a more basic level: The DeepSeek-R1 model is a size of 671 billion parameters! To load that on a single computer, you would need to have over 500GB of CPU or GPU RAM, which consumer computers typically don’t have available. The most powerful Macbook Pro you can buy in 2025 tops out at 128GB of CPU RAM.

In order to load DeepSeek-R1 on your system, you will need to make two sacrifices. First, you can’t load the full model, because it’s too big to fit. Thankfully, DeepSeek has provided several distilled models, which are basically smaller models that were fine-tuned using the reasoning data generated by the large DeepSeek-R1. For example, they provide a distilled Qwen-1.5B which is only 1.5 billion parameters (a 3.5 GB download). 

Second, models can be made even smaller by being quantized. In high level terms, you can think of quantization like the resolution of a youtube video. The model weights provided by DeepSeek are like the 4k resolution of the model, and in order to make it fit on a laptop, the resolution has to be scaled down to 1080p or 360p. This means that although it's still the same model framework and number of parameters, it's not going to act exactly the same: Each parameter has a "lower resolution" and won’t act the same way as the original model. There are lots of optimizations used to try to minimize the impact to the quality, but it does have an impact. There's a nice huggingface blog that explains quantization in more detail. The DeepSeek models available in ollama (whether the DeepSeek architecture or Llama and Qwen distilled ones) are all 4-bit quantized versions.

Conclusion

The flexibility of open-source models are both their strength and weakness. When choosing your deployment path, consider:

  • Scale requirements
  • Performance needs
  • Privacy concerns
  • Technical expertise
  • Budget constraints

Remember that the best solution may change as you develop and scale whatever you’re building! The key to success is staying flexible to adapt your deployment to what best suits your budget and performance requirements.