Deploying DeepSeek V3 on Kubernetes

Previously, we explored how LLMs like Meta’s Llama reshaped AI, offering transparency and control. We discussed open-weight models like DeepSeek and deployment options. Now, we show how to deploy DeepSeek V3, a powerful open-weight model, on a Kubernetes cluster using vLLM.

Dimitris Poulopoulos, Kyle White

Feb 20, 2025 — 5 min read

Photo by Scott Rodgerson / Unsplash

In our previous story, we explored the evolving landscape of Large Language Models (LLMs) and how releases like Meta’s Llama reshaped the AI landscape, offering organizations greater transparency and control over their models. We examined the significance of open-weight models like those in the DeepSeek family and the various deployment options available. Now, we take this a step further by demonstrating how to deploy one of the most powerful open-weight models to date, DeepSeek V3, on a Kubernetes cluster using vLLM.

DeepSeek V3, unlike its sibling DeepSeek R1, did not receive much attention when it was released almost a month ago (Dec 26, 2024). However, V3 is not a model to be ignored or taken lightly, as the team at DeepSeek had to come up with a few clever tricks to overcome the scarcity of AI hardware in China and reduce training costs. And the best part is that they did everything in the open, describing every innovation in their detailed publications.

DeepSeek V3 is a new Mixture-of-Experts (MoE) large language model from the China-based company of the same name. It consists of 671 billion total parameters with 37 billion activated for each token. It was pre-trained on 14.8 trillion tokens followed by Supervised Fine-Tuning (SFT) and Reinforcement Learning stages, to produce an instruction-tuned model, just like one of the models you’d use in ChatGPT. However, DeepSeek V3 is an open-weights model, which means that, if you have the resources, you can download it from Hugging Face Hub and deploy it on your own hardware.

That is exactly what we at Mozilla.ai did, and this is the story of how you can deploy DeepSeek V3 on a Kubernetes cluster and the resources and tools you need.

How much are we talking about?

To get to the point of discussing what you actually need to serve DeepSeek V3, you need to do some math. The model consists of 671B parameters. Assuming each parameter uses FP32 precision, the model needs 4 bytes to represent each weight. If you do the multiplication (671,000,000,000 parameters * 4 bytes/parameter) you get that the model is around 2.5 TB (assuming that 1 TB is 2^40 bytes).

Thankfully, not all the model's weights are in FP32 precision, since the team at DeepSeek designed an FP8 mixed precision training framework, reducing its size to almost 1.3 TB. That’s good news, but not all GPUs support FP8, such as the A100s we, at Mozilla.ai, have available in our cluster. So, if you’re in that situation too, you have to use a script DeepSeek provides to cast FP8 into BF16, and since open source moves fast, there’s already a BF16 variant of the model on the Hugging Face Hub. The drawback? You are back again in the realm of 2.5 TB, since you doubled the precision of FP8 weights.

So, in this case, to deploy DeepSeek V3, you most probably have to split the model across multiple nodes. In our case, we had to split it into four nodes with each providing eight A100s (80 GB). To achieve this, you need Ray to manage the cluster, and vLLM, which integrates natively with Ray, as the inference engine. Indeed, vLLM manages the distributed runtime with either Ray or Python native multiprocessing. However, multiprocessing can be used when deploying on a single node, while multi-node inferencing currently requires Ray.

The process

Now that you have the resources you need pinned down, you have to get the weights of the model. There are two ways to get your hands on the model: i) have vLLM pull it during runtime, or ii) clone it locally beforehand. If you chose the second path and clone the model in a Persistent Volume Claim (PVC), you should have more control over the download process and iterate faster moving forward. To this end, you need to:

Create a PVC, asking for at least 3 TB.
Create a Pod and mount the PVC.
Exec into the Pod and clone the model in the PVC, using "Git LFS".

git lfs install
git clone https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16

Having the model cloned locally in a PVC, the next step is to create a Ray cluster on Kubernetes, with the following characteristics:

Mount the PVC with the model on every node of the cluster.
Give each node 8 GPUs.
Choose a custom image for each node, that derives from the upstream latest image with GPU support, and adds a layer that installs vLLM on top. This could be as simple as creating a new image using the following Dockerfile:

FROM rayproject/ray:latest-py311-gpu
RUN pip install vllm

Next, you should to create the cluster using Kuberay, a toolkit to run Ray applications on Kubernetes, and when the cluster becomes ready, exec into the head node and run the following command:

vllm serve "/mnt/models/DeepSeek-V3-bf16" \
    --trust-remote-code \
    --tensor-parallel-size 8 \
    --pipeline-parallel-size 4 \
    --dtype bfloat16 \
    --gpu-memory-utilization=0.8

The most important parameters here are tensor-parallel-size and pipeline-parallel-size. These two parameters decide the distributed inference strategy, and since the model is too big for a node with 8 GPU devices, you have to move to a multi-node, multi-GPU deployment. In a nutshell, tensor-parallel-size is the number of GPUs you want to use in each node, and the pipeline-parallel-size is the number of nodes you want to use. So, in your case, you need to use 4 nodes, with 8 GPUs each.

On top of that, the gpu-memory-utilization parameter in vLLM controls how much of the GPU's total memory (VRAM) is allocated for the model's weights, activations, and key-value (KV) cache during inference. This sets a safety margin so that system processes, CUDA context, or other applications can find their place in GPU’s memory without causing out-of-memory issues to your inference process.

From this point, you need to port-forward the pod that is running the head node and then, invoking the model is as simple as running the following CuRL command:

curl "http://localhost:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/mnt/models/DeepSeek-V3-bf16",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "What is the meaning of life?"
      }
    ]
  }'

Conclusion

Deploying big models like DeepSeek V3 on a Kubernetes cluster is a challenging yet rewarding process that demonstrates the features of modern distributed inference frameworks, like vLLM, and systems, like Ray.

On the other hand, It also showcases the power of open-weights models, like DeepSeek V3, and the possibilities these models bring to the table. DeepSeek V3 might not have received the attention of its counterparts, but its open approach and innovative optimizations make it a valuable addition to the AI landscape. If you have the infrastructure, deploying it yourself permits you to maintain full control over your AI stack.

Deploying DeepSeek V3 on Kubernetes

Dimitris Poulopoulos, Kyle White

How much are we talking about?

The process

Conclusion

Read more

Introducing any-llm: A unified API to access any LLM provider

Wasm-agents: AI agents running in your browser

Smarter Prompts for Better Responses: Exploring Prompt Optimization and Interpretability for LLMs

The Challenge of Choosing the Right LLM