Benchmarking DeepSeek R1 Models using Lumigator: A Practical Evaluation for Zero-Shot Clinical Summarization

New state-of-the-art models emerge every few weeks, making it hard to keep up, especially when testing and integrating them. In reality, many available models may already meet our needs. The key question isn’t “Which model is the best?” but rather, “What’s the smallest model that gets the job done?”

Nathan Brake, Davide Eynard, Dimitris Poulopoulos, Irina Vidal Migallón

Mar 13, 2025 — 10 min read

Photo by moren hsu / Unsplash

Interested in listening to this as a presentation instead? See here.

Interested in seeing a walkthrough of exactly how we did this with Lumigator? See here.

It seems like every few weeks a new model is released that achieves state-of-the-art performance. Just think about the past three months: we’ve seen the release of DeepSeek V3, DeepSeek R1, Grok 3, Claude 3.7 Sonnet, GPT-4.5, GPT-o3-mini, Gemini 2.0 Flash Thinking, Qwen-QwW, AI2 Jamba-1.6, just to name a few!

Keeping up with them can be tricky, especially when it comes to testing and integrating them into a product you’re already developing. Sometimes, the amount of model options can be so overwhelming that it’s easy to forget to ask an important question: Do we even need a state-of-the-art model for our use case? It’s great that the newest model A can ace the math olympiad with a higher score than model B, but do you really care that much if your use case is generating a five-sentence story to read to your toddler at bedtime? Maybe not so much.

The reality is that many models available for us to use at this moment may be good enough for what we need from them. The question becomes not “Which model is the best?”, but rather, “What's the smallest model that will get the job done?”. As we discussed in a post last week, in instances such as this, evaluation is critical: knowing how much the performance is affected by model size allows you to make an informed decision about which model is best for you at the price-point you want.

DeepSeek R1

One really interesting manifestation of this question appears in all of the chatter around the new DeepSeek R1 model. The primary DeepSeek R1 model is a post-trained version of the DeepSeek V3 model, which is specifically taught to generate thinking steps (via <think> tags) before giving answers. While DeepSeek R1 is a massive size of 671B parameters, the magic of DeepSeek R1 is actually in the post-training technique: this methodology used to train DeepSeek R1 allows for the creation of several distilled models, which follow the reasoning approach of DeepSeek R1, but are much smaller. The DeepSeek team provides us with distilled models based off of the Llama-3 8B, 70B, as well as Qwen-2.5 1.5B, 7B, 14B, and 32B model checkpoints. These models start from their respective released Llama/Qwen model weights and are post-trained using supervised fine-tuning with 800k curated samples generated by DeepSeek R1. This means that although they may have different knowledge and capabilities based on how they were pretrained, they should follow the reasoning strategy of DeepSeek R1.

The existence of all of these different model sizes allows us to investigate the question, "Do we actually need the full 671B parameter model for our application?"

The DeepSeek team has published several benchmarks which are helpful but limited—they focus heavily on math and coding tasks (AIME, MATH, GPQA Diamond, LiveCode Bench, CodeForces). But what about more nuanced, domain-specific tasks? What if you want to know how these models stack up on the task that you actually care about?

Enter Lumigator 🐊

Running and comparing all these models is tricky: you have to create a dataset, generate a bunch of outputs from all the models, figure out how to evaluate all the outputs, and then try to put everything together to understand which one is the right one for you.

We’re building Lumigator to help make this process as easy as possible. The thing that we can’t do for you is create your dataset (though we have some tips to help guide you through the process!), but, once you have the dataset, Lumigator helps manage the dataset, generate all of the outputs from the models, evaluate the outputs on a number of useful metrics, and provide the information you need to make an informed decision.

In this post we’ll discuss the experiment and results, but if you want to jump directly into the code that made it happen, you can head over to the jupyter notebook that was used to execute this experiment.

Summarization of Doctor-Patient Conversations

With all of the different DeepSeek R1 models available, we were interested in seeing how the performance compares on a specific use case: doctor-patient conversation summarization. We investigate the task of generating Assessment & Plan (A&P) sections of a clinical note using the ACI-Bench dataset. The dataset contains pairs of conversation transcripts and clinical notes that are intended to be realistic but don’t contain any real patient data (so there isn’t any risk of exposing any private patient information). For our analysis we will use the test split containing 40 conversation-note pairs. This task is particularly challenging because it requires:

Understanding complex medical conversations.
Identifying key medical conditions and reasoning.
Producing concise, structured documentation that follows clinical standards.
Maintaining factual accuracy (hallucinations can be dangerous in the world of healthcare).

Here's a snippet of what a conversation may look like, taken from the training split of the ACI-Bench dataset:

[doctor] hi , martha . how are you ?
[patient] i'm doing okay . how are you ?
...
[doctor] okay , all right . and then for your last problem your hypertension , you know , again i , i , i think it's out of control . but we'll see , i think , you know , i'd like to see you take the lisinopril as directed....
[patient] okay
...

And here’s a snippet of what the output might look like:

ASSESSMENT AND PLAN

Martha Collins is a 50-year-old female with a past medical history significant for congestive heart failure, depression, and hypertension who presents for her annual exam.
...
Hypertension
...Medical Treatment: We will increase her lisinopril to 40 mg daily as noted above.

Patient Education and Counseling: I encouraged the patient to take her lisinopril as directed. I advised her to monitor her blood pressures at home for the next week and report them to me.

Patient Agreements: The patient understands and agrees with the recommended medical treatment plan.

The System Prompt

In order for the models to generate a clinical note in the general format that we want, we’ll craft a system prompt: this prompt is provided to each model alongside each transcript. The model will “look” at the transcript, and then follow the system prompt to generate the output we want.

You are an expert medical scribe who is tasked with reading the transcript of a conversation between a doctor and a patient, and generating a concise Assessment & Plan (A&P) summary. Please follow the best standards and practices for modern scribe documentation. The A&P should be problem oriented, with the assessment being a short narrative and the plan being a list with nested bullets. When appropriate, please include information about medical treatment, patient consent, patient education and counseling, and medical reasoning.

This prompt does several things:

Establishes domain expertise ("expert medical scribe").
Clearly defines the task (generating A&P summaries).
Specifies the expected format (problem-oriented with narrative assessment and bulleted plan).
Highlights important components to include (medical reasoning, treatment, education, consent).

Although there may be some further prompt engineering required for specific use cases, this should get us reasonable outputs, so long as the model is capable of following our instructions.

The Models

In a previous blog we explained how to deploy DeepSeek models on Kubernetes. Using this technique allows for easy and flexible inference of our models with Lumigator. For our experiment, we’ll compare several different models, corresponding to the several different deployment options that we described in our previous post about deploying open LLMs in 2025.

In order to make these results reproducible, we generate outputs using a temperature of 0, which corresponds to the greedy search algorithm.

Managed API: DeepSeek API

This model is hosted by the DeepSeek team at https://platform.deepseek.com/. Remember that since this model is not self-hosted, it’s important to be cautious about what data is sent to it.

DeepSeek R1 - DeepSeek Reasoner accessed via DeepSeek API

Self-Hosted: vLLM

Following our Kubernetes deployment process, we deployed several models to our internally hosted Kubernetes cluster.

DeepSeek-R1-Distill-Llama-8B
DeepSeek-R1-Distill-Llama-70B
DeepSeek-R1-Distill-Qwen-1.5B
DeepSeek-R1-Distill-Qwen-7B
DeepSeek-R1-Distill-Qwen-14B
DeepSeek-R1-Distill-Qwen-32B

Local: Llamafile

Llamafile lets you distribute and run LLMs with a single file.

The naming can be a bit confusing: the “QX_XX” format is explaining what quantization is used. You can reference this document for a more detailed explanation, but basically, these are all different ways to reduce the size of the models using quantization. For the more recent type of quantization and its S-M-L naming, you can also read about Llama.cpp ggml.

DeepSeek-R1-Distill-Llama-8B-Q5_K_M.llamafile : 5-bit quantization
DeepSeek-R1-Distill-Llama-8B-Q4_K_M.llamafile: 4-bit quantization
DeepSeek-R1-Distill-Llama-8B-Q3_K_M.llamafile: 3-bit quantization
DeepSeek-R1-Distill-Llama-8B-Q2_K_L.llamafile: 2-bit quantization
DeepSeek-R1-Distill-Llama-8B-Q2_K.llamafile: 2-bit quantization
DeepSeek-R1-Distill-Qwen-14B-Q2_K.llamafile: 2-bit quantization
DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.llamafile: 4-bit quantization
DeepSeek-R1-Distill-Qwen-7B-Q2_K.llamafile: 2-bit quantization
DeepSeek-R1-Distill-Qwen-1.5B-Q2_K.llamafile: 2-bit quantization

All Llamafile models can be downloaded via Hugging Face.

Evaluation Metrics

Now that we have the data all sorted and the models selected and deployed, it’s time to decide how we’ll evaluate the model outputs. We’ll focus on a zero-shot scenario, which means that we will completely rely upon the system prompt to instruct the models about what should be present in the generated notes.

Lumigator supports a variety of metrics to measure model performance: in this experiment, we’ll focus on ROUGE and G-Eval (implemented with GPT-4 in the DeepEval library), which we discuss in our previous post.

Since G-Eval was originally implemented with GPT-4, we continue to use that model for the G-Eval implementation. However, it’s likely that an alternate open-weight model (such as Mistral or DeepSeek) would also be suitable.

Results

Drum roll please… 🥁 The results are in!

Given that this test dataset is only 40 examples, the results may be noisy. However, from the table, we can still see a few interesting things:

The Qwen-32B, Llama-70B, DeepSeek R1, and Qwen-14B models had nearly identical G-Eval scores, differing by around 1% or less!
The Qwen-14B parameter model jumps out here: it has the highest ROUGE scores, competitive G-Eval scores, and also one of the shortest average output.
The DeepSeek R1 model surprisingly had a noticeably lower ROUGE (~8% lower ROUGE-1), even though it saw no drop in the G-Eval scores. The difference in pre-training data between DeepSeek R1 (based off the DeepSeek V3) vs the Qwen and Llama model families may account for the difference in approach to the problem. From a manual inspection of the outputs (see the jupyter notebook), the language of R1 appears to be extremely clinical, which may account for a mismatch in words between the ground_truth notes available.
Qwen-1.5B is the lowest in terms of all metrics, and is probably not going to be a good choice since it’s lower by over 30% in G-Eval metrics.
Some llamafile models may be up to the task!

Generation Lengths

Lumigator provides us with information about how many tokens (roughly the number of words) are in the ground_truth/reference summaries from the ACI-Bench dataset, how many tokens each model generated as a part of its reasoning (the tokens inside the <think> and </think> tags), as well as the tokens generated for the answer itself. The average input length of the doctor-patient conversations is about 1300 tokens (See Table 3 of the ACI-Bench paper). The reference summaries have an average length of 241, while all of our models generated longer summaries on average. To make our hypothesis summaries be closer in length to the reference summaries we might do some prompt engineering in the system prompt to be more instructive about the length of the summary. The current system prompt instructed the model to make the summary “concise” but gave no further advice about summary length.

We also can see that each model dedicated around 450 tokens to reasoning before giving the answer! The number of tokens used here is influenced by the DeepSeek R1 training methodology, but perhaps some further prompt engineering might be able to reduce the tokens outputted for thinking, which would have a positive impact on speed and cost (fewer tokens generated = less cost), but might have an impact on the quality of the answer, so long as the model isn’t just talking itself in circles during the reasoning process.

ROUGE and G-Eval Scores

As we talked about in our previous posts, ROUGE and G-Eval paired together give us a more complete picture of which model is best. We see this illustrated by the comparison between DeepSeek-R1-Distill-Llama-8B and DeepSeek-R1-Distill-Qwen-7B. The Llama-8B parameter model has on average ~50 more tokens in its summaries, has a ROUGE that differs by less than 1% from Qwen-7B, but is about 8% higher in fluency, relevance, coherence, and consistency. This dividing line in the results helps to indicate where the inflection point might be for model performance. The Llama 8B model has performance about 5% lower than the state of the art, but may still be useful, while the Qwen-7B model is about 12% lower than the state-of-the-art, so may require some more in depth inspection of the outputs if using that small of a model is of interest.

Limitations

This evaluation has several important caveats, several of which help explain why it’s so important to evaluate a model using your own data, instead of trusting a benchmark shared publicly!

Unknown training data exposure: The ACI-Bench dataset is available publicly and we have no way to verify that the models weren’t already trained on the dataset. The DeepSeek-R1-Distill-Llama and DeepSeek-R1-Distill-Qwen models were post-trained starting with their respective released model checkpoints, which means that it’s possible that the Llama model family saw the ACI-Bench dataset during training but the Qwen model family did not. This is another reason why it’s important to run evaluations using data that you’re sure was not seen by either model during training.
Zero-shot evaluation only: Performance might improve significantly with few-shot examples.
Simulated data: While ACI-Bench uses realistic simulated conversations, real-world clinical conversations tend to be messier and less structured.
Single task focus: Results may not generalize to other medical tasks or different specialties.

Running Your Own Evaluation

Want to replicate or extend this evaluation? Get started with Lumigator here.

Conclusion: The Right Tool for the Job

As with most things in AI, the answer to "Which model is best?" is always "It depends." This evaluation demonstrates how critical it is to systematically test models on tasks that match your specific use case.

Our results provide a transparent comparison to guide teams optimizing for cost, latency, or accuracy. If you're fine-tuning or deploying models, tools like Lumigator help ensure you’re making data-driven choices before committing to a model.

You can reproduce all of these results with Lumigator here! Have your own dataset that you want to try out? Get started here. We’d love to hear about your results and any improvements would like to suggest!

Disclaimer: This evaluation focuses on relative performance between models in the DeepSeek R1 family. Results should be interpreted with appropriate caution given the limitations described in this post.