Evaluating Local LLMs on Translation Use Case with Lumigator
LLMs have transformed a wide array of NLP applications, but concerns about privacy, data security, and control hinder LLM adoption for both individuals and enterprises. In this blog post, we focus on local LLMs, which offer a compelling alternative to cloud-based solutions.

LLMs have revolutionized natural language processing, transforming a wide array of applications from intelligent chatbots to advanced text summarization. However, concerns about privacy, data security, and control have emerged as significant barriers to LLM adoption both for individuals and enterprises. We follow up on our previous blog post on the different solutions for deploying open-source LLMs with the discussion here focused on local LLMs, which offer a compelling alternative to cloud-based solutions. In this blogpost, by "local" we mean LLM deployments that run entirely on a user's own computer, without relying on cloud services or external computational resources.
We explore the advantages of using local LLMs, the open-source tooling around it and share how we can evaluate the different local LLM models with the help of Lumigator, on the use case of translation. What we present here would be for the Local Hero user persona, who would like to find the best of the small models that their hardware can run.
Advantages of Local LLMs
Local LLMs offer several key benefits that make them an attractive option for various applications:
- Enhanced Privacy and Security: Local LLMs process data entirely on-device, ensuring that sensitive information never leaves your control.
- Computational Efficiency: As smaller models (including distilled or quantized versions of larger models) continue to improve, running AI models on consumer-grade hardware becomes increasingly viable. Users can strategically choose models that align with their hardware capabilities, carefully weighing computational constraints against performance requirements.
- Offline Functionality: Users can leverage AI capabilities even without an internet connection, ensuring uninterrupted access to AI-powered tools.
- Potential Cost Optimization: Depending on your needs and scale, local LLMs may offer cost savings by eliminating the need for cloud computing resources and API call charges. This makes advanced AI capabilities more accessible to individual users with limited budgets.
Local LLM Open Source Tools
The landscape of open-source tools for running local LLMs has expanded rapidly, offering developers and organizations various options to deploy and utilize these models efficiently. While numerous tools are available, we’ll focus on three prominent ones: Llamafile, Ollama, and vLLM.
- Llamafile bundles LLM weights and a specially-compiled version of llama.cpp into a single executable file, allowing users to run large language models locally without any additional setup or dependencies. Users can create an executable Llamafile for any model they develop or fine-tune.
- Ollama, which is also based on llama.cpp, provides a simplified way to download, manage, and interact with various open-source LLMs either from the command line or with web UI.
- vLLM is a high-performance library for LLM inference and serving, featuring optimized memory management techniques. While this tool is primarily used for cloud deployments on GPUs, it also comes with options to deploy models locally. Moreover, with vLLM you can host any model available on the HuggingFace Model Hub.
Notably, all three of these tools – Llamafile, Ollama, and vLLM – are compatible with the OpenAI Completions API Client. This compatibility allows developers to easily switch between these local solutions and cloud-based services or even implement hybrid solutions. We took advantage of this when building Lumigator, thereby allowing our users to evaluate both commercial and local LLMs on the same platform. If you are interested to learn more about how to use Lumigator in combination with these open source LLM hosting tools, check out our docs page for detailed instructions.
Evaluation Experiment with Lumigator
Lumigator supports running translation evaluation experiments with LLMs hosted locally, on-premise, or on cloud, as well as models from Hugging Face Model Hub. Here, we would like to compare with the help of Lumigator, two recently released Ollama models:
- deepseek-r1:14b: A 4-bit quantized and distilled version of Qwen2.5-14B enhanced with reasoning capabilities from DeepSeek-R1.
- gemma3:12b: A 4-bit quantized version of Google's multimodal model with 128K context and support for 140+ languages.
Both of them have a similar number of parameters and can be run locally via Ollama (note that you could alternatively use models from Llamafile or vLLM as well for your experiments).
Specifically, we use the WMT24++ dataset, which was published in February 2025 – this means it has been on the internet for only a short while now and is unlikely to be in the training data of these LLMs. For demonstration purposes, we use English-to-German file of the dataset and use the following system prompt for both LLMs:
system_prompt = f"""
You are an expert in {source_language} and {target_language}.
Please provide a high-quality translation of the following text from {source_language} to {target_language}.
Only generate the translated text. No additional text or explanation needed.
"""
For evaluation, we specify the metrics BLEU, METEOR and COMET, which Lumigator computes for us.
- BLEU measures the n-gram overlap between the generated text and the reference translation.
- METEOR takes this a step further by incorporating synonymy, stemming, and word order into its evaluation, however it still has the weakness that it is based on word overlap.
- COMET is a neural metric that leverages a cross-lingual pre-trained model to predict human judgments of translation quality – while there are different COMET models that one can use for evaluation, we used eamt22-cometinho-da since it provides a good trade-off between accuracy and speed (80% smaller and 2x faster than the original COMET model).
Next, we move on to the results:
How to interpret these numbers? The relative comparison of the scores shows that gemma3:12b seems to perform better than deepseek-r1:14b.
However, are the scores good enough for your own application or scenario? Well, the scores themselves are dependent on several factors, such as the dataset, the domain, the language pairs, the length of input texts, etc. The models evaluated in the WMT24++ paper demonstrated BLEU scores in the range 27 to 32. As for COMET scores, the Opus-MT dashboard indicates a range of 48 to 76 for English to German translation, which varies from one dataset to another. It is important to highlight that the local LLMs we explored here are 4-bit quantized models, which also potentially explains the reduction in performance. While the local LLMs presented here certainly have room for improvement, we note that for certain use cases, the data privacy and offline use benefits may outweigh the performance trade-offs.
Here is an example row from the dataset, along with the predictions from the two models:
Conclusion
The growing ecosystem of tools like Llamafile, Ollama, and vLLM, and their community support is making local LLMs increasingly accessible and practical for real-world applications. We believe that the rapid pace of innovation in this space would help narrow the gap between the local models and the cloud-based alternatives. Through Lumigator, we hope to make it easy to run benchmarking experiments and help you in the model selection process.
Please check out the Lumigator docs if you are interested to learn more about running translation evaluation experiments, and our guide on standing up Local LLMs from providers including Llamafile, Ollama and vLLM, if you would like to use them as inference models in your evaluation experiments with Lumigator.