Evaluating Local LLMs on Translation Use Case with Lumigator

LLMs have transformed a wide array of NLP applications, but concerns about privacy, data security, and control hinder LLM adoption for both individuals and enterprises. In this blog post, we focus on local LLMs, which offer a compelling alternative to cloud-based solutions.

Hareesh Bahuleyan

Mar 27, 2025 — 5 min read

Photo by Vandan Patel / Unsplash

LLMs have revolutionized natural language processing, transforming a wide array of applications from intelligent chatbots to advanced text summarization. However, concerns about privacy, data security, and control have emerged as significant barriers to LLM adoption both for individuals and enterprises. We follow up on our previous blog post on the different solutions for deploying open-source LLMs with the discussion here focused on local LLMs, which offer a compelling alternative to cloud-based solutions. In this blogpost, by "local" we mean LLM deployments that run entirely on a user's own computer, without relying on cloud services or external computational resources.

We explore the advantages of using local LLMs, the open-source tooling around it and share how we can evaluate the different local LLM models with the help of Lumigator, on the use case of translation. What we present here would be for the Local Hero user persona, who would like to find the best of the small models that their hardware can run.

Advantages of Local LLMs

Local LLMs offer several key benefits that make them an attractive option for various applications:

Enhanced Privacy and Security: Local LLMs process data entirely on-device, ensuring that sensitive information never leaves your control.
Computational Efficiency: As smaller models (including distilled or quantized versions of larger models) continue to improve, running AI models on consumer-grade hardware becomes increasingly viable. Users can strategically choose models that align with their hardware capabilities, carefully weighing computational constraints against performance requirements.
Offline Functionality: Users can leverage AI capabilities even without an internet connection, ensuring uninterrupted access to AI-powered tools.
Potential Cost Optimization: Depending on your needs and scale, local LLMs may offer cost savings by eliminating the need for cloud computing resources and API call charges. This makes advanced AI capabilities more accessible to individual users with limited budgets.

Local LLM Open Source Tools

The landscape of open-source tools for running local LLMs has expanded rapidly, offering developers and organizations various options to deploy and utilize these models efficiently. While numerous tools are available, we’ll focus on three prominent ones: Llamafile, Ollama, and vLLM.

Llamafile bundles LLM weights and a specially-compiled version of llama.cpp into a single executable file, allowing users to run large language models locally without any additional setup or dependencies. Users can create an executable Llamafile for any model they develop or fine-tune.
Ollama, which is also based on llama.cpp, provides a simplified way to download, manage, and interact with various open-source LLMs either from the command line or with web UI.
vLLM is a high-performance library for LLM inference and serving, featuring optimized memory management techniques. While this tool is primarily used for cloud deployments on GPUs, it also comes with options to deploy models locally. Moreover, with vLLM you can host any model available on the HuggingFace Model Hub.

Notably, all three of these tools – Llamafile, Ollama, and vLLM – are compatible with the OpenAI Completions API Client. This compatibility allows developers to easily switch between these local solutions and cloud-based services or even implement hybrid solutions. We took advantage of this when building Lumigator, thereby allowing our users to evaluate both commercial and local LLMs on the same platform. If you are interested to learn more about how to use Lumigator in combination with these open source LLM hosting tools, check out our docs page for detailed instructions.

Evaluation Experiment with Lumigator

Lumigator supports running translation evaluation experiments with LLMs hosted locally, on-premise, or on cloud, as well as models from Hugging Face Model Hub. Here, we would like to compare with the help of Lumigator, two recently released Ollama models:

deepseek-r1:14b: A 4-bit quantized and distilled version of Qwen2.5-14B enhanced with reasoning capabilities from DeepSeek-R1.
gemma3:12b: A 4-bit quantized version of Google's multimodal model with 128K context and support for 140+ languages.

Both of them have a similar number of parameters and can be run locally via Ollama (note that you could alternatively use models from Llamafile or vLLM as well for your experiments).

Specifically, we use the WMT24++ dataset, which was published in February 2025 – this means it has been on the internet for only a short while now and is unlikely to be in the training data of these LLMs. For demonstration purposes, we use English-to-German file of the dataset and use the following system prompt for both LLMs:

system_prompt = f"""
You are an expert in {source_language} and {target_language}.
Please provide a high-quality translation of the following text from {source_language} to {target_language}.
Only generate the translated text. No additional text or explanation needed.
"""

For evaluation, we specify the metrics BLEU, METEOR and COMET, which Lumigator computes for us.

BLEU measures the n-gram overlap between the generated text and the reference translation.
METEOR takes this a step further by incorporating synonymy, stemming, and word order into its evaluation, however it still has the weakness that it is based on word overlap.
COMET is a neural metric that leverages a cross-lingual pre-trained model to predict human judgments of translation quality – while there are different COMET models that one can use for evaluation, we used eamt22-cometinho-da since it provides a good trade-off between accuracy and speed (80% smaller and 2x faster than the original COMET model).

Next, we move on to the results:

	BLEU	METEOR	COMET
deepseek-r1:14b	16.4	46.2	17.2
gemma3:12b	19.8	49.8	28.7

How to interpret these numbers? The relative comparison of the scores shows that gemma3:12b seems to perform better than deepseek-r1:14b.

However, are the scores good enough for your own application or scenario? Well, the scores themselves are dependent on several factors, such as the dataset, the domain, the language pairs, the length of input texts, etc. The models evaluated in the WMT24++ paper demonstrated BLEU scores in the range 27 to 32. As for COMET scores, the Opus-MT dashboard indicates a range of 48 to 76 for English to German translation, which varies from one dataset to another. It is important to highlight that the local LLMs we explored here are 4-bit quantized models, which also potentially explains the reduction in performance. While the local LLMs presented here certainly have room for improvement, we note that for certain use cases, the data privacy and offline use benefits may outweigh the performance trade-offs.

Here is an example row from the dataset, along with the predictions from the two models:

source	target	deepseek-r1:14b	gemma3:12b
So I guess I'll have to wait till tomorrow to figure out if it actually is enough for this setup or If I need the better panel I ordered.	Ich werd wohl bis morgen warten müssen, ob es für diesen Aufbau ausreichend ist oder ob ich das bessere Panel brauche, das ich bestellt habe.	Ich denke, ich muss bis morgen warten, um herauszufinden, ob es wirklich ausreicht für diese Konfiguration oder ob ich das bessere Paneel, das ich bestellt habe, brauche.	Ich schätze also, ich muss bis morgen warten, um herauszufinden, ob es für die Einrichtung reicht oder ob ich das bessere Panel, das ich bestellt habe, noch benötige.

Conclusion

The growing ecosystem of tools like Llamafile, Ollama, and vLLM, and their community support is making local LLMs increasingly accessible and practical for real-world applications. We believe that the rapid pace of innovation in this space would help narrow the gap between the local models and the cloud-based alternatives. Through Lumigator, we hope to make it easy to run benchmarking experiments and help you in the model selection process.

Please check out the Lumigator docs if you are interested to learn more about running translation evaluation experiments, and our guide on standing up Local LLMs from providers including Llamafile, Ollama and vLLM, if you would like to use them as inference models in your evaluation experiments with Lumigator.

Evaluating Local LLMs on Translation Use Case with Lumigator

Hareesh Bahuleyan

Advantages of Local LLMs

Local LLM Open Source Tools

Evaluation Experiment with Lumigator

Conclusion

Read more

Introducing any-llm: A unified API to access any LLM provider

Wasm-agents: AI agents running in your browser

Smarter Prompts for Better Responses: Exploring Prompt Optimization and Interpretability for LLMs

The Challenge of Choosing the Right LLM