Technical Content

Uncovering GenAI Trends: Using Local Language Models to Explore 35 Organizations

Over the past few months at Mozilla.ai, we’ve engaged with a number of organizations to learn how they are using language models in practice.

Stefan French

Jun 3, 2024 — 6 min read

Photo by Rob Wingate / Unsplash

Over the past few months at Mozilla.ai, we’ve engaged with a number of organizations to learn how they are using language models in practice. We spoke with 35 organizations across various sectors, including finance, government, startups, and large enterprises. Our interviewees ranged from ML engineers to CTOs, capturing a diverse range of perspectives.

Objective: Help Shape Our Product Vision

Our primary aim was to identify patterns and trends that could inform our product development strategy. Despite the unique nature of each discussion, we usually focused on four critical areas:

LLM use cases being explored
Technology, methodologies, and approaches employed
Challenges in developing and delivering LLM solutions
Desired capabilities and features

Data Collection & Model Selection

After each conversation, we wrote up summary notes. In total, these notes for the 35 conversations amounted to 18,481 words (approximately 24,600 tokens), almost the length of a novella. To avoid confirmation bias and subjective interpretation, we decided to leverage language models for a more objective analysis of the data. By providing the models with the complete set of notes, we aimed to uncover patterns and trends without our pre-existing notions and biases.

Given privacy concerns, we decided to keep the information local. Therefore, I selected a set of models that I could run on my MacBook Pro M3 (36GB) locally. Here's an overview of the models and configurations used:

Model	Parameters	Quantization	Size
Llama-3-8B-Instruct-Gradient-1048k	8B	Q5_0	5.6GB
Phi-3-medium-128k-instruct	14B	IQ3_M	6.47GB
Qwen1.5-7B-Chat	7B	1_5	5.53GB

There are a number of options to run LLMs locally, such as ollama, lm-studio, and llamafile. I used both lm-studio and llamafile (an in-house solution by the Mozilla Innovation Team).

Summarizing with Local Language Models

The prompt used to generate model outputs was:

“Summarize the following information to get the key takeaways about developing LLM solutions in 10 bullet points. Take the full information from start to finish into account. Never use company names or an individual's name. [Full notes]”

Asking Llama-3-8b

Recently released, Llama-3 is one of the most capable open-source language models available. The 8-billion parameter version is small enough to easily run on my laptop. However, with a context window of 8k tokens, the model released by Meta won’t be able to handle the ~24.6k tokens of context I require.

Fortunately, Llama-3-8B-Instruct-Gradient-1048k is openly available, where Gradient.ai have extended Llama-3-8b's context length from 8k to 1040K tokens. This modification showcases the open-source community's ability to innovate, continually refining models to suit evolving needs.

Here is the model’s response:

Early stage startups prioritize accuracy, speed, and cost when it comes to LLMs.
Startups prefer staying in a single ecosystem, such as OpenAI, due to time costs associated with switching models.
There is a general consensus that certain frameworks are not a good option for building agents because of their complex architecture and difficulty debugging.
The main LLM use-cases are Q&A on documents, code documentation, fact checking within a document using other document sources, and summarization and extracting information.
Companies have different priorities when it comes to LLM solutions - some focus on maximizing accuracy in the easiest way possible, while others emphasize reusability and using small models for specific tasks.
Others prioritize cost control over purely seeking high-quality outputs when it comes to model management.
The main challenges faced by companies developing LLM solutions include aligning business needs with model outcomes, creating proper test cases, infrastructure and computational power for experimentation, prompt engineering overhead, privacy and security concerns, and the need for standardized metrics and a data warehouse.
Evaluation will be important for going from prototype to production, but it's difficult to know where to start and don't have the time to try lots of different techniques.
There is an opportunity for an "open-source to the core" company to do something meaningful in the LLM space, particularly in terms of interoperability between different actors.
Developing a platform that makes GenAI research easier and removes errors/issues encountered on a daily basis would be beneficial for those new to the LLM space, such as transitioned data scientists.

Asking Phi-3-Medium

Phi-3 is one of the most capable and cost-effective small language models. Here I am using Phi-3-medium-128k-instruct, which has a 128k context window. Here is the response:

Startups prioritize accuracy, speed, and cost when considering LLMs. Switching to open-source options is preferred for its ease of use.
OpenAI often becomes a go-to solution due to simplicity in integration with existing ecosystems; however, there's interest in local experimentation without reliance on APIs.
A preference exists for pretrained models that are fast and provide full access to weights and architecture. Customizable mini-models based on examples provided by users were also desired.
Evaluating the effectiveness of LLM solutions is seen as critical before moving from prototype to production, but evaluations can be challenging without appropriate tools.
Companies value reusability and quick development cycles when leveraging AI models within their product pipelines. The use of smaller, specialized models often aligns better with these goals.
Organizations find OpenAI's managed services restrictive due to data privacy concerns and are exploring alternatives like Sagemaker Jumpstart for easier deployment and scalability.
Finetuning is avoided by some as it seems complex, time-consuming, and expensive, though it could potentially improve performance significantly if made more accessible.
In the context of public sector projects, evaluation metrics are underutilized due to their perceived complexity and resource requirements; however, there's growing interest in making evaluations easier for better transitions to production.
Enterprises face overwhelm with model choices, workflow understanding, and ensuring deterministic outcomes from LLM-based systems.
The gaming industry explores AI for content creation, such as procedural generation of assets, to improve efficiency and reduce costs while supporting user creativity.

Asking Qwen1.5-7B-Chat

Qwen1.5-7B-Chat, part of the Qwen series developed by Alibaba Cloud, is a notable but slightly less well-known open-source language model. This family of models perform well on various benchmarks and are equipped with a 32k token context window. Here is the response:

Focus on accuracy: Early-stage startups value LLMs for their accuracy and strive to minimize errors in tasks like document classification, code generation, and customer support.
Speed vs. customization: The trade-off between speed and model customization is a consideration for startups, with some seeking easy-to-use models and others preferring the ability to adapt.
Cost-effectiveness: Budget constraints influence the choice of LLM tools, with open-source options favored for their lower costs.
Ecosystem dependence: Many start-ups prefer using a single ecosystem to simplify integration and reduce switching costs.
Time-consuming evaluation: Evaluating LLMs is crucial for production readiness, but it can be a time-consuming process for organizations.
Rapid AI model evolution: The constant introduction of new models adds complexity to decision-making for businesses.
Limited technical expertise: Companies often face challenges finding developers and data scientists with the specific skill sets needed for LLM projects.
Incentives for small models: Small, task-specific LLMs are in demand to balance cost and performance.
Interoperability: The need for easy model switching and integration among different tools remains a priority for developers.
Privacy concerns: Data privacy regulations, especially with sensitive data, influence the adoption of LLM solutions in some industries.

Key Takeaways

I was impressed by the quality of the responses from these models, which were all capable of running locally on my laptop. They identified the majority of trends and patterns among the 35 organizations we studied. Each model also highlighted unique insights and communicated in different styles:

Llama-3-8B-Instruct-Gradient-1048k emphasized the main LLM use-cases that were discussed and the difficulties moving from prototype to production. The style of the sentences generated can be quite long.
Phi-3-medium-128k-instruct picked up on the reluctance of many organizations to finetune models. Its style feels more conversational than the other models.
Qwen1.5-7B-Chat highlighted the lack of technical expertise many organizations suffer from. Its style is more concise and straightforward, similar to the style of ChatGPT.

Across all the models, 3 key takeaways stood out:

Evaluation: Many organizations highlight the challenges of evaluating LLMs, finding it time-consuming.
Privacy: Data privacy and security are major concerns influencing tool and platform choices.
Reusability & Customization: Organizations value reusability and seek customizable models for specific tasks.

This exercise showcased how well local language models can extract valuable insights from large text datasets. The discussion and feedback from our network and end-users will continue to guide our efforts at Mozilla.ai, helping us develop tools that support diverse use cases and make LLM solutions more accessible and effective for organizations of all sizes.