Standardized Reasoning Content: A first look at using OpenAI’s gpt-oss on multiple providers using any-llm

After OpenAI’s ChatGPT release, the default standard for communication was the OpenAI Completion API. However, with the new “reasoning models”, a critical piece of the output isn’t handled by that OpenAI specification, leaving each provider to decide how to handle the new model capabilities.

Nathan Brake

Aug 12, 2025 — 4 min read

Motion Picture Photographs / Skating with Bror Meyer

In 2015, my daily driver was a red 1996 Chevrolet Blazer truck. It boasted a cassette player in the entertainment console and some awesome analog electronics that were easy to open up and repair – convenient given how often it broke down…

These days, I drive a 2019 Toyota Sienna minivan (#dadlife), which feels like a spaceship by comparison: touchscreen, bluetooth, and a CD player (cutting-edge technology if we’re talking 1996 standards). Despite these cars being manufactured 23 years apart from each other, there’s a common component in both entertainment consoles: the cigarette lighter.

Smoking is nowhere near as common in the United States as it was 20 years ago. How has the cylindrical slot of the cigarette lighter been a mainstay of car interiors for so long? The answer is probably obvious: it’s not just used for cigarettes, but it’s also become the standard for how to access 12V DC power from your car for all the electronics that you want to charge while travelling. What was designed to efficiently light a Marlboro has turned into the most convenient way to charge my iPhone.

The point is, once a standard gets heavy adoption, it’s difficult to change or adapt, even when the purpose of its use changes.

In 2022, OpenAI released ChatGPT and changed the world. The AI race is on, and companies are all training and deploying their models to compete with each other. In order to ease adoption, the default standard for communication was picked as the OpenAI Completion API. If you browse most of the popular model provider documentation, they report compatibility with the OpenAI Completion API specification for how to interact with their Large Language Models (LLMs).

This spec has been working very successfully for several years. However, in the past year, we’ve seen a wave of new models arrive called “reasoning” models, which are explicitly trained to think before answering. These models not only provide an answer, but they also generate reasoning content (sometimes referred to as "thinking content" or “chain of thought”) before generating their final answer. Reasoning content refers to the intermediate steps or justifications that a model generates before delivering its final answer. OpenAI models such as gpt-o1, gpt-o3, gpt-o4, and now gpt-5 all have the capability to provide this reasoning content.

However, OpenAI does not return this reasoning content in their Completion API. In March 2025, they introduced the Responses API, which is intended to complement the Completions API and will provide back reasoning content alongside the answer. In their recent release of gpt-5, they strongly recommend migration from the Completion API to the Responses API.

The problem: Most LLM providers and their community have centralized on the Completion API, but a critical piece of output (the reasoning/thinking content) isn’t handled by that OpenAI specification. In order to overcome this difficulty, each provider has need to decide how to handle the new model capabilities: abandon the OpenAI Completion API, or augment and extend it?

At Mozilla.ai, our mission is to make AI systems more open, transparent, and interoperable. Our work on any-agent and any-llm supports this by making it easier to evaluate, compare, and switch between language models and providers, especially as reasoning capabilities become more important to use and evaluate real-world agentic applications.

One of the benefits of working on any-agent and any-llm is that our team gets a holistic perspective of how several different code libraries frameworks implement features common to them all. This has been no different with reasoning content. When gpt-oss was released this month, I was excited to try it: this was the first open-weight OpenAI model since gpt-2, and is a reasoning model! For the first time, we would have the opportunity to see the raw reasoning output from a GPT reasoning model.

Since gpt-oss is an open-weight model (not to be confused with open-source), it’s offered as a hosted solution in many places and can also be downloaded and run locally in tools like ollama and LM Studio. I was interested to see how the local gpt-oss models (which are quantized to 4-bit data representations) stacked up against the remote-hosted versions, and my curiosity extended to not only the model's "final answer" but also the models' reasoning.

I started running the models and quickly realized that accessing the reasoning content in a standardized way is no simple task. For providers that offer an OpenAI Completion-compatible API, in order for them to return the reasoning content alongside the answer, they must extend and diverge from the OpenAI specification. See the table below for a sample of how the content is handled among a few different providers.

Provider	Attribute Name in Completion API Response
vLLM	reasoning_content
xAI	reasoning_content
Ollama	thinking
Groq	reasoning

Because of these subtle differences even between providers working to conform to a single standard (the OpenAI Completion API Spec), we see the value of a tool like any-llm to help ease the pain of comparison.

As discussed in our recent blog post, any-llm is an alternative to popular libraries like LiteLLM and aisuite, which focus on simplicity and usage of existing provider SDKs whenever possible.

Using any-llm, reasoning content that providers return over their completion endpoint is standardized into the same output attribute that is typed in Pydantic for consistent and verifiable static type checking, and is standardized across all the providers that we support.

In practice, this means what you'll get are consistent, standard responses even from different providers (e.g. ollama, LM Studio, and Groq), allowing you to focus on what really counts (e.g. which model or inference provider works best for you).

from any_llm import completion

# make sure you already downloaded the model in ollama and lmstudio and have the applications
# and also make sure you have GROQ_API_KEY set.
models = ["groq/openai/gpt-oss-20b", "ollama/gpt-oss:20b", "lmstudio/openai/gpt-oss-20b"]
prompt = "What's a good weekend vacation option in Europe that costs less than $1000 for two people?"
for model in models:
    response = completion(model=model, messages=[{"role": "user", "content": prompt}])
    print("--------------------------------")
    print(f"Model: {model}")
    print(f"Response: {response.choices[0].message.content}")
    print(f"Reasoning: {response.choices[0].message.reasoning}")
    print("--------------------------------")

We believe this cross-provider compatibility is critical to allow developers to choose and easily switch between models and providers, further advancing transparency and accessibility of LLM usage.

If you're working with reasoning models and frustrated by inconsistent outputs, give any-llm a try and let us know how we can support more providers or workflows. You can follow the work or contribute to our responses API work that is ongoing here.

Standardized Reasoning Content: A first look at using OpenAI’s gpt-oss on multiple providers using any-llm

Nathan Brake

Read more

Encoderfile v0.1.0: Deploy Encoder Transformers as Single Binary Executables

mcpd-proxy: Centralized Tool Access for AI agents in VS Code, Cursor, and Beyond

Control LLM Spend and Access with any-llm-gateway

Can Open-Source Guardrails Really Protect AI Agents?