AIssert: Testing LLM Integrations

Since LLMs exploded into public awareness, we have witnessed their integration into a vast array of applications. However, this also introduces new complexities, especially in testing. At Mozilla.ai, we did some research on the need to introduce formal testing to the end to end app.

Alejandro Gonzalez, Irina Vidal Migallón, Javier Torres Ramón, Juliana Rodrigues Araújo

Apr 23, 2025 — 3 min read

Photo by Growtika / Unsplash

Since LLMs exploded into public awareness, we have witnessed their integration into a vast array of applications, from customer support chatbots to tools that can summarize documents. Today, virtually every app seems to use LLMs in some form. This trend isn’t slowing down. As developers continue to explore ways to incorporate AI, the need to test this integration is becoming critically important.

While integration of AI into applications adds useful features, it also introduces new complexities, especially in testing. AI integration testing faces unique challenges, like model changes, prompt modifications, etc.

At Mozilla.ai, we did some research with software engineers who articulated a common frustration: testing AI integrations is predominantly manual. We also have an opinion on the need to introduce formal testing to the end to end app, going beyond vibe checking or even actually evaluating the model inside.

So what could bring those builders to automate the testing of apps that have an LLM at their core? From our conversations, a library to help here should:

Be easy to integrate with the existing CI/CD pipelines, for example by being imported as one does with pytest.
Not require Machine Learning knowledge to configure the tests, but ideally be extensible for those that have the right expertise.
Ideally, be language-agnostic: AI code is predominantly in Python, but general applications are built with a variety of languages, and they too are receiving the LLM upgrade.

Discovering Giskard: AI testing

With that in mind, we explored the existing landscape. One that soon stood out was Giskard, which already could do much of the heavy lifting.

Giskard operates as a “test server”, enabling developers to interact with their AI applications while performing evaluation tasks within the Giskard environment. This makes it easy to integrate into your regular test code, and from there into your CI/CD pipelines. It can automatically probe for issues such as bias, hallucinations, or weakness to prompt injections, and it accepts extensions.

So, right off the bat, we had found at least one existing library that checked two of our requirements: It tests your whole application (not just an LLM), and you can simply choose to run all available tests and let Giskard take it from there.

However, Giskard’s existing integration had limitations, particularly that it’s built for Python ecosystems, leaving a great number of developers without a native tool to test their LLM Integrations.

Building AIssert

We decided to build upon what we found, creating AIssert. The idea is to create a tool that can integrate smoothly into the software development lifecycle, no matter what language the application is built with. Our initial approach is just to offer a wrapper around Giskard that you can run to launch a full test against any API, being written in Python or any other language.

We didn’t set out to build a fully-fledged standalone tool with AIssert. Our goal was to explore how tools like Giskard could be integrated into CI/CD pipelines for testing AI-powered APIs, not necessarily written in Python.

AIssert acts as a wrapper around Giskard, running a full scan against a target API. To get started, you first need to define how your API expected the requests to be structured. For testing purposes, we built a sample app: a Dungeon Master API that uses OpenAI to generate dynamic storytelling worlds. Here’s an example config.json file that tells AIssert how to interact with the API:

 {
	"input_mapping": {
  	"world_context": "world_context",
  	"genre": "genre",
  	"difficulty": "difficulty",
  	"narrative_tone": "narrative_tone",
  	"campaign_name": "campaign_name",
  	"user_question": "user_question"
	},
	"output_mapping": {
  	"predictions": "narrative"
	},
	"request": {
  	"headers": {
    	"Content-Type": "application/json",
    	"Authorization": "Bearer YOUR_TOKEN"
  	},
  	"method": "POST"
	}
  }

Once configured, running AIssert is as simple as executing:

uv run aissert_cli.py --api-endpoint 
http://localhost:8000/api/dungeon/ \
--config ./config.json --input ./input.json \
--output predictions.json --scan --report-file scan_report.html 
--verbose

This will trigger a Giskard scan and generate a detailed HTML report of the results.

AIssert is meant to show how AI apps don’t have to be excluded from automated testing. AI software is still software and deserves unit, integration and end to end tests.

Take a look at our repository: Are you already using this kind of testing on your AI apps? Can you contribute new test cases for others to learn from?

Let us know how you are ensuring your AI apps are thoroughly tested. And what would you add to this repo, that others could also introduce in their app testing?

AIssert: Testing LLM Integrations

Alejandro Gonzalez, Irina Vidal Migallón, Javier Torres Ramón, Juliana Rodrigues Araújo

Discovering Giskard: AI testing

Building AIssert

Read more

Introducing any-llm: A unified API to access any LLM provider

Wasm-agents: AI agents running in your browser

Smarter Prompts for Better Responses: Exploring Prompt Optimization and Interpretability for LLMs

The Challenge of Choosing the Right LLM