Technical Content

Encoderfile’s New Format: Why a “Dull” Design Wins

Encoder models power most NLP in production, but deploying them still means dragging along Python runtimes and dependencies. Encoderfile introduces a single executable with an appended payload and a format that can be inspected and understood.

Raz Besaleli

Apr 7, 2026 — 4 min read

Encoder models don't chat, and they don't get much attention for it. But they're behind most of the NLP that actually runs in production, powering embeddings, search, ranking, guardrails, classification.

There are many reasons to deploy an encoder. A lot of them revolve around needing cheap, fast inference for a task that doesn’t require generative capabilities. So why does deploying one still mean dragging along a Python runtime, a full dependency tree, and serving infrastructure designed for something ten times the size?

Encoderfile fixes this mismatch. Single executable, no runtime, no setup — build once and ship anywhere, or simply download the executable for your architecture. If you're familiar with llamafile, the idea will feel familiar, except built for discriminative models instead of generative models.

The Old Approach (and Why It Didn’t Scale)

The first version of Encoderfile was… baroque. We were generating an entire Cargo project from templates (including a main.rs and a Cargo.toml with dependencies) into a cache directory, and then invoking a compiler just to wrap a model. Embedding weights using include_*! macros and managing dependencies through a Cargo-in-Cargo situation was awkward and slow.

This worked for a proof of concept. We got a single binary with everything inside it. Mission accomplished.

In practice, though, it had problems:

Build times were slow and memory-hungry
Users were expected to install and manage a Rust toolchain
The output was opaque—basically a black box
Iteration was painful

We were optimizing for “single file” without thinking too hard about what that file should actually look like—or how people would interact with it once it existed.

What We Actually Wanted

After one too many out of memory (OOM) errors, the requirements got clearer. We needed:

A build process that isn’t slow, fragile, and full of heavy dependencies
A format that’s honest about what it contains
Something you can inspect, validate, and reason about
A structure that doesn’t fight you when you try to build tooling around it

In regulated environments, deployments should not be a leap of faith. Teams need to be able to audit, verify, and reason about what they’re shipping. This means understanding exactly what data (e.g., the model and tokenizer) is included, how it was built, and how it behaves at runtime. A format that can be inspected and decomposed makes those conversations possible.

People want to answer basic questions:

What model is this?
Where did the weights come from?
What exactly is being executed?

The original "Cargo-in-Cargo" approach made those questions harder than they needed to be.

The New File Format

The current Encoderfile format is intentionally less magical.

Encoderfile is now a pre-built executable with an appended payload that contains:

Model weights and tokenizer data
A Protobuf manifest describing what’s inside
A small self-describing footer so the runtime can orient itself

At runtime, the executable reads itself and loads everything accordingly. No compile-time embedding, no macro gymnastics. Just a structured binary layout that can be parsed and understood.

A few consequences fall out of this "dull" design:

Faster startup: Assets are loaded directly from the binary into memory, giving us precise control over when model weights and other large assets are called.
Writing the file is just appending data: If you’re using a pre-built base binary (which we publish on GitHub releases), you don’t even need a toolchain to build an Encoderfile.
Sub-second Build Times : Speed is not because of clever optimization, but because there’s significantly less work to do.
Up-front Validation: Model weights and configurations are validated before building, not after something fails at runtime.

Speed, in this case, is just a side effect of simplicity.

The Build Story

We've adopted a simple philosophy: unless you're building for an exotic target (e.g., RISC-V, embedded ARM, wasm, a toaster), the build process should be completely in-memory and toolchain-free.

On Linux and macOS (x86_64 and arm64), the build CLI fetches a pre-built base binary from GitHub releases, caches it locally, and appends your model artifacts on top. No Rust, no Cargo, no installation drama.

"Cross-compilation" is similarly unglamorous: you just pick a different base binary. No cross toolchains, no linker drama. If you need something custom, you can bring your own base binary, but for most cases, you won't need to.

Note: Windows is the one holdout. WSL works fine for now; native support is coming.

The Ecosystem Around It

The format is only useful if you can actually build things with it.

Encoderfile currently comes with:

A Rust crate
A CLI for building and running models
Python bindings (coming soon!)

Which means you can:

Wrap it in your own tooling
Generate Encoderfiles as part of a pipeline
Integrate it into existing systems without rewriting everything

The goal isn’t to be a monolith—it’s to be something you can compose with.

What’s Next

A few obvious gaps are already on the roadmap:

Native Windows support
Continued expansion of supported model architectures
Better ergonomics around building and inspecting Encoderfiles

And probably a few things we haven’t tripped over yet.

Encoderfile started as a simple question: why should encoder models be any harder to ship than a binary? The single-file idea came first, but the more interesting work turned out to be defining a format that's honest about what it contains — something you can inspect, decompose, and reason about without it fighting you.

If you want to try it, check out our Getting Started guide. We'd love to know what you build.