Note: all the data below comes from a benchmark run on the Ask This Guy platform, and covers the period from January 5th to 25th, 2026. Requests are sent at regular intervals from an ATG server located in France, which measures response times.

A Tough Moment

Sales call with a prospect. We’re here to present Ask This Guy, our AI solution that (among other things) relies on a company’s internal documents. Everything is ready, the demo is well rehearsed.

I type a simple query into the interface. And then… silence.

Three seconds. Ten seconds. We try to move on to another topic, we mumble a bit, but we can guess what’s going on in the client’s head and the question they’re not asking out loud: “Does your solution actually work?”

When the meeting ends, we take a look at the logs, and the verdict is immediate: we hit the wrong moment. That day, the Mistral API took almost a full minute to respond. A one-off anomaly? Not really. As we’ll see, this is a structural problem.

This embarrassing moment forced us to ask a simple but crucial question: how do you guarantee reliable performance in production when you depend on external AI providers?

What we’ve learned since then is what we’re going to share with you.

Back to Basics

Before we dive in, a quick reminder. If you deploy an AI in your company that relies on your internal documents via a RAG architecture (Retrieval-Augmented Generation), you’ll need embeddings for semantic search.

What is an embedding?

Imagine that the meaning of every sentence in your documents is turned into a numerical vector - a mathematical representation of its meaning. That’s an embedding. It lets you very quickly find the piece of information that best matches a user’s need.

In theory, the mechanism is simple:

A user sends a message
The system formulates one or more search queries
These queries are converted into embeddings (vectors)
These vectors are compared with those of your documents
The closest document snippets are retrieved
The answer is generated from those snippets

You see where this is going. Every time a user sends a request, you need to compute an embedding. If that computation is slow, the user waits. If the embedding fails, there is no answer at all.

That’s why embedding performance is not a technical detail: it is the user experience.

For example, we can define an acceptable performance target like this:

Target: response time under 300 ms in the vast majority of cases
Critical limit: never exceed one second

And we mustn’t forget another crucial point: not all embedding models are equal. The differences in answer quality can be huge - we saw this firsthand when we experimented with “average” models. If you want to compare models, HuggingFace’s MTEB leaderboard is a very good starting point.

Our Initial Path: The "Big Names"

When we started building Ask This Guy, the choice seemed obvious: the big players in the market - Mistral, OpenAI. That’s more or less what everyone does, right?

That’s indeed where we began.

How good is this approach? The answer is… mixed. OpenAI’s performance sits on average around one second for the large model (and about half a second for the small model). Not catastrophic, but far from our target.

Mistral, for its part, shows a better median performance around 300 ms, but with a high failure rate:

Mistral & OpenAI performance

Let’s zoom in on the day‑to‑day variability, especially on Mistral’s side:

Mistral & OpenAI performance

Most of the time, more than 95% of requests are processed in under a second. On paper, that’s fine. But on some days, performance is simply catastrophic. That’s what happened during our demo. And that’s exactly what our data shows, relentlessly.

Second Problem: Lock-In

Let’s assume you’ve found the perfect provider (even if we’re not quite there yet in this article).

Great.

But what happens tomorrow if their performance degrades? Or if their prices go up? Or if the company disappears?

The natural answer is: “We’ll just switch providers!”

Except… it’s not that simple.

Here’s the core technical problem: comparing embeddings generated by different models doesn’t work. The vectors are not compatible. Once you’ve used a model to index your documents, you are stuck using the same model for queries (or you need to re‑index all your documents).

That’s where model choice becomes strategic:

If you use a proprietary model (such as OpenAI’s), you are completely dependent on that provider. There’s no easy way out.
If you use an open‑weights model, you can use several providers for the same model. And that’s where things get interesting.

The Inference Specialists

Back to our benchmark results. A clear trend emerges: some providers perform significantly better than the traditional giants. These are “inference” specialists - they excel at running AI models quickly and reliably. They rely on open‑weights models.

Embedding providers performance comparison

For embeddings specifically, Scaleway and Nebius (both European!) clearly stand out among the providers we tested.

And their consistency is better, especially for Scaleway, where it is remarkable:

Performance comparison of embedding providers

Why is that? Probably because they are cloud infrastructure and inference specialists, whereas OpenAI and Mistral are primarily model creators. Specialization pays off.

Is That Enough?

Not quite. As good as some providers’ performance may be, there are still periods of underperformance. And the risk of vendor lock‑in remains if their performance worsens or their prices rise.

To solve both problems, we chose a particularly strong open‑weights embedding model that is available from several high‑performing providers: Qwen 3 Embedding 8B.

From there, two strategies are possible:

The dynamic approach: you monitor performance in real time and send each request to the best‑performing provider at that moment. If a request fails with one, you resend it to another.
The “all‑out” approach: given how low the economic and environmental cost of embeddings is (you can get below one cent per million tokens - in other words, generating an embedding is about 200 times cheaper than sending the same document to GPT 5.2 in a request), why hold back? You send the request to several providers simultaneously and take the first answer that comes back.

At Ask This Guy, we went with this second approach.

The result? Ultra‑stable performance, no critical dependency on a single provider, and a cost that remains very reasonable.

A Lesson Every IT Professional Knows… But That Is Even More True With AI

Building a POC is easy. Running production is a different story.

In a POC, variability doesn’t really show. With a handful of queries, everything appears to work. But in production, with real users, real volumes, and high expectations, every weakness becomes obvious.

The good news is that with engineering and patience, you can work your way through any of these challenges.

Our failed demo? It cost us a few awkward minutes, but it earned us a far more robust architecture. And that is priceless.

Want to learn more about our RAG approach or our other real‑world AI production lessons? Feel free to book a demo!

How a Failed Demo Forced Us to Rethink Our Embedding Strategy

A Tough Moment

Back to Basics

Our Initial Path: The "Big Names"

Second Problem: Lock-In

The Inference Specialists

Is That Enough?

A Lesson Every IT Professional Knows… But That Is Even More True With AI

Related Articles

Open-Source in AI: The Big Misunderstanding

LLM On-Premises for Your Enterprise: What Can You Do with €60k?

How RAG Transforms Your AI into an Expert on Your Business

Interested in our solutions?