SaaSPerform
p99: 142ms · err: 0.04%

What RAG teams keep learning about production latency that perf engineers already knew

A guest post from the lesspoo team on the five performance-engineering lessons that translate directly from older production systems to the current generation of RAG.

What RAG teams keep learning about production latency that performance engineers already knew

The first time we ran a search and RAG engagement for a team whose system was slow in production, the team's perspective was that the problem was novel. They had a new architecture, a new class of model, a new database. The bottleneck must therefore live somewhere new. The framing made sense to them and was reasonable on the face of it.

Two years and a lot of engagements later, we can say with confidence that almost nothing about the latency problems we see in RAG systems is novel. The systems are new. The patterns of slowness are very old. Performance engineering teams that have been operating high-throughput systems for years have already learned everything the RAG teams are now relearning, and the lessons translate almost completely. We wanted to share the ones that have come up most often, because the audience reading this likely already knows them and might find it useful to see how they apply to a system class that is currently being built without the benefit of the institutional knowledge.

The first lesson is that the user-perceived latency is the latency that matters. RAG teams sometimes optimize for the latency of an internal component without measuring whether the user actually waits for that component. We have seen teams celebrate a 30 percent reduction in embedding time on a query path where embeddings were a small fraction of total latency. The user noticed nothing. The work was real. The framing was wrong. Performance engineers know to optimize what the user is paying for, and the framing translates directly.

The second lesson is that tail latency matters more than median latency. RAG teams report median latency on dashboards and tune to it. The customer experiencing the system is in the tail. A median that is fine and a P99 that is bad is not a system that performs well. The teams who have learned this in older systems intuit it immediately. The teams who are new to production performance often have to learn it the hard way after a customer complains.

The third lesson is that variability is its own bug. A system whose latency varies by an order of magnitude depending on the query produces a different user experience than a system whose latency is consistent. The variable system is harder to integrate with, harder to plan capacity for, and harder to debug. RAG systems often have this variability built in because the synthesis step's latency depends on the size of the retrieved context. Performance engineers have spent decades constraining variability through admission control, capacity reservation, and timeout discipline. The same techniques apply, and the RAG teams that adopt them produce more reliable systems.

The fourth lesson is that locality of reference is still the dominant performance variable. Performance engineers know that the difference between a cached and uncached response is usually two orders of magnitude. RAG systems are a particularly good fit for caching at multiple layers, because the same questions get asked repeatedly and the same chunks get retrieved repeatedly. Teams that have not added caching deliberately are leaving a meaningful win on the table, and the pattern of where to add it is one performance engineering already worked out for older systems.

The fifth lesson is that observability is the prerequisite to all of the above. A system that is not instrumented to the level required cannot be optimized, regardless of how clever the optimization candidates are. RAG teams sometimes try to reason about their performance from logs and intuition. Performance engineering long ago concluded that this does not work and built a culture around continuous measurement. The same culture works for RAG systems, and the same instrumentation patterns translate.

The pattern across all five lessons is that performance engineering for RAG systems is mostly performance engineering for any production system, applied to a new domain. The RAG-specific concerns add to the bottom of the stack rather than replacing it. The teams who recognize this early move faster, build more reliable systems, and avoid the year of relearning. The teams who treat the RAG layer as a new world that needs new methods spend that year and arrive at the same answers performance engineering already had.

For a performance engineering team being asked to help with a slow RAG system, the working pattern is to apply the playbook the team already has, with one addition. The synthesis step's latency depends on the model and the context size. Both are tunable, and both should be measured the same way the team measures other dependencies. Once that addition is in, the rest of the work is the work the team already knows how to do.


This is a guest post from the team at lesspoo, who run custom search and RAG engagements for teams whose corpus is worth searching well. The work focuses on corpus curation, evaluation harnesses that survive launch, and ranking that is explainable when it is wrong.