Embedding search latency

Continuing on the blog search engine I’ve previously mentioned, here are some initial stats on what latency looks like for a simple embedding search engine:

50%: ~600-1000 ms on Cohere Rerank

25%: ~300-500ms on the Pinecone lookup

12.5%: ~150-250ms on the OpenAI text-embedding-3-small call

12.5%: ~150-250ms on everything else (client → worker latency, database calls)

Of course these are the rough numbers I’ve been seeing and their roundness should warn you that they’re imprecise.

Improving Rerank latency is the hardest piece - I’d either need to go without reranking, or setup my own cross-encoder. The latter is doable, and should be much much faster, but would likely be time-consuming to do.

The other three pieces are more obviously optimizable, I’d consider the following improvements that I’d expect to give a ~50-80% improvement to their respective component:

Pinecone → Turbopuffer

OpenAI text-embedding-3 → local smaller model

Smarter DB interaction + caching

The Vespa approach is kinda compelling here too, ie: colocating the DB, VectorDB, embedding model and cross-encoder model is likely to improve performance by minimizing round-trips. Notably, embedding the query might take as little as 5ms and is likely dominated by network latency.

Unrelated, I’d also hope to add a SPLADE embedding soon. It seems like it’d be able to capture most of the benefits of FTS, but be a bit smarter + allow me to continue to use Pinecone.

Liked this post? Get email for new ones

Embedding search latency

Subscribe