Embedding search latency
Continuing on the blog search engine I’ve previously mentioned, here are some initial stats on what latency looks like for a simple embedding search engine:
- 50%: ~600-1000 ms on Cohere Rerank
- 25%: ~300-500ms on the Pinecone lookup
- 12.5%: ~150-250ms on the OpenAI text-embedding-3-small call
- 12.5%: ~150-250ms on everything else (client → worker latency, database calls)
Of course these are the rough numbers I’ve been seeing and their roundness should warn you that they’re imprecise.
Improving Rerank latency is the hardest piece - I’d either need to go without reranking, or setup my own cross-encoder. The latter is doable, and should be much much faster, but would likely be time-consuming to do.
The other three pieces are more obviously optimizable, I’d consider the following improvements that I’d expect to give a ~50-80% improvement to their respective component:
- Pinecone → Turbopuffer
- OpenAI text-embedding-3 → local smaller model
- Smarter DB interaction + caching
The Vespa approach is kinda compelling here too, ie: colocating the DB, VectorDB, embedding model and cross-encoder model is likely to improve performance by minimizing round-trips. Notably, embedding the query might take as little as 5ms and is likely dominated by network latency.
Unrelated, I’d also hope to add a SPLADE embedding soon. It seems like it’d be able to capture most of the benefits of FTS, but be a bit smarter + allow me to continue to use Pinecone.