Replicate & Fly cold-start latency

Replicate & Fly cold-start latency

Replicate has been my default serverless GPU choice in the past, and I’ve been trying to use it to set up some embedding models, like SPLADE and a Q&A-optimized bi-encoder. On the other hand, I’m a huge fan of Fly for hosting, and they’ve just announced GPUs.
How does cold & warm latency compare between these providers?

Problem Context

I’m building a semantic search engine, and I’m most interested in minimizing query-time latency, and trying out multiple different models.
One constraint: I don’t want to pay the $2k/mo to keep a GPU permanently warm, so I’m generally looking for serverless providers with great cold start latency.

Test #1: Cold-start latency profile on Replicate

I’m testing a MSMARCO trained bi-encoder from sentence-transformers . I’m not doing anything particularly intelligent re: model loading - the 100 MB model is downloaded from Huggingface the first time the container is run.
I instrumented the Cog container to report a) time since the setup function was first called, representing the end of machine startup, and b) time taken for model inference. Then, I timed it from a Python script calling the Replicate API.
Cold-start latency on Replicate for a 14 GB Cog Docker image, with 100 MB of runtime download.
Cold-start latency on Replicate for a 14 GB Cog Docker image, with 100 MB of runtime download.
Machine startup takes around 60 seconds, downloading the model takes about 10, and embedding a single query string takes just around 5 ms.
70s is too long to wait at a search box for.

Test #2: Cold & warm latency tests on Replicate vs Fly

I uploaded the same Cog container to Fly, and updated my instrumentation script to call that endpoint too. With Fly, I haven’t figured out configuring the automatic scale-to-zero timer yet, so I manually used fly machines kill to profile cold-start times.
To ensure that Fly didn’t have an additional layer of short-lived caching, I spaced out my cold-start tests by up to 12 hours.
Warm latency figures, and machine start time figures for Fly and Replicate, for the same 14 GB Cog Docker image. Warm latency figures start at the HTTP request to a warm model, and end at receiving a result. Machine start time figures start at the HTTP request to a cold model, and end at the setup() function invocation
Warm latency figures, and machine start time figures for Fly and Replicate, for the same 14 GB Cog Docker image. Warm latency figures start at the HTTP request to a warm model, and end at receiving a result. Machine start time figures start at the HTTP request to a cold model, and end at the setup() function invocation
Fly far dominates Replicate’s numbers, for both warm and cold inferences. After getting my initial numbers on a Replicate A40, I chatted with a friend who works at Replicate, who mentioned that T4s are hosted differently vs A40s and I might get better results. I then tried T4s, but I got much worse cold latencies, and warm latencies were still not much improved.
To ensure there wasn’t something surprisingly wrong happening in my inference code, I switched to an effectively no-op predict function of return json.dumps({success: "True"}) , and I generally saw my results unchanged.

What now?

Replicate seems to have incredibly poor latency, regardless of cold/warm status. 800ms is a lot to spend on networking for an already warm model. If you’re thinking of deploying a model to production, and you really want to use the Cog architecture for some reason, you might still consider deploying to Fly.
Fly continues to be excellent. The min(machine_boot) I experienced with cold boots was an astounding 2s - and I’d guess that think their cold boot time is probably the best available for GPU providers.
I’m likely to still use Replicate for their open source / social aspect. I’d like to containerize and test out a bunch of embedding models for which code but no API currently exists (mostly SPLADE, more models from sentence-transformers ) and Replicate has a great story for transforming that effort into a public good. I’d love to build a universal embeddings API, where it’s trivial to switch the model you’re using in a product. With Replicate’s ability to let other users pay for their own usage, it’s the only provider that could support this.
But if you ask me which provider you should deploy to prod on - it’s probably not Replicate.