Proxying LLMs via Vercel is expensive

Proxying LLMs via Vercel is expensive

Vercel is presenting themselves as the AI-native webdev platform. But there’s something weird about picking Vercel (or another serverless provider) to proxy your OpenAI requests.
Consider a request to GPT-4-turbo, that has ~5k tokens split evenly amongst prompt & completion, and takes about 20s to complete. This request probably only uses <50ms of CPU time on Vercel, most of the time is spent idle.
Vercel bills you for 20s: 400x the CPU-time you’ve used.
Compare this to a couple other options:
  • Running a VM with autoscaling: Assuming all of your requests are proxying OpenAI calls, you’re going to be able to handle many QPS per node. The requests mostly involve waiting, and even then, the network load of each call is pretty low.
  • Cloudflare Workers: Cloudflare recently updated their pricing to address this issue. They now no longer bill you for non-CPU time spent awaiting I/O - so you’d only pay for the 50ms of active CPU time.
It clearly seems like Vercel is worse at this - charging you two orders-of-magnitude more relative to alternatives, for doing not much different.

But does it matter?

Vercel might be overbilling you for CPU time here, but it’s plausibly immaterial. Your GPT request is running on a beefy 8xA100 box, and having Vercel idle a tiny machine-slice shouldn’t count for much.
However, there are some reports of this ultimately being an issue. The comments are generally filled with people a) using workarounds to await for GPT elsewhere, b) being unhappy with the cost of Vercel at scale.
Crunching the numbers, let’s stick with the earlier estimates of 5k tokens, split evenly between prompt & completion, GPT-3.5-turbo or GPT-4-turbo, and 20s to complete w/ ~50ms of CPU time used.
For one million requests:
  • GPT-3.5-turbo: $7,500
  • GPT-4-turbo: $100,000
  • Cloudflare Workers: $2.3
    • ($2 for 50ms * 1m requests)
    • ($0.3 for 1m requests)
  • Vercel Serverless (256 MB machines): $556
    • ($40/(100*3600)) per GB-second
  • Vercel Serverless (1024 MB machines): $2222
  • Vercel Edge: $2 $800
    • ($2 per 1m execution units * 400 execution units per call)
    • $2 per 1m execution units * 1 execution unit per call
    • Updated: leerob pointed out on HN that Vercel bills for wall-time on serverless functions, but not on Edge. Hence, edge is much cheaper. Read here.
So, if you’re running a Next.js server doing a bunch of other tasks (and have found 1024 MB RAM appropriate), and then add an GPT-proxying route for GPT-3.5-turbo, you might find that it’s as much as a third of your OpenAI bill.

So, should you use Vercel for this?

A ~5-30% tax on all your OAI activity that passes through it is nothing to scoff at, but if you’re using Vercel for web hosting anyways, it’s fine to initially proxy LLMs via Vercel in the name of reducing complexity. At least until you’re spending >$1-5k on paying off the troll every month.
Vercel seems to be positioning themselves as “the convenient option you eventually migrate off of” in other places too, so perhaps that’s how you should conceive of them.
Update: This is actually not the case if you’re able to use Vercel Edge instead for when you need to proxy OAI requests since those bill based on CPU-time, not wall-time.