Survey of LLM Fine-tuning APIs (Apr 2025)
Notable LLM fine-tuning APIs
Provider | Model | Training cost ($/MTok) | Inference cost ($/MTok, in → out) | Context Limit (tokens) |
OpenAI | GPT 4.1 | $25 (SFT/DPO) | $3 → $12 | 65k training, 128k inference |
OpenAI | GPT 4.1-mini | $5 (SFT/DPO) | $0.8 → $3.2 | 65k training, 128k inference |
Google Vertex AI† | Gemini 2.0 Flash-Lite | $3 (SFT only) | $0.075 → $0.30* | 131k |
Google Vertex AI† | Gemini 2.0 Flash | $1 (SFT only) | $0.15 → $0.60* | 131k |
Together† | Deepseek-R1-Distil-Llama-70B | $2.90 (SFT), $7.25 (DPO) | $2→$2 | 32k |
*: Unchanged from base model pricing
†: Also offers non-adapter fine-tuning, requiring dedicated GPU capacity to deploy
Notes:
- The above list is non-exhaustive, Together offers most open-source models, Google offers fine-tuning for Gemini-1.5, and OpenAI offers fine-tuning for some older models.
- Providers generally have substantially lower context limits for fine-tuned models - it gets tricky for use-cases that involve 200k+ token inputs.
- Google’s Vertex AI is the only provider with a >100k context fine-tune option.
- OpenAI lets you infer on fine-tuned models w/ long-context, but not train on long-context.
- Together’s highest limit is 32k
- Ultimately: a DIY fine-tune on OSS models might be a good call for long-context, but setting that up might be non-trivial too.
- Google’s Gemini-2.0 series is unique in being the only family of modals available for multi-modal fine-tuning - including text, PDFs, audio & images.
- More generally: Gemini is unusually good at processing input audio & documents vs other providers.
Example fine-tunes
Here are some real fine-tunes I ran recently, to give you a better sense of what prices look like. Providers generally recommend starting with 10-1000 examples, and then increasing dataset size only if you notice that that seems to improve the quality further.
Example 1
Provider: OpenAI
Base model: GPT-4o
Dataset size: 50
Tokens: 200k
Epochs: 3
Unit Cost: $25/MTok
Cost: $17
Example 2
Provider: Google Vertex AI
Base model: Gemini-2.0-Flash
Dataset size: 2000
Tokens: 75M
Epochs: 2
Unit Cost: $3/MTok
Cost: $450
Adapters & RL methods
LoRAs (low-rank adaptation) and other adapter approaches allow you to fine-tune a model by making a “diff” that’s cheap to apply at inference time.
The major implication of this is that adapter-based fine-tunes are often offered serverlessly (ie: you don’t need to reserve GPUs that only run your model), and often at the same cost as the base model. Thus, they’re the most commonly offered option.
This makes fine-tuning much more tractable, even at the early stages. I’m currently unclear on the net downside of adapters (ie: how bad is the quality trade-off).
Some providers have begun offering RL-based fine-tuning, ie: DPO or GRPO, in addition to the regular supervised fine-tuning (SFT). This requires a slightly different dataset format: pairs of positive & negative outputs in response to each input, rather than just a single positive output.
Non-API approaches
You have three options:
- Figure out training & inference yourself (OSS models only)
- Figure out training yourself, but hand it off to someone for inference (OSS models only)
- Hand a dataset to someone, and get an inference-ready API
This blog post has focused primarily on Option 3. Why?
Option 1 is likely the trickiest. Publicly available inference implementations are not as well optimized as those by private companies. In particular, kernels & speculators are where you’d fall short. You’d also need to deal with the ops burden of keeping up a cluster.
Option 2 is perhaps better if you’re trying to train in a more unusual manner. Some companies can handle inference for you, and offer varying amounts of support with training by doing this in a somewhat bespoke manner. Parasail and Together come to mind, I’ve not used either, but I’ve heard from friends using them.