Survey of LLM Fine-tuning APIs (Apr 2025)

Notable LLM fine-tuning APIs

Provider
Model
Training cost ($/MTok)
Inference cost ($/MTok, in → out)
Context Limit (tokens)
OpenAI
GPT 4.1
$25 (SFT/DPO)
$3 → $12
65k training, 128k inference
OpenAI
GPT 4.1-mini
$5 (SFT/DPO)
$0.8 → $3.2
65k training, 128k inference
Google Vertex AI†
Gemini 2.0 Flash-Lite
$3 (SFT only)
$0.075 → $0.30*
131k
Google Vertex AI†
Gemini 2.0 Flash
$1 (SFT only)
$0.15 → $0.60*
131k
Together†
Deepseek-R1-Distil-Llama-70B
$2.90 (SFT), $7.25 (DPO)
$2→$2
32k
*: Unchanged from base model pricing
†: Also offers non-adapter fine-tuning, requiring dedicated GPU capacity to deploy
Notes:
  • The above list is non-exhaustive, Together offers most open-source models, Google offers fine-tuning for Gemini-1.5, and OpenAI offers fine-tuning for some older models.
  • Providers generally have substantially lower context limits for fine-tuned models - it gets tricky for use-cases that involve 200k+ token inputs.
    • Google’s Vertex AI is the only provider with a >100k context fine-tune option.
    • OpenAI lets you infer on fine-tuned models w/ long-context, but not train on long-context.
    • Together’s highest limit is 32k
    • Ultimately: a DIY fine-tune on OSS models might be a good call for long-context, but setting that up might be non-trivial too.
  • Google’s Gemini-2.0 series is unique in being the only family of modals available for multi-modal fine-tuning - including text, PDFs, audio & images.
    • More generally: Gemini is unusually good at processing input audio & documents vs other providers.

Example fine-tunes

Here are some real fine-tunes I ran recently, to give you a better sense of what prices look like. Providers generally recommend starting with 10-1000 examples, and then increasing dataset size only if you notice that that seems to improve the quality further.

Example 1

Provider: OpenAI Base model: GPT-4o Dataset size: 50 Tokens: 200k Epochs: 3 Unit Cost: $25/MTok Cost: $17

Example 2

Provider: Google Vertex AI Base model: Gemini-2.0-Flash Dataset size: 2000 Tokens: 75M Epochs: 2 Unit Cost: $3/MTok Cost: $450

Adapters & RL methods

LoRAs (low-rank adaptation) and other adapter approaches allow you to fine-tune a model by making a “diff” that’s cheap to apply at inference time.
The major implication of this is that adapter-based fine-tunes are often offered serverlessly (ie: you don’t need to reserve GPUs that only run your model), and often at the same cost as the base model. Thus, they’re the most commonly offered option.
This makes fine-tuning much more tractable, even at the early stages. I’m currently unclear on the net downside of adapters (ie: how bad is the quality trade-off).
Some providers have begun offering RL-based fine-tuning, ie: DPO or GRPO, in addition to the regular supervised fine-tuning (SFT). This requires a slightly different dataset format: pairs of positive & negative outputs in response to each input, rather than just a single positive output.

Non-API approaches

You have three options:
  1. Figure out training & inference yourself (OSS models only)
  1. Figure out training yourself, but hand it off to someone for inference (OSS models only)
  1. Hand a dataset to someone, and get an inference-ready API
This blog post has focused primarily on Option 3. Why?
Option 1 is likely the trickiest. Publicly available inference implementations are not as well optimized as those by private companies. In particular, kernels & speculators are where you’d fall short. You’d also need to deal with the ops burden of keeping up a cluster.
Option 2 is perhaps better if you’re trying to train in a more unusual manner. Some companies can handle inference for you, and offer varying amounts of support with training by doing this in a somewhat bespoke manner. Parasail and Together come to mind, I’ve not used either, but I’ve heard from friends using them.

Survey of LLM Fine-tuning APIs (Apr 2025)

Notable LLM fine-tuning APIs

Provider
Model
Training cost ($/MTok)
Inference cost ($/MTok, in → out)
Context Limit (tokens)
OpenAI
GPT 4.1
$25 (SFT/DPO)
$3 → $12
65k training, 128k inference
OpenAI
GPT 4.1-mini
$5 (SFT/DPO)
$0.8 → $3.2
65k training, 128k inference
Google Vertex AI†
Gemini 2.0 Flash-Lite
$3 (SFT only)
$0.075 → $0.30*
131k
Google Vertex AI†
Gemini 2.0 Flash
$1 (SFT only)
$0.15 → $0.60*
131k
Together†
Deepseek-R1-Distil-Llama-70B
$2.90 (SFT), $7.25 (DPO)
$2→$2
32k
*: Unchanged from base model pricing
†: Also offers non-adapter fine-tuning, requiring dedicated GPU capacity to deploy
Notes:
  • The above list is non-exhaustive, Together offers most open-source models, Google offers fine-tuning for Gemini-1.5, and OpenAI offers fine-tuning for some older models.
  • Providers generally have substantially lower context limits for fine-tuned models - it gets tricky for use-cases that involve 200k+ token inputs.
    • Google’s Vertex AI is the only provider with a >100k context fine-tune option.
    • OpenAI lets you infer on fine-tuned models w/ long-context, but not train on long-context.
    • Together’s highest limit is 32k
    • Ultimately: a DIY fine-tune on OSS models might be a good call for long-context, but setting that up might be non-trivial too.
  • Google’s Gemini-2.0 series is unique in being the only family of modals available for multi-modal fine-tuning - including text, PDFs, audio & images.
    • More generally: Gemini is unusually good at processing input audio & documents vs other providers.

Example fine-tunes

Here are some real fine-tunes I ran recently, to give you a better sense of what prices look like. Providers generally recommend starting with 10-1000 examples, and then increasing dataset size only if you notice that that seems to improve the quality further.

Example 1

Provider: OpenAI Base model: GPT-4o Dataset size: 50 Tokens: 200k Epochs: 3 Unit Cost: $25/MTok Cost: $17

Example 2

Provider: Google Vertex AI Base model: Gemini-2.0-Flash Dataset size: 2000 Tokens: 75M Epochs: 2 Unit Cost: $3/MTok Cost: $450

Adapters & RL methods

LoRAs (low-rank adaptation) and other adapter approaches allow you to fine-tune a model by making a “diff” that’s cheap to apply at inference time.
The major implication of this is that adapter-based fine-tunes are often offered serverlessly (ie: you don’t need to reserve GPUs that only run your model), and often at the same cost as the base model. Thus, they’re the most commonly offered option.
This makes fine-tuning much more tractable, even at the early stages. I’m currently unclear on the net downside of adapters (ie: how bad is the quality trade-off).
Some providers have begun offering RL-based fine-tuning, ie: DPO or GRPO, in addition to the regular supervised fine-tuning (SFT). This requires a slightly different dataset format: pairs of positive & negative outputs in response to each input, rather than just a single positive output.

Non-API approaches

You have three options:
  1. Figure out training & inference yourself (OSS models only)
  1. Figure out training yourself, but hand it off to someone for inference (OSS models only)
  1. Hand a dataset to someone, and get an inference-ready API
This blog post has focused primarily on Option 3. Why?
Option 1 is likely the trickiest. Publicly available inference implementations are not as well optimized as those by private companies. In particular, kernels & speculators are where you’d fall short. You’d also need to deal with the ops burden of keeping up a cluster.
Option 2 is perhaps better if you’re trying to train in a more unusual manner. Some companies can handle inference for you, and offer varying amounts of support with training by doing this in a somewhat bespoke manner. Parasail and Together come to mind, I’ve not used either, but I’ve heard from friends using them.