Google Gemini has the worst LLM API

Google is on the frontier with recent Gemini releases. They now have:

A competitive coding & reasoning model (Gemini-2.5-pro)

The longest-context frontier models (1 or 2M context - OAI just achieved parity since GPT-4.1 has a 1M context too)

The best frontier audio, video & document multimodal models. (Some others support audio but at a much higher cost)

The best long-context fine-tuning offering (Gemini-2.0-flash can be tuned at 131k, others top out earlier)

The best multimodal fine-tuning offering (Gemini-2.0-flash can be tuned with audio, image & documents)

And yet these models hide behind the poorest developer experience in the market.

1. The Gemini API is available in two places, with differing functionality

You can use Gemini via either Vertex AI, or Google AI Studio. Google Developer Advocates will tell you: if you’re a startup or hobbyist, you should use Google AI Studio, as Vertex AI is more enterprise-focused.

That makes sense: it parallels the distinction between the OpenAI API and the Azure OpenAI API. Or Anthropic API vs Amazon Bedrock Claude.

It’s not unusual to offer an enterprise API with stronger security or compliance guarantees, but slower feature roll-out.

But unfortunately that comparison doesn’t hold. Functionality is released to AI Studio & Vertex AI at different times, and some functionality never comes to AI Studio.

For example: Vertex AI currently supports fine-tuning for Gemini-2.0-Flash and Gemini-2.0-Flash-Lite. On the other hand, AI Studio supports fine-tuning for Gemini-1.5-Flash only.

This often becomes a dealbreaker and you end up needing to juggle both options as a startup.

2. The documentation is bad

There are two documentation sites for AI Studio vs Vertex. You will land on the wrong one.

In general, the docs are worse than most. Much of the AI Studio docs still refer to Gemini 1.5, which is now deprecated & make it generally unclear whether Gemini 2.5 still supports the same features.

The API is also the quirkiest around. For example, there are “safety settings” that reject certain requests by default, but can be disabled. There is an OpenAI-compatible SDK for Google AI Studio (not for Vertex AI), but it doesn’t support multimodality. You should probably avoid it.

3. The Vertex AI SDK doesn’t support API key authentication

Most LLM providers use bearer authentication with an API key. Not Vertex AI SDK, you’re going to need to figure out their auth methods and the right way to store their credentials.json file in your secrets system.

This gets messier if you’re eg: using a router that lets you bring-your-own-key.

4. The official Vertex AI TS SDK can’t use fine-tuned models (!!)

Let’s say you go fine-tune a Gemini-2.0-Flash model with Vertex AI.

The process is a little more annoying than with OpenAI.

You need to manually deploy an endpoint vs just training a model & using it.

Naming the deployed endpoint anything but the default name seems to throw an error of “Internal error occurred. Contact Vertex AI”.

You get a less-readable 18-digit endpoint ID to call, instead of a readable model name with a configurable suffix.

But - the biggest gotcha comes at the very end. You try swapping in your fine-tuned endpoint ID in your TypeScript SDK, and it just isn’t supported.

It’s not like the TS SDK offers a different function for fine-tuned models, to the best of my knowledge, it doesn’t support fine-tuned models at all. The best they can offer is a REST API.

It’s in fact easier to use a fine-tuned Vertex AI model with a third-party router, such as the Vercel AI SDK.

5. The prefix caching offering is very developer-unfriendly

Update [05/26/25]: Gemini now has an “implicit prompt caching” offering, much more similar to OpenAI’s easy automatic prompt caching! It’s definitely offered for Gemini via Google AI Studio, but it looks like it’s maybe still not offered for Vertex AI?

Prefix caching lets you reduce costs when the same prefix is re-used in multiple requests. Consider three API designs for prefix caching:

The provider automatically caches & reuses prefixes found in LLM completion requests.

The developer marks sections which should be cached in a LLM completion request, and the provider caches them & reuses them in subsequent requests.

The developer sends prefixes to be cached to a separate API endpoint, receiving a cache ID. The developer then includes a cache ID with a request if that cache entry ought to be used, and then the provider uses that cached prefix.

If you didn’t guess: 1 is OpenAI, 2 is Anthropic, and 3 is Gemini.

Setting up prefix caching can save you quite a bit - but for Gemini it’s a whole process of it’s own. The moment you’re dealing with large contexts that are dynamic but re-used, you need to build your own system to keep track of what caches correspond to what data.

Gemini caches also don’t have their TTL re-upped upon hit, so you need to manually call an API to keep the cache entry around for longer.

How to use Gemini anyways

You likely still need to use Gemini. Their models are the cheapest long-context & multi-modal models around.

You should probably start with Vertex AI and assume that it’ll be the main surface you use, but integrate with Google AI Studio as well, just in case.

You should use an LLM router to integrate with both Google AI Studio & Vertex AI SDK, and also sidestep the APIs being unusual.

Vercel AI SDK is my default option, as it’s local & maintained by Vercel.

OpenRouter is cool, especially since they now offer to handle the Gemini prompt caching for you too. But they charge a 5% markup.

Other notable options include LiteLLM, Helicone, Keywords

Appendix: Using fine-tuned Gemini models with the Vercel AI SDK

Vercel’s AI SDK still needs a little bit of patching to use a model fine-tuned via Vertex AI. Here’s how you do it:


// Patch Vercel AI SDK

function patchedFetchForFinetune(
  requestInfo: RequestInfo | URL,
  requestInit?: RequestInit
): Promise<Response> {
  function patchString(str: string) {
    return str.replace(`/publishers/google/models`, `/endpoints`)
  }

  if (requestInfo instanceof URL) {
    let patchedUrl = new URL(requestInfo)
    patchedUrl.pathname = patchString(patchedUrl.pathname)
    return fetch(patchedUrl, requestInit)
  }
  if (requestInfo instanceof Request) {
    let patchedUrl = patchString(requestInfo.url)
    let patchedRequest = new Request(patchedUrl, requestInfo)
    return fetch(patchedRequest, requestInit)
  }
  if (typeof requestInfo === 'string') {
    let patchedUrl = patchString(requestInfo)
    return fetch(patchedUrl, requestInit)
  }
  // Should never happen
  throw new Error('Unexpected requestInfo type: ' + typeof requestInfo)
}

const vertexFinetuned = createVertex({
  project: VERTEX_PROJECT,
  location: VERTEX_LOCATION,
  fetch: patchedFetchForFinetune as unknown as typeof globalThis.fetch,
})

const model = vertexFinetuned(VERTEX_ENDPOINT_ID)

Liked this post? Get email for new ones

Google Gemini has the worst LLM API

How to use Gemini anyways

Appendix: Using fine-tuned Gemini models with the Vercel AI SDK

Subscribe