One image_url, 40+ models: multimodal passthrough

The problem

NexToken is an OpenAI-compatible gateway: keep your existing OpenAI client, point base_url at us, and reach 40+ models across providers (OpenAI, Anthropic, Google, Qwen, local) through one /v1/chat/completions endpoint and one key.

Text worked everywhere. Images didn't. If you sent the standard OpenAI vision shape — a content array mixing text and image_url parts — the request was rejected, because our message schema only accepted a string. Anyone doing document, chart, or screenshot analysis was stuck. This is how we closed that gap without breaking a single existing call.

Constraint 1: backward compatibility is non-negotiable

A large volume of in-flight requests send content as a plain string. The fix had to keep that working byte-for-byte. So content became a union:

content: str | list[dict] | None

String in → unchanged path. List in → new multimodal path. No migration for anyone.

Constraint 2: every provider speaks a different dialect

OpenAI's vision format is a content array of {type:"text"} and {type:"image_url"} parts. Anthropic's Messages API wants {type:"image","source":{…}} blocks, with base64 and URL sources expressed differently. Our job as a gateway is to absorb that difference so you never see it:

# OpenAI image_url  ->  Anthropic image block
data: URL    ->  {"type":"image","source":{"type":"base64","media_type":…,"data":…}}
http(s) URL  ->  {"type":"image","source":{"type":"url","url":…}}

Plain strings still pass straight through, so we never pay conversion cost on the common text path.

Constraint 3: fail loudly, in the right place

Send an image to a model that can't see and the worst outcome is a cryptic error from an upstream after billing pre-checks have already run. Our catalog tags each model with capabilities, so we gate at the edge:

image content + non-vision model
  ->  400 NEX_MODEL_NO_VISION
      "Model 'X' does not support image input.
       Use claude-sonnet-4-6, gpt-4o, …"

You learn what's wrong before the request leaves our building.

Constraint 4: don't open a compliance hole

We run content moderation on prompts before forwarding upstream — it protects the shared provider accounts every customer depends on. Text was moderated; images would have bypassed it. So image parts now go through the same moderation step. Closing that gap was part of shipping, not a follow-up.

What it looks like to you

curl https://api.nextoken.biz/v1/chat/completions \
  -H "Authorization: Bearer $NEX_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4-6",
    "max_tokens": 1024,
    "messages": [{ "role": "user", "content": [
      {"type": "text", "text": "What does this chart show?"},
      {"type": "image_url",
       "image_url": {"url": "data:image/png;base64,…"}}
    ]}]
  }'

The response is the OpenAI shape you already parse — choices[0].message.content — and every response also reports the exact cost of that call in nex.cost_usd. image_url accepts both base64 data URLs and http(s) URLs.

Vision-capable models today (use the id exactly as it appears in /v1/models): claude-sonnet-4-6, claude-opus-4-6, claude-haiku-4-5-20251001, gpt-4o, gpt-4o-mini, gpt-4.1, gemini-2.5-pro. Send an image to any of them through the same endpoint and key.

Things we kept honest

Image tokens are counted and billed like any input — every response still reports nex.cost_usd for that exact call.
One message can carry multiple images for multi-image analysis.
Tests cover string content, text+image arrays, base64 vs URL sources, malformed data URLs, the vision gate, and image moderation.

Takeaway

A good gateway is one your code doesn't notice. Multimodal shipped as an additive union type, a per-provider translation layer, an edge capability gate, and a moderation extension — and zero changes for anyone sending plain text. That's the bar we hold for every feature.

Build multimodal without picking a vendor

One OpenAI-compatible endpoint, 40+ models, per-request cost visibility.

Get started free → Read the docs