← All posts
2026-05-12 · 6 min read

May 2026 — 23 shipments, zero breaking changes, 30–90% cache savings

Same API, lower bills, stricter compliance, four new model classes. Existing OpenAI SDK code keeps working — no integration changes required.

Between 2026-05-10 and 2026-05-12 we shipped 23 changes spanning cost optimisation, compliance, reliability, new modalities, new providers, enterprise features, and observability. None of them broke an existing integration. Three of them you can take advantage of today without writing any code.

Three things to try today

1. See your cache savings

Every /v1/chat/completions response now carries nex.cached_input_tokens (upstream cache hit) and nex.semantic_cache_hit (NexToken cache hit). Cache hits are billed at the documented discount:

If you use long system prompts or repeated context (chat apps with memory, agentic loops, RAG pipelines), expect 30–90% reduction on cached input tokens — visible per request, not bundled into a monthly bill.

2. Quote a request before you pay for it

POST /v1/tokenize and POST /v1/estimate-cost accept the same payload as /v1/chat/completions and return token counts plus a retail-USD estimate. Cheaper than learning your nightly bill blew up because someone fed a 50K-token document into a loop.

3. Switch to nex-auto

Set model: "nex-auto" in your existing code and the smart router picks the cheapest model that can handle each prompt — general, coder, reasoning, or long-context. The decision is surfaced in response.nex.smart_router.target_model so you can audit per request.

// Before
const r = await openai.chat.completions.create({
  model: "gpt-4o", messages
});

// After (one word changed)
const r = await openai.chat.completions.create({
  model: "nex-auto", messages
});
console.log(r.nex.smart_router); // {target_model, reason, score}

Cost optimisation

FeatureEffectHow to opt in
Upstream prompt-cache pass-through OpenAI / Anthropic / DeepSeek / Google cache hits now bill at the documented discount instead of full input rate. Automatic. See nex.cached_input_tokens.
Semantic cache Near-duplicate prompts (cosine ≥ 0.97 on nex-embed-zh) return the previous answer at 5% of normal retail. Automatic on temperature ≤ 0.3 without tools/response_format.
Batch endpoint POST /v1/batch Up to 100 items per call, 30% discount. New sync endpoint.
Async batch POST /v1/batches OpenAI-shape, 50% discount, 24h SLA. Eligible OpenAI batches forwarded to OpenAI's /v1/batches for additional wholesale savings. New async endpoint. Multi-modal items welcome (chat / embeddings / images).
Smart router Picks cheapest capable model; decision in nex.smart_router. model: "nex-auto".
Pre-flight cost quotes Tokenize + estimate before sending. POST /v1/tokenize, POST /v1/estimate-cost.

Compliance & safety

FeatureWhat it does
Content moderation High-risk prompts (CSAM-adjacent, WMD-synthesis, account fraud) blocked at the gateway — protects shared upstream accounts. Returns 422 NEX_CONTENT_FLAGGED.
PII redaction National IDs, mobile numbers, credit cards, IPv4, US SSNs, passports replaced with [REDACTED:CATEGORY] before forwarding upstream. Response reports nex.pii_redactions = {category: count}. PDPA / GDPR friendly out of the box. Enterprise tenants with a signed DPA can disable via pii_mode: "off".
Prompt-injection defence Known jailbreak templates blocked. Returns 422 NEX_PROMPT_INJECTION_DETECTED with score and matched patterns.
Pre-flight context check Oversized inputs rejected as 400 NEX_INPUT_TOO_LONG before paying for an upstream timeout.

Reliability

New modalities

EndpointModels
POST /v1/images/generationsDALL-E 3, DALL-E 3 HD
POST /v1/audio/transcriptionsWhisper
POST /v1/audio/speechTTS-1, TTS-1-HD (HD billed at 2× — already correct)
Vision in /v1/chat/completionsGPT-4o, Claude, Gemini — token math fixed on image_url blocks (was over-counting on base64).

New providers — 7 new models behind 3 backends

ModelProviderNotable for
command-r-plus, command-rCohereRAG-strong
sonar, sonar-proPerplexityWeb-search-grounded chat
grok-3, grok-3-minixAIReasoning + X-platform context

Enterprise

FeatureEndpoint / API
Prompt templates POST /v1/templates with CRUD + /render — server-side {{variable}} substitution. 200 templates × 64 KB each.
Fine-tune job lifecycle POST /v1/fine_tunes (queue) → poll → webhook on completion. OpenAI-shaped, persisted, integrated with /v1/files for training data.
File uploads POST /v1/files with purpose=batch | fine-tune | assistants | user_data. SHA-256 integrity, per-user quota.
Webhooks POST /v1/webhooks — HMAC-signed POST to your URL on batch.completed, fine_tune.completed, invoice.issued, invoice.paid. Auto-retry with exponential backoff up to 6 attempts.
Responses API POST /v1/responses — OpenAI's new stateful single-turn API. Pass previous_response_id and the server rebuilds the message history for you.
Assistants API /v1/assistants, /v1/threads, /v1/threads/{id}/messages, /v1/threads/{id}/runs — for SDK clients still targeting the Assistants surface.
Reserved throughput Per-tenant RPM / concurrency floor — Enterprise SLA opt-in. Contact tinggang@nextoken.biz.
SSO (SAML + OIDC) Self-built SP. OIDC available today; SAML waiting on a deploy-side libxmlsec1 install for sites that prefer SAML.
Monthly invoicing Admin-generated NET-30 invoices for enterprise accounts, PDF export, void / mark-paid lifecycle.

Observability

Prometheus /metrics All counters, histograms, gauges. Scrape from VPC; public is denied at the nginx layer.
Grafana dashboards Four ready-to-import JSON files — API overview, provider health, billing & margin, safety gates.
OpenTelemetry traces Tempo-backed; every chat completion produces spans for routing, cache lookup, provider call, billing.

Breaking changes

None. Every change is additive. The nex block on the response gained new fields (cached_input_tokens, semantic_cache_hit, smart_router, pii_redactions, injection_score) — clients ignoring unknown fields are unaffected.

Pricing changes

All existing list prices unchanged.

Want to try the new shape on production?

Sign in and your existing keys still work. Switch model: "nex-auto" in one line, watch the nex block in your responses.

Get started → Read the docs

Questions? Reply to your existing thread or write to support@nextoken.biz.