May 2026 Release — NexToken

Between 2026-05-10 and 2026-05-12 we shipped 23 changes spanning cost optimisation, compliance, reliability, new modalities, new providers, enterprise features, and observability. None of them broke an existing integration. Three of them you can take advantage of today without writing any code.

Three things to try today

1. See your cache savings

Every /v1/chat/completions response now carries nex.cached_input_tokens (upstream cache hit) and nex.semantic_cache_hit (NexToken cache hit). Cache hits are billed at the documented discount:

OpenAI prompt cache: 50% off the cached portion
Anthropic prompt cache: 10% off the cached portion
DeepSeek prompt cache: 10% off the cached portion
Google Gemini cache: 25% off the cached portion
Semantic cache (near-duplicate match): 95% off retail

If you use long system prompts or repeated context (chat apps with memory, agentic loops, RAG pipelines), expect 30–90% reduction on cached input tokens — visible per request, not bundled into a monthly bill.

2. Quote a request before you pay for it

POST /v1/tokenize and POST /v1/estimate-cost accept the same payload as /v1/chat/completions and return token counts plus a retail-USD estimate. Cheaper than learning your nightly bill blew up because someone fed a 50K-token document into a loop.

3. Switch to `nex-auto`

Set model: "nex-auto" in your existing code and the smart router picks the cheapest model that can handle each prompt — general, coder, reasoning, or long-context. The decision is surfaced in response.nex.smart_router.target_model so you can audit per request.

// Before
const r = await openai.chat.completions.create({
  model: "gpt-4o", messages
});

// After (one word changed)
const r = await openai.chat.completions.create({
  model: "nex-auto", messages
});
console.log(r.nex.smart_router); // {target_model, reason, score}

Cost optimisation

Feature	Effect	How to opt in
Upstream prompt-cache pass-through	OpenAI / Anthropic / DeepSeek / Google cache hits now bill at the documented discount instead of full input rate.	Automatic. See `nex.cached_input_tokens`.
Semantic cache	Near-duplicate prompts (cosine ≥ 0.97 on `nex-embed-zh`) return the previous answer at 5% of normal retail.	Automatic on `temperature ≤ 0.3` without tools/response_format.
Batch endpoint `POST /v1/batch`	Up to 100 items per call, 30% discount.	New sync endpoint.
Async batch `POST /v1/batches`	OpenAI-shape, 50% discount, 24h SLA. Eligible OpenAI batches forwarded to OpenAI's `/v1/batches` for additional wholesale savings.	New async endpoint. Multi-modal items welcome (chat / embeddings / images).
Smart router	Picks cheapest capable model; decision in `nex.smart_router`.	`model: "nex-auto"`.
Pre-flight cost quotes	Tokenize + estimate before sending.	`POST /v1/tokenize`, `POST /v1/estimate-cost`.

Compliance & safety

Feature	What it does
Content moderation	High-risk prompts (CSAM-adjacent, WMD-synthesis, account fraud) blocked at the gateway — protects shared upstream accounts. Returns `422 NEX_CONTENT_FLAGGED`.
PII redaction	National IDs, mobile numbers, credit cards, IPv4, US SSNs, passports replaced with `[REDACTED:CATEGORY]` before forwarding upstream. Response reports `nex.pii_redactions = {category: count}`. PDPA / GDPR friendly out of the box. Enterprise tenants with a signed DPA can disable via `pii_mode: "off"`.
Prompt-injection defence	Known jailbreak templates blocked. Returns `422 NEX_PROMPT_INJECTION_DETECTED` with score and matched patterns.
Pre-flight context check	Oversized inputs rejected as `400 NEX_INPUT_TOO_LONG` before paying for an upstream timeout.

Reliability

Redis-backed circuit breaker — provider trips replicate across all NexToken instances; restarts no longer reset state.
In-provider retry + exponential backoff — 429/503/timeout retries the same provider (0.5s, 1.5s) before rotating, so transient blips don't bounce you to a different model.
Region-aware routing — APAC users prefer APAC-hosted backends (Qwen / GLM / nex-pro); US users prefer OpenAI / Anthropic direct; EU users prefer Mistral / GLM.
Multi-key load balancing — multiple keys per upstream rotate round-robin with per-key cooldown; one bad key no longer takes a provider offline.
Real-time provider scoring — every call records P50/P95/P99 + success rate; routing uses live measurements instead of static heuristics.

New modalities

Endpoint	Models
`POST /v1/images/generations`	DALL-E 3, DALL-E 3 HD
`POST /v1/audio/transcriptions`	Whisper
`POST /v1/audio/speech`	TTS-1, TTS-1-HD (HD billed at 2× — already correct)
Vision in `/v1/chat/completions`	GPT-4o, Claude, Gemini — token math fixed on `image_url` blocks (was over-counting on base64).

New providers — 7 new models behind 3 backends

Model	Provider	Notable for
`command-r-plus`, `command-r`	Cohere	RAG-strong
`sonar`, `sonar-pro`	Perplexity	Web-search-grounded chat
`grok-3`, `grok-3-mini`	xAI	Reasoning + X-platform context

Enterprise

Feature	Endpoint / API
Prompt templates	`POST /v1/templates` with CRUD + `/render` — server-side `{{variable}}` substitution. 200 templates × 64 KB each.
Fine-tune job lifecycle	`POST /v1/fine_tunes` (queue) → poll → webhook on completion. OpenAI-shaped, persisted, integrated with `/v1/files` for training data.
File uploads	`POST /v1/files` with `purpose=batch \| fine-tune \| assistants \| user_data`. SHA-256 integrity, per-user quota.
Webhooks	`POST /v1/webhooks` — HMAC-signed POST to your URL on `batch.completed`, `fine_tune.completed`, `invoice.issued`, `invoice.paid`. Auto-retry with exponential backoff up to 6 attempts.
Responses API	`POST /v1/responses` — OpenAI's new stateful single-turn API. Pass `previous_response_id` and the server rebuilds the message history for you.
Assistants API	`/v1/assistants`, `/v1/threads`, `/v1/threads/{id}/messages`, `/v1/threads/{id}/runs` — for SDK clients still targeting the Assistants surface.
Reserved throughput	Per-tenant RPM / concurrency floor — Enterprise SLA opt-in. Contact tinggang@nextoken.biz.
SSO (SAML + OIDC)	Self-built SP. OIDC available today; SAML waiting on a deploy-side `libxmlsec1` install for sites that prefer SAML.
Monthly invoicing	Admin-generated NET-30 invoices for enterprise accounts, PDF export, void / mark-paid lifecycle.

Observability


Prometheus `/metrics`	All counters, histograms, gauges. Scrape from VPC; public is denied at the nginx layer.
Grafana dashboards	Four ready-to-import JSON files — API overview, provider health, billing & margin, safety gates.
OpenTelemetry traces	Tempo-backed; every chat completion produces spans for routing, cache lookup, provider call, billing.

Breaking changes

None. Every change is additive. The nex block on the response gained new fields (cached_input_tokens, semantic_cache_hit, smart_router, pii_redactions, injection_score) — clients ignoring unknown fields are unaffected.

Pricing changes

Cache savings now visible — apps with long system prompts or RAG context typically see 30–90% off cached input tokens.
nex-auto published at $0.0003 / $0.0012 per 1K (general tier default). Actual underlying-model cost surfaced via nex.smart_router.
Batch API discount: 30% off retail per item (sync). Async batch: 50%.
Semantic cache hit discount: 95% off retail (5% residual covers the embed lookup).

All existing list prices unchanged.

Want to try the new shape on production?

Sign in and your existing keys still work. Switch model: "nex-auto" in one line, watch the nex block in your responses.

Get started → Read the docs

Questions? Reply to your existing thread or write to support@nextoken.biz.

May 2026 — 23 shipments, zero breaking changes, 30–90% cache savings