Between 2026-05-10 and 2026-05-12 we shipped 23 changes spanning cost optimisation, compliance, reliability, new modalities, new providers, enterprise features, and observability. None of them broke an existing integration. Three of them you can take advantage of today without writing any code.
Three things to try today
1. See your cache savings
Every /v1/chat/completions response now carries
nex.cached_input_tokens (upstream cache hit) and
nex.semantic_cache_hit (NexToken cache hit). Cache hits are
billed at the documented discount:
- OpenAI prompt cache: 50% off the cached portion
- Anthropic prompt cache: 10% off the cached portion
- DeepSeek prompt cache: 10% off the cached portion
- Google Gemini cache: 25% off the cached portion
- Semantic cache (near-duplicate match): 95% off retail
If you use long system prompts or repeated context (chat apps with memory, agentic loops, RAG pipelines), expect 30–90% reduction on cached input tokens — visible per request, not bundled into a monthly bill.
2. Quote a request before you pay for it
POST /v1/tokenize and POST /v1/estimate-cost
accept the same payload as /v1/chat/completions and return
token counts plus a retail-USD estimate. Cheaper than learning your
nightly bill blew up because someone fed a 50K-token document into a
loop.
3. Switch to nex-auto
Set model: "nex-auto" in your existing code and the smart
router picks the cheapest model that can handle each prompt — general,
coder, reasoning, or long-context. The decision is surfaced in
response.nex.smart_router.target_model so you can audit per
request.
// Before
const r = await openai.chat.completions.create({
model: "gpt-4o", messages
});
// After (one word changed)
const r = await openai.chat.completions.create({
model: "nex-auto", messages
});
console.log(r.nex.smart_router); // {target_model, reason, score}
Cost optimisation
| Feature | Effect | How to opt in |
|---|---|---|
| Upstream prompt-cache pass-through | OpenAI / Anthropic / DeepSeek / Google cache hits now bill at the documented discount instead of full input rate. | Automatic. See nex.cached_input_tokens. |
| Semantic cache | Near-duplicate prompts (cosine ≥ 0.97 on nex-embed-zh) return the previous answer at 5% of normal retail. |
Automatic on temperature ≤ 0.3 without tools/response_format. |
Batch endpoint POST /v1/batch |
Up to 100 items per call, 30% discount. | New sync endpoint. |
Async batch POST /v1/batches |
OpenAI-shape, 50% discount, 24h SLA. Eligible OpenAI batches forwarded to OpenAI's /v1/batches for additional wholesale savings. |
New async endpoint. Multi-modal items welcome (chat / embeddings / images). |
| Smart router | Picks cheapest capable model; decision in nex.smart_router. |
model: "nex-auto". |
| Pre-flight cost quotes | Tokenize + estimate before sending. | POST /v1/tokenize, POST /v1/estimate-cost. |
Compliance & safety
| Feature | What it does |
|---|---|
| Content moderation | High-risk prompts (CSAM-adjacent, WMD-synthesis, account fraud) blocked at the gateway — protects shared upstream accounts. Returns 422 NEX_CONTENT_FLAGGED. |
| PII redaction | National IDs, mobile numbers, credit cards, IPv4, US SSNs, passports replaced with [REDACTED:CATEGORY] before forwarding upstream. Response reports nex.pii_redactions = {category: count}. PDPA / GDPR friendly out of the box. Enterprise tenants with a signed DPA can disable via pii_mode: "off". |
| Prompt-injection defence | Known jailbreak templates blocked. Returns 422 NEX_PROMPT_INJECTION_DETECTED with score and matched patterns. |
| Pre-flight context check | Oversized inputs rejected as 400 NEX_INPUT_TOO_LONG before paying for an upstream timeout. |
Reliability
- Redis-backed circuit breaker — provider trips replicate across all NexToken instances; restarts no longer reset state.
- In-provider retry + exponential backoff — 429/503/timeout retries the same provider (0.5s, 1.5s) before rotating, so transient blips don't bounce you to a different model.
- Region-aware routing — APAC users prefer APAC-hosted backends (Qwen / GLM /
nex-pro); US users prefer OpenAI / Anthropic direct; EU users prefer Mistral / GLM. - Multi-key load balancing — multiple keys per upstream rotate round-robin with per-key cooldown; one bad key no longer takes a provider offline.
- Real-time provider scoring — every call records P50/P95/P99 + success rate; routing uses live measurements instead of static heuristics.
New modalities
| Endpoint | Models |
|---|---|
POST /v1/images/generations | DALL-E 3, DALL-E 3 HD |
POST /v1/audio/transcriptions | Whisper |
POST /v1/audio/speech | TTS-1, TTS-1-HD (HD billed at 2× — already correct) |
Vision in /v1/chat/completions | GPT-4o, Claude, Gemini — token math fixed on image_url blocks (was over-counting on base64). |
New providers — 7 new models behind 3 backends
| Model | Provider | Notable for |
|---|---|---|
command-r-plus, command-r | Cohere | RAG-strong |
sonar, sonar-pro | Perplexity | Web-search-grounded chat |
grok-3, grok-3-mini | xAI | Reasoning + X-platform context |
Enterprise
| Feature | Endpoint / API |
|---|---|
| Prompt templates | POST /v1/templates with CRUD + /render — server-side {{variable}} substitution. 200 templates × 64 KB each. |
| Fine-tune job lifecycle | POST /v1/fine_tunes (queue) → poll → webhook on completion. OpenAI-shaped, persisted, integrated with /v1/files for training data. |
| File uploads | POST /v1/files with purpose=batch | fine-tune | assistants | user_data. SHA-256 integrity, per-user quota. |
| Webhooks | POST /v1/webhooks — HMAC-signed POST to your URL on batch.completed, fine_tune.completed, invoice.issued, invoice.paid. Auto-retry with exponential backoff up to 6 attempts. |
| Responses API | POST /v1/responses — OpenAI's new stateful single-turn API. Pass previous_response_id and the server rebuilds the message history for you. |
| Assistants API | /v1/assistants, /v1/threads, /v1/threads/{id}/messages, /v1/threads/{id}/runs — for SDK clients still targeting the Assistants surface. |
| Reserved throughput | Per-tenant RPM / concurrency floor — Enterprise SLA opt-in. Contact tinggang@nextoken.biz. |
| SSO (SAML + OIDC) | Self-built SP. OIDC available today; SAML waiting on a deploy-side libxmlsec1 install for sites that prefer SAML. |
| Monthly invoicing | Admin-generated NET-30 invoices for enterprise accounts, PDF export, void / mark-paid lifecycle. |
Observability
Prometheus /metrics |
All counters, histograms, gauges. Scrape from VPC; public is denied at the nginx layer. |
| Grafana dashboards | Four ready-to-import JSON files — API overview, provider health, billing & margin, safety gates. |
| OpenTelemetry traces | Tempo-backed; every chat completion produces spans for routing, cache lookup, provider call, billing. |
Breaking changes
nex
block on the response gained new fields (cached_input_tokens,
semantic_cache_hit, smart_router,
pii_redactions, injection_score) — clients
ignoring unknown fields are unaffected.
Pricing changes
- Cache savings now visible — apps with long system prompts or RAG context typically see 30–90% off cached input tokens.
nex-autopublished at $0.0003 / $0.0012 per 1K (general tier default). Actual underlying-model cost surfaced vianex.smart_router.- Batch API discount: 30% off retail per item (sync). Async batch: 50%.
- Semantic cache hit discount: 95% off retail (5% residual covers the embed lookup).
All existing list prices unchanged.
Want to try the new shape on production?
Sign in and your existing keys still work. Switch model: "nex-auto" in one line, watch the nex block in your responses.
Questions? Reply to your existing thread or write to support@nextoken.biz.