Chinese-strong RAG customer support

End-to-end Chinese-language Q&A over your support docs, using nex-embed-zh for retrieval and nex-pro for answer generation. Pure Python, no LangChain — easier to debug, ~80 lines total.

⏱ 15 minnex-embed-zhnex-propgvector$0.01/1M tokens

Why this is a good fit for Nex

Chinese embeddings are a weak spot for OpenAI's text-embedding-3-small — recall on Chinese phrases is meaningfully worse than on English. nex-embed-zh is BGE-large-zh-v1.5, self-hosted on Nex's Singapore GPU, 1024-dim, with retrieval quality that matches Cohere's Chinese model at ~50% the cost ($0.01/1M tokens).

Pair it with nex-pro (Qwen2.5-7B, native bilingual) and you get a fully APAC-resident pipeline — useful for PDPA / 大陆生成式 AI 备案 compliance.

What we're building

Load a folder of .md support docs in Chinese
Chunk into ~500-character overlapping windows
Embed each chunk with nex-embed-zh → write to Postgres pgvector
At query time: embed the question, top-k cosine search, feed retrieved chunks to nex-pro
Return the answer + cited source snippets

Step 1 · Postgres + pgvector setup

-- Run once on your Postgres
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE kb_chunks (
  id BIGSERIAL PRIMARY KEY,
  source TEXT NOT NULL,
  content TEXT NOT NULL,
  embedding vector(1024) NOT NULL    -- nex-embed-zh dim
);
CREATE INDEX ON kb_chunks USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

Step 2 · Ingest pipeline

Requirements: pip install openai "psycopg[binary]" pgvector — we use psycopg (v3); the legacy psycopg2 package is not required.

import os, glob
from openai import OpenAI
import psycopg

client = OpenAI(api_key=os.environ["NEX_API_KEY"], base_url="https://api.nextoken.biz/v1")
db = psycopg.connect(os.environ["DATABASE_URL"])

def chunk(text, size=500, overlap=80):
    """Character-level chunking — Chinese doesn't need word-aware splitting."""
    out, i = [], 0
    while i < len(text):
        out.append(text[i:i + size])
        i += size - overlap
    return out

def embed(text):
    r = client.embeddings.create(model="nex-embed-zh", input=text)
    return r.data[0].embedding

for path in glob.glob("docs/**/*.md", recursive=True):
    with open(path, encoding="utf-8") as f:
        body = f.read()
    for chunk_text in chunk(body):
        v = embed(chunk_text)
        with db.cursor() as c:
            c.execute(
                "INSERT INTO kb_chunks (source, content, embedding) VALUES (%s, %s, %s)",
                (path, chunk_text, v),
            )
    db.commit()
    print(f"ingested {path}")

Cost check. 100K Chinese characters at nex-embed-zh rates = about $0.001. Embedding 10 MB of Chinese docs costs roughly 30 ¢ — one-time.

Step 3 · Query pipeline

def search(question, k=5):
    qv = embed(question)
    with db.cursor() as c:
        c.execute(
            "SELECT source, content, 1 - (embedding <=> %s::vector) AS score "
            "FROM kb_chunks ORDER BY embedding <=> %s::vector LIMIT %s",
            (qv, qv, k),
        )
        return c.fetchall()

def answer(question):
    chunks = search(question)
    context = "\n\n---\n\n".join(f"【来源 {i+1}】{src}\n{txt}" for i, (src, txt, _) in enumerate(chunks))
    r = client.chat.completions.create(
        model="nex-pro",
        messages=[
            {"role": "system", "content":
                "你是客户支持助手。只根据 <context> 中的资料回答。"
                "如果资料里没有答案，直接说『资料中没有相关信息』，不要编造。"
                "回答末尾标注引用的来源编号。"},
            {"role": "user", "content": f"<context>\n{context}\n</context>\n\n问题: {question}"},
        ],
        temperature=0.1,
    )
    return r.choices[0].message.content, [c[0] for c in chunks]

ans, srcs = answer("如何申请退款？")
print(ans)
print("引用:", srcs)

Step 4 · Quality checks

Before shipping to users:

Chunk size sanity — for Chinese docs, 400–600 chars is usually right. Too small loses context, too large dilutes relevance.
Reranker (optional) — for harder queries, retrieve top-20 then rerank to top-5 with a stronger model. nex-pro can act as a cheap reranker via a "score these chunks 0–10" prompt.
Hallucination guard — keep temperature=0.1 and the explicit "only answer from context" instruction. Test with off-topic questions.

Production extras

Cache embeddings of identical chunks by content hash to save re-ingest cost
Add a per-user model_allowlist on the API key (Settings → API Keys) so the embedding key can only call nex-embed-zh + nex-pro, even if leaked
Stream the answer for better UX: pass stream=True and iterate over r

What's next

← Quickstart Next: Translation →