Chinese-strong RAG customer support
End-to-end Chinese-language Q&A over your support docs, using nex-embed-zh for
retrieval and nex-pro for answer generation. Pure Python, no LangChain — easier
to debug, ~80 lines total.
Why this is a good fit for Nex
Chinese embeddings are a weak spot for OpenAI's text-embedding-3-small — recall
on Chinese phrases is meaningfully worse than on English. nex-embed-zh is BGE-large-zh-v1.5,
self-hosted on Nex's Singapore GPU, 1024-dim, with retrieval quality that matches Cohere's
Chinese model at ~50% the cost ($0.01/1M tokens).
Pair it with nex-pro (Qwen2.5-7B, native bilingual) and you get a fully APAC-resident
pipeline — useful for PDPA / 大陆生成式 AI 备案 compliance.
What we're building
- Load a folder of
.mdsupport docs in Chinese - Chunk into ~500-character overlapping windows
- Embed each chunk with
nex-embed-zh→ write to Postgrespgvector - At query time: embed the question, top-k cosine search, feed retrieved chunks to
nex-pro - Return the answer + cited source snippets
Step 1 · Postgres + pgvector setup
-- Run once on your Postgres
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE kb_chunks (
id BIGSERIAL PRIMARY KEY,
source TEXT NOT NULL,
content TEXT NOT NULL,
embedding vector(1024) NOT NULL -- nex-embed-zh dim
);
CREATE INDEX ON kb_chunks USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
Step 2 · Ingest pipeline
import os, glob
from openai import OpenAI
import psycopg2
client = OpenAI(api_key=os.environ["NEX_API_KEY"], base_url="https://api.nextoken.biz/v1")
db = psycopg2.connect(os.environ["DATABASE_URL"])
def chunk(text, size=500, overlap=80):
"""Character-level chunking — Chinese doesn't need word-aware splitting."""
out, i = [], 0
while i < len(text):
out.append(text[i:i + size])
i += size - overlap
return out
def embed(text):
r = client.embeddings.create(model="nex-embed-zh", input=text)
return r.data[0].embedding
for path in glob.glob("docs/**/*.md", recursive=True):
with open(path, encoding="utf-8") as f:
body = f.read()
for chunk_text in chunk(body):
v = embed(chunk_text)
with db.cursor() as c:
c.execute(
"INSERT INTO kb_chunks (source, content, embedding) VALUES (%s, %s, %s)",
(path, chunk_text, v),
)
db.commit()
print(f"ingested {path}")
Cost check. 100K Chinese characters at
nex-embed-zh rates =
about $0.001. Embedding 10 MB of Chinese docs costs roughly 30 ¢ — one-time.
Step 3 · Query pipeline
def search(question, k=5):
qv = embed(question)
with db.cursor() as c:
c.execute(
"SELECT source, content, 1 - (embedding <=> %s::vector) AS score "
"FROM kb_chunks ORDER BY embedding <=> %s::vector LIMIT %s",
(qv, qv, k),
)
return c.fetchall()
def answer(question):
chunks = search(question)
context = "\n\n---\n\n".join(f"【来源 {i+1}】{src}\n{txt}" for i, (src, txt, _) in enumerate(chunks))
r = client.chat.completions.create(
model="nex-pro",
messages=[
{"role": "system", "content":
"你是客户支持助手。只根据 <context> 中的资料回答。"
"如果资料里没有答案,直接说『资料中没有相关信息』,不要编造。"
"回答末尾标注引用的来源编号。"},
{"role": "user", "content": f"<context>\n{context}\n</context>\n\n问题: {question}"},
],
temperature=0.1,
)
return r.choices[0].message.content, [c[0] for c in chunks]
ans, srcs = answer("如何申请退款?")
print(ans)
print("引用:", srcs)
Step 4 · Quality checks
Before shipping to users:
- Chunk size sanity — for Chinese docs, 400–600 chars is usually right. Too small loses context, too large dilutes relevance.
- Reranker (optional) — for harder queries, retrieve top-20 then rerank to top-5 with a stronger model.
nex-procan act as a cheap reranker via a "score these chunks 0–10" prompt. - Hallucination guard — keep
temperature=0.1and the explicit "only answer from context" instruction. Test with off-topic questions.
Production extras
- Cache embeddings of identical chunks by content hash to save re-ingest cost
- Add a per-user
model_allowliston the API key (Settings → API Keys) so the embedding key can only callnex-embed-zh+nex-pro, even if leaked - Stream the answer for better UX: pass
stream=Trueand iterate overr