Skip to content
RAG

Building RAG with Claude in 2026: 200K Context, Caching, and Citation

Claude's 200K context window changes RAG architecture. Here is how to build a production RAG with Sonnet 4.6, prompt caching, citation, and stable retrieval.

Retrieval-augmented generation looks very different in 2026 than it did two years ago. With Claude Sonnet 4.6 shipping a 200K token context window and prompt caching cutting the price of stable context by an order of magnitude, the question is no longer "how do I cram the right chunk into 8K tokens?" — it is "when do I retrieve at all, and when do I just stuff the whole knowledge base into a cached system prompt?"

This post is a working architect's guide to building RAG on Claude in 2026: when classic retrieval still wins, how to use the long-context window without going broke, which embedding and reranker models to pair with Claude, how to get reliable citations, and the pitfalls that keep showing up in production.

Classic RAG vs long-context: when to retrieve

The first decision is whether to retrieve at all. The 200K window on Sonnet 4.6 is large enough to hold roughly 100–150 medium-sized documents (5–10 pages each), which covers a surprising number of real-world knowledge bases end-to-end. Three rules of thumb:

  • Stuff the context when your corpus is small (under ~150K tokens), stable across sessions, and questions span multiple documents. A cached system prompt with the entire corpus beats a retriever on both quality and latency after the first call.
  • Retrieve when your corpus is large (millions of tokens), changes frequently, or when each query only needs 1–5 documents to answer. Embedding search remains the cheapest way to narrow the field.
  • Hybrid — retrieve aggressively (top-50 or top-100) and let Claude's long context absorb the slack. This is the dominant pattern in 2026 because it reduces the burden on retrieval precision and pushes more of the quality budget onto the model.

Why prompt caching changes the math

Prompt caching is the single biggest reason the long-context approach is now economically viable. When you mark a prefix of the prompt as cacheable, Anthropic stores it server-side; subsequent calls that hit the same prefix pay roughly 10% of the normal input rate for that portion. Cache writes are slightly more expensive than a normal input token, but cache reads are dramatically cheaper.

For a stable knowledge base, that means:

  1. First call: pay full price to write the cache (~25% premium on the cached portion).
  2. Every subsequent call within the cache TTL: pay ~10% for the cached prefix and full price only on the per-query suffix.

In practice, a RAG assistant that serves dozens of queries against the same corpus per hour breaks even after two or three calls and is roughly 10× cheaper than uncached prompts thereafter. This is what makes "stuff 80 documents into Sonnet" a real production pattern, not a demo trick.

Embedding model choice

If you do retrieve, the embedding model matters more than people give it credit for. In 2026 the practical shortlist is:

  • Voyage AI (voyage-3-large or voyage-code-3 for code-heavy corpora) — Anthropic's officially recommended pairing. Strong on English, code, and multilingual; built specifically for RAG.
  • OpenAI text-embedding-3-large — the safe fallback if you already have OpenAI keys and want a known-good 3072-dimensional baseline.
  • BAAI bge-m3 — open-source, multilingual, strong recall, runs locally on a single A100 if you need to keep embeddings on-prem.

Whatever you choose, lock the model version. Re-embedding a corpus because a vendor silently upgraded a model is the most common self-inflicted RAG outage in production.

Reranking: the cheap win nobody skips anymore

A vector retriever returns plausible candidates, not ranked answers. The 2026 default stack is to over-retrieve (top-50 to top-100) and then rerank down to the top-10 or top-20 with a cross-encoder:

  • Cohere Rerank 3 — hosted, multilingual, very good quality, pay-per-call.
  • BAAI bge-reranker-v2-m3 — open-source, runs on commodity GPUs, no per-call cost.

Adding a reranker typically lifts answer quality more than swapping Sonnet for Opus does, at a tiny fraction of the cost. If you are not reranking in 2026, that is the first thing to fix.

Hybrid search: BM25 still earns its keep

Pure vector search misses exact-match queries: product SKUs, error codes, function names, legal section numbers. The fix is hybrid search: run BM25 (via OpenSearch, Elastic, or tantivy) and vector retrieval in parallel, then fuse the scores with reciprocal rank fusion (RRF). Hybrid search reliably beats vector-only on heterogeneous corpora, and it costs almost nothing once you already have the documents indexed.

Chunking strategies that actually work

Chunking remains a tuning problem, not a solved one. The defaults that hold up in 2026:

  • 800–1200 token chunks with 100–200 token overlap for prose.
  • Whole-function chunks for code, with the file path and class name prepended as a header.
  • Semantic chunking (split on heading boundaries first, then on size) for technical docs.
  • Always store a parent document ID and a stable chunk ID so citations can resolve back to the source.

Avoid the temptation to chunk too small. Sub-500-token chunks lose context and force the reranker to do work the embedder should have done.

Getting reliable citations

Hallucinated citations are the failure mode that erodes user trust fastest. The pattern that works on Claude:

  1. Tag every retrieved chunk in the prompt with an explicit ID: [1] (source: docs/auth.md): ...content...
  2. In the system prompt, instruct Claude to cite every factual claim with the bracketed ID it came from, and to refuse to answer if no retrieved chunk supports the claim.
  3. Post-process the response to verify each [N] actually maps to a chunk you sent. Drop or flag any that do not.

Step 3 is non-negotiable. Even Sonnet 4.6 occasionally invents a citation ID under pressure; the verifier is what makes the system trustworthy.

A minimal pipeline with Claudexia

Here is a working sketch of the full loop — embed, retrieve, rerank, cite — using the Claudexia API endpoint:

import anthropic
import voyageai
import cohere

client = anthropic.Anthropic(
    base_url="https://api.claudexia.tech/v1",
    api_key="<your-claudexia-key>",
)
voyage = voyageai.Client()
co = cohere.Client()

def answer(question: str, corpus_chunks: list[dict]) -> str:
    # 1. Embed the query
    q_emb = voyage.embed([question], model="voyage-3-large").embeddings[0]

    # 2. Vector retrieve top-50 (your vector store call here)
    candidates = vector_store.search(q_emb, k=50)

    # 3. Rerank to top-10 with Cohere
    reranked = co.rerank(
        model="rerank-3",
        query=question,
        documents=[c["text"] for c in candidates],
        top_n=10,
    )
    top = [candidates[r.index] for r in reranked.results]

    # 4. Build the cited context
    context = "\n\n".join(
        f"[{i+1}] (source: {c['source']}): {c['text']}"
        for i, c in enumerate(top)
    )

    # 5. Call Claude with a cached system prompt
    msg = client.messages.create(
        model="claude-sonnet-4.6",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": "You are a precise assistant. Cite every factual "
                        "claim with the bracketed source ID like [1]. "
                        "If no source supports an answer, say so.",
                "cache_control": {"type": "ephemeral"},
            }
        ],
        messages=[
            {
                "role": "user",
                "content": f"{context}\n\nQuestion: {question}",
            }
        ],
    )
    return msg.content[0].text

The cache_control marker is the lever that unlocks the 10× discount on the system prompt across calls. For a fully stuffed knowledge base, move the corpus itself into the cached system block instead of the user message — same pattern, much bigger savings.

Evaluate retrieval and generation separately

The single most useful discipline in production RAG is to measure the two stages independently:

  • Retrieval quality: recall@k against a labelled set of (question, correct-document) pairs. If recall@10 is below ~0.85, no amount of prompt engineering will save the answer.
  • Generation quality: faithfulness (does the answer match the retrieved context?) and answer correctness, scored either by humans or by an LLM-judge run on a held-out eval set.

Mixing the two metrics is how teams end up "fixing" generation issues by tweaking the retriever and vice versa. Keep them separate.

Common pitfalls

  • Lost-in-the-middle. Even with 200K tokens, Claude attends most reliably to the start and end of the context. Put the most relevant chunks first, the question last, and avoid burying critical facts in the middle of a 100K-token wall.
  • Citation hallucination. Always verify cited IDs map to real chunks. Treat unverified citations as a failure, not a warning.
  • Cache invalidation. Any change to the cached prefix — even whitespace — invalidates the cache. Version your system prompts and templates.
  • Stale embeddings. When you upgrade your embedding model, re-embed the entire corpus. Mixed-model vector stores degrade silently.
  • Over-retrieval without reranking. Pulling top-100 chunks and dumping them straight into the prompt costs money and hurts quality. Always rerank.

Bottom line

The winning 2026 RAG stack is boring in the best way: a cached system prompt, Sonnet 4.6 as the generator, Voyage embeddings, a Cohere or BGE reranker, hybrid BM25+vector retrieval, and a citation verifier on the way out. Each piece is cheap, well-understood, and improves quality on its own. Together they give you a system that is fast, auditable, and cost-stable as your corpus grows. For pricing details on Sonnet 4.6 and the rest of the family, see our companion post on Claude API pricing in 2026.