Retrieval-augmented generation explained: when to use RAG vs long context

The choice you're actually making

You have documents. A lot of them. Maybe product specs, legal contracts, customer support tickets, or a knowledge base. You need Claude to understand and work with this material.

You face a decision: feed everything directly into Claude's context window, or set up retrieval-augmented generation (RAG) to pull in only relevant documents?

This isn't a theoretical question. The choice impacts cost, latency, accuracy, and maintainability. Let's cut through the confusion.

What's actually happening with RAG

RAG is deceptively simple: you store documents somewhere searchable, a user asks a question, you find relevant documents, and you stuff them into the prompt. That's it.

The mechanics:

Documents get chunked and embedded into vector space (using something like Claude's embeddings API)
User query gets embedded similarly
Vector database finds semantically similar chunks
Those chunks go into your Claude prompt as context
Claude answers based on retrieved material

The appeal is obvious: you're not paying for tokens on irrelevant documents.

Why Claude's long context changes the game

Claude now supports 200,000 tokens (soon 5 million). That's roughly 150,000 words. For many document workflows, this isn't theoretical capacity—it's practical.

Here's what matters: you can fit an entire small knowledge base directly into a single conversation. No embedding database. No vector search latency. No retrieval pipeline to maintain.

The cost math: At $3 per million input tokens, adding 100,000 tokens of context costs $0.30 per request. Often negligible compared to your actual problem-solving value.

When to skip RAG entirely (use long context instead)

Static, bounded document sets You have 50–100 documents that rarely change. Upload them all. Customer onboarding docs, employee handbook, product API reference. The cost-per-request is lower than maintaining a retrieval system. Complexity drops to zero.

High precision requirements RAG succeeds when you need 80% of the right context. But if Claude must consider the entire document to give a correct answer—because the answer depends on subtle cross-references or context scattered across pages—RAG's chunking strategy will fail you. Legal analysis, financial audits, and technical architecture reviews often fall here.

Conversational continuity matters If users ask follow-up questions that shift context, pure retrieval breaks down. With long context, the conversation thread remains intact. The AI remembers what was discussed. This is why long context is better for interactive Q&A with known documents.

You don't have a retrieval system yet Sometimes the simplest solution wins. If setting up embeddings, vector storage, and search logic takes your team 2 weeks but long context solves the problem today—use long context. Ship first. Optimize after you understand the real bottleneck.

Sub-100 queries per month RAG infrastructure (vector DB, embeddings API, search logic) has baseline operational cost and complexity. Below certain query volume, direct context is cheaper.

When RAG actually wins

Massive, constantly updated knowledge bases If you're indexing 10,000+ documents that change weekly (support articles, research papers, internal wikis), retrieval is the only sane option. Embedding and retrieving relevant sections beats reprocessing everything every request.

Variable and unpredictable content You don't know in advance which documents will be relevant. A customer support chatbot doesn't know what product the user is asking about until they speak. RAG automatically finds the right documents. Stuffing everything is wasteful.

Cost optimization at scale At 10,000+ queries monthly, paying for unused context tokens adds up. Retrieval lets you pay only for relevant material. The math flips.

Multi-turn retrieval with shifting context Some workflows require pulling different documents at each turn. A research assistant that digs deeper into sources as the conversation evolves benefits from retrieval's flexibility.

Document classification and routing If you need to send users to specific documents reliably, RAG with metadata filtering ensures you're not relying on Claude to guess which file contains the answer.

The practical evaluation framework

Ask yourself these questions in order:

How many documents? Fewer than 100 and relatively stable? Long context. More than 1,000 or constantly changing? RAG.
What's the failure mode? If you need certainty that all relevant context is included, long context wins. If you can tolerate occasionally missing a tangential detail in exchange for cost savings, RAG works.
How often will this run? Fewer than 100 queries/month: long context. More than 1,000: evaluate RAG.
Do I have time? Shipping with long context this week beats building RAG over three weeks. Get feedback first.
What's my retrieval quality? Be honest: can you build reliable embeddings and search for your documents? If your domain is highly technical or your documents are unstructured, retrieval fails silently. Long context doesn't.

A concrete example

Imagine you're building a contract analysis tool.

Scenario A: Long context approach

Lawyer uploads 15 contracts
All go into Claude's context
Claude extracts obligations, risks, deadlines
Cost per analysis: ~$0.50
Time to ship: 3 hours
Works well

Scenario B: RAG approach

15 contracts get chunked, embedded, stored
User uploads new contract
System retrieves similar contracts
Similar ones get passed to Claude
Cost per analysis: ~$0.20
Time to ship: 2 weeks
Slightly better cost, but fragile retrieval

For 15 contracts, choose A.

Now imagine 5,000 contracts, added weekly, and users can ask about any of them. Choose B. The retrieval system pays for itself immediately.

The hybrid approach

Sophisticated teams use both. Store everything in a vector DB, but retrieve aggressively into a long-context window. This gives you:

The cost savings of selective retrieval
The accuracy benefits of seeing full document context
The flexibility to refine and re-retrieve within a conversation

It's more complex to implement, but it's the architecture used by serious document-heavy applications.

Make the call

RAG isn't inherently better than long context, and long context isn't inherently better than RAG. They're optimized for different constraints.

Start with long context. It's simpler, ships faster, and Claude's 200K tokens solve most document problems directly. Move to RAG only when you hit its limitations: scale, cost, or constantly-changing documents.

And always remember: the best RAG system is one you don't have to build.