RAG (Retrieval-Augmented Generation)

RAG combines a language model with a search step over your own documents, so answers stay grounded in your data instead of hallucinating.

Retrieval-Augmented Generation is the dominant pattern for building useful AI systems on top of company-specific data. Instead of asking an LLM to "remember" everything you care about — which costs money, runs into context limits, and produces hallucinations — RAG splits the work in two:

Retrieve — when a user asks a question, search a database of your documents (PDFs, wikis, support tickets, code) for the most relevant snippets.
Generate — paste those snippets into the LLM's context window along with the user's question, then ask it to answer using only that material.

The retrieval step is typically powered by a vector database (Pinecone, Weaviate, pgvector) that stores documents as numerical embeddings. The generation step is whichever model you trust most (Claude, GPT-4o, Llama).

RAG matters because it solves three problems at once: freshness (your data updates without retraining), traceability (you can show users which document an answer came from), and cost (a small retrieved context is cheaper than a 200k-token prompt). The trade-off is engineering complexity: retrieval quality often dominates final answer quality, so chunking strategy, embedding model choice, and re-ranking all matter.

When you hear "Chat with your PDFs" or "AI for your codebase", you're almost always hearing about a RAG product. When you hear "AI hallucinated a fake citation," you're hearing about a system that needed RAG but skipped it.

Real-world examples

Notion Q&A pulls answers from your workspace, not from training data
Cursor uses RAG over your codebase before suggesting edits
GitHub Copilot's @workspace command is RAG over your repo

Related on ToolMango

Cursor

The AI-first code editor.

ChatGPT

OpenAI's flagship conversational AI — research, writing, code.

Claude

Anthropic's long-context AI for serious writing and reasoning.

AI Coding

Browse category →

FAQ

Is RAG better than fine-tuning?

For most use cases, yes. RAG handles new information without retraining, costs less, and gives you citation transparency. Fine-tuning makes more sense when you need the model to learn a specific style, format, or task that prompting can't reliably elicit.

Do I need a vector database for RAG?

Not for small corpora. With <10k documents, lexical search (BM25) or even keyword filtering can outperform vector search. Vector DBs become necessary at scale or when you need semantic recall across paraphrased queries.

Related terms

Embeddings — An embedding is a list of numbers that represents the meaning of a piece of text, image, or audio so similar things cluster together in vector space.
Vector database — A vector database stores numerical embeddings of text/images/audio and finds similar items by distance, powering semantic search and RAG.
Context window — The context window is the maximum number of tokens (text chunks) a language model can consider at once — both the prompt you send and the response it generates.
LLM (Large Language Model) — A Large Language Model is a neural network trained on huge volumes of text to predict the next token, which produces emergent capabilities like reasoning, code generation, and translation.

Want to actually build with this?

Our Stack Builder picks the best AI tools for your specific project in under 60 seconds.

Build my stack →