RAG — Retrieval-Augmented Generation

RAG grounds LLM responses in external, up-to-date knowledge. Instead of relying solely on what the model learned during training, you retrieve relevant documents at query time and feed them into the prompt.

The Problem RAG Solves

LLM without RAG:
  ❌ Knowledge cutoff — doesn't know about recent events
  ❌ No access to your private data (company docs, internal wikis)
  ❌ Hallucination — invents facts when it doesn't know something
  ❌ Can't cite sources (there are none)

LLM with RAG:
  ✅ Grounded in real, retrieved documents
  ✅ Can work with your private, up-to-date data
  ✅ Can cite the source documents it used
  ✅ Hallucination drops dramatically when grounded

RAG Architecture Overview

               ┌─────────────────────────────────────────────┐
               │             INDEXING PIPELINE                │
               │   (done once, or on document updates)        │
               │                                              │
               │  Documents → Chunk → Embed → Vector Store   │
               └─────────────────────────────────────────────┘

                         ┌──────────────────┐
                         │   Vector Store   │
                         │ (Pinecone/Chroma) │
                         └────────┬─────────┘
                                  │
               ┌─────────────────────────────────────────────┐
               │             RETRIEVAL PIPELINE               │
               │   (runs on every user query)                 │
               │                                              │
  User Query ──▶ Embed Query ──▶ Similarity Search ──▶ Top-K │
               │       Chunks                                 │
               └──────────────────────┬──────────────────────┘
                                      │
                         ┌────────────▼────────────┐
                         │  Prompt: Context + Query │
                         │  [chunk1][chunk2][chunk3] │
                         │  "Answer based on above:" │
                         └────────────┬─────────────┘
                                      │
                              ┌───────▼───────┐
                              │      LLM      │
                              └───────┬───────┘
                                      │
                             Grounded Answer
                             + Source Citations

Step 1: Document Loading

Get your data into a usable format. Different sources need different loaders.

from langchain_community.document_loaders import (
    PyPDFLoader,          # PDF files
    WebBaseLoader,        # Web pages
    TextLoader,           # Plain text
    CSVLoader,            # CSV files
    NotionDBLoader,       # Notion databases
    GitLoader,            # Code repositories
)

loader = PyPDFLoader("company_handbook.pdf")
docs = loader.load()  # List[Document] with page_content + metadata

Step 2: Chunking Strategies

Chunking is critical. Too large = diluted retrieval, too small = missing context. The right strategy depends on your document type.

mindmap
  root((Chunking Strategies))
    Fixed-Size
      Split every N characters
      Simple fast predictable
      Ignores sentence boundaries
      Good for structured data
    Recursive Character
      LangChain default
      Tries paragraphs then sentences then words
      Respects text structure
      Best general-purpose choice
    Semantic
      Embed sentences find breakpoints
      Chunks around meaning not size
      More expensive but better quality
    Document-Aware
      Split at natural boundaries
      Headers for markdown
      Functions for code
      Pages for PDFs
      Best when structure matters
    Sliding Window
      Chunks overlap by N chars
      Prevents cutting context mid-sentence
      overlap=200 chars typical

Chunk Size Trade-offs

Small chunks (~200 tokens):
  ✅ Precise retrieval — return exactly what was asked
  ❌ May lose surrounding context needed for the answer
  Best for: FAQs, fact lookup

Large chunks (~800 tokens):
  ✅ Rich context — answer fits within the chunk
  ❌ Retrieval is noisier — includes irrelevant content too
  Best for: long-form documents, conceptual explanations

Overlap (~10-20% of chunk size):
  Chunks share N tokens with next chunk
  Prevents answers being split at a boundary

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,        # Target token count per chunk
    chunk_overlap=50,      # Shared tokens between adjacent chunks
    separators=["\n\n", "\n", " ", ""],  # Try these in order
)
chunks = splitter.split_documents(docs)

Step 3: Embedding & Storing

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Embed chunks and store in vector DB (one-time setup)
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

Vector Databases

┌──────────────────┬─────────────┬────────────────────────────────────────┐
│ Database         │ Type        │ Notes                                  │
├──────────────────┼─────────────┼────────────────────────────────────────┤
│ Pinecone         │ Managed     │ Fully hosted, production-grade, fast   │
│ Weaviate         │ Self/Managed│ GraphQL API, multi-modal support       │
│ Chroma           │ Open-source │ Easiest for local dev + prototyping    │
│ Qdrant           │ Open-source │ Rust-based, fast, rich filtering       │
│ pgvector         │ Extension   │ Vector search inside PostgreSQL        │
│ Redis (RediSearch)│ In-memory  │ Low-latency, existing Redis infra      │
│ FAISS            │ Library     │ Meta's library, in-memory, no server   │
└──────────────────┴─────────────┴────────────────────────────────────────┘

pgvector is worth highlighting — it adds vector search to PostgreSQL, letting you keep everything in one database without a separate vector store service.

-- pgvector example
CREATE EXTENSION vector;
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT,
    embedding vector(1536)
);

-- Semantic similarity search
SELECT content, embedding <-> '[0.1, 0.2, ...]' AS distance
FROM documents
ORDER BY distance
LIMIT 5;

Step 4: Retrieval Strategies

mindmap
  root((Retrieval Strategies))
    Dense Retrieval
      Embed query find nearest vectors
      Semantic understanding
      Catches paraphrase and synonyms
      Default approach
    Sparse Retrieval BM25
      Keyword matching TF-IDF based
      Exact term matching
      Fast no embedding needed
      Misses synonyms and paraphrases
    Hybrid Search
      Combine dense + sparse scores
      Best of both worlds
      Reciprocal Rank Fusion RRF
      Most robust in production
    HyDE Hypothetical Document Embedding
      Generate hypothetical answer first
      Embed the answer not the question
      Better for vague queries
      Extra LLM call cost
    Multi-Query Retrieval
      Generate N variations of the query
      Retrieve for each variation
      Merge and deduplicate results
      Catches different phrasings
    Parent Document Retrieval
      Index small child chunks
      Retrieve parent document on match
      Better context around the match

Hybrid Search (Most Robust)

Query: "how to handle auth errors"

Dense retrieval finds:
  - "authentication failure handling" (semantic match)
  - "token expiry management" (related concept)

BM25 (sparse) finds:
  - "catch auth errors in middleware" (keyword match)
  - "error codes for auth" (exact terms)

Combined → better coverage than either alone
RRF score = Σ 1/(k + rank_i)  where k=60 is a constant

Step 5: Context Stuffing & Re-ranking

Context Stuffing

# Basic RAG query
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

def rag_chain(question):
    docs = retriever.invoke(question)
    context = "\n\n".join(doc.page_content for doc in docs)

    prompt = f"""Answer based ONLY on the provided context.
If the answer isn't in the context, say "I don't have that information."

Context:
{context}

Question: {question}
Answer:"""

    return llm.invoke(prompt)

Re-ranking (Cross-Encoder)

Problem: Vector similarity is fast but approximate.
         The top-5 retrieved chunks may not be the most relevant 5.

Re-ranking:
  1. Retrieve top-20 chunks (recall phase — cast wide net)
  2. Run a cross-encoder model on each chunk vs the query
     (more expensive but more accurate relevance scoring)
  3. Return top-5 after re-ranking

Result: Much better precision with reasonable latency cost.

Tools: Cohere Rerank API, cross-encoder/ms-marco-MiniLM (HuggingFace)

RAG vs Fine-tuning

┌───────────────────┬──────────────────────────┬──────────────────────────┐
│                   │ RAG                      │ Fine-tuning              │
├───────────────────┼──────────────────────────┼──────────────────────────┤
│ Knowledge update  │ Easy — add docs to store │ Re-train required        │
│ Private data      │ ✅ At retrieval time      │ ✅ Baked into weights    │
│ Up-to-date info   │ ✅ Always current         │ ❌ Frozen at train time  │
│ Source citation   │ ✅ Natural                │ ❌ Hard                  │
│ Cost              │ Retrieval + inference     │ Training + inference     │
│ Hallucination     │ Low (grounded)            │ Lower than base model    │
│ Best for          │ Dynamic knowledge bases  │ Style/format/behaviour   │
│                   │ Private docs, Q&A        │ Domain language patterns │
└───────────────────┴──────────────────────────┴──────────────────────────┘

Key insight: Use RAG for "what does the model know?"
             Use fine-tuning for "how does the model respond?"
             They're complementary — you can do both.

RAG Failure Modes & Fixes

Problem	Symptom	Fix
Wrong chunks retrieved	Answer is off-topic	Better chunking, hybrid search, re-ranking
Answer not in chunks	"I don't have that information" when doc exists	Increase `k`, fix chunking boundaries
Context too long	Model ignores middle of context	Rerank to put most relevant chunks first/last
Stale embeddings	Old answers after document update	Re-index updated documents
Semantic drift	Query phrasing differs from doc phrasing	HyDE, multi-query retrieval, hybrid search
Injection via docs	Retrieved content manipulates the model	Sanitize retrieved content, treat as untrusted

Evaluation Metrics

Retrieval quality:
  Precision@K  → Of the K chunks retrieved, how many are relevant?
  Recall@K     → Of all relevant chunks, how many did we retrieve?
  MRR          → Mean Reciprocal Rank — how high is the first relevant result?

Answer quality (RAGAS framework):
  Faithfulness     → Is the answer grounded in the retrieved context? (no hallucination)
  Answer relevance → Does the answer actually address the question?
  Context recall   → Does the context contain what's needed to answer?
  Context precision → Are the retrieved chunks actually useful?