TL;DR

Embeddings turn text into numbers that represent meaning. RAG (Retrieval-Augmented Generation) uses embeddings to search your documents, find relevant chunks, and feed them to an AI to generate accurate, grounded answers.

Why it matters

Standard chatbots are limited to what they learned during training. RAG lets them pull in fresh, specific information—like your company docs, research papers, or personal notes—so they can answer questions about your data, not just generic knowledge.

The problem RAG solves

Imagine asking a chatbot: "What's our company's refund policy?"

  • Without RAG: "I don't know—I wasn't trained on your internal docs."
  • With RAG: The system searches your knowledge base, finds the refund policy, and uses it to answer.

RAG bridges the gap between a general-purpose AI and your specific information.

What are embeddings?

Embeddings are a way to represent text as a list of numbers (a vector). These numbers capture the meaning of the text.

Example (simplified)

  • "Cat" → [0.2, 0.8, 0.1, ...]
  • "Kitten" → [0.22, 0.79, 0.12, ...]
  • "Dog" → [0.3, 0.6, 0.15, ...]

Words with similar meanings have similar vectors. "Cat" and "kitten" are close; "cat" and "database" are far apart.

This lets computers measure semantic similarity—how close two pieces of text are in meaning, not just spelling.

Jargon: "Vector"
A list of numbers. In AI, vectors represent meanings so they can be compared mathematically.

Jargon: "Embedding"
The process of turning text into a vector, or the vector itself. Think of it as a "meaning fingerprint."

How embeddings are created

You use an embedding model (a type of AI) to convert text into vectors:

  1. Input: "The quick brown fox jumps over the lazy dog."
  2. Embedding model: Processes the text
  3. Output: A vector like [0.23, 0.67, -0.12, 0.88, ...] (usually hundreds or thousands of numbers)

Popular embedding models: OpenAI's text-embedding-ada-002, Cohere Embed, Google's Universal Sentence Encoder.

What is RAG?

RAG (Retrieval-Augmented Generation) is a technique that combines:

  • Retrieval: Finding relevant information from a knowledge base
  • Generation: Using an LLM to create a natural-language answer

The RAG workflow

  1. Index your documents

    • Split docs into chunks (paragraphs or sections)
    • Generate embeddings for each chunk
    • Store embeddings in a vector database
  2. User asks a question

    • "What's our refund policy?"
  3. Retrieve relevant chunks

    • Convert the question into an embedding
    • Search the vector database for chunks with similar embeddings
    • Return the top N most relevant chunks
  4. Generate an answer

    • Feed the question + retrieved chunks to an LLM
    • LLM reads the chunks and generates a grounded answer
  5. User sees the answer

    • "Our refund policy allows returns within 30 days with a receipt..."

Traditional search (like Google) matches keywords. Vector search matches meaning.

Example:

  • Query: "How do I reset my password?"
  • Keyword search: Looks for exact words like "reset" and "password"
  • Vector search: Also finds chunks about "forgotten credentials," "account recovery," "login issues"—even if they don't use the exact words

Vector search is more flexible and human-like.

Vector databases

A vector database stores embeddings and lets you search by similarity.

Examples: Pinecone, Weaviate, Qdrant, Chroma, Milvus, Postgres with pgvector.

What they do

  • Store millions of vectors
  • Index them for fast search
  • Query by similarity (find the nearest neighbors to a query vector)

When you ask a question, the vector DB returns the most relevant chunks in milliseconds.

RAG vs. fine-tuning

Both customize an AI, but they work differently:

RAG Fine-tuning
What it does Pulls in external data at query time Retrains the model on your data
Best for Dynamic, changing data (docs, wikis) Specialized style or domain knowledge
Speed Fast to set up Slow (requires retraining)
Cost Lower (no retraining) Higher (compute-intensive)
Updates Easy (just add new docs) Hard (requires retraining)
Accuracy Great for factual Q&A Great for style, tone, niche tasks

Rule of thumb: Use RAG for knowledge retrieval. Use fine-tuning for style or specialized reasoning.

Real-world RAG examples

  • Customer support bots: Answer questions using your help docs
  • Internal knowledge bases: "What's our PTO policy?" searches HR docs
  • Research assistants: Summarize findings from a library of papers
  • Legal research: Find relevant case law or contract clauses
  • Code search: Find code examples in your company's repos

Challenges and gotchas

1. Chunking matters

How you split documents affects results. Too small = missing context. Too large = noisy, irrelevant info.

Common strategies:

  • By paragraph
  • By fixed token count (e.g., 500 tokens)
  • By semantic boundaries (headings, sections)

2. Retrieval quality

If the search misses the right chunk, the answer will be wrong. Improve by:

  • Better chunking
  • Better embeddings (try different models)
  • Tuning the number of chunks retrieved
  • Adding metadata (tags, dates, authors) to narrow search

3. Hallucinations still happen

Even with RAG, the LLM might misinterpret the chunks or fill in gaps with guesses. Always verify critical info.

4. Context window limits

LLMs have a context window—how much text they can process at once. If you retrieve too many chunks, you might overflow the window. Balance quality and quantity.

How to build a simple RAG system

  1. Gather your documents (PDFs, markdown, text files)
  2. Chunk them (split into paragraphs or sections)
  3. Generate embeddings (use an embedding model API)
  4. Store in a vector DB (Pinecone, Chroma, etc.)
  5. Build a query flow:
    • User asks a question
    • Embed the question
    • Search the vector DB
    • Retrieve top chunks
    • Send chunks + question to LLM
    • Return the answer
  6. Test and refine (tune chunking, retrieval, prompts)

Tools to try: LangChain, LlamaIndex (frameworks for RAG), OpenAI API (embeddings + LLM), Pinecone (vector DB).

Key terms (quick reference)

  • Embedding: A vector (list of numbers) representing the meaning of text
  • Vector: A list of numbers used in math to represent data
  • Vector Database: Storage system optimized for searching by similarity
  • RAG (Retrieval-Augmented Generation): Using search to feed relevant info to an LLM for grounded answers
  • Chunking: Splitting documents into smaller pieces for indexing
  • Semantic similarity: How close two pieces of text are in meaning
  • Context window: How much text an LLM can handle at once

Use responsibly

  • Don't index sensitive data in public or shared vector DBs
  • Verify outputs (RAG reduces hallucinations but doesn't eliminate them)
  • Monitor for bias (if your docs are biased, the answers will be too)
  • Audit retrieval (check what chunks are being used—sometimes the search is wrong)

What's next?

  • Evaluating AI Answers: Check for accuracy and hallucinations
  • Vector DBs 101 (coming soon): Deep dive into vector databases
  • Retrieval 201 (coming soon): Advanced chunking, re-ranking, hybrid search
  • Prompting 101: Craft better questions for RAG systems