- Home
- /Courses
- /Building AI-Powered Products
- /Building RAG Systems from Scratch
Building RAG Systems from Scratch
Build Retrieval Augmented Generation systems. Give AI access to your custom knowledge base.
Learning Objectives
- ✓Understand RAG architecture
- ✓Implement document chunking
- ✓Build retrieval systems
- ✓Optimize for accuracy
What Is RAG and Why Does It Matter?
Imagine you're taking a test. In a closed-book test, you can only use what you've memorised. In an open-book test, you can look things up in your notes. RAG — Retrieval Augmented Generation — is how you give AI an open-book test.
Here's the problem RAG solves: AI models like GPT-4 and Claude are trained on vast amounts of public data, but they don't know anything about your company's internal documents, your product documentation, your customer records, or anything created after their training cutoff. If you ask ChatGPT a question about your company's refund policy, it has no idea.
RAG fixes this by giving the AI access to your specific documents. Instead of relying only on its training data, the AI first searches your document collection to find relevant information, and then uses that information to generate an accurate answer. This is the technology behind features like "chat with your documents," enterprise knowledge bases, and customer support bots that actually know your product.
When to Use RAG vs. Fine-Tuning
This is one of the most common questions teams ask, so let's clear it up.
Use RAG when your knowledge changes frequently (product docs, policies, news), you need the AI to cite specific sources, or you want to add new information without retraining a model. RAG is also much cheaper and faster to set up.
Use fine-tuning when you need the AI to adopt a specific writing style or tone, you want to teach it a specialised skill (like coding in a niche language), or your task requires deep domain knowledge baked into the model's behaviour. Fine-tuning is more expensive and time-consuming.
The short version: RAG is for knowledge (what the AI knows), fine-tuning is for behaviour (how the AI acts). Most products start with RAG because it covers 80% of use cases with a fraction of the effort.
The RAG Pipeline: Step by Step
Every RAG system follows the same five-step pipeline. Think of it like building a library with an incredibly fast librarian.
Step 1: Chunk Your Documents
You can't feed entire documents into an AI — they're often too long, and most of the content won't be relevant to any given question. So you split documents into smaller pieces called "chunks." Each chunk should be large enough to contain a complete thought but small enough to be specific.
A good starting point is chunks of about 500-1,000 characters with some overlap between them (so ideas that span a chunk boundary don't get lost).
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
chunks = splitter.split_documents(docs)
Step 2: Embed the Chunks
Next, you convert each chunk into a "vector" — a list of numbers that represents the meaning of that text. This is like creating a fingerprint for each chunk's content. Chunks with similar meanings will have similar vectors.
from openai import OpenAI
client = OpenAI()
embedding = client.embeddings.create(
model="text-embedding-3-small",
input="Your chunk text here"
)
vector = embedding.data[0].embedding
Step 3: Store in a Vector Database
Save all your chunk vectors in a vector database (covered in depth in Module 4). This database is optimised for finding similar vectors quickly — exactly what you need when a user asks a question.
Step 4: Retrieve Relevant Chunks
When a user asks a question, you embed their question into a vector using the same process, then search the vector database for the chunks most similar to the question. The database returns the top matches — usually the 3-5 most relevant chunks.
Step 5: Generate the Answer
Finally, you take the user's question and the retrieved chunks, combine them into a prompt, and send it to an AI model. The prompt looks something like this:
Answer the user's question based on the following context.
If the context doesn't contain the answer, say so.
Context:
[chunk 1 text]
[chunk 2 text]
[chunk 3 text]
Question: What is our refund policy?
The AI reads the context and generates an answer grounded in your actual documents, not its general training data.
Practical Example: "Chat with Your Docs"
Let's say you're building a feature that lets employees ask questions about your company handbook. Here's the flow in plain English:
- Setup (done once): Take the handbook PDF, split it into chunks, embed each chunk, and store the vectors in your database.
- At query time: An employee asks "How many vacation days do I get?" You embed this question, search the database, and find the chunks about vacation policy.
- Generate response: You send those chunks plus the question to the AI, which responds with "According to the employee handbook, full-time employees receive 20 vacation days per year, accruing at 1.67 days per month."
- Show the source: You display the answer along with a link to the relevant section of the handbook, so the employee can verify it.
Common RAG Pitfalls
Chunks are too big or too small. If chunks are too large, you'll retrieve a lot of irrelevant text. If they're too small, you'll lose context. Start with 500-1,000 characters and experiment.
No overlap between chunks. If a key fact spans the boundary between two chunks, it might get lost. Always use some overlap (100-200 characters works well).
Not handling "I don't know." If your documents don't contain the answer, the AI should say so instead of making something up. Explicitly instruct it in your prompt: "If the context doesn't contain enough information, say you don't know."
Ignoring metadata. Tag your chunks with metadata like document title, section name, and date. This lets you filter results (e.g., "only search the 2024 handbook, not the 2022 version") and helps the AI provide better citations.
Skipping evaluation. You need to test your RAG system with real questions and verify it retrieves the right chunks and generates accurate answers. Don't assume it works — measure it.
Key Takeaways
- →RAG combines retrieval with generation for custom knowledge
- →Chunk size affects both retrieval and generation quality
- →Use embeddings to find semantically similar content
- →Always cite sources in responses
- →Test with real user questions
Practice Exercises
Apply what you've learned with these practical exercises:
- 1.Build simple RAG with your documents
- 2.Experiment with chunk sizes
- 3.Test retrieval accuracy
- 4.Implement source citation