L1.7 — Build a tiny RAG system from scratch (Lab)¶

Type: Lab · Duration: ~50 min · Status: Mandatory Module: Module 1 — AI/ML Foundations for Security Engineers Framework tags: foundational — builds the artifact that ATLAS techniques in M3 / M5 / M7 will target

Goal of the lab¶

Build a working Retrieval-Augmented Generation (RAG) system from scratch — chunk documents, embed them, store in a vector DB, retrieve at query time, stuff into a prompt, generate an answer with citations. By the end you'll have a small but real RAG application running in your container. You will attack this exact system in Module 3 (indirect prompt injection, system-prompt extraction, retrieval poisoning), and defend it in Module 7 (guardrails, eval harness, observability). Build it carefully.

Why this matters¶

RAG is the dominant production LLM pattern in 2026 — most enterprise LLM apps are RAG under the hood. It is also the dominant attack surface for indirect prompt injection (OWASP LLM01) and a primary vector for sensitive information disclosure (LLM06). You cannot defend RAG until you've built one and understand which file in the pipeline is responsible for which failure mode.

Prerequisites¶

Skills assumed: Python, comfort reading 100-line scripts, basic understanding of "embedding."
Lessons completed: L1.1 – L1.6.
Environment: passing sanity_check.py.

What you'll build / verify¶

A small documents corpus (Asfela "company handbook" — a stand-in for any text corpus).
A chunker that splits documents into ~500-token chunks with overlap.
An embedding pipeline using sentence-transformers (specifically all-MiniLM-L6-v2, a small, fast, open embedding model).
A Chroma vector DB populated with the chunks and their embeddings.
A rag.py query function that retrieves top-k chunks and produces a grounded answer with citations.
A CLI you can ask questions against (uv run python -m ai_sec.rag query "...").
A baseline "happy path" demo plus a deliberate failure case you'll exploit in Module 3.

Steps¶

Step 1 — Look at the documents corpus¶

cd /workspace/ai-sec-course
ls corpora/asfela-handbook/

Expected:

01-mission.md
02-pto-policy.md
03-expense-policy.md
04-onboarding.md
05-security-policy.md
06-incident-response.md
07-vendor-list.md

Read one to set expectations:

cat corpora/asfela-handbook/02-pto-policy.md

You'll see a few hundred words of plausible-looking internal-handbook content. The handbook is intentionally boring — it has no secrets, no PII, no spicy content. Boring is the point: it's a clean baseline for the attacks you'll layer on top later.

Step 2 — Look at the empty RAG module you'll fill in¶

Open src/ai_sec/rag.py:

cat src/ai_sec/rag.py

You'll see a skeleton:

# /workspace/ai-sec-course/src/ai_sec/rag.py — skeleton

from pathlib import Path
from typing import Iterable
from dataclasses import dataclass

import chromadb
from sentence_transformers import SentenceTransformer

from ai_sec.chat import chat  # the helper from L1.6

EMBEDDING_MODEL = "all-MiniLM-L6-v2"
CHUNK_TOKENS = 500
CHUNK_OVERLAP = 50
COLLECTION = "asfela_handbook"

@dataclass
class Chunk:
    doc_id: str
    chunk_id: str
    text: str

# YOU WILL IMPLEMENT THESE:
def load_corpus(corpus_dir: Path) -> Iterable[Chunk]: ...
def embed(chunks: list[Chunk]) -> None: ...
def query(question: str, k: int = 4) -> str: ...

if __name__ == "__main__":
    # CLI entry point — see end of file
    pass

We'll fill in the three functions, then exercise them.

Step 3 — Implement `load_corpus` (chunk the documents)¶

Replace the load_corpus stub with this implementation. Type it, don't copy-paste blindly — the point of this lab is for you to internalize what's happening, not just cp a file.

def load_corpus(corpus_dir: Path) -> Iterable[Chunk]:
    """Yield Chunk(doc_id, chunk_id, text) for every chunk in every .md file."""
    for path in sorted(corpus_dir.glob("*.md")):
        text = path.read_text()
        # Naive whitespace tokenization. Real RAG uses the embedding model's
        # tokenizer; we're keeping it simple to make chunking visible.
        words = text.split()
        for i in range(0, len(words), CHUNK_TOKENS - CHUNK_OVERLAP):
            chunk_words = words[i : i + CHUNK_TOKENS]
            if not chunk_words:
                break
            yield Chunk(
                doc_id=path.name,
                chunk_id=f"{path.stem}-{i//(CHUNK_TOKENS-CHUNK_OVERLAP):03d}",
                text=" ".join(chunk_words),
            )

Why naive whitespace splitting. Production RAG uses the embedding model's tokenizer (so chunk size in tokens is exact, and you don't split inside a token). We're using whitespace splits because (a) you can read the resulting chunks and they make sense, and (b) the security questions are the same. We'll point at this later when we look at chunk-boundary-aware injection.

Step 4 — Implement `embed` (vectorize chunks, write to Chroma)¶

def embed(chunks: list[Chunk]) -> None:
    """Embed every chunk and store in Chroma."""
    model = SentenceTransformer(EMBEDDING_MODEL)  # downloads ~80 MB on first call
    client = chromadb.PersistentClient(path="/workspace/.cache/chroma")
    # Reset collection for repeatability across runs
    if COLLECTION in [c.name for c in client.list_collections()]:
        client.delete_collection(COLLECTION)
    coll = client.create_collection(COLLECTION)
    vectors = model.encode([c.text for c in chunks]).tolist()
    coll.add(
        ids=[c.chunk_id for c in chunks],
        embeddings=vectors,
        metadatas=[{"doc_id": c.doc_id} for c in chunks],
        documents=[c.text for c in chunks],
    )
    print(f"[embed] indexed {len(chunks)} chunks in collection '{COLLECTION}'")

Step 5 — Implement `query` (retrieve, stuff, generate)¶

def query(question: str, k: int = 4) -> str:
    """Retrieve top-k chunks, build a prompt, ask the LLM, return the answer."""
    model = SentenceTransformer(EMBEDDING_MODEL)
    client = chromadb.PersistentClient(path="/workspace/.cache/chroma")
    coll = client.get_collection(COLLECTION)
    q_vector = model.encode([question]).tolist()
    results = coll.query(query_embeddings=q_vector, n_results=k)
    chunks = results["documents"][0]
    doc_ids = [m["doc_id"] for m in results["metadatas"][0]]

    # Stuff the retrieved chunks into a prompt with citations.
    context = "\n\n".join(
        f"[Source: {doc_id}]\n{chunk}" for doc_id, chunk in zip(doc_ids, chunks)
    )
    system = (
        "You are an Asfela company-handbook assistant. "
        "Answer ONLY from the provided Sources. "
        "Cite the source filename in square brackets after every claim. "
        "If the Sources do not contain the answer, say so."
    )
    user = f"Sources:\n{context}\n\nQuestion: {question}"
    result = chat(prompt=user, system=system, backend="ollama", model="llama3.2:3b")
    return result.text

Step 6 — Build the index¶

uv run python -c "
from pathlib import Path
from ai_sec.rag import load_corpus, embed
chunks = list(load_corpus(Path('corpora/asfela-handbook')))
print(f'Loaded {len(chunks)} chunks from {len(set(c.doc_id for c in chunks))} docs')
embed(chunks)
"

Expected output:

Loaded ~18 chunks from 7 docs
[embed] indexed ~18 chunks in collection 'asfela_handbook'

The exact chunk count depends on document length and your overlap setting.

Step 7 — Ask your RAG a question¶

uv run python -c "
from ai_sec.rag import query
print(query('What is Asfela\\'s PTO policy?'))
"

Expected output (shape):

Asfela's PTO policy provides 20 days of paid time off per year [02-pto-policy.md].
Unused days carry over up to 5 days into the next year [02-pto-policy.md].
...

Citations should reference the correct source file. If they don't, your retrieval is bringing back wrong chunks — try increasing k from 4 to 6, or shrinking your chunk size to 300.

Step 8 — Ask a question the handbook doesn't answer¶

uv run python -c "
from ai_sec.rag import query
print(query('What is the capital of Burkina Faso?'))
"

Expected output (shape):

The Sources do not contain information about the capital of Burkina Faso.

A correctly built RAG should refuse here, because the system prompt told it to. In practice many small models fail this test — Llama 3.2 3B in particular will sometimes "helpfully" answer from its base-model knowledge instead of refusing. Document what you see. This grounding failure is itself a finding class (OWASP LLM09 — Misinformation) and we cover it in Module 3.

Step 9 — Save your RAG state for later modules¶

The Chroma DB persists at /workspace/.cache/chroma. Snapshot it for Module 3:

tar czf /workspace/.cache/rag-snapshot-module1.tgz -C /workspace/.cache chroma
ls -la /workspace/.cache/rag-snapshot-module1.tgz

In Module 3 we'll restore from this snapshot, then attack the system. If your lab session resets, you'll need to rebuild the index — but you've already typed the code, so a single python invocation reproduces it.

What just happened (debrief)¶

You built a complete RAG application in about 100 lines of Python. Walk through what each part is responsible for, because each part is an attack surface in Module 3.

The corpus (corpora/asfela-handbook/) is your trusted document store. Trust assumption: every file in this directory was written by someone authorized. In real systems, this is rarely true: the corpus is often pulled from a wiki, an S3 bucket, a SharePoint site, or — in the worst case — user uploads. The moment any untrusted writer can land content in the corpus, you've created an indirect prompt-injection vector. Module 3 lab L3.7 walks you through poisoning this exact corpus.

The chunker (load_corpus) decides how documents are split. Chunk boundaries matter for security in two ways: (a) if instructions can be split across two chunks such that the malicious chunk gets retrieved alone, you get a "fragment injection"; (b) chunk size determines whether a full instruction fits in one retrieval result. Production chunkers do semantic-aware splitting, but that doesn't eliminate the surface — it changes its shape.

The embeddings + vector DB (embed) is where text becomes searchable vectors. Two security questions: (1) the embedding model — all-MiniLM-L6-v2 — was trained on data you didn't audit; it has its own biases and its own glitch behavior. (2) The vector DB is now a high-value target — anyone who can read the embeddings can attempt embedding-leak attacks against the underlying text (Module 5). Anyone who can write to it can plant content that matches likely queries via adversarial embedding manipulation.

The query path (query) retrieves the top-k chunks and assembles a prompt. Look at the system prompt closely:

Answer ONLY from the provided Sources. Cite the source filename. If the Sources do not contain the answer, say so.

This system prompt is your defense. It is also your single point of failure. In Module 3 we'll show that a sufficiently crafted user query (or a sufficiently crafted injected chunk) can override it. This is OWASP LLM01 in action.

The model (llama3.2:3b) is small and weakly aligned. Some attacks succeed on it that fail on a frontier model; some defenses (e.g., a guardrail prompt) work on a frontier model that leak on it. We'll switch backends in Module 3 lab L3.6 to see this concretely.

You now have, in one container, the exact artifact that will carry you through Modules 3 and 7. Do not delete /workspace/.cache/chroma between sessions. If you do, just re-run Step 6.

Extension challenges (optional)¶

Easy. Add a fourth document to the corpus (08-travel-policy.md — make up plausible content), re-build the index, and verify a question against it returns correct citations.
Medium. Swap the LLM backend in Step 7 from Ollama to your frontier API (one line change in query). Re-run the "What is Asfela's PTO policy" question. Notice the difference in answer style and grounding fidelity.
Hard. Add a re-ranker stage between retrieval and prompt construction — use cross-encoder/ms-marco-MiniLM-L-6-v2 from sentence-transformers to re-score the top-k chunks before passing to the LLM. Measure (qualitatively) whether grounding fidelity improves on a few test questions. This is the production-pattern most enterprise RAGs converge to; you'll see why.

References¶

Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (2020) — original RAG paper: https://arxiv.org/abs/2005.11401
LangChain RAG docs (for production patterns) — https://python.langchain.com/docs/use_cases/question_answering/
Chroma docs — https://docs.trychroma.com/
sentence-transformers library — https://www.sbert.net/
Greshake et al., "Not what you've signed up for" (2023) — the indirect prompt injection paper that this corpus will become a target for: https://arxiv.org/abs/2302.12173

Provisioning spec (for lab platform admin, NOT shown to learner)¶

Container base image: aisec/labs-base:0.1

Additional pre-installed files: - /workspace/ai-sec-course/corpora/asfela-handbook/01-mission.md through 07-vendor-list.md — seven small markdown files, ~500-1500 words each. Content is fictional, no PII, no copyrighted material. Content authored as part of course production. - /workspace/ai-sec-course/src/ai_sec/rag.py — skeleton file with stubs (NOT the filled-in version; learner implements). A reference filled-in version lives at /workspace/ai-sec-course/solutions/lab1_7_rag_solution.py for instructor / support reference only — do NOT include in learner-visible file tree. - /workspace/.cache/chroma/.gitkeep — directory placeholder

Additional Python packages (already in pyproject.toml from L0.3): - chromadb>=0.5 - sentence-transformers>=3.0

Pre-downloaded models cached on host volume (mounted read-only): - sentence-transformers/all-MiniLM-L6-v2 (~80 MB) — pre-cached to /workspace/.cache/huggingface/hub/ to skip first-run download

Network access: - Egress: huggingface.co (model card fetch), pypi.org (if extension challenge runs) - Otherwise self-contained — all required models pre-pulled

Estimated container resource use during lab: - RAM: 4–6 GB peak (Ollama + embedding model + Chroma) - CPU: ~80% one core during embedding pass (5-10s for the corpus) - Disk: ~150 MB new (Chroma DB + cache) - Wallclock: 45–60 min including reading and typing time

Persistence requirement: the Chroma DB and the runs/ directory must survive a lab session reset within the same module. If the platform does not support per-learner persistent volumes, document explicitly that Module 3's L3.7 will start by re-running the L1.7 build script.

Notes for platform admin: - The L1.7 RAG is referenced by L3.6, L3.7, L3.9, L7.7, L7.9 and the capstone. Keep this corpus and the rag.py reference implementation versioned in the companion repo — if the corpus changes, every later lab's expected outputs change. - chromadb occasionally has SQLite version incompatibilities. Pin the version in pyproject.toml and test on a fresh container image at every release.