Skip to content

L1.3 — LLMs explained: tokens, embeddings, transformers, decoding

Type: Theory · Duration: ~35 min · Status: Mandatory Module: Module 1 — AI/ML Foundations for Security Engineers Framework tags: foundational — directly enables OWASP LLM01, LLM02, LLM06, LLM10 discussion in later modules

Learning objectives

By the end of this lesson, the learner can: 1. Explain the four stages an LLM goes through to turn a string into an output: tokenization, embedding, transformer forward pass, decoding. 2. Define a token and recognize that text→token mapping is itself an attack surface. 3. Explain self-attention conceptually — what it computes, why it scales. 4. Identify decoding strategies (greedy, sampling, top-k, top-p, temperature) and how each affects determinism, security testing, and jailbreak susceptibility.

Concept primer

This lesson is the primer. If you've read "Attention is All You Need" we'll move quickly; if not, we'll build the intuition step by step.

Core content

What happens when you send "Hello, world" to an LLM

Four stages, every time:

"Hello, world"
   ▼   ① Tokenization
[15496, 11, 1917]        ← integer token IDs
   ▼   ② Embedding lookup
[[0.13, -0.04, …],       ← vector per token
 [-0.02, 0.41, …],
 [0.27, 0.18, …]]
   ▼   ③ Transformer forward pass
        (N stacked layers of self-attention + FFN)
   ▼   ④ Decoding
Next token: 14026 → " how"
   (loop) — append "how", run again, get next token, …

Each stage is its own attack surface. Let's unpack them.

① Tokenization — text becomes integers

A tokenizer is a deterministic function that maps a string to a list of integer IDs (tokens), and back. The vocabulary is fixed at the model's training time — typically 30k–200k entries. Modern LLMs use subword tokenization (BPE — Byte-Pair Encoding — or its variants like SentencePiece, tiktoken). Subword tokenization means common words get a single token, rare words get split into multiple tokens.

Two security-relevant facts about tokenization:

  • Tokens are not characters and not words. "Hello, world" might be 3 tokens or 5, depending on the tokenizer. Counting characters to estimate cost or input length is wrong; counting tokens is correct. Many input-validation schemes that count characters can be bypassed by an attacker who knows the tokenizer.
  • Tokenizers can be exploited. Some tokenizers produce surprising token IDs for adversarially-crafted strings — the "glitch token" phenomenon (e.g., GPT-2/3 had tokens like SolidGoldMagikarp that triggered bizarre, unpredictable behavior because they appeared in the BPE vocabulary but were almost absent from training data). More recent attacks include token smuggling — encoding instructions in unusual unicode normalizations, zero-width joiners, or homoglyphs that the tokenizer parses one way and the safety filter parses another. We touch this in Module 3.

Reference for what a tokenizer does: try OpenAI's tiktoken playground (links in references) and paste a paragraph. See how many tokens it produces. Notice that the same word in different contexts often tokenizes differently (e.g., leading space matters).

② Embedding lookup — integers become vectors

Each token ID indexes into an embedding matrix — a giant table where each row is a learned vector (typically 512–8192 dimensions). After lookup, your sequence of N tokens becomes a sequence of N vectors. These vectors are the model's internal representation of "the meaning of each token in this context" (at this first layer; they get refined as they pass through layers).

Embeddings are not anonymized text. Given an embedding, you can often recover surprising amounts of information about what it represents. This is the core of:

  • Embedding-leak attacks. If you store user queries as embeddings in a vector DB, you have not stored them privately. Attackers with access to the DB and a similar model can reconstruct text from embeddings. We touch this in Module 5.
  • Cross-model embedding attacks. Embeddings from one model can sometimes be projected onto another model's embedding space.

③ Transformer — self-attention in one slide

The transformer is the architecture that lets the model figure out which tokens depend on which other tokens for the meaning of the current position. The core mechanism is self-attention.

Intuition without math: for each token in the input, the model computes three vectors — a query, a key, and a value. The query is "what am I looking for?"; the keys are "what does each other token offer?"; the values are "what content does each other token contribute if attended to?" Attention is the dot product of the query with every key, normalized, used as weights over the values. The output for each position is a weighted sum that includes information from every other position.

The defining property: every token can directly attend to every other token. There's no sequential bottleneck the way an RNN has. This is the property that lets transformers (a) scale to long contexts and (b) be parallelized on GPUs cheaply. It is also the property that lets a single hostile token buried in a 50-page document influence the model's output everywhere — the structural reason indirect prompt injection works.

After self-attention, each layer applies a feed-forward network (a small MLP) per position, then moves on. A modern LLM stacks 32 to 100+ such layers. The output of the final layer, for the last input position, is a vector that gets projected into vocabulary space to produce logits — one number per possible next token.

④ Decoding — vectors become the next token

Logits are a probability distribution over the vocabulary, telling you how likely each token is to be next. The decoder picks one. How it picks matters more than people realize.

  • Greedy decoding. Pick the highest-probability token. Deterministic. Often produces flat, repetitive output but is reproducible.
  • Sampling. Pick a token randomly according to the distribution. Non-deterministic. Same prompt, different output. The default in most chat UIs.
  • Temperature. A scalar that flattens (> 1) or sharpens (< 1) the distribution before sampling. Temperature 0 ≈ greedy. Temperature 1.0 = sample as-is. Temperature 2.0 = much more random.
  • Top-k. Sample only from the top k most-likely tokens. Caps the chance of an exotic token.
  • Top-p (nucleus). Sample only from the smallest set of tokens whose cumulative probability is ≥ p. Adapts to peakedness.

Three security-relevant consequences:

  1. Reproducibility is on a setting. Most security tests assume deterministic outputs. Set temperature = 0 and seed where supported. A jailbreak that "works 1 in 5 tries" at temp 1.0 is still a jailbreak; you just need to stop running your evals at default settings.
  2. Sampling expands the jailbreak surface. A safety filter that catches the most likely completion may not catch the second-most-likely. Attackers retry with sampling, harvest the lucky draws. This is why Module 7's eval harness runs N samples per prompt.
  3. Constrained/structured decoding is a defense. Forcing the model to output valid JSON, or only one of a small set of allowed strings, dramatically narrows what an attacker can extract via output handling (OWASP LLM02 — Insecure Output Handling). Module 7 covers this.

The full picture — and what to remember

Every interaction with an LLM: 1. Untrusted text → integer tokens. 2. Tokens → embedding vectors. 3. Vectors → forward-pass through N transformer layers. 4. Final layer → logits → decoded next token. 5. Repeat until stop condition.

Each stage has its own attack surface and its own defenses, which is the structure of the rest of the course.

Real-world example

SolidGoldMagikarp and friends (2023). Researchers analyzing the GPT-2/3 tokenizers found a class of "glitch tokens" — tokens that existed in the BPE vocabulary but had almost no training data behind them. Asking the model to repeat them produced bizarre, off-distribution outputs: refusing to repeat the word, hallucinating other words, even producing offensive content. These tokens originated from Reddit usernames that got included in the tokenizer training corpus but were filtered out of the LLM training corpus. The lesson is general: the tokenizer is a separately-trained artifact with its own attack surface. (Source: Rumbelow & Watkins, "SolidGoldMagikarp (plus, prompt generation)", LessWrong, Feb 2023.)

Key terms

  • Token — integer ID produced by a tokenizer; the unit an LLM actually processes.
  • BPE / SentencePiece / tiktoken — common subword tokenization algorithms.
  • Embedding matrix — learned table mapping token IDs to vectors.
  • Self-attention — the mechanism by which each token's representation can incorporate information from every other token in the input.
  • Logits — the raw, un-normalized scores the model outputs over the vocabulary; converted to probabilities for decoding.
  • Temperature / top-k / top-p — decoding hyperparameters that control determinism vs diversity of output.
  • Glitch token — a token in the vocabulary with sparse training signal, causing unpredictable model behavior.

References

  • Vaswani et al., "Attention is All You Need" (2017) — https://arxiv.org/abs/1706.03762
  • Jay Alammar, "The Illustrated Transformer" — https://jalammar.github.io/illustrated-transformer/ (best free visual intro)
  • OpenAI tiktoken playground / repo — https://github.com/openai/tiktoken (paste text, see token count)
  • Sennrich et al., "Neural Machine Translation of Rare Words with Subword Units" (BPE paper, 2016) — https://arxiv.org/abs/1508.07909
  • Rumbelow & Watkins, "SolidGoldMagikarp (plus, prompt generation)" — https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation

Quiz items

  1. Q: Why is counting characters a poor proxy for LLM input length? A: Because LLMs operate on tokens, not characters, and the character-to-token ratio varies by content (and is bypassable by attackers who know the tokenizer). Why: Input-length controls and cost estimation must use the actual tokenizer.
  2. Q: Self-attention has a structural property that lets transformers scale to long contexts. What is it, and why is the same property a security concern? A: Every token can directly attend to every other token; the same property means a single hostile token buried in a long document can influence the entire output — the structural enabler of indirect prompt injection. Why: Sets up Module 3.
  3. Q: You're red-teaming a chatbot at default sampling settings (temperature 1.0, top-p 0.9). After 1 attempt, your jailbreak fails. What should you do before declaring the defense effective? A: Re-run N times (e.g., 10–30) — sampling means a jailbreak can succeed on a low-probability draw, and one failed run does not prove defense. Why: Eval harnesses (Module 7) account for this.
  4. Q: True or false: storing user queries as embeddings (instead of raw text) in your vector DB is a form of anonymization. A: False. Why: Embeddings can leak surprisingly specific information about the underlying text; embedding-leak attacks (Module 5) are real.

Video script

[SLIDE 1 — Title]

Welcome to lesson 1.3. This is the lesson where LLMs stop being magic and start being a four-stage pipeline you can attack and defend. Tokenization, embedding, transformer, decoding. By the end you'll have the vocabulary for every LLM attack we cover in the rest of the course.

[SLIDE 2 — The four-stage pipeline]

Every interaction with an LLM goes through four stages. One: your text becomes a sequence of integer tokens. Two: those integers index into an embedding matrix and become vectors. Three: those vectors pass through dozens of transformer layers. Four: the final vector gets turned into a probability distribution over the vocabulary, and the decoder picks the next token. Then the model appends that token and repeats. Generation is just this loop.

[SLIDE 3 — Tokenization]

Stage one, tokenization. A tokenizer is a deterministic function that maps a string to a list of integer IDs. Modern LLMs use subword tokenization — common words get a single token, rare words get split. "Hello, world" might be three tokens or five. Two security-relevant facts. One: tokens are not characters and not words. If your input-length validation is counting characters, an attacker who knows the tokenizer can bypass it. Two: tokenizers themselves can be exploited. The SolidGoldMagikarp class of glitch tokens — tokens that existed in the vocabulary but had no training signal — produced wildly unpredictable behavior. More recent attacks include token smuggling — encoding instructions in unicode normalizations or zero-width joiners that the tokenizer parses one way and your safety filter parses another. The tokenizer is a separately trained artifact with its own attack surface.

[SLIDE 4 — Embedding lookup]

Stage two, embedding lookup. Each token ID indexes into a giant table called the embedding matrix. Each row of that table is a learned vector — typically 512 to 8192 dimensions. After this stage, your N tokens are N vectors. These vectors are the model's internal representation of the meaning of each token. One critical security point: embeddings are not anonymized text. Given an embedding, you can often recover surprising amounts of information about what it represents. If you're storing user queries as embeddings in a vector DB and you think that's privacy-preserving, it's not. We cover embedding-leak attacks in Module 5.

[SLIDE 5 — Self-attention]

Stage three, the transformer. Specifically, self-attention. Intuition without math: for each token, the model computes three vectors — query, key, value. Query is "what am I looking for"; keys are "what does each other token offer"; values are "what content does each token contribute if attended to." Attention is dot product of query with every key, normalized, used as weights over the values. The output for each position is a weighted sum that includes information from every other position. The defining property: every token can directly attend to every other token. No sequential bottleneck like in an RNN.

[SLIDE 6 — The same property is your security problem]

Same property, security lens. Because every token can attend to every other token, a single hostile token buried in a fifty-page document can influence the model's output everywhere. That is the structural reason indirect prompt injection works. We'll exploit this in Module 3 with poisoned RAG documents.

[SLIDE 7 — Decoding]

Stage four, decoding. The final layer produces logits — one score per possible next token. The decoder picks one. How it picks matters more than people realize. Greedy: highest probability, deterministic, often flat. Sampling: random from the distribution, non-deterministic. Temperature: flattens or sharpens. Top-k: sample only from the top k. Top-p, also called nucleus: sample only from the smallest set whose cumulative probability is at least p.

[SLIDE 8 — Three decoding consequences for security]

Three consequences. One: reproducibility is on a setting. Most security tests assume deterministic outputs. Set temperature to zero and seed where supported. A jailbreak that works one in five tries at default temperature is still a jailbreak, you just need to stop running your evals at default settings. Two: sampling expands the jailbreak surface. A safety filter that catches the most likely completion may not catch the second most likely. Attackers retry with sampling and harvest the lucky draws. This is why Module 7's eval harness runs multiple samples per prompt. Three: constrained or structured decoding is a defense. Forcing the model to output valid JSON, or only one of a small set of allowed strings, dramatically narrows what an attacker can extract via output handling. Module 7 covers this.

[SLIDE 9 — The full picture]

Pull it together. Every LLM interaction: untrusted text becomes tokens, tokens become vectors, vectors pass through transformer layers, the final vector becomes logits, the decoder picks the next token, repeat until stop. Each stage has its own attack surface and its own defenses, and that's the structure of the rest of the course.

[SLIDE 10 — Up next]

Next lesson, we zoom out. The full AI pipeline — from data collection through deployment and monitoring — and where the attacks live at each stage. See you there.

Slide outline

  1. Title — "LLMs explained: tokens, embeddings, transformers, decoding".
  2. The four-stage pipeline — vertical flow diagram: Text → Tokens → Embeddings → Transformer → Logits → Decoded token → (loop).
  3. Tokenization — example: "Hello, world" → [15496, 11, 1917]. Show same word tokenizing differently with/without leading space.
  4. Embedding lookup — visualization: 3 token IDs indexing into a 50k-row matrix, producing 3 vectors. Caption: "Embeddings are not anonymization."
  5. Self-attention — Q/K/V diagram with arrows; one token's "attention weights" rendered as a heatmap across other tokens.
  6. The same property is a security problem — RAG diagram: long document → some tokens highlighted red ("injection") → arrow into model → poisoned output. Caption: "Why indirect injection works."
  7. Decoding — comparison of greedy / sampling / top-k / top-p with the same prompt; show different outputs.
  8. Three decoding consequences — three bullets, each tied to a Module reference (M7 evals, M7 evals again, M7 structured output).
  9. The full picture — the four-stage diagram from slide 2, now annotated with attack-class labels at each stage.
  10. Up next — "L1.4 — The modern AI pipeline, ~25 min."

Production notes

  • Recording: ~32–36 min raw, target 32–34 min final. Don't compress; this is the most reused vocabulary in the course.
  • Slide 4: use a real tiktoken screenshot if licensing allows; otherwise reproduce the visual ourselves.
  • Slide 5 is the slide where most learners either get it or get lost. Walk through the Q/K/V intuition slowly. A short animated build of "query searches the keys" helps a lot — worth the editing time.
  • Tone: this is denser than L1.1 and L1.2. Acknowledge that out loud at the start; promise it's the densest lesson in the module.