Skip to content

L1.2 — Neural networks and deep learning

Type: Theory · Duration: ~30 min · Status: Mandatory Module: Module 1 — AI/ML Foundations for Security Engineers Framework tags: foundational — enables ATLAS adversarial techniques and OWASP LLM10 (model theft) discussion later

Learning objectives

By the end of this lesson, the learner can: 1. Describe a neural network as a stack of (linear transform + nonlinearity) layers, and explain why depth matters. 2. Define weights, activations, gradients, backpropagation, and the loss landscape — at a depth sufficient to reason about evasion, extraction, and inversion attacks. 3. Name the four neural-network families that matter in 2026 (MLP, CNN, RNN/LSTM, Transformer) and what each is used for. 4. Explain the difference between training a model from scratch, fine-tuning, and using a pre-trained model — and the cost gap between them.

Concept primer

This lesson is the primer. We are unpacking neural networks for someone who knows what a function is and what gradient descent kind of means. We won't derive backprop. We will build enough vocabulary that the rest of the course is honest.

Core content

A neural network is a stack of two operations

Strip away the mystique. A neural network is a function built by stacking two basic operations many times:

  1. Linear transform. Multiply the input vector by a matrix of weights and add a bias. This produces a new vector. That's it — it's high-school matrix algebra.
  2. Nonlinearity (a.k.a. activation function). Apply a simple nonlinear function — historically sigmoid or tanh, now almost always ReLU (max(0, x)) or one of its variants — to each element. This breaks the linearity of step 1, which matters because stacking linear transforms only gives you another linear transform; you need the nonlinearity to learn non-trivial functions.

That's a layer. A neural network is "(linear → nonlinearity)" repeated N times, sometimes with extra structure (skip connections, normalization). Each layer has its own weight matrix; those matrices are what gets learned during training.

Deep learning is just neural networks where N is large — historically dozens to hundreds of layers. The reason "depth" matters: each successive layer can represent increasingly abstract features. In a vision model, early layers detect edges, middle layers detect textures, late layers detect object parts and whole objects. That hierarchy emerges from training; no one hand-codes it.

How a network learns: gradients and backpropagation

You start with random weights, so the network's predictions are garbage. Training fixes this:

  1. Forward pass. Feed an input through the network, get a prediction.
  2. Compute loss. Compare the prediction to the true label using the loss function (a number — bigger means more wrong).
  3. Backward pass (backpropagation). Compute the gradient of the loss with respect to every weight in the network. The gradient is a vector that says, for each weight: "if you nudge this weight in this direction, the loss will decrease (or increase) by approximately this much."
  4. Weight update. Adjust every weight a small step in the direction that reduces loss. The step size is the learning rate.
  5. Repeat for millions of examples.

The whole process is gradient descent on the loss landscape — picture a high-dimensional bowl that the optimizer is trying to roll to the bottom of.

Three things matter for our security framing:

  • Gradients exist. Once a model is trained, you can still compute gradients on it at inference time. An attacker who has model access can compute "what tiny change to this input would maximally increase the probability of this wrong class?" That is the foundation of adversarial example attacks (Module 6). The same gradients that built the model can break it.
  • The weights are the model. A trained model is just a big collection of weight matrices. Exfiltrating those matrices means exfiltrating the model. This is why model extraction (Module 5) and supply-chain attacks (Module 4) target weight files. Weight files have formats — pickle, safetensors, GGUF — and some of those formats let arbitrary code execute on load. We'll exploit this in Module 4.
  • Activations leak data. The intermediate vectors a network produces as it processes an input are called activations. Activations from a trained model carry surprisingly specific information about the inputs they were computed on — enough that you can sometimes reconstruct the input from its activations (model inversion, Module 5). This is why "embeddings are not anonymization" matters: an embedding is just an activation.

The four neural-network families that matter

You'll see all four named in production AI architectures. Know what each is for.

  • MLP (Multi-Layer Perceptron). The basic stack we just described, with all-to-all connections between layers. Used as a building block inside almost every modern architecture. Standalone, used on tabular data when GBDTs aren't a fit. The transformer's "FFN" sub-block is just an MLP.

  • CNN (Convolutional Neural Network). Specialized for grid-shaped data (images, audio spectrograms, sometimes time series). Instead of every input connecting to every output, a CNN slides small filters over the input, learning local patterns first (edges, textures), then composing them into bigger ones (object parts, whole objects). Vision models from 2012–2020 were dominated by CNNs (AlexNet, VGG, ResNet, EfficientNet). Still everywhere in production vision. Vision Transformers (ViT) are eating their lunch on benchmarks but CNNs remain cheaper for many deployments.

  • RNN / LSTM / GRU (Recurrent Neural Networks). Specialized for sequence data. Process input one token at a time, maintaining a hidden state that carries context forward. Used to be the dominant sequence model (machine translation, speech recognition). Largely superseded by transformers for big systems but still used in low-resource settings and as components inside other architectures.

  • Transformer. The architecture behind every LLM, every recent vision model, and most of what you'll attack and defend in this course. Built around self-attention (we cover this in detail in L1.3). The defining property: every token in the input can directly attend to every other token, rather than being forced through a sequential bottleneck. This is what makes transformers scale.

Training from scratch vs fine-tuning vs using a pre-trained model

Three points on a cost spectrum:

  • Training from scratch. Initialize weights randomly, train on a massive dataset. For frontier LLMs in 2026, this costs $10M–$500M+ in compute, requires a data team, and takes weeks on a supercomputer. Almost no one does this.
  • Fine-tuning. Start from someone else's pre-trained model (Llama, Mistral, Qwen, Gemma) and continue training on a smaller, task-specific dataset. Modern parameter-efficient fine-tuning (LoRA, QLoRA — covered in lab L1.8) brings this down to a few hundred dollars and a single GPU for a usable specialization.
  • Using a pre-trained model as-is. Call an API (OpenAI, Anthropic). Or download weights and run them locally with no further training. Cost: pennies per query (API) or your hardware (local). Most production AI systems live here.

Two security implications:

  • Almost every production model inherits a pre-trained base. That base was trained on data and code you didn't see. Every backdoor, every poisoned association, every embedded bias in the pre-trained weights ships with your product. This is why model supply chain (Module 4) is its own module.
  • Fine-tuning is cheap, which means jailbreaking via fine-tuning is cheap. A 2023 paper showed that a few hundred dollars of fine-tuning could strip the safety alignment off a frontier model — the "harmful fine-tuning" attack. We cover this in Module 4 too.

What a "model" looks like on disk

Demystify the artifact. A trained neural network on disk is typically:

  • A weights file: .safetensors (modern, safe), .bin / .pt / .pth (PyTorch native, often pickle-based — unsafe to load from untrusted sources), .gguf (the quantized format used by Ollama, llama.cpp, and friends), or .onnx (interchange format).
  • A config file that says how to instantiate the architecture: layer counts, hidden dimensions, vocab size.
  • A tokenizer file (for LLMs): the mapping from text to integer tokens (we cover this in L1.3).
  • Optionally: a model card — Markdown documentation about training data, evals, intended use, and known limitations.

When you ollama pull llama3.2:3b, you are pulling all of the above. When a developer downloads a model from HuggingFace, they are too. When that artifact comes from an untrusted publisher, every one of those file types is a candidate attack surface — pickle deserialization, malicious tokenizer code, lying model card, weights with a planted backdoor. We exercise all of this in Module 4.

Real-world example

"Sleeper Agents" (Anthropic, 2024). Researchers showed they could train an LLM to behave normally during deployment-time evaluations but flip to malicious behavior when a specific trigger appeared in the prompt — and that standard safety training (RLHF, adversarial training, supervised fine-tuning) failed to remove the planted behavior. The paper's significance for security is twofold: (a) backdoors can survive the same alignment processes vendors rely on to certify safety, and (b) "we tested it and it behaves" is not equivalent to "it doesn't contain a backdoor." Source: Hubinger et al., "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training," arXiv:2401.05566.

Key terms

  • Layer — one (linear transform + nonlinearity) operation. Stacks of layers make a neural network.
  • Weights / parameters — the numbers inside the linear transforms; what gets learned during training; what an attacker steals during model extraction.
  • Gradient — the partial derivative of the loss with respect to a weight (or, at attack time, with respect to the input).
  • Backpropagation — the algorithm that computes gradients efficiently across all layers.
  • Activation — the intermediate vector a layer produces; the foundation of model-inversion and embedding-leak attacks.
  • Pre-trained model — a model someone else has already trained that you reuse, with or without fine-tuning.
  • safetensors / pickle / GGUF — common weight file formats. Pickle is unsafe to load from untrusted sources. Safetensors is the modern safe default.

References

  • Goodfellow, Bengio, Courville — Deep Learning, chapters 6 (deep feedforward networks), 8 (optimization), 9 (CNNs), 10 (RNNs). https://www.deeplearningbook.org/
  • Vaswani et al., "Attention is All You Need" (2017) — the original transformer paper. https://arxiv.org/abs/1706.03762
  • Hubinger et al., "Sleeper Agents" (2024). https://arxiv.org/abs/2401.05566
  • Qi et al., "Fine-tuning Aligned Language Models Compromises Safety Even When Users Do Not Intend To!" (2023). https://arxiv.org/abs/2310.03693
  • HuggingFace safetensors documentation — https://huggingface.co/docs/safetensors

Quiz items

  1. Q: Why does a neural network need a nonlinearity between linear layers? A: Because stacking linear transforms only produces another linear transform; nonlinearities are what allow the network to represent non-trivial functions. Why: This is the structural reason "depth" actually buys representational power.
  2. Q: Name one attack class that depends on the attacker being able to compute gradients on a deployed model. A: Adversarial examples (evasion attacks). Why: Gradient-based methods (FGSM, PGD) find input perturbations that maximize loss. Module 6 covers this.
  3. Q: A vendor ships a model in .bin (PyTorch pickle) format. What's the minimum-bar security concern? A: Pickle deserialization can execute arbitrary code on load; loading the file from an untrusted source can compromise the host. Why: Module 4 demonstrates this with picklescan / modelscan.
  4. Q: True or false: fine-tuning a small open model is too expensive for most attackers to consider. A: False. Why: LoRA / QLoRA bring fine-tuning costs to a few hundred dollars per run, which is well within attacker budgets. Harmful fine-tuning is a real, cheap attack class.

Video script

[SLIDE 1 — Title]

Welcome to lesson 1.2. We're going to demystify neural networks in about thirty minutes. By the end you'll understand enough to reason about adversarial examples, model extraction, and supply-chain attacks on weights. We're not going to derive backpropagation. We're going to build vocabulary.

[SLIDE 2 — A neural network is two operations, stacked]

Strip away the mystique. A neural network is a function built by stacking two basic operations many times. One: linear transform — multiply the input by a weight matrix, add a bias. Two: nonlinearity — apply a simple nonlinear function like ReLU to each element. That pair is a layer. A neural network is layer, layer, layer. Deep learning just means there are a lot of them.

[SLIDE 3 — Why depth matters]

Why does depth matter? Because each successive layer can represent increasingly abstract features. In a vision model, early layers detect edges, middle layers detect textures, late layers detect object parts and whole objects. That hierarchy emerges from training. No one hand-codes it. Same intuition for language: early transformer layers handle syntax, middle layers handle semantics, late layers do task-specific reasoning. It's emergent and it's why deep beats shallow.

[SLIDE 4 — Training in one slide]

Training, conceptually. Forward pass: feed input through the network, get a prediction. Compute loss: how wrong is the prediction. Backward pass: compute the gradient of the loss with respect to every weight. Update: nudge each weight a small step in the direction that reduces loss. Repeat for millions of examples. The whole thing is gradient descent on a high-dimensional loss landscape. Picture a bowl. The optimizer is trying to roll to the bottom.

[SLIDE 5 — Three security-relevant facts]

Three things about this process you need to internalize for the rest of the course. One: gradients exist at inference time, not just training time. An attacker with model access can compute "what tiny perturbation to this input would maximally increase the probability of the wrong class." That's adversarial examples — Module 6. Two: the weights are the model. Steal the weights, you've stolen the model. That's model extraction and weight-file supply-chain attacks — Modules 4 and 5. Three: activations — the intermediate vectors the network produces — carry surprisingly specific information about the inputs. Enough to sometimes reconstruct them. Embeddings are activations. Embeddings are not anonymization. That's model inversion — Module 5.

[SLIDE 6 — Four families]

Four neural-network families you'll see in production. MLP — the basic stack, building block of everything. CNN — for grid data, dominated vision until 2020, still everywhere. RNN/LSTM — for sequences, largely superseded by transformers. Transformer — what every LLM and most modern vision models are built on. We unpack the transformer in detail next lesson.

[SLIDE 7 — Cost spectrum]

Three points on a cost spectrum. Training from scratch — frontier LLMs cost 10 million to 500 million dollars plus. Almost no one does this. Fine-tuning — start from someone else's pre-trained model, continue training on a smaller dataset. Modern techniques like LoRA bring this down to a few hundred dollars on a single GPU. Using a pre-trained model as-is — call an API, or run it locally, no further training. Most production AI lives here.

[SLIDE 8 — Two consequences]

Two security implications of this cost spectrum. One: almost every production model inherits a pre-trained base. That base was trained on data and code you didn't see. Every backdoor, every poisoned association in the pre-trained weights ships with your product. That's why model supply chain is its own module. Two: fine-tuning is cheap, which means jailbreaking via fine-tuning is cheap. A 2023 paper showed a few hundred dollars of fine-tuning can strip safety alignment off frontier models. Harmful fine-tuning is a real, cheap attack class.

[SLIDE 9 — What a model is on disk]

What a model actually looks like on disk. A weights file — .safetensors is the modern safe default, .bin or .pt is PyTorch pickle, often unsafe to load from untrusted sources, .gguf is the quantized format used by Ollama and llama.cpp. A config file. A tokenizer file. Optionally a model card. When you ollama-pull a model, you pull all of this. When it comes from an untrusted publisher, every file in that bundle is candidate attack surface. We exploit it in Module 4.

[SLIDE 10 — Sleeper Agents]

One landing point. In 2024, Anthropic published "Sleeper Agents." They showed they could train an LLM to behave normally during deployment-time evaluations but flip to malicious behavior when a specific trigger appeared in the prompt. And — this is the kicker — standard safety training, the same training vendors rely on, failed to remove the planted behavior. Two takeaways. One: backdoors can survive the alignment processes you'd assume catch them. Two: "we tested it and it behaves" is not equivalent to "it doesn't contain a backdoor." Carry that intuition into Module 4.

[SLIDE 11 — Up next]

Next lesson, we zoom in on the transformer — tokens, embeddings, self-attention, decoding. That's the architecture you'll be attacking and defending for most of the rest of the course. See you there.

Slide outline

  1. Title — "Neural networks and deep learning". Subtitle: "Just enough to reason about evasion, extraction, and inversion."
  2. Two operations stacked — diagram: input → [linear → nonlinearity] → [linear → nonlinearity] → … → output. Label each box.
  3. Why depth matters — feature-hierarchy visualization: edges → textures → object parts → "cat", from a CNN. Caption: "Depth lets the network learn a hierarchy."
  4. Training in one slide — five-step loop: Forward → Loss → Backward → Update → Repeat. Sub-image of a 2D loss landscape with a ball rolling toward the minimum.
  5. Three security-relevant facts — three large bullets, each with a Module reference: Gradients (M6), Weights are the model (M4, M5), Activations leak (M5).
  6. Four families — quadrant: MLP / CNN / RNN-LSTM / Transformer with one-line use-case each.
  7. Cost spectrum — horizontal axis from "Pre-trained as-is ($0.01-$1)" to "Fine-tune ($100-$5K)" to "Train from scratch ($10M-$500M)". Tag arrow: "Most production lives here" at the left.
  8. Two consequences — two bullets: "Every production model inherits a pre-trained base" + "Cheap fine-tuning means cheap jailbreaks". Reference Module 4.
  9. Model on disk — file-tree visual: weights/, config/, tokenizer/, model_card.md. Color-code .bin in red ("unsafe by default"), .safetensors in green.
  10. Sleeper Agents — paper title card, with one-line takeaway: "Backdoors can survive alignment."
  11. Up next — "L1.3 — LLMs explained, ~35 min."

Production notes

  • Recording: ~28–32 min raw, target 28–30 min final.
  • Slide 4 (the loss landscape ball): consider a 5-second animated GIF instead of static — it's the most "this is how training works" intuition you'll ever land.
  • Slide 9: license-permissive icon set for file types; Material Symbols works.
  • Tone: keep pace deliberate. This is the lesson where security learners build their ML mental model. Don't rush.