Skip to content

L4.4.3 — Model card lies and provenance gaps

Type: Theory · Duration: ~4 min · Status: Mandatory Module: Module 4 — Data Poisoning, Backdoors & Supply Chain Framework tags: OWASP LLM05 · MITRE ATLAS AML.T0010

Learning objectives

  1. Recognize the three most-common categories of model-card misrepresentation.
  2. Apply three independent-verification practices when adopting a model.

Core content

What a model card claims vs. what it can prove

A model card (the markdown documentation that accompanies a published model) typically claims:

  • Training data — what corpora the model was trained on.
  • Evaluation results — accuracy, safety, bias benchmarks.
  • Intended use & out-of-scope use — what the publisher endorses and warns against.
  • Known limitations — admitted failure modes.
  • License & acceptable use policy.

The publisher writes these claims. The registry does not verify them. A model card can lie in any of these dimensions and the registry won't catch it. Three common categories of misrepresentation:

1. Training-data misrepresentation. "Trained on publicly-available data" while the actual corpus includes scraped private content, copyrighted material, or specific PII. Sometimes deliberate; sometimes a vendor genuinely doesn't know what's in the corpus (a problem in itself).

2. Evaluation misrepresentation. "Achieves 87% on MMLU" — true, but with test-set contamination, cherry-picked subset, or favorable prompting that wouldn't generalize. Or: "passes safety eval X" — true, but the eval is the one the model was specifically trained against.

3. Provenance-chain misrepresentation. "Fine-tuned from Meta Llama 3.2 base" — claim, but the actual base was a different model, or an intermediate fine-tune, or a fork. The PoisonGPT case is the canonical example.

Three independent-verification practices

You can't fully verify a model card without running comparable training yourself. But three practices let you catch the bulk of misrepresentations:

1. Independent behavioral evaluation. Run your own safety / accuracy / domain eval on the candidate model. Compare to the model card's claims. Material gaps are red flags. Especially: re-run any safety eval the card claims to pass.

2. Source verification. Where did the publisher come from? Verified organization on HuggingFace? Identifiable real-world entity? GitHub history? A model published by xyz9921 with no other artifacts is a different trust level than one published by meta-llama.

3. Provenance probing. For a model claimed to be fine-tuned from a known base, run a few queries against both the candidate and the claimed base. Behavioral similarity is a weak but useful provenance signal — radically different behavior on common queries suggests the provenance chain is misrepresented.

The principle

Model cards are publisher-authored marketing artifacts. Treat them with the same skepticism you'd treat a vendor's product datasheet. Useful starting point; not authoritative.

For high-trust adoption, your security review should produce an independent assessment that either confirms the card's claims or notes the gaps.

Why this is harder than software CVE checking

For software libraries, you have NVD, CVE database, vendor security pages, transparent issue trackers. For models, the equivalents don't yet exist. Some efforts (MLSecOps community advisories, ATLAS-linked case studies, vendor coordinated-disclosure programs) but coverage is incomplete in 2026.

The practical implication: AI supply-chain security relies more on team-internal verification than on industry-wide databases. Build the verification habit.

Real-world example

PoisonGPT's model card (the version uploaded under the typosquatted name) claimed to be a copy of GPT-J-6B. It was actually a fine-tune with planted misinformation. The card claimed the wrong provenance, intentionally. Anyone who treated the card as authoritative inherited the poisoned model.

Key terms

  • Model card — markdown documentation accompanying a published model.
  • Provenance probing — behavioral testing to confirm a model's claimed lineage.
  • Independent evaluation — your own safety/accuracy assessment, not the publisher's.

References

  • Mitchell et al., "Model Cards for Model Reporting" (FAT* 2019) — the original model-card paper.
  • L1.6 lab (you read your first model cards).
  • HuggingFace Datasets cards (sibling artifact, same risks).

Quiz items

  1. Q: Name three categories of model-card misrepresentation. A: Training-data misrepresentation, evaluation misrepresentation, provenance-chain misrepresentation.
  2. Q: Name three independent-verification practices for adopting a model. A: Independent behavioral evaluation, source verification (publisher trust), provenance probing.
  3. Q: Why are model cards less trustworthy than published software CVE databases? A: Publisher-authored, registry-unverified, no industry-wide vulnerability database for models; coverage and verification infrastructure is incomplete.

Video script (~480 words, ~3.5 min)

[SLIDE 1 — Title]

Model card lies and provenance gaps. Four minutes. By the end you'll know the three categories of model-card misrepresentation and three independent-verification practices.

[SLIDE 2 — What a model card claims vs proves]

A model card typically claims: training data, evaluation results, intended use, known limitations, license. The publisher writes these claims. The registry does not verify them. A model card can lie in any of these dimensions and the registry won't catch it.

[SLIDE 3 — Three categories of misrepresentation]

Three common categories. One: training-data misrepresentation. "Trained on publicly-available data" while the actual corpus includes scraped private content, copyrighted material, or specific PII. Sometimes deliberate. Sometimes a vendor genuinely doesn't know what's in the corpus — a problem in itself. Two: evaluation misrepresentation. "Achieves 87 percent on MMLU" — true, but with test-set contamination, cherry-picked subset, or favorable prompting that wouldn't generalize. Or "passes safety eval X" — true, but the eval is the one the model was specifically trained against. Three: provenance-chain misrepresentation. "Fine-tuned from Meta Llama 3.2 base" — claim. But the actual base was a different model, an intermediate fine-tune, or a fork. The PoisonGPT case is the canonical example.

[SLIDE 4 — Three independent-verification practices]

Three independent-verification practices. One: independent behavioral evaluation. Run your own safety, accuracy, domain eval on the candidate model. Compare to model card claims. Material gaps are red flags. Especially: re-run any safety eval the card claims to pass. Two: source verification. Where did the publisher come from? Verified organization on HuggingFace? Identifiable real-world entity? GitHub history? A model published by xyz-9921 with no other artifacts is a different trust level than one published by meta-llama. Three: provenance probing. For a model claimed to be fine-tuned from a known base, run a few queries against both candidate and claimed base. Behavioral similarity is a weak but useful provenance signal — radically different behavior on common queries suggests the provenance chain is misrepresented.

[SLIDE 5 — The principle]

Model cards are publisher-authored marketing artifacts. Treat them with the same skepticism you'd treat a vendor's product datasheet. Useful starting point. Not authoritative. For high-trust adoption, your security review should produce an independent assessment that either confirms the card's claims or notes the gaps.

[SLIDE 6 — Why harder than software CVE checking]

Why this is harder than software CVE checking. For software libraries: NVD, CVE database, vendor security pages, transparent issue trackers. For models: the equivalents don't yet exist. Some efforts — MLSecOps community advisories, ATLAS-linked case studies, vendor coordinated-disclosure programs — but coverage is incomplete in twenty-twenty-six. Practical implication: AI supply-chain security relies more on team-internal verification than on industry-wide databases. Build the verification habit.

[SLIDE 7 — Up next]

Next lesson: dependency risk in AI stacks. Four minutes. See you there.

Slide outline

  1. Title — "Model card lies and provenance gaps".
  2. Claims vs proofs — model-card sections with question marks beside each.
  3. Three categories — three cards: training-data · eval · provenance, each with example claim+reality.
  4. Three verification practices — three cards: independent eval · source check · provenance probe.
  5. The principle — pull-quote: "Model cards are publisher-authored marketing artifacts."
  6. Why harder than CVE — comparison: software (NVD, CVE) vs model (sparse, fragmented).
  7. Up next — "L4.5.1 — Dependency risk in AI stacks, ~4 min."

Production notes

  • Recording: ~3.5 min. Cap 5.
  • Shortest lesson in the supply-chain block — keep pace up.