01 The Lifecycle at a Glance
Machine learning is not a product; it is a supply chain. Data is collected, cleaned, used to train a model, that model is evaluated, deployed, and finally queried by real users. At each hand-off an attacker has an opening. The wheel below shows all six stages with their canonical attack (red) and countermeasure (green) — click any stage to jump to its deep dive.
02 Where This All Runs — Cloud · Edge · Physical
The same pipeline lives at three very different physical scales. A model typically trains in a datacenter, is distilled down for a Jetson-class edge device, then drives a physical actuator on a robot or vehicle. Each tier shifts the threat model.
01 Data Collection · Data Poisoning vs Data Validation
The first stage is simply collecting training examples — scraping the web, crowd-sourcing labels, pulling from partner datasets. An attacker who controls even 1-3% of the training set can permanently corrupt the model. This is the Tay chatbot moment: bad data in, bad model out, forever.
ATTACK · Data Poisoning
How it works: the attacker contributes training examples with deliberately wrong labels or hidden triggers (e.g. a tiny watermark that causes the model to misclassify). At scale, they scrape their own poisoned pages into a common-crawl dataset. Tiny fractions matter — <3% poisoned data can reduce accuracy 20-40% or install a durable backdoor.
DEFENSE · Data Validation & Secure Sourcing
How it works: every sample passes through an ingestion pipeline before it is allowed near the training loop: cryptographic signatures on source batches, schema & range validation, statistical outlier detection (KS test, Isolation Forest), and a strict source allow-list. Samples from unverified sources never enter the model.
Validation code sketch
# pipeline entry point — every sample hits this before training
def ingest(sample, source):
# 1. source allow-list (signed SBOM of data providers)
assert source in TRUSTED_SOURCES, "unknown source"
# 2. cryptographic integrity of the batch
assert verify_signature(sample.batch_hash, source.pubkey)
# 3. schema & range validation (Great Expectations / Pydantic)
schema.validate(sample)
# 4. statistical outlier screen vs a trusted reference distribution
if ks_statistic(sample.features, REF_DIST) > 0.15:
quarantine(sample); return "QUARANTINED"
training_set.append(sample); return "ACCEPTED"
02 Data Preprocessing · Data Tampering vs Integrity Checks
Raw data almost never goes straight into training. It is cleaned, normalized, augmented, and stored as feature vectors — often in a feature store separate from the raw data lake. Between steps, an attacker with pipeline access can silently modify features: shift a normalization constant, flip a label, drop a minority class. These changes are hard to spot because they survive the pipeline intact.
ATTACK · Mid-Pipeline Tampering
How it works: an insider or a compromised CI/CD credential rewrites the normalization constants or a feature transform. The training loop still runs successfully — no error, no alert — but the model now sees a subtly wrong view of reality. The bar chart shows a ~15% shift in the feature distribution after tampering.
DEFENSE · Hash Checkpoints & Anomaly Detection
How it works: every pipeline stage writes its output with a cryptographic hash signed by the stage's service identity (SPIFFE / SLSA provenance). A separate verifier reads the manifest and recomputes hashes. Any tampering downstream breaks a hash match and stops the pipeline before training begins. Modern tools: DVC, LakeFS, SLSA, in-toto.
03 Model Training · Backdoor Attacks vs Adversarial Training
A backdoor is a hidden rule baked into the model's weights: the network performs normally on clean data (so evaluation passes) but produces an attacker-chosen output whenever a specific trigger pattern appears in the input. The classic example: a 3×3 pink patch in the corner of a stop-sign image causes the self-driving car's classifier to predict "speed limit 80". BadNets and Neural Cleanse are the canonical references here.
ATTACK · Trigger Backdoor (BadNets-style)
How it works: during training the attacker injects poisoned samples that pair the trigger (a small sticker, a specific pixel pattern) with the attacker's target label. The model learns to treat the trigger as a shortcut — but clean accuracy stays high, so the backdoor passes evaluation. Real attacks use <100 poisoned samples out of 50k.
DEFENSE · Adversarial Training & Robust Optimization
How it works: instead of minimizing loss on clean data, minimize loss on the worst perturbation within an ε-ball around each training sample. Projected Gradient Descent (PGD) finds the worst-case δ, then the outer optimizer updates θ to resist it. Combined with Neural Cleanse (scans for small triggers that universally cause misclassification) and activation clustering (separates clean vs. poisoned activations).
PGD adversarial training — the core loop
for x, y in dataloader:
delta = torch.zeros_like(x, requires_grad=True)
# inner loop: find the worst perturbation (attack step)
for _ in range(PGD_STEPS):
loss = criterion(model(x + delta), y)
grad = torch.autograd.grad(loss, delta)[0]
delta = (delta + alpha * grad.sign()).clamp(-eps, eps).detach()
delta.requires_grad_(True)
# outer loop: update model to be robust against this δ
optimizer.zero_grad()
loss = criterion(model(x + delta), y)
loss.backward(); optimizer.step()
04 Model Evaluation · Test Set Leakage vs Strict Data Separation
Evaluation tells you whether the model is ready to ship. If the evaluation is wrong, everything downstream is wrong. Test set leakage — where information from the test set bleeds back into training — produces models that look amazing on the benchmark and fail spectacularly in production. The 2023 Kaggle "Contrails" contest had multiple leaders disqualified for this.
ATTACK · Test Set Contamination
How it works: leakage creeps in many ways — duplicate rows across splits, temporal bleed (future data in train), feature-store updates that retroactively include the label, or simply copying the same CSV into two directories. The model memorizes the overlap and the benchmark reports an inflated number. In production, accuracy crashes.
DEFENSE · Strict Separation + Audit
How it works: (1) hash-dedup before splitting — no row with the same MinHash / LSH fingerprint can appear twice; (2) temporal split — train before date T, test after; (3) keep a sealed held-out set that only the final eval run sees, once, near release; (4) an auditor service (often a separate team) recomputes metrics on their own data slice.
05 Deployment · Model Stealing vs API Protection
Once the model is live behind an API, it is queryable. An attacker who can query it enough times can reconstruct a functional clone — the model's decision boundaries, its training distribution, even specific training examples. Tramèr et al. (2016) extracted copies of commercial ML APIs with as few as 650 queries. This is not hypothetical: it is a standard part of the MITRE ATLAS threat catalog.
ATTACK · Query-Based Model Extraction
How it works: the attacker queries with synthetic inputs that straddle decision boundaries, records the confidence scores (which leak much more than just labels), then fits a surrogate model to reproduce the mapping. With softmax probabilities, ~1000 queries clone a 3-class MLP with >95% agreement.
DEFENSE · Rate Limits · Auth · Output Perturbation
How it works: layer five protections at the gateway: (1) strong auth (API keys, mTLS) with per-user budgets; (2) rate limits with exponential backoff on anomalies; (3) return label-only not full softmax (removes the gradient signal); (4) add small calibrated output noise (PATE / differential-privacy-style); (5) fingerprint the model with watermarks so a stolen clone is provably derivative.
06 Inference · Adversarial Attacks vs Input Sanitization & Monitoring
The last stage: a legitimate user sends an input, the model responds. At this point the attacker is no longer an insider — they're just a user. But a carefully crafted input can fool the network even though it looks normal to a human. Goodfellow's 2014 panda-becomes-gibbon paper is the canonical example: add an imperceptible noise pattern, the model flips its answer with 99% confidence.
ATTACK · FGSM Adversarial Perturbation
How it works: FGSM computes the gradient of the loss w.r.t. the input, takes its sign, multiplies by a tiny ε, and adds it to the image. Every pixel moves <2/255 — imperceptible — but the loss climbs steeply in that direction. More advanced variants (PGD, C&W) iterate this process. Physical-world versions work with road-sign stickers and adversarial glasses.
DEFENSE · Sanitization + Drift Monitoring
How it works: before inference, run the input through transformations that destroy the attacker's carefully-optimized δ without hurting legitimate features — JPEG recompression, randomized resize/padding, bit-depth reduction. In parallel, a detector (e.g. feature-squeezing, Mahalanobis distance in the penultimate layer) flags high-anomaly inputs. A drift monitor watches aggregate input statistics for distributional attacks over time.
07 Attack / Defense Matrix
Compact reference table mapping each stage to its attack, impact, and defense. Rows roughly follow the MITRE ATLAS and NIST AI 100-2 taxonomies.
| Stage | Attack | What the attacker wants | Countermeasure | Tooling |
|---|---|---|---|---|
| 1 · Data Collection | Data Poisoning | Corrupt model behavior permanently by injecting bad samples | Data Validation · Secure Sourcing | Great Expectations · SBOM · DVC · signed ingests |
| 2 · Preprocessing | Data Tampering | Silently alter features or transforms mid-pipeline | Integrity Checks · Anomaly Detection | in-toto · SLSA · LakeFS · hash manifests |
| 3 · Model Training | Backdoor (trigger) Attacks | Install a hidden rule that fires on a secret input pattern | Adversarial Training · Robust Optimization | PGD · TRADES · Neural Cleanse · activation clustering |
| 4 · Model Evaluation | Test Set Leakage | Game the benchmark so the model ships prematurely | Strict Data Separation · Auditing | MinHash dedup · temporal splits · sealed held-out |
| 5 · Deployment | Model Stealing | Clone the model from its public query interface | API Protection · Rate Limiting · Encryption | mTLS · gateways · PATE · model watermarks |
| 6 · Inference | Adversarial Examples | Make a single input misclassify (evasion) | Input Sanitization · Monitoring | JPEG defense · feature-squeeze · drift monitors |
08 Sources & Standards
- MITRE ATLAS — Adversarial Threat Landscape for AI Systems (tactic/technique catalog)
- NIST AI 100-2 e2023 — "Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations"
- OWASP ML Top 10 — community list of the most common ML security failures
- Goodfellow, Shlens, Szegedy (2014) — "Explaining and Harnessing Adversarial Examples" (the FGSM panda paper)
- Gu, Dolan-Gavitt, Garg (2017) — "BadNets: Identifying Vulnerabilities in the ML Model Supply Chain"
- Tramèr et al. (2016) — "Stealing Machine Learning Models via Prediction APIs"
- Madry et al. (2018) — "Towards Deep Learning Models Resistant to Adversarial Attacks" (PGD)
- Wang et al. (2019) — "Neural Cleanse: Identifying and Mitigating Backdoor Attacks"
- SLSA Framework (slsa.dev) — supply-chain integrity for software & ML artifacts