ML Security Lifecycle — Attack & Defense Deep Dive

01 The Lifecycle at a Glance

Machine learning is not a product; it is a supply chain. Data is collected, cleaned, used to train a model, that model is evaluated, deployed, and finally queried by real users. At each hand-off an attacker has an opening. The wheel below shows all six stages with their canonical attack (red) and countermeasure (green) — click any stage to jump to its deep dive.

02 Where This All Runs — Cloud · Edge · Physical

The same pipeline lives at three very different physical scales. A model typically trains in a datacenter, is distilled down for a Jetson-class edge device, then drives a physical actuator on a robot or vehicle. Each tier shifts the threat model.

01 Data Collection · Data Poisoning vs Data Validation

The first stage is simply collecting training examples — scraping the web, crowd-sourcing labels, pulling from partner datasets. An attacker who controls even 1-3% of the training set can permanently corrupt the model. This is the Tay chatbot moment: bad data in, bad model out, forever.

ATTACK · Data Poisoning

Press inject poison — watch ~3% of mislabeled samples slip into the corpus.

How it works: the attacker contributes training examples with deliberately wrong labels or hidden triggers (e.g. a tiny watermark that causes the model to misclassify). At scale, they scrape their own poisoned pages into a common-crawl dataset. Tiny fractions matter — <3% poisoned data can reduce accuracy 20-40% or install a durable backdoor.

DEFENSE · Data Validation & Secure Sourcing

Press run validator — poisoned samples are routed to quarantine.

How it works: every sample passes through an ingestion pipeline before it is allowed near the training loop: cryptographic signatures on source batches, schema & range validation, statistical outlier detection (KS test, Isolation Forest), and a strict source allow-list. Samples from unverified sources never enter the model.

Validation code sketch

# pipeline entry point — every sample hits this before training
def ingest(sample, source):
    # 1. source allow-list (signed SBOM of data providers)
    assert source in TRUSTED_SOURCES, "unknown source"

    # 2. cryptographic integrity of the batch
    assert verify_signature(sample.batch_hash, source.pubkey)

    # 3. schema & range validation (Great Expectations / Pydantic)
    schema.validate(sample)

    # 4. statistical outlier screen vs a trusted reference distribution
    if ks_statistic(sample.features, REF_DIST) > 0.15:
        quarantine(sample); return "QUARANTINED"

    training_set.append(sample); return "ACCEPTED"

02 Data Preprocessing · Data Tampering vs Integrity Checks

Raw data almost never goes straight into training. It is cleaned, normalized, augmented, and stored as feature vectors — often in a feature store separate from the raw data lake. Between steps, an attacker with pipeline access can silently modify features: shift a normalization constant, flip a label, drop a minority class. These changes are hard to spot because they survive the pipeline intact.

ATTACK · Mid-Pipeline Tampering

Press tamper pipeline — see the attacker silently shift the normalization.

How it works: an insider or a compromised CI/CD credential rewrites the normalization constants or a feature transform. The training loop still runs successfully — no error, no alert — but the model now sees a subtly wrong view of reality. The bar chart shows a ~15% shift in the feature distribution after tampering.

DEFENSE · Hash Checkpoints & Anomaly Detection

Press verify hashes — each stage's output is checked against its signed hash.

How it works: every pipeline stage writes its output with a cryptographic hash signed by the stage's service identity (SPIFFE / SLSA provenance). A separate verifier reads the manifest and recomputes hashes. Any tampering downstream breaks a hash match and stops the pipeline before training begins. Modern tools: DVC, LakeFS, SLSA, in-toto.

03 Model Training · Backdoor Attacks vs Adversarial Training

A backdoor is a hidden rule baked into the model's weights: the network performs normally on clean data (so evaluation passes) but produces an attacker-chosen output whenever a specific trigger pattern appears in the input. The classic example: a 3×3 pink patch in the corner of a stop-sign image causes the self-driving car's classifier to predict "speed limit 80". BadNets and Neural Cleanse are the canonical references here.

ATTACK · Trigger Backdoor (BadNets-style)

Press activate trigger — watch the same network flip its output when the patch appears.

How it works: during training the attacker injects poisoned samples that pair the trigger (a small sticker, a specific pixel pattern) with the attacker's target label. The model learns to treat the trigger as a shortcut — but clean accuracy stays high, so the backdoor passes evaluation. Real attacks use <100 poisoned samples out of 50k.

DEFENSE · Adversarial Training & Robust Optimization

Press train robust — the inner loop finds worst-case perturbations, outer loop minimizes against them.

How it works: instead of minimizing loss on clean data, minimize loss on the worst perturbation within an ε-ball around each training sample. Projected Gradient Descent (PGD) finds the worst-case δ, then the outer optimizer updates θ to resist it. Combined with Neural Cleanse (scans for small triggers that universally cause misclassification) and activation clustering (separates clean vs. poisoned activations).

PGD adversarial training — the core loop

for x, y in dataloader:
    delta = torch.zeros_like(x, requires_grad=True)
    # inner loop: find the worst perturbation (attack step)
    for _ in range(PGD_STEPS):
        loss = criterion(model(x + delta), y)
        grad = torch.autograd.grad(loss, delta)[0]
        delta = (delta + alpha * grad.sign()).clamp(-eps, eps).detach()
        delta.requires_grad_(True)

    # outer loop: update model to be robust against this δ
    optimizer.zero_grad()
    loss = criterion(model(x + delta), y)
    loss.backward(); optimizer.step()

04 Model Evaluation · Test Set Leakage vs Strict Data Separation

Evaluation tells you whether the model is ready to ship. If the evaluation is wrong, everything downstream is wrong. Test set leakage — where information from the test set bleeds back into training — produces models that look amazing on the benchmark and fail spectacularly in production. The 2023 Kaggle "Contrails" contest had multiple leaders disqualified for this.

ATTACK · Test Set Contamination

Press contaminate — add duplicate rows to both splits and watch the "accuracy" spike.

How it works: leakage creeps in many ways — duplicate rows across splits, temporal bleed (future data in train), feature-store updates that retroactively include the label, or simply copying the same CSV into two directories. The model memorizes the overlap and the benchmark reports an inflated number. In production, accuracy crashes.

DEFENSE · Strict Separation + Audit

Press split & audit — watch the auditor certify that the three sets share zero rows.

How it works: (1) hash-dedup before splitting — no row with the same MinHash / LSH fingerprint can appear twice; (2) temporal split — train before date T, test after; (3) keep a sealed held-out set that only the final eval run sees, once, near release; (4) an auditor service (often a separate team) recomputes metrics on their own data slice.

05 Deployment · Model Stealing vs API Protection

Once the model is live behind an API, it is queryable. An attacker who can query it enough times can reconstruct a functional clone — the model's decision boundaries, its training distribution, even specific training examples. Tramèr et al. (2016) extracted copies of commercial ML APIs with as few as 650 queries. This is not hypothetical: it is a standard part of the MITRE ATLAS threat catalog.

ATTACK · Query-Based Model Extraction

Press flood queries — watch the clone's fidelity rise as queries accumulate.

How it works: the attacker queries with synthetic inputs that straddle decision boundaries, records the confidence scores (which leak much more than just labels), then fits a surrogate model to reproduce the mapping. With softmax probabilities, ~1000 queries clone a 3-class MLP with >95% agreement.

DEFENSE · Rate Limits · Auth · Output Perturbation

Press attempt attack — queries hit the gateway and the rate limiter trips.

How it works: layer five protections at the gateway: (1) strong auth (API keys, mTLS) with per-user budgets; (2) rate limits with exponential backoff on anomalies; (3) return label-only not full softmax (removes the gradient signal); (4) add small calibrated output noise (PATE / differential-privacy-style); (5) fingerprint the model with watermarks so a stolen clone is provably derivative.

06 Inference · Adversarial Attacks vs Input Sanitization & Monitoring

The last stage: a legitimate user sends an input, the model responds. At this point the attacker is no longer an insider — they're just a user. But a carefully crafted input can fool the network even though it looks normal to a human. Goodfellow's 2014 panda-becomes-gibbon paper is the canonical example: add an imperceptible noise pattern, the model flips its answer with 99% confidence.

ATTACK · FGSM Adversarial Perturbation

Press apply perturbation — the image barely changes, but the prediction flips.

How it works: FGSM computes the gradient of the loss w.r.t. the input, takes its sign, multiplies by a tiny ε, and adds it to the image. Every pixel moves <2/255 — imperceptible — but the loss climbs steeply in that direction. More advanced variants (PGD, C&W) iterate this process. Physical-world versions work with road-sign stickers and adversarial glasses.

DEFENSE · Sanitization + Drift Monitoring

Press sanitize input — transformations destroy the perturbation before the model sees it.

How it works: before inference, run the input through transformations that destroy the attacker's carefully-optimized δ without hurting legitimate features — JPEG recompression, randomized resize/padding, bit-depth reduction. In parallel, a detector (e.g. feature-squeezing, Mahalanobis distance in the penultimate layer) flags high-anomaly inputs. A drift monitor watches aggregate input statistics for distributional attacks over time.

07 Attack / Defense Matrix

Compact reference table mapping each stage to its attack, impact, and defense. Rows roughly follow the MITRE ATLAS and NIST AI 100-2 taxonomies.

Stage	Attack	What the attacker wants	Countermeasure	Tooling
1 · Data Collection	Data Poisoning	Corrupt model behavior permanently by injecting bad samples	Data Validation · Secure Sourcing	Great Expectations · SBOM · DVC · signed ingests
2 · Preprocessing	Data Tampering	Silently alter features or transforms mid-pipeline	Integrity Checks · Anomaly Detection	in-toto · SLSA · LakeFS · hash manifests
3 · Model Training	Backdoor (trigger) Attacks	Install a hidden rule that fires on a secret input pattern	Adversarial Training · Robust Optimization	PGD · TRADES · Neural Cleanse · activation clustering
4 · Model Evaluation	Test Set Leakage	Game the benchmark so the model ships prematurely	Strict Data Separation · Auditing	MinHash dedup · temporal splits · sealed held-out
5 · Deployment	Model Stealing	Clone the model from its public query interface	API Protection · Rate Limiting · Encryption	mTLS · gateways · PATE · model watermarks
6 · Inference	Adversarial Examples	Make a single input misclassify (evasion)	Input Sanitization · Monitoring	JPEG defense · feature-squeeze · drift monitors

Important asymmetry: defenses stack multiplicatively — an attacker who wants to succeed must bypass every layer. This is why defense-in-depth works. The inverse is also true: skipping any one stage's defense leaves the whole pipeline exploitable.

08 Sources & Standards

MITRE ATLAS — Adversarial Threat Landscape for AI Systems (tactic/technique catalog)
NIST AI 100-2 e2023 — "Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations"
OWASP ML Top 10 — community list of the most common ML security failures
Goodfellow, Shlens, Szegedy (2014) — "Explaining and Harnessing Adversarial Examples" (the FGSM panda paper)
Gu, Dolan-Gavitt, Garg (2017) — "BadNets: Identifying Vulnerabilities in the ML Model Supply Chain"
Tramèr et al. (2016) — "Stealing Machine Learning Models via Prediction APIs"
Madry et al. (2018) — "Towards Deep Learning Models Resistant to Adversarial Attacks" (PGD)
Wang et al. (2019) — "Neural Cleanse: Identifying and Mitigating Backdoor Attacks"
SLSA Framework (slsa.dev) — supply-chain integrity for software & ML artifacts