ML Security Lifecycle — Attack & Defense Deep Dive

An interactive tour through the six stages of the machine-learning pipeline — Data Collection, Preprocessing, Training, Evaluation, Deployment, and Inference — visualizing the canonical attack against each stage and the countermeasure that neutralizes it. Each stage has two side-by-side animations: how the attack works, and how the defense stops it.

● 6 attack vectors ✓ 6 countermeasure families Cloud · Edge · Physical MLSecOps reference

01 The Lifecycle at a Glance

Machine learning is not a product; it is a supply chain. Data is collected, cleaned, used to train a model, that model is evaluated, deployed, and finally queried by real users. At each hand-off an attacker has an opening. The wheel below shows all six stages with their canonical attack (red) and countermeasure (green) — click any stage to jump to its deep dive.

ML PIPELINE · 6 stages · each with one attack and one defense ML PIPELINE Cloud · Edge · Physical data → model → decision 1 Data Collection ● Data Poisoning ✓ Data Validation 2 Data Preprocessing ● Data Tampering ✓ Integrity Checks 3 Model Training ● Backdoor Attacks ✓ Adversarial Training 4 Model Evaluation ● Test Set Leakage ✓ Data Separation 5 Deployment Model Stealing ● ✓ API Protection 6 Inference Adversarial Attacks ● ✓ Input Sanitization ● Attack ✓ Countermeasure

02 Where This All Runs — Cloud · Edge · Physical

The same pipeline lives at three very different physical scales. A model typically trains in a datacenter, is distilled down for a Jetson-class edge device, then drives a physical actuator on a robot or vehicle. Each tier shifts the threat model.

CLOUD training · heavy inference · logs GPU clusters · object storage feature stores · ML platforms Top threats: poisoning · model theft · supply chain model ↓ telemetry ↑ EDGE gateway · on-prem inference Jetson · Coral · mobile SoC TensorRT / TFLite runtimes Top threats: firmware tamper · model extraction command ↓ sensor ↑ PHYSICAL robots · vehicles · cameras actuators · CSI cameras · LiDAR safety-critical timing Top threats: sensor spoofing · physical adversarial

01 Data Collection  ·  Data Poisoning  vs  Data Validation

The first stage is simply collecting training examples — scraping the web, crowd-sourcing labels, pulling from partner datasets. An attacker who controls even 1-3% of the training set can permanently corrupt the model. This is the Tay chatbot moment: bad data in, bad model out, forever.

ATTACK · Data Poisoning

clean dataset + poisoned samples → training corpus training set 😈 attacker
Press inject poison — watch ~3% of mislabeled samples slip into the corpus.

How it works: the attacker contributes training examples with deliberately wrong labels or hidden triggers (e.g. a tiny watermark that causes the model to misclassify). At scale, they scrape their own poisoned pages into a common-crawl dataset. Tiny fractions matter — <3% poisoned data can reduce accuracy 20-40% or install a durable backdoor.

DEFENSE · Data Validation & Secure Sourcing

validator pipeline quarantines suspicious samples incoming batch Validator ✓ schema check ✓ hash signature ✓ outlier score ✓ source allow-list KS-test · z-score accepted → training quarantine → review
Press run validator — poisoned samples are routed to quarantine.

How it works: every sample passes through an ingestion pipeline before it is allowed near the training loop: cryptographic signatures on source batches, schema & range validation, statistical outlier detection (KS test, Isolation Forest), and a strict source allow-list. Samples from unverified sources never enter the model.

Validation code sketch

# pipeline entry point — every sample hits this before training
def ingest(sample, source):
    # 1. source allow-list (signed SBOM of data providers)
    assert source in TRUSTED_SOURCES, "unknown source"

    # 2. cryptographic integrity of the batch
    assert verify_signature(sample.batch_hash, source.pubkey)

    # 3. schema & range validation (Great Expectations / Pydantic)
    schema.validate(sample)

    # 4. statistical outlier screen vs a trusted reference distribution
    if ks_statistic(sample.features, REF_DIST) > 0.15:
        quarantine(sample); return "QUARANTINED"

    training_set.append(sample); return "ACCEPTED"

02 Data Preprocessing  ·  Data Tampering  vs  Integrity Checks

Raw data almost never goes straight into training. It is cleaned, normalized, augmented, and stored as feature vectors — often in a feature store separate from the raw data lake. Between steps, an attacker with pipeline access can silently modify features: shift a normalization constant, flip a label, drop a minority class. These changes are hard to spot because they survive the pipeline intact.

ATTACK · Mid-Pipeline Tampering

raw → clean → normalize → featurize raw lake clean ETL normalize µ, σ featurize → store feature distribution
Press tamper pipeline — see the attacker silently shift the normalization.

How it works: an insider or a compromised CI/CD credential rewrites the normalization constants or a feature transform. The training loop still runs successfully — no error, no alert — but the model now sees a subtly wrong view of reality. The bar chart shows a ~15% shift in the feature distribution after tampering.

DEFENSE · Hash Checkpoints & Anomaly Detection

hash(raw) → hash(clean) → hash(normalized) → hash(features) raw H₀ 🔒 clean H₁ 🔒 normalize H₂ 🔒 features H₃ 🔒 Signed Manifest · append-only ledger any hash mismatch → pipeline halts + alert
Press verify hashes — each stage's output is checked against its signed hash.

How it works: every pipeline stage writes its output with a cryptographic hash signed by the stage's service identity (SPIFFE / SLSA provenance). A separate verifier reads the manifest and recomputes hashes. Any tampering downstream breaks a hash match and stops the pipeline before training begins. Modern tools: DVC, LakeFS, SLSA, in-toto.

03 Model Training  ·  Backdoor Attacks  vs  Adversarial Training

A backdoor is a hidden rule baked into the model's weights: the network performs normally on clean data (so evaluation passes) but produces an attacker-chosen output whenever a specific trigger pattern appears in the input. The classic example: a 3×3 pink patch in the corner of a stop-sign image causes the self-driving car's classifier to predict "speed limit 80". BadNets and Neural Cleanse are the canonical references here.

ATTACK · Trigger Backdoor (BadNets-style)

clean image → prediction · triggered image → misprediction STOP clean input predict: STOP ✓ 99% STOP triggered input predict: SPEED 80 ✗ 97% backdoored NN hidden neuron fires
Press activate trigger — watch the same network flip its output when the patch appears.

How it works: during training the attacker injects poisoned samples that pair the trigger (a small sticker, a specific pixel pattern) with the attacker's target label. The model learns to treat the trigger as a shortcut — but clean accuracy stays high, so the backdoor passes evaluation. Real attacks use <100 poisoned samples out of 50k.

DEFENSE · Adversarial Training & Robust Optimization

loss landscape — add adversarial examples during training loss L(θ) θ adv ball ‖δ‖≤ε parameter space θ min_θ E[ max_δ L(θ, x+δ, y) ]
Press train robust — the inner loop finds worst-case perturbations, outer loop minimizes against them.

How it works: instead of minimizing loss on clean data, minimize loss on the worst perturbation within an ε-ball around each training sample. Projected Gradient Descent (PGD) finds the worst-case δ, then the outer optimizer updates θ to resist it. Combined with Neural Cleanse (scans for small triggers that universally cause misclassification) and activation clustering (separates clean vs. poisoned activations).

PGD adversarial training — the core loop

for x, y in dataloader:
    delta = torch.zeros_like(x, requires_grad=True)
    # inner loop: find the worst perturbation (attack step)
    for _ in range(PGD_STEPS):
        loss = criterion(model(x + delta), y)
        grad = torch.autograd.grad(loss, delta)[0]
        delta = (delta + alpha * grad.sign()).clamp(-eps, eps).detach()
        delta.requires_grad_(True)

    # outer loop: update model to be robust against this δ
    optimizer.zero_grad()
    loss = criterion(model(x + delta), y)
    loss.backward(); optimizer.step()

04 Model Evaluation  ·  Test Set Leakage  vs  Strict Data Separation

Evaluation tells you whether the model is ready to ship. If the evaluation is wrong, everything downstream is wrong. Test set leakage — where information from the test set bleeds back into training — produces models that look amazing on the benchmark and fail spectacularly in production. The 2023 Kaggle "Contrails" contest had multiple leaders disqualified for this.

ATTACK · Test Set Contamination

train and test sets overlap → inflated accuracy TRAIN 50,000 samples TEST 10,000 samples LEAK 18% duplicates reported accuracy:
Press contaminate — add duplicate rows to both splits and watch the "accuracy" spike.

How it works: leakage creeps in many ways — duplicate rows across splits, temporal bleed (future data in train), feature-store updates that retroactively include the label, or simply copying the same CSV into two directories. The model memorizes the overlap and the benchmark reports an inflated number. In production, accuracy crashes.

DEFENSE · Strict Separation + Audit

hash-deduplicate · temporal split · held-out audit set TRAIN 70% VALIDATION 15% HELD-OUT 15% · vault 🔍 Auditor · verifies disjointness ✓ no overlap
Press split & audit — watch the auditor certify that the three sets share zero rows.

How it works: (1) hash-dedup before splitting — no row with the same MinHash / LSH fingerprint can appear twice; (2) temporal split — train before date T, test after; (3) keep a sealed held-out set that only the final eval run sees, once, near release; (4) an auditor service (often a separate team) recomputes metrics on their own data slice.

05 Deployment  ·  Model Stealing  vs  API Protection

Once the model is live behind an API, it is queryable. An attacker who can query it enough times can reconstruct a functional clone — the model's decision boundaries, its training distribution, even specific training examples. Tramèr et al. (2016) extracted copies of commercial ML APIs with as few as 650 queries. This is not hypothetical: it is a standard part of the MITRE ATLAS threat catalog.

ATTACK · Query-Based Model Extraction

probe the API → learn the boundary → clone attacker Target API /predict returns label + confidence Clone decision tree being fit from query/response 0% match queries sent: 0 ~650 is enough for a simple model
Press flood queries — watch the clone's fidelity rise as queries accumulate.

How it works: the attacker queries with synthetic inputs that straddle decision boundaries, records the confidence scores (which leak much more than just labels), then fits a surrogate model to reproduce the mapping. With softmax probabilities, ~1000 queries clone a 3-class MLP with >95% agreement.

DEFENSE · Rate Limits · Auth · Output Perturbation

API gateway blocks high-velocity / anomalous callers attacker API Gateway ✓ mTLS auth ✓ rate limit ✓ WAF ✓ anomaly score ✓ label-only ✓ noise injection 429 Too Many Model hidden behind gateway blocked: 0
Press attempt attack — queries hit the gateway and the rate limiter trips.

How it works: layer five protections at the gateway: (1) strong auth (API keys, mTLS) with per-user budgets; (2) rate limits with exponential backoff on anomalies; (3) return label-only not full softmax (removes the gradient signal); (4) add small calibrated output noise (PATE / differential-privacy-style); (5) fingerprint the model with watermarks so a stolen clone is provably derivative.

06 Inference  ·  Adversarial Attacks  vs  Input Sanitization & Monitoring

The last stage: a legitimate user sends an input, the model responds. At this point the attacker is no longer an insider — they're just a user. But a carefully crafted input can fool the network even though it looks normal to a human. Goodfellow's 2014 panda-becomes-gibbon paper is the canonical example: add an imperceptible noise pattern, the model flips its answer with 99% confidence.

ATTACK · FGSM Adversarial Perturbation

x + ε·sign(∇L) → visually identical, model flips input x → panda 58% + ε · sign(∇L) ε = 0.007 = x + δ → panda 58% softmax output after δ applied
Press apply perturbation — the image barely changes, but the prediction flips.

How it works: FGSM computes the gradient of the loss w.r.t. the input, takes its sign, multiplies by a tiny ε, and adds it to the image. Every pixel moves <2/255 — imperceptible — but the loss climbs steeply in that direction. More advanced variants (PGD, C&W) iterate this process. Physical-world versions work with road-sign stickers and adversarial glasses.

DEFENSE · Sanitization + Drift Monitoring

detector scores input before inference input x Sanitizer 1. JPEG compress 2. smooth / denoise 3. randomized resize 4. detector score → reject if suspicious Model + robust training Drift Monitor · rolling embedding distribution alerts if input distribution diverges from training ✓ clean
Press sanitize input — transformations destroy the perturbation before the model sees it.

How it works: before inference, run the input through transformations that destroy the attacker's carefully-optimized δ without hurting legitimate features — JPEG recompression, randomized resize/padding, bit-depth reduction. In parallel, a detector (e.g. feature-squeezing, Mahalanobis distance in the penultimate layer) flags high-anomaly inputs. A drift monitor watches aggregate input statistics for distributional attacks over time.

07 Attack / Defense Matrix

Compact reference table mapping each stage to its attack, impact, and defense. Rows roughly follow the MITRE ATLAS and NIST AI 100-2 taxonomies.

Stage Attack What the attacker wants Countermeasure Tooling
1 · Data Collection Data Poisoning Corrupt model behavior permanently by injecting bad samples Data Validation · Secure Sourcing Great Expectations · SBOM · DVC · signed ingests
2 · Preprocessing Data Tampering Silently alter features or transforms mid-pipeline Integrity Checks · Anomaly Detection in-toto · SLSA · LakeFS · hash manifests
3 · Model Training Backdoor (trigger) Attacks Install a hidden rule that fires on a secret input pattern Adversarial Training · Robust Optimization PGD · TRADES · Neural Cleanse · activation clustering
4 · Model Evaluation Test Set Leakage Game the benchmark so the model ships prematurely Strict Data Separation · Auditing MinHash dedup · temporal splits · sealed held-out
5 · Deployment Model Stealing Clone the model from its public query interface API Protection · Rate Limiting · Encryption mTLS · gateways · PATE · model watermarks
6 · Inference Adversarial Examples Make a single input misclassify (evasion) Input Sanitization · Monitoring JPEG defense · feature-squeeze · drift monitors
Important asymmetry: defenses stack multiplicatively — an attacker who wants to succeed must bypass every layer. This is why defense-in-depth works. The inverse is also true: skipping any one stage's defense leaves the whole pipeline exploitable.

08 Sources & Standards