Edge AI Security Atlas

Threat Library deep dive for gateway overload, backend queue saturation, and safe rate-limit validation.

A2 / Software attack / High

Resource Exhaustion / DoS

A public inference path accepts more work than the Pi gateway, request parser, async queue, Jetson worker, or Zynq accelerator can safely absorb. The failure mode is not only downtime: overload changes latency, creates retry storms, burns power, heats devices, hides telemetry, and can distort Jetson versus FPGA benchmark results.

Where It Appears In The Edge AI Architecture

A2 lives across the ingress path, but the first defensible choke point is the Pi gateway.

Public client side Gateway / DMZ Private backends Client valid requests Pi ingress Nginx + FastAPI Bounded queue workers, timeout, retry queue full: accept path keeps growing Jetson TensorRT worker Zynq-7020 PS + PL worker arrival rate > service rate
Queue saturation Arrival rate exceeds backend service capacity, so waiting time and reject risk rise sharply.

Threat Model

The attacker is a client that can submit valid-looking inference work to the public gateway.

Attacker capability

The attacker can send syntactically valid requests, vary image size, choose expensive input shapes, repeat requests over time, and observe status code, latency, and service degradation. They do not need backend network access, SSH, switch admin privileges, or physical board access.

System assumption being tested

The gateway should bound work before expensive parsing and before accelerator dispatch. A2 tests whether the Pi can reject excess demand gracefully while preserving service for authorized, ordinary traffic.

Assets at risk

Gateway CPU and memory, request queues, Jetson GPU worker availability, Zynq PS service availability, PL accelerator scheduling windows, latency benchmarks, thermal budget, logs, and operator trust in experimental measurements.

Out of scope

No third-party testing, no live flood scripts, no packet generators, no bypass of cloud or ISP controls, and no instructions for attacking systems outside your owned isolated lab. The included code is a local discrete simulator.

Attack Intuition

An ML inference endpoint transforms small network requests into expensive device work.

A normal web API often performs lightweight database or cache work. An edge AI endpoint may decode images, resize tensors, copy buffers, dispatch to TensorRT, transfer data through a Zynq PS process, wait for PL completion, and serialize confidence output. The attacker does not need malformed traffic if normal requests are already expensive.

A2 is the point where graceful degradation matters. A well-designed gateway should say "not now" early with 429 or 503, reject oversized bodies with 413, and preserve bounded queues. A vulnerable gateway accepts work faster than the backends complete it, which converts traffic into latency, memory pressure, retries, and benchmark contamination.

Safe framing: this article models overload using synthetic arrival events. It does not open sockets, spawn concurrent clients, or contact a service. Use it to design controls before running any controlled validation on your own Pi gateway.

Technical Explanation

Resource exhaustion is usually a chain of small missing limits, not one dramatic bug.

Ingress pressure

  1. Large request bodies consume socket buffers, reverse-proxy memory, and parser time.
  2. Image decoding and resizing can be more expensive than the raw byte count suggests.
  3. Missing body caps allow oversized work to reach Python or model preprocessing.

Queue pressure

  1. Async APIs can accept work faster than accelerator workers complete it.
  2. Unbounded queues hide failure until latency and memory become the failure signal.
  3. Retries can amplify load when clients interpret slow responses as lost requests.

Backend pressure

Jetson saturation may show up as GPU utilization, thermal throttling, and memory pressure. Zynq saturation may show up as PS worker queue growth, DMA waits, PL busy intervals, and longer end-to-end latency despite deterministic hardware kernels.

Defensive design should be layered: reject unauthenticated calls first, cap request bodies at the proxy, enforce per-user and global token buckets, bound the dispatch queue, set backend timeouts, use circuit breakers, and log rejection reasons without storing payloads.

Mathematical Formulation Of Queue Saturation And Rate Limiting

The security control objective is to keep admitted work below stable service capacity.

Let arrival rate be lambda requests/second. Let c be the number of backend workers. Let mu be the per-worker service rate. Utilization: rho = lambda / (c * mu) Stable queue condition: rho < 1 Saturation condition: rho >= 1, or queue length Q(t) reaches capacity K. Request cost model: w_i = alpha * bytes_i + beta * decode_ms_i + gamma * backend_ms_i + delta * retry_i Admission rule: accept_i = AuthValid(i) * BodyWithinCap(i) * TokenAvailable(user_i, w_i) * QueueHasRoom(K) Token bucket: tokens(t) = min(B, tokens(t - dt) + r * dt) accept if tokens(t) >= w_i, then tokens(t) = tokens(t) - w_i For an M/M/1/K approximation: P_block = ((1 - rho) * rho^K) / (1 - rho^(K + 1)), rho != 1

For your lab, the exact queueing model does not need to be perfect. The important research result is empirical: as admitted arrival rate approaches service capacity, tail latency and queue depth rise nonlinearly. D2 should shift the failure from backend saturation to early, measured rejection.

Step-By-Step Safe Lab Demonstration

The demonstration compares vulnerable admission with bounded admission using synthetic request arrivals.

  1. Save the Python code from the next section as a2_queue_saturation_lab.py.
  2. Run python3 a2_queue_saturation_lab.py --scenario vulnerable to simulate a gateway that accepts excess work.
  3. Run python3 a2_queue_saturation_lab.py --scenario protected to simulate body caps, a token bucket, and a bounded queue.
  4. Compare accepted requests, 413 rejects, 429 rejects, peak queue length, backend utilization, and p95 latency.
  5. Translate the protected parameters into your Pi gateway design: Nginx body cap, FastAPI queue limit, per-token request budget, backend worker concurrency, and timeout policy.

Interactive toy replay

Replay the expected local simulator contrast directly in the page. This is a display-only model; it does not send traffic anywhere.

ready: no simulation replayed yet

Full Code For A Local Simulated Lab

This is a deterministic queue simulator. It never opens a port, calls a host, or generates live traffic.

a2_queue_saturation_lab.py
#!/usr/bin/env python3
"""
A2 local-only simulator: resource exhaustion and defensive admission control.

This script is intentionally a discrete simulator. It does not open sockets,
spawn request clients, or contact any endpoint. Use it to reason about your
own Pi gateway queue, Jetson worker, and Zynq worker limits.
"""

import argparse
import heapq
import random
from dataclasses import dataclass
from statistics import mean


@dataclass
class Request:
    req_id: int
    arrival_ms: int
    size_bytes: int
    backend: str
    service_ms: int
    cost: float


@dataclass
class Event:
    time_ms: int
    req: Request

    def __lt__(self, other):
        return self.time_ms < other.time_ms


class TokenBucket:
    def __init__(self, capacity, refill_per_second):
        self.capacity = float(capacity)
        self.tokens = float(capacity)
        self.refill_per_ms = float(refill_per_second) / 1000.0
        self.last_ms = 0

    def allow(self, now_ms, cost):
        elapsed = max(0, now_ms - self.last_ms)
        self.tokens = min(self.capacity, self.tokens + elapsed * self.refill_per_ms)
        self.last_ms = now_ms
        if self.tokens >= cost:
            self.tokens -= cost
            return True
        return False


class GatewaySimulator:
    def __init__(self, scenario):
        self.scenario = scenario
        self.max_queue = 10 if scenario == "protected" else 10_000
        self.body_cap = 256_000 if scenario == "protected" else 10_000_000
        self.bucket = TokenBucket(capacity=14, refill_per_second=8)
        self.queue = []
        self.events = []
        self.inflight = {"jetson": 0, "zynq": 0}
        self.backend_limit = {"jetson": 2, "zynq": 1}
        self.stats = {
            "accepted": 0,
            "rejected_413": 0,
            "rejected_429": 0,
            "completed": 0,
            "peak_queue": 0,
            "latencies": [],
            "backend_busy_ms": {"jetson": 0, "zynq": 0},
        }
        self.logs = []

    def log(self, message):
        if len(self.logs) < 18:
            self.logs.append(message)

    def admit(self, req):
        if req.size_bytes > self.body_cap:
            self.stats["rejected_413"] += 1
            self.log(f"t={req.arrival_ms:05d} status=413 req={req.req_id} bytes={req.size_bytes}")
            return

        if self.scenario == "protected" and not self.bucket.allow(req.arrival_ms, req.cost):
            self.stats["rejected_429"] += 1
            self.log(f"t={req.arrival_ms:05d} status=429 req={req.req_id} reason=token_bucket")
            return

        if self.inflight_total() + len(self.queue) >= self.max_queue:
            self.stats["rejected_429"] += 1
            self.log(f"t={req.arrival_ms:05d} status=429 req={req.req_id} reason=queue_full")
            return

        self.stats["accepted"] += 1
        self.queue.append(req)
        self.stats["peak_queue"] = max(self.stats["peak_queue"], len(self.queue))
        self.dispatch(req.arrival_ms)

    def inflight_total(self):
        return self.inflight["jetson"] + self.inflight["zynq"]

    def dispatch(self, now_ms):
        still_waiting = []
        for req in self.queue:
            if self.inflight[req.backend] < self.backend_limit[req.backend]:
                self.inflight[req.backend] += 1
                finish_ms = now_ms + req.service_ms
                self.stats["backend_busy_ms"][req.backend] += req.service_ms
                heapq.heappush(self.events, Event(finish_ms, req))
                self.log(f"t={now_ms:05d} dispatch req={req.req_id} backend={req.backend} finish={finish_ms}")
            else:
                still_waiting.append(req)
        self.queue = still_waiting
        self.stats["peak_queue"] = max(self.stats["peak_queue"], len(self.queue))

    def complete_until(self, now_ms):
        while self.events and self.events[0].time_ms <= now_ms:
            event = heapq.heappop(self.events)
            req = event.req
            self.inflight[req.backend] -= 1
            self.stats["completed"] += 1
            latency = event.time_ms - req.arrival_ms
            self.stats["latencies"].append(latency)
            self.log(f"t={event.time_ms:05d} status=200 req={req.req_id} backend={req.backend} latency_ms={latency}")
            self.dispatch(event.time_ms)

    def drain(self):
        while self.events:
            self.complete_until(self.events[0].time_ms)

    def summary(self):
        latencies = sorted(self.stats["latencies"])
        p95 = latencies[int(0.95 * (len(latencies) - 1))] if latencies else 0
        avg = round(mean(latencies), 2) if latencies else 0
        return {
            "scenario": self.scenario,
            "accepted": self.stats["accepted"],
            "completed": self.stats["completed"],
            "rejected_413": self.stats["rejected_413"],
            "rejected_429": self.stats["rejected_429"],
            "peak_queue": self.stats["peak_queue"],
            "avg_latency_ms": avg,
            "p95_latency_ms": p95,
            "backend_busy_ms": self.stats["backend_busy_ms"],
        }


def make_requests(seed, total, interval_ms):
    random.seed(seed)
    requests = []
    for req_id in range(total):
        arrival_ms = req_id * interval_ms
        backend = "jetson" if req_id % 3 != 0 else "zynq"
        base_service = 48 if backend == "jetson" else 62
        jitter = random.randint(0, 24)
        large_body = req_id % 17 == 0
        size_bytes = 480_000 if large_body else random.randint(32_000, 180_000)
        decode_ms = int(size_bytes / 16_000)
        service_ms = base_service + jitter + decode_ms
        cost = 1.0 + size_bytes / 200_000 + service_ms / 100.0
        requests.append(Request(req_id, arrival_ms, size_bytes, backend, service_ms, cost))
    return requests


def run(scenario):
    sim = GatewaySimulator(scenario)
    # The arrival pattern is intentionally synthetic and local. It represents
    # a busy lab interval, not instructions for generating network traffic.
    for req in make_requests(seed=7, total=90, interval_ms=18):
        sim.complete_until(req.arrival_ms)
        sim.admit(req)
    sim.drain()
    return sim


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--scenario", choices=["vulnerable", "protected"], default="vulnerable")
    args = parser.parse_args()
    sim = run(args.scenario)
    print("sample_logs:")
    for line in sim.logs:
        print("  " + line)
    print("summary:")
    for key, value in sim.summary().items():
        print(f"  {key}: {value}")


if __name__ == "__main__":
    main()
local simulator commands
# Vulnerable admission model: accepts excess work and lets queueing absorb the failure.
python3 a2_queue_saturation_lab.py --scenario vulnerable

# Protected admission model: body cap, token bucket, and bounded queue reject early.
python3 a2_queue_saturation_lab.py --scenario protected

# Suggested comparison fields:
# accepted, rejected_413, rejected_429, peak_queue, avg_latency_ms, p95_latency_ms, backend_busy_ms

Practical Example For Your Pi Gateway, Jetson, And Zynq Setup

A2 contaminates both security and performance research if overload is not controlled.

Expected topology

The client VLAN reaches only the Raspberry Pi gateway. The Pi handles TLS, auth, body limits, validation, logging, and dispatch. Jetson Orin Nano and Zynq-7020 sit on the private subnet behind the switch and should receive only admitted work from the Pi.

Client VLAN Pi gateway Jetson + Zynq private

A2 failure mode

The Pi accepts too much work, queues grow, and backend workers stay busy. Jetson timing becomes dominated by queue delay and thermal effects. Zynq timing becomes dominated by PS queueing and PL busy windows. The result can look like an accelerator comparison, but it is actually an admission-control experiment.

429 missing 413 missing D2 absent

For your lab, treat D2 as part of the measurement harness. Before comparing Jetson and FPGA throughput, define the maximum admitted rate, body size, queue capacity, timeout, retry policy, and backend concurrency. This keeps "GPU versus FPGA" results from being quietly shaped by gateway overload.

Observable Signals Or Logs

Good A2 telemetry distinguishes early rejection from backend saturation.

SignalVulnerable observationHardened observation
HTTP statusMany slow 200s, occasional 500s or timeouts under pressure413 for oversized bodies, 429 for quota or queue pressure, 503 for circuit breaker
Gateway queueUnbounded or opaque growth; memory pressure becomes the first clear signalBounded queue depth with explicit reject counter
Jetson logsHigh GPU utilization, rising inference latency, thermal throttling riskStable backend rate; rejected excess requests never reach TensorRT
Zynq logsPS worker backlog, DMA wait time, PL busy intervals near 100 percentStable accelerator window; clear Pi-side admission decisions
Research metricsP95 and p99 latency dominated by queue delayTail latency bounded for admitted work; excess demand visible as 429/413 counts

Impact Analysis

A2 primarily targets availability, but it can disturb integrity and confidentiality research too.

Availability

The obvious impact is degraded or unavailable inference. Queue saturation can block legitimate users, starve accelerator workers, and keep the gateway in a slow-failure mode instead of a clean reject mode.

Integrity

Timeouts, retries, partial preprocessing failures, and fallback routing can change which backend answers a request. If downstream decisions assume stable latency and backend choice, overload can alter behavior.

Confidentiality

Overload can amplify timing side channels. When queues are visible through response delay, a caller may infer backend load, dispatch policy, or whether another experiment is running.

Mapping To CIA, STRIDE, PASTA, And MITRE ATLAS

Use these labels to keep overload experiments tied to security objectives.

FrameworkA2 mappingResearch interpretation
CIAAvailability primary; Integrity and Confidentiality secondaryProtect service continuity while preventing overload from distorting backend routing or timing signals.
STRIDEDenial of Service; Repudiation if logs are insufficientReject excess demand early and preserve enough sanitized telemetry to attribute pressure to user, token, route, and reason.
PASTAStage 3 decomposition, Stage 4 threat analysis, Stage 5 vulnerability analysis, Stage 6 attack modelingModel each queue, worker, body parser, and backend as a resource with capacity and failure semantics.
MITRE ATLASRelevant to AI denial-of-service and cost-harvesting style concerns, including AML.T0034 Cost Harvesting and public-facing application exposure such as AML.T0049 Exploit Public Facing Application.A2 is often an impact technique and an enabling condition for timing analysis or extraction by forcing backend stress states.

Defense Mapping To Existing D1-D11 Controls

D2 is the direct mitigation, but A2 should be validated across the gateway stack.

ControlRole against A2Validation
D2 Rate Limit + Body CapPrimary control. Caps request bodies, bounds queue size, limits per-user and global request cost, and rejects early with 413, 429, or 503.Replay synthetic workload in an owned lab and confirm excess demand does not reach Jetson or Zynq.
D1 JWT Auth + Object ChecksAllows per-user quotas and prevents anonymous callers from consuming accelerator budget.Missing or invalid token receives 401 before body parsing and queue admission.
D4 Private Backend SubnetPrevents clients from bypassing the Pi and saturating Jetson or Zynq service ports directly.Client VLAN cannot connect to backend ports; only Pi private interface can reach them.
D5 Query Anomaly DetectionDetects extraction-like or stress-like request distributions that stay under simple rate limits.Alert on unusual body sizes, entropy, request intervals, backend targeting, and repeated timeout edges.
D6 Sanitized LoggingPreserves reject and queue evidence without storing request bodies or tokens.Logs include route, status, reason, user id, body hash, size bucket, queue depth, and backend when dispatched.

Research Notes: What To Measure Experimentally

The measurable goal is controlled degradation, not heroic throughput.

Admission-control curve

Sweep admitted arrival rate below and above capacity. Plot accepted rate, rejected rate, average queue depth, p95 latency, and p99 latency. The protected system should convert overload into early rejects rather than unbounded delay.

Backend fairness

Measure whether Jetson and Zynq receive fair work under pressure. A poorly designed dispatcher may starve one backend or route all hard requests to the same accelerator.

Thermal and power coupling

Track Jetson temperature, Zynq board rails, CPU load on the Pi, and PL busy intervals. A2 can turn a software queue problem into hardware measurement noise.

Retry amplification

Measure how client timeout and retry policies affect effective arrival rate. A safe gateway should include retry-after headers and idempotency guidance for clients.

Key Takeaways

A2 is best handled by explicit budgets at every transition from network to accelerator.

  • A2 targets the finite capacity of the Pi parser, gateway queue, Jetson worker, and Zynq PS/PL path.
  • D2 should reject excess demand before expensive parsing and before backend dispatch.
  • Body caps, token buckets, bounded queues, backend timeouts, circuit breakers, and sanitized logs work as a layered control set.
  • For research, overload must be measured separately from accelerator performance or it will contaminate Jetson versus FPGA comparisons.