Where It Appears In The Edge AI Architecture
A2 lives across the ingress path, but the first defensible choke point is the Pi gateway.
Threat Model
The attacker is a client that can submit valid-looking inference work to the public gateway.
Attacker capability
The attacker can send syntactically valid requests, vary image size, choose expensive input shapes, repeat requests over time, and observe status code, latency, and service degradation. They do not need backend network access, SSH, switch admin privileges, or physical board access.
System assumption being tested
The gateway should bound work before expensive parsing and before accelerator dispatch. A2 tests whether the Pi can reject excess demand gracefully while preserving service for authorized, ordinary traffic.
Assets at risk
Gateway CPU and memory, request queues, Jetson GPU worker availability, Zynq PS service availability, PL accelerator scheduling windows, latency benchmarks, thermal budget, logs, and operator trust in experimental measurements.
Out of scope
No third-party testing, no live flood scripts, no packet generators, no bypass of cloud or ISP controls, and no instructions for attacking systems outside your owned isolated lab. The included code is a local discrete simulator.
Attack Intuition
An ML inference endpoint transforms small network requests into expensive device work.
A normal web API often performs lightweight database or cache work. An edge AI endpoint may decode images, resize tensors, copy buffers, dispatch to TensorRT, transfer data through a Zynq PS process, wait for PL completion, and serialize confidence output. The attacker does not need malformed traffic if normal requests are already expensive.
A2 is the point where graceful degradation matters. A well-designed gateway should say "not now" early with 429 or 503, reject oversized bodies with 413, and preserve bounded queues. A vulnerable gateway accepts work faster than the backends complete it, which converts traffic into latency, memory pressure, retries, and benchmark contamination.
Safe framing: this article models overload using synthetic arrival events. It does not open sockets, spawn concurrent clients, or contact a service. Use it to design controls before running any controlled validation on your own Pi gateway.
Technical Explanation
Resource exhaustion is usually a chain of small missing limits, not one dramatic bug.
Ingress pressure
- Large request bodies consume socket buffers, reverse-proxy memory, and parser time.
- Image decoding and resizing can be more expensive than the raw byte count suggests.
- Missing body caps allow oversized work to reach Python or model preprocessing.
Queue pressure
- Async APIs can accept work faster than accelerator workers complete it.
- Unbounded queues hide failure until latency and memory become the failure signal.
- Retries can amplify load when clients interpret slow responses as lost requests.
Backend pressure
Jetson saturation may show up as GPU utilization, thermal throttling, and memory pressure. Zynq saturation may show up as PS worker queue growth, DMA waits, PL busy intervals, and longer end-to-end latency despite deterministic hardware kernels.
Defensive design should be layered: reject unauthenticated calls first, cap request bodies at the proxy, enforce per-user and global token buckets, bound the dispatch queue, set backend timeouts, use circuit breakers, and log rejection reasons without storing payloads.
Mathematical Formulation Of Queue Saturation And Rate Limiting
The security control objective is to keep admitted work below stable service capacity.
For your lab, the exact queueing model does not need to be perfect. The important research result is empirical: as admitted arrival rate approaches service capacity, tail latency and queue depth rise nonlinearly. D2 should shift the failure from backend saturation to early, measured rejection.
Step-By-Step Safe Lab Demonstration
The demonstration compares vulnerable admission with bounded admission using synthetic request arrivals.
- Save the Python code from the next section as
a2_queue_saturation_lab.py. - Run
python3 a2_queue_saturation_lab.py --scenario vulnerableto simulate a gateway that accepts excess work. - Run
python3 a2_queue_saturation_lab.py --scenario protectedto simulate body caps, a token bucket, and a bounded queue. - Compare accepted requests, 413 rejects, 429 rejects, peak queue length, backend utilization, and p95 latency.
- Translate the protected parameters into your Pi gateway design: Nginx body cap, FastAPI queue limit, per-token request budget, backend worker concurrency, and timeout policy.
Interactive toy replay
Replay the expected local simulator contrast directly in the page. This is a display-only model; it does not send traffic anywhere.
Full Code For A Local Simulated Lab
This is a deterministic queue simulator. It never opens a port, calls a host, or generates live traffic.
#!/usr/bin/env python3
"""
A2 local-only simulator: resource exhaustion and defensive admission control.
This script is intentionally a discrete simulator. It does not open sockets,
spawn request clients, or contact any endpoint. Use it to reason about your
own Pi gateway queue, Jetson worker, and Zynq worker limits.
"""
import argparse
import heapq
import random
from dataclasses import dataclass
from statistics import mean
@dataclass
class Request:
req_id: int
arrival_ms: int
size_bytes: int
backend: str
service_ms: int
cost: float
@dataclass
class Event:
time_ms: int
req: Request
def __lt__(self, other):
return self.time_ms < other.time_ms
class TokenBucket:
def __init__(self, capacity, refill_per_second):
self.capacity = float(capacity)
self.tokens = float(capacity)
self.refill_per_ms = float(refill_per_second) / 1000.0
self.last_ms = 0
def allow(self, now_ms, cost):
elapsed = max(0, now_ms - self.last_ms)
self.tokens = min(self.capacity, self.tokens + elapsed * self.refill_per_ms)
self.last_ms = now_ms
if self.tokens >= cost:
self.tokens -= cost
return True
return False
class GatewaySimulator:
def __init__(self, scenario):
self.scenario = scenario
self.max_queue = 10 if scenario == "protected" else 10_000
self.body_cap = 256_000 if scenario == "protected" else 10_000_000
self.bucket = TokenBucket(capacity=14, refill_per_second=8)
self.queue = []
self.events = []
self.inflight = {"jetson": 0, "zynq": 0}
self.backend_limit = {"jetson": 2, "zynq": 1}
self.stats = {
"accepted": 0,
"rejected_413": 0,
"rejected_429": 0,
"completed": 0,
"peak_queue": 0,
"latencies": [],
"backend_busy_ms": {"jetson": 0, "zynq": 0},
}
self.logs = []
def log(self, message):
if len(self.logs) < 18:
self.logs.append(message)
def admit(self, req):
if req.size_bytes > self.body_cap:
self.stats["rejected_413"] += 1
self.log(f"t={req.arrival_ms:05d} status=413 req={req.req_id} bytes={req.size_bytes}")
return
if self.scenario == "protected" and not self.bucket.allow(req.arrival_ms, req.cost):
self.stats["rejected_429"] += 1
self.log(f"t={req.arrival_ms:05d} status=429 req={req.req_id} reason=token_bucket")
return
if self.inflight_total() + len(self.queue) >= self.max_queue:
self.stats["rejected_429"] += 1
self.log(f"t={req.arrival_ms:05d} status=429 req={req.req_id} reason=queue_full")
return
self.stats["accepted"] += 1
self.queue.append(req)
self.stats["peak_queue"] = max(self.stats["peak_queue"], len(self.queue))
self.dispatch(req.arrival_ms)
def inflight_total(self):
return self.inflight["jetson"] + self.inflight["zynq"]
def dispatch(self, now_ms):
still_waiting = []
for req in self.queue:
if self.inflight[req.backend] < self.backend_limit[req.backend]:
self.inflight[req.backend] += 1
finish_ms = now_ms + req.service_ms
self.stats["backend_busy_ms"][req.backend] += req.service_ms
heapq.heappush(self.events, Event(finish_ms, req))
self.log(f"t={now_ms:05d} dispatch req={req.req_id} backend={req.backend} finish={finish_ms}")
else:
still_waiting.append(req)
self.queue = still_waiting
self.stats["peak_queue"] = max(self.stats["peak_queue"], len(self.queue))
def complete_until(self, now_ms):
while self.events and self.events[0].time_ms <= now_ms:
event = heapq.heappop(self.events)
req = event.req
self.inflight[req.backend] -= 1
self.stats["completed"] += 1
latency = event.time_ms - req.arrival_ms
self.stats["latencies"].append(latency)
self.log(f"t={event.time_ms:05d} status=200 req={req.req_id} backend={req.backend} latency_ms={latency}")
self.dispatch(event.time_ms)
def drain(self):
while self.events:
self.complete_until(self.events[0].time_ms)
def summary(self):
latencies = sorted(self.stats["latencies"])
p95 = latencies[int(0.95 * (len(latencies) - 1))] if latencies else 0
avg = round(mean(latencies), 2) if latencies else 0
return {
"scenario": self.scenario,
"accepted": self.stats["accepted"],
"completed": self.stats["completed"],
"rejected_413": self.stats["rejected_413"],
"rejected_429": self.stats["rejected_429"],
"peak_queue": self.stats["peak_queue"],
"avg_latency_ms": avg,
"p95_latency_ms": p95,
"backend_busy_ms": self.stats["backend_busy_ms"],
}
def make_requests(seed, total, interval_ms):
random.seed(seed)
requests = []
for req_id in range(total):
arrival_ms = req_id * interval_ms
backend = "jetson" if req_id % 3 != 0 else "zynq"
base_service = 48 if backend == "jetson" else 62
jitter = random.randint(0, 24)
large_body = req_id % 17 == 0
size_bytes = 480_000 if large_body else random.randint(32_000, 180_000)
decode_ms = int(size_bytes / 16_000)
service_ms = base_service + jitter + decode_ms
cost = 1.0 + size_bytes / 200_000 + service_ms / 100.0
requests.append(Request(req_id, arrival_ms, size_bytes, backend, service_ms, cost))
return requests
def run(scenario):
sim = GatewaySimulator(scenario)
# The arrival pattern is intentionally synthetic and local. It represents
# a busy lab interval, not instructions for generating network traffic.
for req in make_requests(seed=7, total=90, interval_ms=18):
sim.complete_until(req.arrival_ms)
sim.admit(req)
sim.drain()
return sim
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--scenario", choices=["vulnerable", "protected"], default="vulnerable")
args = parser.parse_args()
sim = run(args.scenario)
print("sample_logs:")
for line in sim.logs:
print(" " + line)
print("summary:")
for key, value in sim.summary().items():
print(f" {key}: {value}")
if __name__ == "__main__":
main()
# Vulnerable admission model: accepts excess work and lets queueing absorb the failure.
python3 a2_queue_saturation_lab.py --scenario vulnerable
# Protected admission model: body cap, token bucket, and bounded queue reject early.
python3 a2_queue_saturation_lab.py --scenario protected
# Suggested comparison fields:
# accepted, rejected_413, rejected_429, peak_queue, avg_latency_ms, p95_latency_ms, backend_busy_ms
Practical Example For Your Pi Gateway, Jetson, And Zynq Setup
A2 contaminates both security and performance research if overload is not controlled.
Expected topology
The client VLAN reaches only the Raspberry Pi gateway. The Pi handles TLS, auth, body limits, validation, logging, and dispatch. Jetson Orin Nano and Zynq-7020 sit on the private subnet behind the switch and should receive only admitted work from the Pi.
A2 failure mode
The Pi accepts too much work, queues grow, and backend workers stay busy. Jetson timing becomes dominated by queue delay and thermal effects. Zynq timing becomes dominated by PS queueing and PL busy windows. The result can look like an accelerator comparison, but it is actually an admission-control experiment.
For your lab, treat D2 as part of the measurement harness. Before comparing Jetson and FPGA throughput, define the maximum admitted rate, body size, queue capacity, timeout, retry policy, and backend concurrency. This keeps "GPU versus FPGA" results from being quietly shaped by gateway overload.
Observable Signals Or Logs
Good A2 telemetry distinguishes early rejection from backend saturation.
| Signal | Vulnerable observation | Hardened observation |
|---|---|---|
| HTTP status | Many slow 200s, occasional 500s or timeouts under pressure | 413 for oversized bodies, 429 for quota or queue pressure, 503 for circuit breaker |
| Gateway queue | Unbounded or opaque growth; memory pressure becomes the first clear signal | Bounded queue depth with explicit reject counter |
| Jetson logs | High GPU utilization, rising inference latency, thermal throttling risk | Stable backend rate; rejected excess requests never reach TensorRT |
| Zynq logs | PS worker backlog, DMA wait time, PL busy intervals near 100 percent | Stable accelerator window; clear Pi-side admission decisions |
| Research metrics | P95 and p99 latency dominated by queue delay | Tail latency bounded for admitted work; excess demand visible as 429/413 counts |
Impact Analysis
A2 primarily targets availability, but it can disturb integrity and confidentiality research too.
Availability
The obvious impact is degraded or unavailable inference. Queue saturation can block legitimate users, starve accelerator workers, and keep the gateway in a slow-failure mode instead of a clean reject mode.
Integrity
Timeouts, retries, partial preprocessing failures, and fallback routing can change which backend answers a request. If downstream decisions assume stable latency and backend choice, overload can alter behavior.
Confidentiality
Overload can amplify timing side channels. When queues are visible through response delay, a caller may infer backend load, dispatch policy, or whether another experiment is running.
Mapping To CIA, STRIDE, PASTA, And MITRE ATLAS
Use these labels to keep overload experiments tied to security objectives.
| Framework | A2 mapping | Research interpretation |
|---|---|---|
| CIA | Availability primary; Integrity and Confidentiality secondary | Protect service continuity while preventing overload from distorting backend routing or timing signals. |
| STRIDE | Denial of Service; Repudiation if logs are insufficient | Reject excess demand early and preserve enough sanitized telemetry to attribute pressure to user, token, route, and reason. |
| PASTA | Stage 3 decomposition, Stage 4 threat analysis, Stage 5 vulnerability analysis, Stage 6 attack modeling | Model each queue, worker, body parser, and backend as a resource with capacity and failure semantics. |
| MITRE ATLAS | Relevant to AI denial-of-service and cost-harvesting style concerns, including AML.T0034 Cost Harvesting and public-facing application exposure such as AML.T0049 Exploit Public Facing Application. | A2 is often an impact technique and an enabling condition for timing analysis or extraction by forcing backend stress states. |
Defense Mapping To Existing D1-D11 Controls
D2 is the direct mitigation, but A2 should be validated across the gateway stack.
| Control | Role against A2 | Validation |
|---|---|---|
| D2 Rate Limit + Body Cap | Primary control. Caps request bodies, bounds queue size, limits per-user and global request cost, and rejects early with 413, 429, or 503. | Replay synthetic workload in an owned lab and confirm excess demand does not reach Jetson or Zynq. |
| D1 JWT Auth + Object Checks | Allows per-user quotas and prevents anonymous callers from consuming accelerator budget. | Missing or invalid token receives 401 before body parsing and queue admission. |
| D4 Private Backend Subnet | Prevents clients from bypassing the Pi and saturating Jetson or Zynq service ports directly. | Client VLAN cannot connect to backend ports; only Pi private interface can reach them. |
| D5 Query Anomaly Detection | Detects extraction-like or stress-like request distributions that stay under simple rate limits. | Alert on unusual body sizes, entropy, request intervals, backend targeting, and repeated timeout edges. |
| D6 Sanitized Logging | Preserves reject and queue evidence without storing request bodies or tokens. | Logs include route, status, reason, user id, body hash, size bucket, queue depth, and backend when dispatched. |
Research Notes: What To Measure Experimentally
The measurable goal is controlled degradation, not heroic throughput.
Admission-control curve
Sweep admitted arrival rate below and above capacity. Plot accepted rate, rejected rate, average queue depth, p95 latency, and p99 latency. The protected system should convert overload into early rejects rather than unbounded delay.
Backend fairness
Measure whether Jetson and Zynq receive fair work under pressure. A poorly designed dispatcher may starve one backend or route all hard requests to the same accelerator.
Thermal and power coupling
Track Jetson temperature, Zynq board rails, CPU load on the Pi, and PL busy intervals. A2 can turn a software queue problem into hardware measurement noise.
Retry amplification
Measure how client timeout and retry policies affect effective arrival rate. A safe gateway should include retry-after headers and idempotency guidance for clients.
Key Takeaways
A2 is best handled by explicit budgets at every transition from network to accelerator.
- A2 targets the finite capacity of the Pi parser, gateway queue, Jetson worker, and Zynq PS/PL path.
- D2 should reject excess demand before expensive parsing and before backend dispatch.
- Body caps, token buckets, bounded queues, backend timeouts, circuit breakers, and sanitized logs work as a layered control set.
- For research, overload must be measured separately from accelerator performance or it will contaminate Jetson versus FPGA comparisons.