AI Security Section

Agentic AI Security

Agentic AI introduces a different security regime from one-shot assistants because the system can plan, maintain state, call tools, retrieve external information, delegate subtasks, and act over time. The security problem is therefore not limited to harmful output generation. It includes goal hijacking, memory poisoning, tool misuse, privilege abuse, unsafe delegation, cascading failures, and operational drift across long-horizon workflows.

Overview

Why agentic AI changes the security problem

Agentic systems are not just chat interfaces with longer prompts. They are software systems in which a model can decompose tasks, choose intermediate actions, retrieve context, store memory, invoke tools, communicate with other services, and sometimes coordinate with other agents. This means the attack surface becomes temporal and operational. A single manipulated step may persist in memory, alter later planning, trigger a chain of tool calls, or quietly reshape the system's goals over many iterations.

The central difference from ordinary LLM applications is that an agent does not only answer; it can act. Once action is possible, security depends on capability boundaries, policy enforcement, state handling, and runtime oversight. In that setting, prompt injection is still important, but it becomes only one part of a larger problem that includes tool authorization, memory integrity, inter-agent trust, identity propagation, rollback, and containment of downstream side effects.

What makes a system agentic?

Goal-directed behavior: the system tries to complete a task rather than simply produce a one-step answer.
Planning and decomposition: the system breaks work into subgoals, iterations, or conditional branches.
State and memory: the system persists observations, preferences, previous tool outputs, or partial results.
Tool use: the system can browse, search, query databases, send messages, execute code, or manipulate files.
Delegation: the system may call sub-agents, workflows, or external services on the user's behalf.
Long-horizon execution: the system may operate across many steps, making partial compromise accumulate over time.

Why agentic security is different from generative security

Time matters: security failures can emerge gradually across many small steps rather than in one output.
State matters: malicious content can persist in memory, context, or workflow artifacts.
Permissions matter: the model's access to tools, files, identities, and external systems becomes part of the threat model.
Composition matters: safety depends on how planner, retriever, tools, memory, UI, and connectors interact.
Blast radius matters: an agent connected to real systems can cause irreversible external effects.

Core security objectives for agents

Goal integrity: the agent should continue pursuing the authorized task rather than an attacker-injected objective.
Capability bounding: the agent should only access tools, data, and actions needed for the current job.
State integrity: memory, scratchpads, and intermediate context should resist poisoning and unintended persistence.
Execution safety: tool calls and downstream actions should remain policy-compliant and auditable.
Containment: when manipulation occurs, the damage should be localized rather than propagated across services.
Human recoverability: operators should be able to inspect, interrupt, approve, roll back, or disable agent behavior.

Research intuition: the defining risk of agentic AI is not only harmful generation. It is the transformation of language-driven mistakes or manipulations into stateful, tool-mediated, real-world actions.

Diagram showing agentic AI security across planner, memory, tools, delegation, threats, oversight points, and research focus. — Agentic AI security across planning, memory, tools, delegation, and runtime control.

Threat model

Threat model and agentic attack surface

An agentic threat model should describe who can influence the agent, which resources the agent can reach, how memory is handled, whether tools have side effects, and what approvals or guardrails exist between reasoning and action. Agentic risk often arises not because one component is completely broken, but because multiple partially trusted components are composed without strong boundaries.

Attacker positions

Direct user attacker: interacts with the agent through prompts, uploads, or instructions.
Indirect content attacker: plants malicious instructions in emails, documents, webpages, tickets, repos, or search results the agent later reads.
Tool-output attacker: controls or influences external tools, APIs, or documents whose outputs are fed back into the agent.
Memory attacker: causes malicious state to persist across turns, tasks, or users.
Identity attacker: steals, expands, or misuses credentials and delegated access available to the agent.
Supply-chain attacker: compromises plugins, MCP servers, agent frameworks, adapters, configs, or sub-agent dependencies.
Insider or operator attacker: modifies system prompts, policies, tools, audit settings, or deployment parameters.

Attacker goals

Goal hijack: steer the agent away from the user's intended task toward attacker-defined objectives.
Unauthorized action: trigger emails, file changes, purchases, deletions, approvals, or data transfer.
Data exfiltration: cause the agent to reveal secrets, documents, credentials, prompts, or user data.
Privilege abuse: make the agent use more authority than the task requires.
Persistence: implant malicious instructions into memory, task state, or downstream workflows.
Cascading compromise: propagate the attack across tools, services, or other agents.

Major threat classes

1. Goal hijacking and prompt-chain compromise

The core failure mode for many agents is that the agent silently shifts from the user's goal to an attacker-specified one. This may begin as direct or indirect prompt injection, but in agentic settings the effect is broader than unsafe text generation. Once the goal or subgoal is corrupted, the planner may choose new tools, reinterpret constraints, and rationalize harmful behavior as task completion.

Goal hijacking may happen through webpages, emails, retrieved documents, hidden text, or tool responses.
Multi-step planning can magnify the effect because corrupted subgoals influence later decisions.
Long context windows may increase exposure to attacker-controlled instructions mixed with legitimate task context.

2. Tool misuse and unsafe actuation

A powerful agent is dangerous if its tool permissions are too broad. Tool misuse includes invoking commands on untrusted data, sending messages to unauthorized recipients, reading or modifying files outside scope, calling high-risk APIs, running destructive shell commands, or triggering external actions without meaningful confirmation. In practice, tool misuse often turns a model-level attack into a business or security incident.

Read actions: opening sensitive files, inboxes, databases, or dashboards beyond scope.
Write actions: editing documents, tickets, records, or configuration files incorrectly.
Execute actions: running code, scripts, workflows, or commands that alter the host environment.
Network actions: sending data to external endpoints or communicating with attacker-controlled services.

3. Memory poisoning and unsafe persistence

Memory is one of the most distinctive risks in agentic systems. Short-term memory, long-term memory, task notes, summaries, and scratchpads can all persist attacker-controlled content. If that content is later treated as trusted context, the agent can repeatedly reintroduce the attack to itself. This turns one-time manipulation into durable workflow corruption.

User memory poisoning: attacker injects content that becomes part of long-term preferences or instructions.
Task-state poisoning: malicious summaries or notes distort subsequent decision steps.
Cross-session contamination: memory survives beyond the original attack surface and affects later tasks.
Cross-user contamination: poor isolation causes one user's poisoned state to influence another user's work.

4. Identity and privilege abuse

Agentic systems frequently inherit user tokens, service accounts, API keys, or delegated permissions. If identity propagation is weakly designed, the agent may gain more authority than intended or may use a high-privilege connector in a low-trust context. Once that happens, even a modest prompt attack can lead to serious consequences because the agent is acting with real credentials.

Agents may accidentally reuse credentials across tasks or users.
Delegated access may persist longer than the task that justified it.
High-privilege service accounts create large blast radius if the agent is manipulated.

5. Unsafe delegation and multi-agent trust failures

Multi-agent systems add another layer of complexity because one agent may rely on another agent's outputs, summaries, or actions. This creates new trust boundaries: who authenticated the sub-agent, what policy it follows, whether its memory is isolated, and whether its outputs are verified. A weak or compromised sub-agent can mislead the planner, leak information, or become an attack relay.

Delegation can obscure accountability when many sub-agents contribute to one outcome.
Inter-agent messages can carry malicious instructions, poisoned summaries, or hidden assumptions.
Failure containment becomes harder when one compromised agent can influence others.

6. Unexpected code execution and environment escape

Some agents can execute code directly or indirectly through scripts, plugins, or local runtimes. If execution boundaries are weak, natural-language manipulation can become code execution, shell misuse, file-system abuse, or container escape. This risk is especially acute in coding agents, browser agents, and development assistants that interact with repositories, terminals, or build systems.

7. Context poisoning from tool outputs and external content

Agents continuously ingest external content: search results, browser pages, emails, tickets, documents, API responses, and logs. Every one of those becomes a potential instruction channel. A tool may return attacker-controlled text, and the agent may treat it as operational guidance. This blurs the distinction between data and instruction and makes ordinary integration points part of the attack surface.

8. Cascading failures across long-horizon plans

The longer the plan, the more chances for partial errors to compound. An agent may make a small mistake in step two, use that mistaken assumption in step six, persist the resulting summary into memory, and then call a powerful tool in step ten. The final incident may appear disconnected from the original manipulation, which makes debugging and forensics difficult.

Small early errors can become large later actions.
Recovery is harder when the agent rewrites its own plan as it proceeds.
Users may not observe intermediate states closely enough to catch drift early.

9. Human-oversight failure and automation overtrust

Human-in-the-loop does not automatically solve agentic risk. If approvals are vague, rushed, or poorly explained, users may become rubber stamps. Conversely, too many low-value approval prompts produce fatigue and missed critical warnings. The oversight layer itself therefore needs design discipline: who approves, what they see, what they can block, and how the system behaves when approval is absent.

10. Agentic supply chain and configuration drift

Modern agents depend on frameworks, connectors, model providers, tool servers, memory stores, prompts, policies, and orchestration runtimes. A vulnerability or silent change in any of those can materially alter the security posture. Agentic systems are therefore exposed not only to model misuse, but also to configuration drift, unsafe defaults, poisoned connectors, and third-party protocol risk.

Countermeasures

Countermeasures and secure design principles

Secure agent design is fundamentally about constraining impact. Because prompt injection and social engineering are difficult to detect perfectly, defenses should assume some manipulations will get through. The goal is to prevent those manipulations from easily turning into harmful or irreversible actions.

1. Capability bounding and least privilege

Give each tool only the minimum permissions required for the current task.
Prefer narrow, typed tools over broad shell or file-system access.
Issue temporary, scoped credentials instead of long-lived, reusable secrets.
Separate read, write, execute, and network permissions rather than bundling them together.
Disable high-risk tools by default and require explicit elevation for exceptional cases.

2. Approval gates for consequential actions

Require explicit confirmation before sending external messages, deleting data, approving transactions, or changing critical records.
Present users with clear summaries of what the agent intends to do, why, and with which data.
Make approval state auditable and non-bypassable through alternative tool routes.
Use stronger approval workflows for actions involving money, credentials, confidential data, or infrastructure changes.

Good approval design is selective. The point is not to interrupt every low-risk step, but to create friction at the moments where a wrong action would be costly.

3. Memory hygiene and state governance

Treat memory as untrusted unless it has been validated for the current task.
Separate short-term scratchpads from durable long-term memory.
Apply retention limits, expiry, and review policies to persistent memories.
Prevent sensitive tool outputs, secrets, and attacker-controlled instructions from being stored unnecessarily.
Isolate memory by user, task, and environment to prevent cross-session contamination.

4. Tool sandboxing and execution containment

Sandbox code execution, browsing, file access, and command invocation.
Use allowlists for domains, file paths, commands, repositories, and API scopes.
Restrict outbound network access for tools that do not need it.
Bind destructive or environment-altering commands to stricter review and logging.
Ensure failed or interrupted tool actions leave a recoverable system state where possible.

5. Strong identity and delegation control

Propagate user identity and authorization context explicitly rather than implicitly sharing service credentials.
Use task-scoped tokens and revoke them when the task ends.
Separate agent identity from human identity so actions remain attributable.
Limit which agents can impersonate users, access mailboxes, browse private content, or modify infrastructure.
Apply policy checks at delegation boundaries before one agent can invoke another.

6. Policy engines and deterministic guardrails

Use deterministic policy checks between planning and acting.
Validate tool arguments against schemas, business rules, and authorization context.
Keep critical policies outside the model so they cannot be changed by prompt manipulation alone.
Version-control prompts, tools, policy files, and workflow configurations together.
Prefer fail-closed behavior for ambiguous or policy-conflicting actions.

7. Monitoring, tracing, and runtime detection

Log prompts, tool calls, approvals, memory writes, identity use, and external side effects.
Detect suspicious patterns such as repeated prompt-boundary probing, unusual tool chains, credential access, or exfiltration-like behavior.
Trace agent execution across sub-agents and connectors so incidents remain reconstructable.
Alert on policy violations, anomalous autonomy escalation, or changes in normal task structure.
Keep emergency stop, rollback, and disable mechanisms available to operators.

8. Secure multi-agent composition

Authenticate and authorize inter-agent communication explicitly.
Treat outputs from one agent as untrusted input to another unless independently verified.
Prevent unrestricted recursive delegation and unbounded agent spawning.
Assign distinct roles, permissions, and memory scopes to different agents.
Design containment so one compromised agent cannot silently dominate the whole workflow.

9. Continuous red teaming and adversarial evaluation

Test direct and indirect prompt injection against realistic tool chains.
Include memory poisoning, malicious tool outputs, identity abuse, and multi-agent compromise scenarios.
Evaluate not only final outputs but intermediate plans, approval prompts, memory writes, and external actions.
Re-test after every prompt, tool, connector, or policy update.
Benchmark long-horizon behavior because many agentic failures appear only after many steps.

Practical takeaway: secure agents should be designed as if some manipulation will succeed. The architecture must therefore constrain authority, verify actions, isolate state, and keep humans able to interrupt the workflow before damage becomes irreversible.

Open challenges

Open research challenges and future directions

Agentic AI security is still early. The field has many useful design patterns, but relatively few mature assurance methods. Existing guidance increasingly agrees that prompt injection cannot be solved purely with classifiers or prompt tricks; instead, security must be built into identity, tools, memory, delegation, and human oversight. The research challenge is to turn those principles into strong, measurable guarantees.

1. Safe autonomy boundaries are still hard to define

Agents need enough freedom to be useful, but not so much freedom that any successful manipulation becomes high impact. The right autonomy boundary depends on task value, reversibility, identity scope, and operator expectations. We still lack mature methods for setting those boundaries systematically.

2. Prompt injection is partly an architectural problem

Because agents ingest untrusted content while simultaneously trying to follow trusted goals, prompt injection is not simply bad input classification. Better architectures are needed to separate data, goals, permissions, and actions so that even when the model is misled, policy enforcement still holds.

3. Memory security remains underdeveloped

Memory makes agents useful, but it also creates persistence and contamination risk. The field still needs stronger frameworks for which memories should persist, how they should be validated, when they should expire, and how to prove they are not poisoning later behavior.

4. Multi-agent assurance is especially immature

Single-agent security is already difficult; multi-agent systems add delegation, trust chains, message validation, and failure propagation. We still lack widely adopted assurance methods for inter-agent identity, message integrity, and containment of compromised sub-agents.

5. Human oversight is easy to add superficially and hard to design well

Approval steps can create false comfort if they are too frequent, too opaque, or too easy to bypass. Research is needed on effective approval UX, calibrated trust, selective interruption, and operator decision support under high-speed automated workflows.

6. Benchmarking lags behind real deployments

Many current evaluations are short-horizon and single-agent. Real deployments involve long tasks, multiple tools, retrieval, memory, and enterprise connectors. Better benchmarks should capture long-horizon drift, memory poisoning, delegation failures, and compound attack chains.

7. Deterministic guarantees are still limited

Probabilistic guardrails are useful, but organizations increasingly want stronger guarantees around what an agent can never do. Future work is likely to focus more on policy languages, authenticated workflows, cryptographic attestation, typed tools, and formalized execution boundaries.

8. Supply-chain exposure is growing with agent ecosystems

Protocols, plugins, MCP servers, sub-agents, hosted tools, and framework extensions are expanding quickly. Security research must therefore account for ecosystem risk, not just the base model. This includes dependency trust, update governance, third-party server behavior, and configuration drift.

9. Cross-layer integration is still weak

Agentic security is often discussed only at the application layer, even though the actual deployment may depend on cloud IAM, browser isolation, endpoint security, hardware trust, or physical actuators. Stronger cross-layer research is needed to connect model-level reasoning, software control flow, infrastructure boundaries, and real-world consequences.

10. Future direction

Capability-bounded agent architectures with explicit trust boundaries.
Stronger memory governance and state-isolation mechanisms.
Authenticated workflows and policy languages for tool use and delegation.
Benchmark suites for long-horizon, multi-agent, and browser-based attacks.
Better human-oversight interfaces for high-consequence agent actions.
Cross-layer work connecting agentic AI security with cloud, edge, hardware, and physical AI deployment.

Selected readings

Selected readings and frameworks

The references below are a strong starting point for understanding current agentic-AI security from both practical and research perspectives.

Best next step for this page: add one diagram showing the full agent loop—goal → planner → memory → retrieval → tool call → observation → updated plan → approval gate → external action. That single figure will make the temporal nature of agentic risk much easier to understand.