Research-grade attack flows for foundation models, RAG systems, and tool-using agents
00 Threat Model Overview
How an LLM system expands the attack surface
Modern deployments are no longer just “user prompt → model output”. Retrieval, memory, tools, policies, and feedback loops create multiple adversarial entry points.
An attacker no longer needs model weights. Control over any input channel that gets merged into the final context can steer generation.
The effective attack surface includes instruction precedence, chunk ranking, tool schemas, planner behavior, and downstream execution logic.
Security failures are often cross-component failures: the LLM is only one stage in a larger vulnerable pipeline.
A correct analysis must track entry point → context assembly → model interpretation → external side effect.
System view: secure deployment requires defending prompts, retrieval, tools, memory, and action policies together.
01 Instruction-Layer Attack
Direct prompt injection overrides intended behavior
The attacker places adversarial instructions directly in the user-visible prompt, aiming to defeat policy constraints or alter task priorities.
user promptpolicy conflict
Research-grade view
This is not just “asking bad things.” The attack exploits the fact that policy is encoded as text and competes with attacker text inside the same inference context.
Success depends on instruction hierarchy design, template delimiters, context ordering, refusal training, and how much task pressure remains after injection.
Common evaluation metrics: attack success rate, refusal robustness, leakage rate for hidden prompt fragments, and transferability across paraphrases.
Defense is not one filter. It usually needs template hardening, structured instruction channels, adversarial training, and post-generation policy checks.
Scenario: a seemingly normal request appends a hidden adversarial suffix that causes system prompt leakage or refusal bypass.
02 Retrieval-Channel Attack
RAG poisoning turns trusted evidence into an attack carrier
The attacker modifies documents, web pages, or indexed chunks so that malicious instructions are retrieved alongside relevant content.
retrieved chunkhidden instruction
What makes this hard
RAG systems often treat retrieved text as evidence, but the model still sees that evidence as natural language that can contain instructions.
A poisoning campaign can target ingestion time, chunking time, or ranking time. Each stage changes which malicious text reaches the final prompt.
Even if one chunk looks benign, adversaries can exploit top-k composition, duplicate boosting, metadata abuse, or hidden formatting to dominate attention.
A robust defense needs source trust scoring, instruction stripping, retrieval-time anomaly checks, and policy-aware context isolation.
Scenario: an attacker poisons an indexed knowledge source so a later enterprise query retrieves attacker-controlled instructions as “relevant context”.
03 Agentic Attack
Indirect prompt injection drives unsafe tool use
When an agent reads untrusted content and can invoke tools, attacker text can steer not only words but external actions.
tool callunsafe side effect
Agent-specific risk
Indirect prompt injection is especially dangerous because the attacker text may never be shown to the human operator. It is consumed by the agent during browsing, email reading, or file parsing.
The vulnerable boundary is often the planner-to-tool interface: insufficient parameter validation, weak authorization rules, or missing confirmation for sensitive actions.
A serious evaluation must measure not just toxic output but action-level harm: unauthorized email, code execution, purchases, ticket creation, or data movement.
Good defenses include least-privilege tools, typed action constraints, high-risk confirmation gates, and provenance-aware planning.
Scenario: a mail assistant reads an attacker-crafted message, follows hidden instructions, and uses available tools to exfiltrate data.
04 Alignment Evasion
Jailbreak optimization searches for prompts that systematically bypass safeguards
Unlike casual misuse, a jailbreak attack is often iterative and measurement-driven: optimize the input until refusal behavior degrades.
attack searchquery budget
Why this is more than a single clever prompt
Attackers often use automated search: generate many candidate prompts, query the model, score responses, and evolve stronger jailbreaks over time.
The optimization target can be explicit harmful completion, reduced refusal confidence, hidden prompt leakage, or downstream tool execution.
A research-grade study should report transferability across models, persistence across releases, query efficiency, and robustness under paraphrase or content moderation.
Defense requires making the refusal boundary harder to exploit under search, not just blocking a list of known strings.
Scenario: an attacker repeatedly queries an API and optimizes prompt variants until a stable jailbreak bypass emerges.
05 API Abuse / Theft
Model extraction and capability mapping via systematic probing
An adversary interacts with the model as a black box to infer behavior, recover capabilities, imitate outputs, or bootstrap a surrogate.
probingsurrogate reconstruction
Extraction objectives
Full weight recovery is not required for harm. A high-fidelity surrogate or a detailed capability map may already weaken competitive advantage and enable better downstream attacks.
Attackers probe topical competence, refusal regimes, prompt sensitivity, and latent policy boundaries, often using active learning strategies.
The threat becomes stronger when outputs include extra observables such as probabilities, ranking scores, tool traces, or chain-like intermediate hints.
Mitigations include rate limiting, adaptive query monitoring, response minimization, watermarking, canary prompts, and contractual access controls.
Scenario: a black-box adversary systematically probes an API to build a surrogate or a detailed behavioral fingerprint of the target model.