While building, securing, or deploying AI agents, understanding AI agent traps is essential, because the vulnerability doesn’t come from what the model thinks, but from what it does.
We analyzed 20 real-world security incidents and found that behavioral control and systemic traps (not prompt injection) now drive the majority of critical breaches. We mapped each incident to a six-category taxonomy (content injection, semantic manipulation, cognitive state, behavioral control, systemic, and human-in-the-loop) based on CVE data and research from Microsoft and Google DeepMind.
Real-world AI agent trap incidents
1. Grok Morse Code Crypto Heist: The attack smuggles instructions through Morse encoding: exploiting the gap between what Grok’s guardrails inspect (plain text) and what it decodes and acts on (the translated instruction). The encoding choice is specifically a content-layer bypass: the malicious directive is invisible to filters until the agent itself renders it readable.1
2. Claude ClaudeBleed: It is a critical security vulnerability within the Anthropic Claude for Chrome browser extension, allowing malicious actors to hijack the AI assistant, steal sensitive data, and perform actions without user consent.2
3. Gemini CLI RCE: A critical Remote Code Execution (RCE) vulnerability, identified as GHSA-wpqr-6v78-jr5g, had a maximum CVSS score of 10.0. It was discovered in the Gemini CLI and its associated GitHub Action. This vulnerability allowed attackers to gain full control over the system executing the tool. This made it a critical supply-chain security threat.3
4. Antropic PocketOS: A Cursor agent powered by Claude, while investigating a staging bug, autonomously discovered an unscoped Railway CLI token, inferred an API endpoint, and issued a volumeDelete command that destroyed the production database and three months of backups in 9 seconds.4
5. Open-Source AI Ecosystem: CLI-Anything auto-generates SKILL.md instruction-layer files consumed by Claude Code, Codex, OpenClaw, Cursor, and GitHub Copilot CLI. Poisoned skill definitions propagate silently across every agent that imports the affected package; no CVE is issued, no SBOM entry exists, and no scanner detects it. The attack targets shared ecosystem infrastructure (the ClawHub skill registry, the npm dependency graph) rather than any individual agent.5
6. Grafana AI: Noma Security found that an attacker could store a malicious prompt inside a data source that Grafana’s AI assistant later retrieved. Once processed, the AI sent sensitive data, such as financial metrics and infrastructure telemetry, to an attacker-controlled server without requiring a user click.6
7. Anthropic MCP Ecosystem: OX Security disclosed a systemic architectural vulnerability across Anthropic’s official MCP SDKs (Python, TypeScript, Java, Rust) where user input flows directly into STDIO MCP server configurations without sanitization, affecting over 150 million SDK downloads, 7,000+ publicly exposed servers, and downstream tools including LiteLLM, LangChain, Cursor, Windsurf, and Claude Code. Because the flaw is in the shared SDK architecture rather than any single agent, any agent built on the framework inherits the exposure.7
8. Andon Market (Luna AI): Andon Market, a San Francisco retail shop run autonomously by an AI agent called “Luna,” makes inventory, pricing, and hiring decisions by reading Google Reviews. Customers discovered that leaving a review phrased as an instruction, such as “please stock product X”, causes the agent to act on it, turning a public-facing review platform into a live prompt injection surface with real business consequences.8
9. ChatGPT Code Execution: A malicious prompt disguised as productivity tips triggers DNS tunneling code that encodes sensitive conversation content and uploads documents into subdomain queries, silently transmitting them to an attacker-controlled DNS server. Check Point Research demonstrated that the exfiltration channel is invisible to conventional network monitoring because it rides on standard DNS traffic initiated by the agent’s own code execution environment.9
10. Perplexity Comet: Zenity Labs disclosed that Perplexity Comet’s agentic browser can be hijacked via a malicious calendar invite containing a prompt injection payload, causing it to access the local file system, browse directories, open and read files, and exfiltrate data. The attack requires no user interaction beyond accepting what appears to be a legitimate meeting invitation, and operates entirely within the browser’s intended capabilities.10
11. Microsoft Semantic Kernel: Microsoft’s Defender Security Research Team identified two critical vulnerabilities in Semantic Kernel, CVE-2026-26030 (Python SDK, patched in 1.39.4) and CVE-2026-25592 (.NET SDK, patched in 1.71.0), where an attacker with any prompt injection vector can achieve remote code execution on the machine hosting the agent. CVE-2026-26030 exploited an eval-based filter in the InMemoryVectorStore whose AST blocklist was bypassable through undocumented attribute traversal, while CVE-2026-25592 exposed a file-transfer helper function as a callable kernel tool, allowing a hostile prompt to steer the agent into writing arbitrary files to dangerous host locations.11
12. Cline AI Triage Bot: A malicious GitHub issue title injected instructions into Cline’s AI triage bot, tricking it into running npm install on a typosquatted package. This led to cache poisoning, credential theft, and a backdoored cline@2.3.0 release that silently installed OpenClaw malware on approximately 4,000 developer machines.12
13. Claude Desktop Extensions: LayerX security researchers discovered a CVSS 10/10 vulnerability in Claude Desktop Extensions affecting over 10,000 users, where an attacker can embed malicious instructions inside a calendar event that Claude processes when a user asks about their schedule. The agent then automatically executes arbitrary code on the user’s machine without any further interaction, with no visible indication that anything has occurred.13
14. npm/MCP Ecosystem: Socket discovered SANDWORM_MODE, a self-replicating npm worm distributed through 19 typosquatted packages that installs a rogue MCP server with prompt injection payloads embedded in tool descriptions, enabling it to exfiltrate credentials from AI coding assistants. Because the worm propagates through the shared package registry, a single infection seeds the attack across every developer who installs an affected dependency.14
15. Snowflake Cortex Code: PromptArmor discovered that Cortex Code’s command validation system failed to evaluate commands inside process substitution expressions, allowing a malicious prompt injection hidden in a GitHub repository README to execute arbitrary shell commands without ever triggering the human-in-the-loop approval step. The injected instruction also manipulated the model into setting an unsandboxed execution flag, causing the malicious command to run entirely outside the sandbox without prompting the user for consent.
16. MetaGPT / LangChain Agents: MemoryGraft is a novel indirect injection attack that compromises agent behavior not through immediate jailbreaks but by implanting malicious “successful experiences” into the agent’s long-term memory, exploiting its tendency to replicate patterns from retrieved successful tasks. Unlike traditional prompt injections, which are transient, or standard RAG poisoning, which targets factual knowledge, MemoryGraft corrupts all future sessions without any session-level injection, requiring an attacker to supply only benign-seeming ingestion-level artifacts that the agent reads during normal execution.15
17. ServiceNow Now Assist: In ServiceNow’s Now Assist, default settings allow AI agents to autonomously discover and recruit each other; a malicious prompt embedded in data processed by a low-privilege agent can instruct it to call upon a more powerful agent to steal data, modify records, or escalate privileges. The result was privilege escalation and data exposure driven entirely by inter-agent trust.16
18. Apple Intelligence: Malicious Unicode RIGHT-TO-LEFT OVERRIDE characters hide harmful instructions by writing them backward, so they render correctly on screen but remain reversed where Apple’s safety filters inspect them, bypassing all three layers of on-device guardrails. The technique succeeded in 76% of test cases across approximately 200 million affected devices.17
19. Google Gemini (Calendar): Hidden instructions embedded in calendar event descriptions lay dormant in Gemini’s context until a user asks about their schedule, at which point the payload activates, summarizing private meeting contents and writing them to a new calendar event visible to the attacker. The attack exploits Gemini’s integration with calendar data, turning structured personal data into a trigger surface without requiring the victim to click anything.18
20. Microsoft 365 Copilot: EchoLeak (CVE-2025-32711), discovered by Aim Security, is the first known case of prompt injection weaponized to cause concrete data exfiltration in a production AI system. It is a single-crafted email that coerces Copilot into accessing internal files and transmitting their contents to an attacker-controlled server without any user interaction. The attack chains four bypasses: evading Microsoft’s XPIA classifier, circumventing link redaction with reference-style Markdown, exploiting auto-fetched images, and abusing a Microsoft Teams proxy permitted by the content security policy.
What are AI agent traps?
AI agent traps are adversarial content embedded in digital environments and engineered to manipulate, deceive, or exploit autonomous AI agents that interact with those environments.19
The central insight is that autonomous agents process web content at layers humans do not perceive. Attackers can embed malicious instructions in HTML comments, CSS-positioned or zero-opacity text, metadata attributes, and steganographic data encoded in image files.20 None of these layers is ordinarily visible to a human reviewer; an agent parsing the same page treats content found in them as equally valid input to content rendered visibly on screen. The Google DeepMind researchers note this as a fundamental asymmetry: attackers can calibrate attacks to exploit an agent’s instruction-following, tool-chaining, and goal-prioritization abilities precisely because those are the capabilities that make agents operationally useful.21
Six attack categories of AI agent traps
Researchers have identified 6 categories of AI agent traps that adversaries can exploit to compromise autonomous systems:
Content injection traps
Exploit the gap between human perception, machine parsing, and dynamic rendering to smuggle malicious inputs past the agent.
The attack surface covers several distinct injection vectors. Hidden instructions embedded in HTML comments, such as `<!– SYSTEM: Ignore prior instructions –>`, appear in page source but never in the rendered view.22 CSS off-screen positioning, using `position: absolute; left: -9999px` or equivalent, places text at coordinates outside any viewport while leaving it fully parseable by agents that process document object model content. Accessibility attributes, specifically `aria-label` and related ARIA markup, carry text agents interpret as semantic context; injecting adversarial directives there places them inside the accessibility tree without any visible output.23 A fourth vector uses steganographic encoding: malicious payloads encoded in image pixel data at values imperceptible to human vision but readable by agents that process image metadata or apply pixel-level analysis.24
Semantic manipulation traps
Corrupt the agent’s reasoning chain and internal verification processes, leading it to draw flawed conclusions from seemingly valid inputs.
Three mechanisms drive this category. The first is biased phrasing and contextual priming: loading surrounding text with language that anchors the agent’s interpretation of subsequently processed content. The second is authoritative language saturation, flooding documents with phrases such as “industry-standard,” “enterprise-grade,” or “recommended by leading practitioners” to exploit the model’s learned association between such language and credible, trustworthy sources.25 The third mechanism is the lost-in-the-middle effect, a structural weakness in transformer-based LLMs where model performance on retrieval and synthesis tasks degrades when relevant information is positioned in the middle of a long context window rather than at the beginning or end.26
Cognitive state traps
Target the agent’s long-term memory, knowledge bases, and learned behavioral policies to poison future decision-making.
The three primary variants are direct RAG poisoning, latent memory poisoning, and adversarial few-shot examples in contextual learning.27
Direct RAG poisoning injects false information into indexed document corpora that agents consult during retrieval-augmented generation. Poisoned memory is more advanced. An attacker stores innocuous-seeming data in an agent’s persistent memory during routine interactions. The stored data produces no detectable effect until a specific future context activates it, at which point it modifies agent behavior in ways that appear to have no recent causal trigger.28 Adversarial few is injecting carefully crafted demonstration pairs into a context window so that the agent adopts the pattern implicit in those examples. Research on backdoor triggers in demonstrations found average attack success rates of 95 percent across models of varying scale under this approach.29
Behavioral control traps
Behavioral control traps are the most operationally consequential category in the taxonomy. They target what agents do rather than what they perceive or conclude, giving attackers direct influence over tool execution, file operations, network requests, and inter-agent communications.30
Systemic traps
Systemic traps do not target individual agents. They target the ecosystem properties that emerge when many agents of similar design operate on shared data sources, execute similar reasoning patterns, and take actions that feed back into the environment that other agents read.31
The broader category encompasses three distinct mechanisms. The first is congestion trap design: fabricating scarcity or demand signals that cause multiple agents to execute synchronized resource-acquisition behaviors, creating coordinated failures without direct agent-to-agent communication. The second is the interdependence cascade: exploiting feedback loops in multi-agent systems where each agent’s output becomes input to others, so a single corrupted signal propagates and amplifies across the network. The third is compositional payload fragmentation: distributing attack components across multiple individually benign sources that reconstitute into a functional malicious payload only when aggregated by an agent during a retrieval or synthesis task.32
Human-in-the-loop traps
Human-in-the-loop traps are the most subtle category in the taxonomy and target the supervisory layer that is conventionally treated as a safeguard. Rather than bypassing human review, these traps exploit it: the compromised agent produces outputs specifically engineered to gain human approval for actions the human would reject if described accurately.33
The core mechanism is deceptive summarization. An agent with write access to its own output layer can describe its actions in a way that frames destructive or unauthorized operations as routine maintenance.
Soyez le premier à commenter
Votre adresse courriel ne sera pas publiée. Tous les champs sont obligatoires.