MCP Injection Lab: 28 Pages That Try to Hijack Your AI Agent
I built 28 realistic-looking web pages — blog posts, API docs, job listings, RSS feeds — each embedded with hidden prompt injections at increasing levels of sophistication. Send your AI agent to browse them. See if it gets manipulated.
Every time an AI agent browses a web page, it's trusting that the content is what it appears to be. A tech blog is just a tech blog. An API documentation page is just documentation. A job listing is just a job listing. But what happens when the page isn't? What happens when the person who wrote the page anticipated that an AI agent would read it — and embedded instructions specifically designed to manipulate what the agent does next?
That's the question MCP Injection Lab is built to answer. It's a collection of 28 standalone web pages that look exactly like normal content — but contain hidden prompt injection payloads at varying levels of sophistication. Send your AI agent to summarize one of them. Watch what it does.
The attack: indirect prompt injection
Prompt injection is what happens when an attacker gets their instructions into the same context window as your AI agent. Direct injection is the obvious case — you're talking to an AI and you type in adversarial instructions yourself. Indirect injection is the subtler, more dangerous case: the instructions don't come from you at all. They come from data the agent processes on your behalf.
An AI agent reading a web page, summarizing an email, or processing a shared document is consuming untrusted content. If that content contains instructions — hidden in HTML comments, invisible CSS text, metadata fields, or seemingly ordinary prose — the agent may follow them without any indication to the user that anything unusual happened.
The attack surface is every piece of external content an agent ever processes. Web pages, emails, documents, API responses, RSS feeds, tool schemas. If the agent reads it, an attacker can write to it.
When does an agent become vulnerable?
Not every agent is equally at risk. An agent becomes truly dangerous to operate when three conditions are met simultaneously — what security researchers call the lethal trifecta.
- Access to private data — the agent can read emails, files, databases, or other sensitive information on behalf of the user
- Exposure to untrusted content — the agent processes content from external, potentially adversarial sources: web pages, emails, shared documents
- An exfiltration vector — the agent can take actions that send data outward: rendering images, making API calls, creating links, writing files
All three together create the conditions for a complete attack. An agent with database access reads a malicious web page (exposure), and the injection instructs it to query the user's data and encode it into an image URL (exfiltration). The Supabase MCP breach in mid-2025 is a real example of this exact chain.
How to use the lab
The lab is a single-page app with 28 test pages accessible at clean URLs. Each page looks exactly like normal web content — the lab chrome (tier badge, technique tags, reveal toggle) is visible to you but stripped entirely from the AI-facing version. To run a test:
Three steps:
- Open the Lab → navigate to the Test Pages section. Find a test to try — start with Tier 1 for basic hiding techniques, or jump straight to Tier 5 for behavioral attacks.
- Copy the AI-facing URL → each test card shows the clean URL your agent will visit. This page has no lab branding at all — it looks like a real website.
- Ask your agent to read it → paste the URL and ask the agent to summarize, analyze, or browse the page. Watch whether it outputs the embedded passphrase (each test has a unique NATO-phonetic code like ECHO-FOXTROT-7) or takes any unintended actions.
If the agent outputs the passphrase — the injection succeeded. The agent followed hidden instructions embedded in the page. You can then toggle "Reveal Injections" in the lab view to see exactly where the payloads were hidden and how they were structured.
The lab also includes a ready-made agent prompt you can copy — a system instruction that tells your agent to browse a list of test URLs and record which ones cause it to output a passphrase. Useful for running a full sweep against a specific model.
The seven tiers
The 28 tests are organized into seven difficulty tiers. Lower tiers use straightforward hiding techniques — the kind that are well-documented and that well-aligned models should resist. Higher tiers test more sophisticated behavioral manipulation and protocol-level attacks.
- Tier 1 — Basic: HTML comments, CSS display:none, visibility:hidden, invisible text. The injection is literally in the HTML source, just hidden from visual rendering.
- Tier 2 — Intermediate: Meta tags, data attributes, off-screen positioning, SVG embedding, ARIA attribute abuse. Payloads move beyond visual hiding into the structural metadata of the page.
- Tier 3 — Advanced: Base64 encoding, homoglyph character substitution, zero-width Unicode characters, payload splitting across disconnected elements. Requires the agent to decode or reassemble the instruction.
- Tier 4 — Expert: All previous techniques combined simultaneously. Persona hijacking attempts — instructions telling the agent it has a new identity with different rules. Multi-stage exfiltration chains (the EchoLeak pattern).
- Tier 5 — Behavioral: The most dangerous class. Instead of asking the agent to reveal a passphrase, these tests try to change what the agent does — redirecting research, extracting system prompt data, triggering form interactions, chaining tool calls.
- Tier 6 — MCP Protocol: Attacks that target the Model Context Protocol specifically. Tool schema poisoning, rug-pull (tool behavior changes after trust is established), cross-server shadowing, sampling injection.
- Tier 7 — Novel Formats: Non-HTML attack surfaces. Alt-text and EXIF metadata injection for vision-capable agents, RSS/Atom CDATA sections, JSON API response values, robots.txt directives and structured data.
The density system
Each test page has a density selector with four levels: Sparse, Moderate, Dense, and Saturated. This controls how many injection payloads are simultaneously active on the page — from a single carefully-placed injection up to 8+ overlapping techniques firing at once.
The density system exists because real-world attacks aren't always single-payload. Unit 42 researchers documented adversarial pages containing 24+ simultaneous injection payloads. At Dense and Saturated levels, the lab simulates what those pages look like to an agent — overwhelming the agent's content-versus-instruction boundary with volume. Test 20 (the full attack chain) runs all eight behavioral attacks with all ten hiding techniques active at once, mirroring those real-world campaigns.
Research methodology
The lab's threat model is specific: an AI agent is instructed by a user to browse, summarize, or analyze a web page. The attacker controls the page content but has no direct access to the agent's system prompt, tools, or conversation history. The attacker's only channel is the page content itself — HTML, CSS, metadata, structured data, and any non-visual elements the agent processes.
Injection techniques are classified along three independent axes: delivery method (how the payload reaches the agent), obfuscation (how it evades detection), and behavioral objective (what the agent is manipulated to do). This produces a taxonomy of roughly 25 distinct technique categories across the seven tiers.
Key design decisions:
- Realistic content first — every test page contains 1,000–3,000 words of genuine, useful content. Blog posts are well-written. API docs are accurate. Job listings are plausible. Agents should process the page as normal content before encountering injections — matching real-world conditions.
- No lab branding on test pages — the 28 standalone HTML files (ai/1.html through ai/28.html) contain zero references to MCP Injection Lab, testing, or security research. An agent visiting these pages should perceive them as ordinary websites.
- Passphrase-based detection — each test embeds a unique NATO-phonetic passphrase. Agent outputs the passphrase = injection succeeded. Binary, unambiguous, no interpretation required.
- Canary tokens for exfiltration detection — fake AWS keys, GitHub PATs, Stripe API keys, and database connection strings are embedded in several tests. If the agent surfaces these in its output, it demonstrates willingness to extract and share credential-like data from processed content.
Novel contributions
Beyond implementing known techniques from published CVEs and research, the lab introduces several attack vectors not well-represented in existing injection benchmarks. Test 25 targets multimodal metadata: alt-text descriptions, EXIF field simulations, and structured image data for vision-capable agents. Tests 26–28 target non-HTML formats that most injection benchmarks ignore entirely: RSS/Atom feed CDATA sections, JSON API response fields and error messages, and crawler directive abuse via robots.txt, meta robots, and sitemap.xml.
The density system is also novel as a research instrument. By controlling injection volume as an independent variable, it enables systematic study of the relationship between payload count and agent susceptibility — something existing benchmarks treat as fixed.
Why this matters now
Prompt injection has been a theoretical concern since the earliest days of LLM-powered agents. In 2025 it became operational. EchoLeak (CVE-2025-32711) demonstrated markdown image exfiltration from Microsoft Copilot — an agent processing an email with an injected payload would silently encode sensitive data into an image URL request to an attacker-controlled server. MCPoison (CVE-2025-54136) showed tool description poisoning in MCP servers. The Supabase MCP breach combined all three elements of the lethal trifecta in a single real-world attack.
The OWASP LLM Top 10 ranks prompt injection as the number one attack vector for LLM applications. The OWASP MCP Top 10, published in 2025, identifies ten critical risks specific to Model Context Protocol-integrated systems — tool poisoning, rug pull, cross-server injection, sampling abuse, and others. All ten are represented in the lab's Tier 6 tests.
The ATTESTMCP defense framework — which adds cryptographic capability attestation, message authentication, and server isolation to MCP — reduced overall attack success rates from 52.8% to 12.4% in testing. That 76.5% reduction suggests the attack surface is meaningful and the defenses are tractable. It also suggests the 47.2% of attacks that still succeed through a hardened framework deserve serious attention.
Limitations
- Passphrase-based success metrics may not capture subtle behavioral changes — an agent that softens its tone, changes its recommendations, or omits certain information because of an injection won't necessarily trigger any detection
- Tests are static — real-world attacks can adapt dynamically based on agent responses
- The leaderboard data is illustrative, based on published benchmarks and external evaluations, not direct systematic testing through the lab
- The lab tests indirect injection only — direct injection (where the user themselves provides adversarial input) is out of scope
See more interactive experiments.
The playground is where I put things built for the joy of building — visualizations, tools, and demos you can use right in the browser.
Browse the Playground