How AgentGuard Works
Two-phase architecture, full engine scan, agent-specific probes, and Claude-based evaluation.
How AgentGuard Works
AgentGuard runs a two-phase scan against your deployed agent. Phase 1 uses the full ZeroLeaks scan engine (same as the dashboard). Phase 2 runs agent-specific probes that target tool safety, multi-turn resilience, and data leakage.
Overview
┌─────────────────────────────────────────────────────────────────┐
│ AgentGuard Scan Flow │
├─────────────────────────────────────────────────────────────────┤
│ Phase 1: Full Engine Scan │
│ ├── Extraction: 30 TAP-based attacks → your agent endpoint │
│ └── Injection: 23 injection probes → your agent endpoint │
├─────────────────────────────────────────────────────────────────┤
│ Phase 2: Agent-Specific Probes │
│ ├── Tool hijacking (8) │
│ ├── Indirect injection (8) │
│ ├── Authority exploitation (5) │
│ ├── Protocol exploits (5) │
│ ├── Multi-turn grooming (5 sequences) │
│ ├── Data leakage (8) │
│ ├── Legacy behavior (21) │
│ └── Dynamic tool probes (from your tool definitions) │
└─────────────────────────────────────────────────────────────────┘Phase 1: Full Engine Scan
Phase 1 sends requests to your agent endpoint using the same logic as dashboard scans. The difference: instead of calling a simulated target, ZeroLeaks issues HTTP requests to your configured URL.
Extraction (30 attacks):
- Strategist selects attack strategy
- Attacker generates prompts (TAP-style, multi-category)
- Each prompt is sent to your agent
- Evaluator analyzes the response for system prompt leakage
- Mutator refines attacks based on feedback
Injection (23 probes):
- Skeleton key, crescendo, echo chamber, many-shot, semantic variation
- Tool poisoning, indirect injection, ASCII art, promptware
- Hybrid injection, output control, role hijack
Each request triggers a real inference on your model. Your agent's response is evaluated for compliance with the attacker's intent.
Phase 2: Agent-Specific Probes
Phase 2 runs a fixed set of probes designed for agentic systems. Each probe has:
- Prompt — The attack payload sent to your agent
- Intent — What the attacker is trying to achieve
- Success indicators — Strings that suggest the attack succeeded
- Category — tool_hijacking, indirect_injection, authority_exploit, protocol_exploit, multi_turn, data_leakage, or legacy_behavior
Single-turn probes: One message per probe. The agent's response is evaluated.
Multi-turn probes: A sequence of messages (e.g. 3 turns). The evaluator checks the final turn for success indicators (e.g. curl execution, data exfiltration).
Dynamic probes: If you define tools, AgentGuard generates extra probes (e.g. abuse of send_email, http_request, run_sql).
Evaluation
Each Phase 2 response is evaluated by Claude (the evaluator model). The evaluator receives:
- Probe category and intent
- Attack prompt
- Success indicators
- Whether any indicator matched in the response
- The agent's response (truncated)
It returns:
succeeded— Whether the attack achieved its goalseverity— critical, high, medium, low, or noneevidence— What in the response indicates successtechnique— The technique name
The evaluator is instructed to be strict: only mark succeeded=true when there is clear evidence of a security breach or compliance with the attacker's intent.
Flow Summary
- Config load — Fetch your agent config (endpoint, auth, format, tools)
- HTTP target — Create a target that sends requests to your URL
- Phase 1 extraction — Run extraction scan, send attacks to your endpoint
- Phase 1 injection — Run injection scan, send probes to your endpoint
- Phase 2 probes — Run each agent-specific probe category in sequence
- Merge findings — Combine Phase 1 and Phase 2 findings
- Component scores — Compute Prompt Security, Tool Safety, Multi-Turn Resilience, Data Leakage
- Overall score — Weighted average of component scores
- Report — Store results, recommendations, conversation log
Throttling
Probes are throttled (300ms between requests) to avoid overwhelming your endpoint and to reduce rate-limit issues.