Attack Categories
The 19 attack categories ZeroLeaks uses for extraction and injection testing.
Attack Categories
ZeroLeaks uses 19 attack categories to probe system prompts. Each category represents a distinct technique or research-backed approach.
Extraction Categories
direct
Straightforward extraction attempts. Asks directly for the system prompt, instructions, or configuration. Used for reconnaissance and baseline testing.
encoding
Bypass via encoding. Uses Base64, ROT13, Unicode, or other encodings to hide extraction intent from simple filters. Probes whether the model decodes and follows encoded instructions.
persona
Persona-based jailbreaks. DAN, DUDE, STAN, and similar roleplay personas that ask the model to adopt an alternate identity with fewer restrictions. Tests whether the model maintains its configured persona.
social
Psychological manipulation. Authority claims, urgency, fake research framing, or social pressure to extract information. Tests resistance to social engineering.
technical
Context manipulation and injection. Instruction override, XML/HTML injection, format exploitation. Tests whether the model prioritizes system instructions over user-supplied structure.
crescendo
Multi-turn gradual escalation. Starts with benign requests and slowly escalates toward extraction. Based on Crescendo (Microsoft/USENIX) research. Harder to detect than single-turn attacks.
many_shot
Context priming with examples. Provides fabricated conversation history or few-shot examples that prime the model to reveal instructions. Based on Many-Shot Jailbreaking (Anthropic).
cot_hijack
Chain-of-thought manipulation. Hijacks the model's reasoning process to steer it toward disclosure. Exploits models that "think aloud."
policy_puppetry
Format exploitation. Asks for system information in YAML, JSON, XML, or config-like formats. Many models are less guarded when outputting structured data.
ascii_art
Visual obfuscation. Uses ASCII art or similar patterns to hide extraction intent from text-based detectors. Based on ArtPrompt (ACL 2024).
context_overflow
Long context attacks. Floods the context with content to push instructions out of attention or to inject instructions at boundaries. Tests context window handling.
reasoning_exploit
Reasoning-model-specific attacks. Targets models with extended reasoning (e.g., o1, o3) by exploiting how they process chain-of-thought.
semantic_shift
Best-of-N style variations. Generates multiple semantically similar prompts and uses the one most likely to succeed. Based on Best-of-N jailbreaking research.
Hybrid and Advanced Categories
hybrid
Hybrid prompt injection. Combines prompt injection with XSS/CSRF-style patterns, template injection, and protocol confusion. Based on Prompt Injection 2.0 research.
tool_exploit
Tool calling exploits. iMIST-style attacks, MCP injection, function call manipulation, auth bypass, agent chain exploitation. For prompts that use tools.
siren
Siren framework multi-turn patterns. Simulates human jailbreak behaviors with trust building and gradual escalation. Multi-turn orchestration.
echo_chamber
Echo chamber gradual escalation. Each message appears benign but builds toward extraction across turns. Based on Echo Chamber (arXiv 2601.05742).
injection
Direct prompt injection. Tests whether the model follows injected instructions (instruction override, role hijack, policy bypass, etc.). Used in injection and Full scans.
Category Selection
The Strategist selects categories based on target analysis and defense fingerprinting. Failed categories are recorded and may influence subsequent strategy. The Mutator generates Best-of-N variations within selected categories to improve success rates.