Attack Categories

ZeroLeaks uses 19 attack categories to probe system prompts. Each category represents a distinct technique or research-backed approach.

Extraction Categories

direct

Straightforward extraction attempts. Asks directly for the system prompt, instructions, or configuration. Used for reconnaissance and baseline testing.

encoding

Bypass via encoding. Uses Base64, ROT13, Unicode, or other encodings to hide extraction intent from simple filters. Probes whether the model decodes and follows encoded instructions.

Persona-based jailbreaks. DAN, DUDE, STAN, and similar roleplay personas that ask the model to adopt an alternate identity with fewer restrictions. Tests whether the model maintains its configured persona.

Psychological manipulation. Authority claims, urgency, fake research framing, or social pressure to extract information. Tests resistance to social engineering.

technical

Context manipulation and injection. Instruction override, XML/HTML injection, format exploitation. Tests whether the model prioritizes system instructions over user-supplied structure.

crescendo

Multi-turn gradual escalation. Starts with benign requests and slowly escalates toward extraction. Based on Crescendo (Microsoft/USENIX) research. Harder to detect than single-turn attacks.

many_shot

Context priming with examples. Provides fabricated conversation history or few-shot examples that prime the model to reveal instructions. Based on Many-Shot Jailbreaking (Anthropic).

cot_hijack

Chain-of-thought manipulation. Hijacks the model's reasoning process to steer it toward disclosure. Exploits models that "think aloud."

policy_puppetry

Format exploitation. Asks for system information in YAML, JSON, XML, or config-like formats. Many models are less guarded when outputting structured data.

ascii_art

Visual obfuscation. Uses ASCII art or similar patterns to hide extraction intent from text-based detectors. Based on ArtPrompt (ACL 2024).

context_overflow

Long context attacks. Floods the context with content to push instructions out of attention or to inject instructions at boundaries. Tests context window handling.

reasoning_exploit

Reasoning-model-specific attacks. Targets models with extended reasoning (e.g., o1, o3) by exploiting how they process chain-of-thought.

semantic_shift

Best-of-N style variations. Generates multiple semantically similar prompts and uses the one most likely to succeed. Based on Best-of-N jailbreaking research.

Hybrid and Advanced Categories

hybrid

Hybrid prompt injection. Combines prompt injection with XSS/CSRF-style patterns, template injection, and protocol confusion. Based on Prompt Injection 2.0 research.

tool_exploit

Tool calling exploits. iMIST-style attacks, MCP injection, function call manipulation, auth bypass, agent chain exploitation. For prompts that use tools.

siren

Siren framework multi-turn patterns. Simulates human jailbreak behaviors with trust building and gradual escalation. Multi-turn orchestration.

echo_chamber

Echo chamber gradual escalation. Each message appears benign but builds toward extraction across turns. Based on Echo Chamber (arXiv 2601.05742).

injection

Direct prompt injection. Tests whether the model follows injected instructions (instruction override, role hijack, policy bypass, etc.). Used in injection and Full scans.

Category Selection

The Strategist selects categories based on target analysis and defense fingerprinting. Failed categories are recorded and may influence subsequent strategy. The Mutator generates Best-of-N variations within selected categories to improve success rates.

Attack Categories

On this page