What Shield catches and what it does not. Heuristic-based runtime defense complements full ZeroLeaks scanning.

Threat Model

Shield is a heuristic-based runtime defense layer. It catches known attack patterns quickly and does not replace comprehensive ZeroLeaks scanning for deep security assessment.

What Shield Catches

Threat	Mechanism
Instruction overrides	Pattern matching for "ignore previous instructions", "new instructions:", etc.
Role hijacking	Persona anchors (harden) + detection of DAN, developer mode, etc.
Prompt extraction	Anti-extraction rules (harden) + detection of "repeat your prompt", "output instructions"
Authority exploitation	Detection of [SYSTEM], [ADMIN], MAINTENANCE WINDOW, compliance notices
Tool hijacking	Detection of curl, wget, /dev/tcp, cloud metadata URLs, SSRF patterns
Indirect injection	Detection of hidden instructions in documents, HTML comments, JSON fields
Encoding attacks	Detection of base64, Unicode zero-width, ROT13, hex encoding patterns
Output leakage	N-gram matching to detect system prompt fragments in model output

What Shield Does Not Catch

Gap	Mitigation
Novel zero-day patterns	Heuristics are derived from known probes; new attack variants may evade detection
Semantic attacks	Natural-language phrasing that bypasses regex without obvious keywords
Complex multi-turn grooming	Shield evaluates each turn independently; gradual escalation across turns is not modeled
Non-English (partial)	Patterns are primarily English; some non-English attacks may not match
Adversarial obfuscation	Heavy obfuscation, typosquatting, or encoding can evade pattern matching

How to Use Shield

Runtime defense — Shield is best used as a first line of defense in production. It blocks or sanitizes obvious attacks before they reach the model or user.
Complement ZeroLeaks scanning — Run full ZeroLeaks scans (extraction, injection, sandbox, AgentGuard) during development and CI to discover vulnerabilities. Shield adds runtime protection for known patterns.
Tune threshold — Use threshold: "low" for stricter detection (more false positives) or threshold: "high" for fewer false positives. Default medium balances both.
Monitor callbacks — Use onInjectionDetected and onLeakDetected to log, alert, or feed analytics for tuning and incident response.

Threat Model

Threat Model

What Shield Catches

What Shield Does Not Catch

How to Use Shield

On this page