ZeroLeaks
Shield SDK

Threat Model

What Shield catches and what it does not. Heuristic-based runtime defense complements full ZeroLeaks scanning.

Threat Model

Shield is a heuristic-based runtime defense layer. It catches known attack patterns quickly and does not replace comprehensive ZeroLeaks scanning for deep security assessment.

What Shield Catches

ThreatMechanism
Instruction overridesPattern matching for "ignore previous instructions", "new instructions:", etc.
Role hijackingPersona anchors (harden) + detection of DAN, developer mode, etc.
Prompt extractionAnti-extraction rules (harden) + detection of "repeat your prompt", "output instructions"
Authority exploitationDetection of [SYSTEM], [ADMIN], MAINTENANCE WINDOW, compliance notices
Tool hijackingDetection of curl, wget, /dev/tcp, cloud metadata URLs, SSRF patterns
Indirect injectionDetection of hidden instructions in documents, HTML comments, JSON fields
Encoding attacksDetection of base64, Unicode zero-width, ROT13, hex encoding patterns
Output leakageN-gram matching to detect system prompt fragments in model output

What Shield Does Not Catch

GapMitigation
Novel zero-day patternsHeuristics are derived from known probes; new attack variants may evade detection
Semantic attacksNatural-language phrasing that bypasses regex without obvious keywords
Complex multi-turn groomingShield evaluates each turn independently; gradual escalation across turns is not modeled
Non-English (partial)Patterns are primarily English; some non-English attacks may not match
Adversarial obfuscationHeavy obfuscation, typosquatting, or encoding can evade pattern matching

How to Use Shield

  • Runtime defense — Shield is best used as a first line of defense in production. It blocks or sanitizes obvious attacks before they reach the model or user.
  • Complement ZeroLeaks scanning — Run full ZeroLeaks scans (extraction, injection, sandbox, AgentGuard) during development and CI to discover vulnerabilities. Shield adds runtime protection for known patterns.
  • Tune threshold — Use threshold: "low" for stricter detection (more false positives) or threshold: "high" for fewer false positives. Default medium balances both.
  • Monitor callbacks — Use onInjectionDetected and onLeakDetected to log, alert, or feed analytics for tuning and incident response.

On this page