Shield SDK
Threat Model
What Shield catches and what it does not. Heuristic-based runtime defense complements full ZeroLeaks scanning.
Threat Model
Shield is a heuristic-based runtime defense layer. It catches known attack patterns quickly and does not replace comprehensive ZeroLeaks scanning for deep security assessment.
What Shield Catches
| Threat | Mechanism |
|---|---|
| Instruction overrides | Pattern matching for "ignore previous instructions", "new instructions:", etc. |
| Role hijacking | Persona anchors (harden) + detection of DAN, developer mode, etc. |
| Prompt extraction | Anti-extraction rules (harden) + detection of "repeat your prompt", "output instructions" |
| Authority exploitation | Detection of [SYSTEM], [ADMIN], MAINTENANCE WINDOW, compliance notices |
| Tool hijacking | Detection of curl, wget, /dev/tcp, cloud metadata URLs, SSRF patterns |
| Indirect injection | Detection of hidden instructions in documents, HTML comments, JSON fields |
| Encoding attacks | Detection of base64, Unicode zero-width, ROT13, hex encoding patterns |
| Output leakage | N-gram matching to detect system prompt fragments in model output |
What Shield Does Not Catch
| Gap | Mitigation |
|---|---|
| Novel zero-day patterns | Heuristics are derived from known probes; new attack variants may evade detection |
| Semantic attacks | Natural-language phrasing that bypasses regex without obvious keywords |
| Complex multi-turn grooming | Shield evaluates each turn independently; gradual escalation across turns is not modeled |
| Non-English (partial) | Patterns are primarily English; some non-English attacks may not match |
| Adversarial obfuscation | Heavy obfuscation, typosquatting, or encoding can evade pattern matching |
How to Use Shield
- Runtime defense — Shield is best used as a first line of defense in production. It blocks or sanitizes obvious attacks before they reach the model or user.
- Complement ZeroLeaks scanning — Run full ZeroLeaks scans (extraction, injection, sandbox, AgentGuard) during development and CI to discover vulnerabilities. Shield adds runtime protection for known patterns.
- Tune threshold — Use
threshold: "low"for stricter detection (more false positives) orthreshold: "high"for fewer false positives. Defaultmediumbalances both. - Monitor callbacks — Use
onInjectionDetectedandonLeakDetectedto log, alert, or feed analytics for tuning and incident response.