6.1 Defense-in-Depth Model
What you'll learn
- The 7-layer defense stack and what fails when each layer is missing
- Five agent-specific risks and which layer mitigates each
- The minimum viable security posture vs. the production-grade one
Embedding an agent into your product isn't about one security boundary — it's layered. Each layer is independent. Never bet safety on a single layer.
Seven-layer defense stack
User request / upstream system
│
▼
┌─────────────────────────────┐
│ 1. Business access control │ "Can this user even reach the agent?"
│ (SSO / RBAC / tenant) │
└─────────────────────────────┘
│
▼
┌─────────────────────────────┐
│ 2. Credential boundary │ "What identity does the agent act as?"
│ (API key / STS / svcacct)│
└─────────────────────────────┘
│
▼
┌─────────────────────────────┐
│ 3. Prompt-injection defense │ "Can user input override LLM rules?"
│ (AGENTAO.md constraints, │
│ tool-output tagging) │
└─────────────────────────────┘
│
▼
┌─────────────────────────────┐
│ 4. Permission engine │ "Is this tool call allowed?" — rule-level
│ (PermissionEngine rules) │
└─────────────────────────────┘
│
▼
┌─────────────────────────────┐
│ 5. Tool confirmation │ "Does the user approve?" — human-level
│ (confirm_tool UI) │
└─────────────────────────────┘
│
▼
┌─────────────────────────────┐
│ 6. Shell sandbox │ "What can the command actually do?"
│ (macOS sandbox-exec) │ — kernel-level
└─────────────────────────────┘
│
▼
┌─────────────────────────────┐
│ 7. Network isolation │ "What IPs are reachable?"
│ (container / VPC / egress│ — infrastructure-level
└─────────────────────────────┘
│
▼
Actual execution
│
▼
┌─────────────────────────────┐
│ 8. Audit log (cross-cutting)│ "What happened? Who signed off?"
└─────────────────────────────┘Core principle: if one layer fails, the others still hold.
Threat model: five agent-specific risks
| Risk | Attack path | Primary defense |
|---|---|---|
| Prompt injection | User input / web content / file content / tool output carries hidden instructions | Layers 3 + 4 |
| Credential leakage | LLM reply or log contains API keys / DB passwords | Layers 2 + 8 (scrubbing) |
| Privilege escalation | Agent tool crosses tenants or escalates privilege | Layers 1 + 4 + multi-tenant isolation (6.4) |
| SSRF (internal network) | LLM coaxed into web_fetch http://169.254.169.254/ | Layer 4 (domain blocklist) + Layer 7 |
| Resource exhaustion | Infinite tool loops, huge files, context explosion | 6.7 resource governance |
Responsibilities
| Layer | Owner | Frequency |
|---|---|---|
| 1. Access control | Your app | When users/roles change |
| 2. Credentials | DevOps | On issuance / rotation |
| 3. Prompt-injection defense | Developers (AGENTAO.md) | Design time |
| 4. Permission rules | Developers + security | Each new tool |
| 5. Confirmation UI | Frontend developers | Design time |
| 6. Shell sandbox | Platform team | Config time |
| 7. Network isolation | Ops / network | Deploy time |
| 8. Audit logs | Platform team | Runtime (monitored) |
Minimum vs ideal
Minimum (prototype / internal tool):
OPENAI_API_KEYset properly (not committed to git)PermissionEngineonWORKSPACE_WRITEpresetworking_directory=per session- Basic
confirm_toolimplementation - Default
agentao.log
Enough for demos and internal use, not customer-facing.
Ideal (production SaaS):
- Per-tenant STS / service account, least-privilege
- Custom
PermissionEngineper tenant plan - Per-session
working_directory+ container isolation - Confirmation UI tied to SSO, approvals recorded for compliance
- macOS:
sandbox-exec; Linux: seccomp/namespaces (see 6.2) - Network egress rules (VPC + allowlist)
- Logs to SIEM, metrics to APM
- Red-team tests for prompt injection
Subsequent sections land each layer.
Pre-deployment checklist
- [ ] No credentials hard-coded in code or AGENTAO.md
- [ ]
PermissionEnginerules cover every custom tool and MCP server - [ ]
confirm_toolhas a timeout (no infinite wait) - [ ]
working_directoryisolated per session - [ ]
agentao.logpath is writable and persisted - [ ] Unit tests cover "expected rule hits"
- [ ] Alerts: tool failure rate, LLM 5xx rate, confirm-timeout rate
TL;DR
- Never bet safety on a single layer. 7 stacked layers + cross-cutting audit.
- The five agent-specific risks: prompt injection, credential leakage, privilege escalation, SSRF, resource exhaustion.
- Minimum viable: API key off git +
WORKSPACE_WRITEpreset + per-sessionworking_directory+ basicconfirm_tool+ default log. - Production-grade: per-tenant STS + custom
PermissionEngine+ container isolation + sandbox-exec / seccomp + VPC egress rules + SIEM logs + red-team drills.