Appendix F · FAQ & Troubleshooting
Organized by symptom, not by part. Each entry links back to the main guide for detail.
F.1 Setup & startup
"ImportError: cannot import name 'Agentao'"
- Check you installed the package (
uv add agentaoorpip install agentao, not justuv add openai) - Import from the top level:
from agentao import Agentao(notfrom agentao.agent import Agentao— that path is not guaranteed stable)
"No module named 'openai' / 'mcp'"
Install with the full extras if you need MCP:
uv add 'agentao[mcp]' # or
uv add 'agentao[all]'"ValueError: OPENAI_API_KEY is not set"
Three resolution options:
.envat working-directory root, withOPENAI_API_KEY=…- Process env:
export OPENAI_API_KEY=… - Constructor:
Agentao(api_key="sk-…")
Constructor wins over env, which wins over .env. See Appendix B.
"Model 'gpt-5.4' not found" (custom endpoint)
The default model id is gpt-5.4. If your endpoint offers different models, pass model= or set OPENAI_MODEL. See 2.2.
F.2 Runtime behavior
Agent says "Tool execution cancelled by user" on every write
You set PermissionMode.READ_ONLY (explicitly or by accident). Either:
- Construct an engine and switch modes explicitly:
e = PermissionEngine(project_root=workdir); e.set_mode(PermissionMode.WORKSPACE_WRITE); agent = Agentao(working_directory=workdir, permission_engine=e, ...)—project_root=is required since 0.2.16 - Or implement a
confirm_toolcallback on the transport so users can approve interactively
chat() never returns
Three likely causes:
- Infinite tool loop — hit
max_iterations. Lower the limit or wireon_max_iterations(4.6) - Tool hangs — a custom tool has no timeout. Wrap subprocess / HTTP calls in
timeout=(6.7) - User prompt needs input — the default
ask_userwaits forever in headless mode. Override viaSdkTransport(ask_user=…)
Enforce a hard limit at the host:
reply = await asyncio.wait_for(asyncio.to_thread(agent.chat, msg), timeout=120)"Why does my tool get called with weird paths?"
Tool execute() receives whatever the LLM supplied. Validate args, and use self._resolve_path(raw) to join against working_directory — see Tool base class.
Output contains random escape sequences
Some terminals aren't used. Either:
- Disable color in the transport side before display (
rich.console.Console(no_color=True)) - Strip with a post-filter; Agentao itself does not enforce a color policy
F.3 Memory & sessions
"I cleared history but old context still leaks"
clear_history() only resets self.messages. The memory DB persists — that's intentional. To also wipe memory:
agent.clear_history()
agent.memory.clear(scope="project")Memory bleeds across tenants
Classic multi-tenant trap. You mounted ~/.agentao/memory.db user scope across tenants. Either:
- Pin per-tenant working directory AND disable user scope, or
- Key user-scope memories by
tenant_id+user_id
See 6.4.
"Session state lost on restart"
Two-part fix:
- SDK: serialize
agent.messagesyourself; on restartagent.messages = saved_messages - ACP: use
session/loadwith a storedsessionId— agent must advertiseloadSession: true(7.2 pattern)
F.4 MCP
"MCP server listed but no tools appear"
Check in order:
/mcpCLI (oragent.mcp_manager.get_status()) — is the server inreadystate?- Subprocess stderr — often stdout-corruption from a server that logs to stdout
- Tool-name collision — same
{server}_{tool}registered twice triggers a warning inagentao.log
"'mcp' command not found"
Install the MCP extras. On Linux you may also need a JS runtime if the MCP server is npx-launched.
"Server fails with 'timeout'"
Three layers:
- Per-tool timeout in
mcp.json("timeout": 30) - Transport default (~30s for stdio, ~60s for SSE)
- Your wrapping
asyncio.wait_for
The tightest wins. See Appendix B.3.1.
F.5 Security & sandbox
"macOS says sandbox-exec denied"
Open agentao.log — the exact denial reason is logged. Common fixes:
- Shell profile too restrictive → switch
default_profilefromreadonlytoworkspace-write-no-network - Command outside workspace → use absolute paths inside
working_directory - See 6.2
"Sandbox disabled in production — how do I enforce it?"
Sandbox config is merged: project .agentao/sandbox.json overrides user. Mount project config read-only in your container so LLM-led changes to sandbox settings can't persist. See 7.4 pitfall table.
"Agent tried to fetch 169.254.169.254"
Expected — SSRF attempts hit the built-in blocklist. Check agentao.log for the deny record and verify your PermissionEngine rules (6.3).
F.6 ACP integration
handshake_fail on initialize
Likely a version mismatch. Agentao v0.2.x speaks protocolVersion: 1 (integer). If your client sends a string like "2025-09-01", the server rejects it. See 3.1. If the failure reaches you as an AcpRpcError instead of a plain AcpClientError(code=HANDSHAKE_FAIL), the handshake-phase signal lives in details["phase"] == "handshake" — see Appendix D §D.7 for the full classification rules.
server_busy from prompt_once
Fail-fast semantics — someone else is already in a turn. Options:
- Wait + retry
- Use the session-based API (
send_prompt) if queueing is acceptable - Spawn a dedicated subprocess per tenant
See Appendix D.5.
"session/cancel doesn't stop my long tool"
Cancellation bubbles through CancellationToken, but your custom tool must cooperate. Check self._current_token inside long loops and call token.check() between steps.
"How do I tell if an ACP server is usable right now?"
Don't string-match on state; call readiness(name):
if mgr.is_ready("my-server"):
mgr.prompt_once("my-server", "hello", timeout=30)"ready"— safe to submit."busy"— a turn is in flight; retrying will raiseSERVER_BUSY."failed"— auto-recovery already handles recoverable idle exits (capped bymaxRecoverableRestarts, default 3); once the sticky fatal flag is set or the cap is exhausted, an explicitrestart_server()/start_server()by the operator is required."not_ready"— server is still starting up or winding down.
"Why is last_error still set even though my last turn succeeded?"
By design. last_error / last_error_at are sticky diagnostic fields so a host polling once per minute still sees the last-known failure. Read state (or readiness()) first for gating; treat last_error as history. To explicitly clear it, call reset_last_error(name). See Appendix D.5.
"Is last_error_at the exact raise time?"
No. It's the instant the manager stored the error, not the instant it was raised. Use it for staleness judgements (now - last_error_at > Δ), not as raise-time instrumentation. The regression suite pins this by monkey-patching datetime during a recorded error and verifying the snapshot reflects the patched clock.
"Why does my "nonInteractivePolicy": "reject_all" now raise AcpConfigError?"
Week 3 dropped the legacy bare-string form. The new shape is a structured object:
"nonInteractivePolicy": { "mode": "reject_all" }The failure is deliberately loud and raised at config-load time (AcpClientConfig.from_dict / load_acp_client_config) — not at send_prompt time — so config drift cannot quietly ship to production. For a single-call override, don't touch the config at all — use interaction_policy= on send_prompt / prompt_once. Full migration in Appendix E.7.
"A server crashed mid-turn. How do I recover?"
Depends on how it died (Week 4 classifier):
- Recoverable death (clean exit, non-zero idle exit within cap, stdio EOF, death during active turn): no operator action needed. The next
send_prompt/prompt_onceautomatically rebuilds the client;mgr.restart_count(name)shows how many auto-rebuilds happened. - Fatal death (OOM / SIGKILL /
exit 137/ consecutive handshake failure / beyondmaxRecoverableRestarts): the server is marked sticky-fatal.mgr.is_fatal(name)returnsTrue; allensure_connectedcalls raiseAcpClientError(code=TRANSPORT_DISCONNECT, details={"recovery": "fatal"}). Callmgr.restart_server(name)ormgr.start_server(name)to acknowledge and re-enable auto-recovery.
To tune the retry cap, set maxRecoverableRestarts on the server config (default 3).
"Is ensure_connected safe to call after cancel_turn?"
Yes. Week 4's cleanup guarantees (see §7.1 of the headless runtime doc):
- The pending slot is dropped before
session/cancelis sent. - The turn slot and the per-server lock are released in
finallyblocks. last_erroris recorded before the lock is released, so a parallelget_status()observes the failure on the same tick.
The next send_prompt sees a ready server with no residual busy / locked state. test_headless_runtime.py::TestDaemonRegression::test_cancel_then_continue pins this.
F.7 Deployment & ops
"Docker image is huge"
Multi-stage build — see 6.8 Dockerfile template. Key move: don't ship uv into the runtime stage.
"Kubernetes pod restarts and loses sessions"
Use StatefulSet (not Deployment) and a PVC for /data. Set Service sessionAffinity: ClientIP. See 6.8.
"How do I cap token spend per tenant?"
TokenBudget pattern — see 6.7. For exact counts use agentao[tokenizer] (pulls tiktoken).
Cost suddenly doubled overnight
Likely culprits:
- Model version swap (check deployment audit)
- A skill change made the LLM call more tools per turn
- Context compression triggered more often — check
max_context_tokens
Compare LLM_TEXT event token counts from yesterday vs today. Session replay (6.6) is what makes this feasible.
F.8 Development & testing
"How do I unit-test a custom tool?"
Tools are plain classes — MyTool().execute(**args). No Agentao instance needed. For tools that touch disk, pass a working_directory temp dir.
For an end-to-end testing rig — agent, agent_with_reply, and fake_llm_client pytest fixtures with passing smoke tests — see examples/pytest-fixture/. Drop the fixtures into your own test suite and you get hermetic Agentao tests with no OPENAI_API_KEY requirement.
"How do I assert the agent did the right thing?"
Don't assert on LLM output text (non-deterministic). Instead:
- Spy on
EventType.TOOL_STARTevents viaSdkTransport(on_event=spy); assert the tool was called with expected args - Or mock the tool and assert interactions
- Reuse the
fake_llm_clientfixture fromexamples/pytest-fixture/to script LLM responses turn-by-turn
"LLM responses are non-deterministic across test runs"
Lower temperature=0 when testing, but accept that exact wording will still drift. Test the effect (tool calls, final files, return shape), not the prose.
F.9 Still stuck?
Minimum reproduction for a bug report:
- Agentao version (
python -c "import agentao; print(agentao.__version__)") OS, Python version- A script that reproduces in ≤ 30 lines
- Tail of
agentao.logaround the failure - For ACP issues:
AcpClientError.code+.details