Skip to content

6.6 Observability & Audit

What you'll learn

  • The four observation axes: structured logs, metrics, traces, session replay
  • The minimum metrics every production deployment should emit
  • When session replay is the only thing that lets you diagnose a "why did the agent do that?" incident

Agents are classic "long-tail bug" workloads — fine 90% of the time, then 10% of incomprehensible behavior. No observability = no diagnosis = no improvement.

Four observation axes

┌─────────────────────────────────────────────┐
│ 1. Structured logs: what happened?           │
│    agentao.log + your app logs               │
├─────────────────────────────────────────────┤
│ 2. Metrics: how much, how fast, how expensive? │
│    calls / latency / tokens / failure rate   │
├─────────────────────────────────────────────┤
│ 3. Event stream: per-session replay           │
│    AgentEvent archive                        │
├─────────────────────────────────────────────┤
│ 4. Distributed tracing: end-to-end per request │
│    OpenTelemetry                             │
└─────────────────────────────────────────────┘

Axis one: structured logs

The built-in agentao.log

Defaults to <working_directory>/agentao.log. It's very thorough:

  • Every LLM request/response (full content, tokens, model)
  • Every tool call with args and result
  • MCP server start/stop
  • Plugin hook dispatch
  • Context compression triggers

This is your most important debugging tool. In production:

  1. Mount it on a persistent volume (survives restarts)
  2. Rotate daily + keep 7–30 days
  3. Scrub (see 6.5)
  4. Isolate per tenant (natural with working_directory)

Take over Agentao's logger

By default LLMClient.__init__ mutates logging.getLogger("agentao") — sets level to DEBUG, attaches a RotatingFileHandler for <wd>/agentao.log, and evicts its own marker-tagged handlers on re-construction. Hosts that want to own their logging stack should inject a logger explicitly so that mutation never runs:

python
import logging
from agentao import Agentao

# Your own logger — JSON handler shipped to Loki / CloudWatch / ELK
import pythonjsonlogger.jsonlogger as jl
my_logger = logging.getLogger("myapp.agentao")
handler = logging.StreamHandler()
handler.setFormatter(jl.JsonFormatter())
my_logger.addHandler(handler)
my_logger.setLevel(logging.INFO)

agent = Agentao(
    api_key=..., base_url=..., model=...,
    working_directory=workdir,
    logger=my_logger,            # ← skips package-root mutation
)

When you pass logger=, LLMClient short-circuits before building any file handler — so even the default <wd>/agentao.log is not created. For a fully silent agent, use a NullHandler:

python
quiet = logging.getLogger("myapp.agentao")
quiet.addHandler(logging.NullHandler())
quiet.propagate = False
agent = Agentao(..., logger=quiet)

Gotcha

Adding a handler to getLogger("agentao") without also passing logger= works, but the package-root level still gets forced to DEBUG and the rolling agentao.log file still gets written alongside your handler. To fully suppress the file, either inject your own logger (above) or pass log_file=None when building LLMClient yourself.

Full reference (knob matrix, code anchors, LLMClient direct path) in docs/guides/embedding.md §2 → "Optional: silencing or redirecting agentao.log".

Essential fields

Inject business context via on_event:

python
def on_event(ev):
    logger.info("agent_event", extra={
        "event_type": ev.type.value,
        "session_id": current_session_id(),
        "tenant_id": current_tenant_id(),
        "user_id": current_user_id(),
        **ev.data,
    })

session_id / tenant_id / user_id are the most-used filter fields when debugging.

Axis two: metrics

Must-have metrics

MetricTypeMeaning
agent.turn.countcounterTurns per chat()
agent.turn.duration_mshistogramPer-turn duration
agent.tool.callscounter, by toolTool invocations
agent.tool.failurescounter, by toolTool failures
agent.tool.duration_mshistogram, by toolTool latency
agent.llm.tokens.promptcounterPrompt tokens
agent.llm.tokens.completioncounterCompletion tokens
agent.llm.tokens.cachedcounterPrompt-cache hits
agent.llm.errorscounter, by error_typeLLM errors
agent.confirm.requestscounter, by outcomeConfirm request / allow / reject / timeout
agent.max_iterations.hitscounterBailouts triggered

Prometheus template

python
from prometheus_client import Counter, Histogram

turn_dur = Histogram("agent_turn_duration_ms", "Turn duration",
                     buckets=[100, 500, 1000, 3000, 10_000, 30_000])
tool_calls = Counter("agent_tool_calls", "Tool invocations", ["tool", "status"])

def on_event(ev):
    if ev.type == EventType.TOOL_COMPLETE:
        tool_calls.labels(tool=ev.data["tool"], status=ev.data["status"]).inc()

start = time.time()
reply = agent.chat(msg)
turn_dur.observe((time.time() - start) * 1000)

Alert thresholds

MetricTypical threshold
Tool failure rate > 10%Tools broken or misconfigured
LLM 5xx rate > 2%Vendor issue
max_iterations hits > 5%Agent stuck pattern
Cache hit rate < 30%System prompt is churning
Confirm timeout rate > 10%UI issue or user drop-off

Axis three: session replay

Agentao can record each session's runtime timeline as append-only JSONL under .agentao/replays/. Enable it per project:

bash
/replay on

This writes .agentao/settings.json:

json
{
  "replay": {
    "enabled": true,
    "max_instances": 20
  }
}

Recording starts on the next session. Existing replay files remain readable after /replay off.

Replay enables:

  • Session replay (reconstruct issue scene in UI)
  • Retroactive debugging (see where the LLM made the wrong call)
  • Compliance audit (user X had the agent do Z at time Y)

Commands

bash
/replay list            # list replay instances (also the default of bare /replay)
/replay on | /replay off  # toggle recording (persists to .agentao/settings.json)
/replay show <id>       # grouped render
/replay show <id> --raw
/replay show <id> --turn <turn_id>
/replay show <id> --kind tool_
/replay show <id> --errors
/replay tail <id> 50
/replay prune

Replay files are separate from saved sessions: save_session / load_session restore conversation state, while replay records what the runtime did.

Capture depth

Default replay capture includes turn boundaries, user messages, assistant chunks, tool lifecycle, permission decisions, sub-agent lifecycle, errors, state changes, and compact LLM deltas.

Deep capture flags live under replay.capture_flags in .agentao/settings.json:

FlagDefaultRisk
capture_llm_deltatrueNormal replay history delta
capture_full_llm_iofalseFull provider payloads; sensitive
capture_tool_result_fullfalseFull tool output; may be large or sensitive
capture_plugin_hook_output_fullfalseFull plugin hook output

Custom archive hook

Use the built-in replay recorder first. Add a custom on_event archiver only when you need to send selected events into your own audit pipeline:

python
def audit_event(ev):
    if ev.type in {EventType.TOOL_COMPLETE, EventType.ERROR}:
        audit_log.info("agent_event", extra={
            "type": ev.type.value,
            "session_id": session_id,
            "tenant_id": tenant.id,
            **ev.data,
        })

transport = SdkTransport(on_event=audit_event)

Axis four: distributed tracing

When the agent is embedded in your web service, one user request can span:

Browser → your API → Agent.chat() → LLM API → Agent → custom tool → database

OpenTelemetry stitches these into one trace:

python
from opentelemetry import trace

tracer = trace.get_tracer(__name__)

@app.post("/chat")
async def chat(req: ChatRequest):
    with tracer.start_as_current_span("user_chat") as span:
        span.set_attribute("user.id", req.user_id)
        span.set_attribute("session.id", req.session_id)
        with tracer.start_as_current_span("agent_chat"):
            reply = await asyncio.to_thread(agent.chat, req.message)
        return {"reply": reply}

Deeper: wrap LLMClient / Tool.execute and instrument each call.

  • gen_ai.system = "openai"
  • gen_ai.request.model = model name
  • gen_ai.usage.prompt_tokens / completion_tokens
  • gen_ai.response.finish_reason

Follow OpenTelemetry GenAI semantic conventions.

Audit and compliance

Must-keep audit events

ScenarioTriggerRetention
User starts a sessionAgent construction90–365 days
User approves dangerous toolconfirm_tool = True180–365 days
Permission rule denieddecide = DENY90 days
Agent modified user dataBusiness tool executionBusiness-defined (usually 1–7 years)
User requests "forget me"memory.clear_allIndefinite (compliance evidence)

Scrubbing and retention

Audit logs should not be scrubbed (you lose evidence), but must be encrypted-at-rest with strict access control.

Under compliance regimes: append-only storage (WORM) required.

Minimum viable observability

On a budget:

  1. agentao.log → per-tenant files, daily rotate, 14d retention
  2. prometheus_client → the 5 key metrics, Grafana dashboard
  3. Built-in replay JSONL → .agentao/replays/, tune replay.max_instances
  4. No OpenTelemetry

Enough for 99% of small/mid SaaS. Add APM when you scale.

TL;DR

  • Four axes: structured logs (agentao.log), metrics (Prometheus / StatsD), traces (OpenTelemetry), session replay.
  • Minimum metrics: tool call rate by name, tool failure rate, LLM 5xx rate, confirm-timeout rate, turn duration p50/p95/p99, max-iterations hit rate.
  • Session replay is the killer feature — when "why did the agent do X?" comes up, replay deterministically with replay_config= and step through.
  • Cost monitoring is a first-class observable: track tokens-per-turn and tokens-per-tenant; sudden 2× spikes usually mean a model swap or skill change.

6.7 Resource Governance & Concurrency