Observability

Why you need to see inside your AI systems

AI agents are non-deterministic. The same input can produce different outputs. An agent might choose different tools, retrieve different context, or generate different responses - even with identical prompts.

This is fundamentally different from traditional software where you can predict behavior from code inspection. With agents, you need to observe what actually happened, not what you expected to happen.

The Challenge

Traditional software:

Input → Deterministic Code → Predictable Output
        (inspect code to debug)

AI agents:

Input → LLM Reasoning → Tool Selection → Memory Retrieval → Response
        (unpredictable) (unpredictable)  (context-dependent)

When something goes wrong, you can't just read the code. You need to trace:

  • Which agent was selected and why
  • What the LLM was thinking (reasoning)
  • Which tools were called with what parameters
  • What context was retrieved from memory
  • How the final response was generated

Without observability, you're debugging a black box.

What MUXI Tracks

Every action in the system emits structured events:

Category What's tracked
Routing Agent selection decisions, routing scores
LLM Calls Prompts, completions, tokens, latency, cost
Tools Tool calls, parameters, results, errors
Memory Retrievals, what context was used
Workflows Task decomposition, step execution
Errors Failures, retries, fallbacks

~350 typed events across the full request lifecycle. If you can't measure it, you can't improve it. Every LLM call, tool use, routing decision, memory operation, workflow step - traced and exportable.

Two Types of Events

Type Audience Purpose Examples
Request lifecycle Users (via SDK) Trace what happened to their request Agent selected, task progress, response generated
System/error Developers only Debug infrastructure and errors MCP server connected, memory operation failed

System-level events (connecting to MCP servers, error handling, file creation, memory operations) are never exposed to users - only developers can see them for debugging.

Example: Debugging a Wrong Answer

User reports: "The agent gave me incorrect information about our refund policy."

Without observability: 🤷 No idea what happened.

With observability:

1. Request received → routed to "support" agent ✓
2. Knowledge retrieval → searched "refund policy"
   → Retrieved: outdated-policy-2023.md ❌ (found the bug!)
3. LLM call → generated response based on wrong doc
4. Response sent

Now you know: the knowledge index has stale documents. Fix: re-index with current docs.

Export Targets

Stream events to your existing infrastructure:

  • Datadog - Logs, metrics, APM
  • Elastic - ELK stack
  • Splunk - Enterprise logging
  • OpenTelemetry - OTLP protocol
  • Webhooks - Custom integrations

No sidecars or agents required - MUXI exports directly.

Configuration

observability:
  enabled: true
  export:
    type: datadog          # datadog, elastic, splunk, otlp, webhook
    endpoint: "https://..."
    api_key: ${{ secrets.DATADOG_API_KEY }}

  # What to track
  events:
    llm_calls: true        # Token usage, latency, cost
    tool_calls: true       # All tool executions
    memory_ops: true       # Memory reads/writes
    routing: true          # Agent selection decisions

Topic Tagging (Analytics)

Every request automatically gets 1-5 semantic topic tags:

Request: "Analyze customer feedback survey data from last quarter"
Topics: ["data-analysis", "customer-feedback", "surveys", "insights"]

No extra LLM calls - topics are extracted during request analysis that already happens.

What Topics Enable

Use Case Example
Dashboards "What are users asking about this week?"
Filtering "Show me all debugging-related requests"
Trends "API questions up 40% this month"
Alerting "Alert if >10 'authentication' requests fail"

Event Format

{
  "event": "request.topics.extracted",
  "data": {
    "topics": ["data-analysis", "customer-feedback", "surveys"],
    "topic_count": 3,
    "complexity_score": 7.5
  }
}

Topics are normalized to lowercase-with-hyphens for consistency.

Use Cases

Debugging

Trace why an agent made a specific decision. See the full context it had access to.

Cost Tracking

Monitor token usage per user, per agent, per formation. Set alerts for runaway costs.

Performance

Identify slow LLM calls, memory lookups, or tool executions. Optimize bottlenecks.

Compliance

Audit log of all data access and tool calls. Required for regulated industries.

Alerting

Get notified when error rates spike, latency degrades, or specific events occur.

Why This Matters More for AI

In traditional software, bugs are reproducible. Run the same input, get the same bug.

In AI systems:

  • Same input → different outputs (non-deterministic)
  • Behavior depends on context (memory, retrieved docs)
  • Failures can be subtle (wrong answer vs. crash)
  • Root causes are often in data, not code

Observability isn't optional for production AI. It's how you maintain quality.

Learn More

Observability Deep Dive - Full event reference and configuration
Event Types - All 349 event types
Set Up Monitoring - Step-by-step guide