Observability

AI agents are non-deterministic. The same input can produce different outputs. An agent might choose different tools, retrieve different context, or generate different responses - even with identical prompts.

This is fundamentally different from traditional software where you can predict behavior from code inspection. With agents, you need to observe what actually happened, not what you expected to happen.

The Challenge

Traditional software:

Input → Deterministic Code → Predictable Output
        (inspect code to debug)

AI agents:

Input → LLM Reasoning → Tool Selection → Memory Retrieval → Response
        (unpredictable) (unpredictable)  (context-dependent)

When something goes wrong, you can't just read the code. You need to trace:

Which agent was selected and why
What the LLM was thinking (reasoning)
Which tools were called with what parameters
What context was retrieved from memory
How the final response was generated

Without observability, you're debugging a black box.

What MUXI Tracks

Every action in the system emits structured events:

Category	What's tracked
Routing	Agent selection decisions, routing scores
LLM Calls	Prompts, completions, tokens, latency, cost
Tools	Tool calls, parameters, results, errors
Memory	Retrievals, what context was used
Workflows	Task decomposition, step execution
Errors	Failures, retries, fallbacks

~350 typed events across the full request lifecycle. If you can't measure it, you can't improve it. Every LLM call, tool use, routing decision, memory operation, workflow step - traced and exportable.

Two Types of Events

Type	Audience	Purpose	Examples
Request lifecycle	Users (via SDK)	Trace what happened to their request	Agent selected, task progress, response generated
System/error	Developers only	Debug infrastructure and errors	MCP server connected, memory operation failed

System-level events (connecting to MCP servers, error handling, file creation, memory operations) are never exposed to users - only developers can see them for debugging.

Example: Debugging a Wrong Answer

User reports: "The agent gave me incorrect information about our refund policy."

Without observability: 🤷 No idea what happened.

With observability:

1. Request received → routed to "support" agent ✓
2. Knowledge retrieval → searched "refund policy"
   → Retrieved: outdated-policy-2023.md ❌ (found the bug!)
3. LLM call → generated response based on wrong doc
4. Response sent

Now you know: the knowledge index has stale documents. Fix: re-index with current docs.

Export Targets

Stream events to your existing infrastructure:

Datadog - Logs, metrics, APM
Elastic - ELK stack
Splunk - Enterprise logging
OpenTelemetry - OTLP protocol
Webhooks - Custom integrations

No sidecars or agents required - MUXI exports directly.

Configuration

observability:
  enabled: true
  export:
    type: datadog          # datadog, elastic, splunk, otlp, webhook
    endpoint: "https://..."
    api_key: ${{ secrets.DATADOG_API_KEY }}

  # What to track
  events:
    llm_calls: true        # Token usage, latency, cost
    tool_calls: true       # All tool executions
    memory_ops: true       # Memory reads/writes
    routing: true          # Agent selection decisions

Topic Tagging (Analytics)

Every request automatically gets 1-5 semantic topic tags:

Request: "Analyze customer feedback survey data from last quarter"
Topics: ["data-analysis", "customer-feedback", "surveys", "insights"]

No extra LLM calls - topics are extracted during request analysis that already happens.

What Topics Enable

Use Case	Example
Dashboards	"What are users asking about this week?"
Filtering	"Show me all debugging-related requests"
Trends	"API questions up 40% this month"
Alerting	"Alert if >10 'authentication' requests fail"

Event Format

{
  "event": "request.topics.extracted",
  "data": {
    "topics": ["data-analysis", "customer-feedback", "surveys"],
    "topic_count": 3,
    "complexity_score": 7.5
  }
}

Topics are normalized to lowercase-with-hyphens for consistency.

Same input → different outputs (non-deterministic)
Behavior depends on context (memory, retrieved docs)
Failures can be subtle (wrong answer vs. crash)
Root causes are often in data, not code

Observability isn't optional for production AI. It's how you maintain quality.

Learn More

Observability Deep Dive - Full event reference and configuration
Event Types - All 349 event types
Set Up Monitoring - Step-by-step guide

Why you need to see inside your AI systems

The Challenge

What MUXI Tracks

Two Types of Events

Example: Debugging a Wrong Answer

Export Targets

Configuration

Topic Tagging (Analytics)

What Topics Enable

Event Format

Use Cases

Debugging

Cost Tracking

Performance

Compliance

Alerting

Why This Matters More for AI

Learn More

We use cookies