Agent可观测性：监控AI Agent运行状态指南

📊 Agent可观测性：监控AI Agent运行状态指南

凌晨3点30分，我的Agent在生产环境出了bug，但我完全不知道它干了什么。那一刻我领悟到：看不见的东西，永远修不好。

AI Agent的可观测性（Observability）是确保Agent系统稳定运行的关键。它包括日志记录、指标监控、分布式追踪和错误报告四个维度，让你能全面了解Agent的运行状态和健康程度。

🎯 可观测性四支柱

支柱	目的	关键指标
🪵 日志（Logs）	记录操作细节	请求日志、错误日志、审计日志
📈 指标（Metrics）	量化运行状态	延迟、吞吐量、错误率、Token消耗
🔍 追踪（Traces）	追踪请求链路	请求耗时、工具调用链、上下文传播
🔔 告警（Alerts）	主动通知异常	阈值告警、异常检测、恢复通知

支柱

目的

关键指标

🪵 日志（Logs）

记录操作细节

请求日志、错误日志、审计日志

📈 指标（Metrics）

量化运行状态

延迟、吞吐量、错误率、Token消耗

🔍 追踪（Traces）

追踪请求链路

请求耗时、工具调用链、上下文传播

🔔 告警（Alerts）

主动通知异常

阈值告警、异常检测、恢复通知

🏗️ OpenClaw可观测性架构

┌─────────────────────────────────────────────┐ │ Agent Runtime │ ├─────────────────────────────────────────────┤ │ OpenTelemetry Instrumentation │ │ ├── Metrics (Prometheus format) │ │ ├── Traces (Jaeger/Zipkin) │ │ ├── Logs (Structured JSON) │ │ └── Health Checks (HTTP) │ ├─────────────────────────────────────────────┤ │ Monitoring Stack │ │ ├── Prometheus (Metrics Storage) │ │ ├── Grafana (Visualization) │ │ ├── Jaeger (Distributed Tracing) │ │ ├── Loki (Log Aggregation) │ │ └── AlertManager (Alerting) │ └─────────────────────────────────────────────┘

🚀 快速实现

1. 指标监控

// OpenClaw Agent指标配置 const agent = new Agent({ name: 'observed-agent', model: 'claude-3-opus', observability: { metrics: { enabled: true, port: 9090, // Prometheus metrics端口 // 自定义指标 customMetrics: [ { name: 'agent_request_duration', type: 'histogram', labels: ['model', 'tool', 'status'] }, { name: 'agent_token_usage', type: 'counter', labels: ['type'] // input/output }, { name: 'agent_tool_calls', type: 'counter', labels: ['tool', 'success'] } ] } } }); // Prometheus抓取指标 // GET http://localhost:9090/metrics // agent_request_duration_seconds_bucket{model="claude-3-opus",...} // agent_token_usage_total{type="input"} 12345

2. 分布式追踪

// 配置OpenTelemetry追踪 const { trace } = require('@opentelemetry/api'); const agent = new Agent({ name: 'traced-agent', observability: { tracing: { enabled: true, exporter: 'jaeger', endpoint: 'http://jaeger:4318/v1/traces', samplingRate: 0.1, // 10%采样率 // 自定义Span customSpans: { toolExecution: true, contextCompression: true, memoryRetrieval: true } } } }); // 追踪示例 // [Agent Request] (100ms) // ├── [Tool: web_search] (50ms) // ├── [Tool: file_read] (20ms) // └── [Context Compress] (30ms)

3. 结构化日志

// 日志配置 const agent = new Agent({ name: 'logged-agent', observability: { logging: { level: 'info', format: 'json', output: './logs/agent.log', // 结构化字段 fields: { timestamp: true, agent_id: true, session_id: true, request_id: true, tool_calls: true, token_usage: true, duration_ms: true }, // 敏感字段过滤 redact: ['api_key', 'password', 'token'] } } }); // 日志输出示例 { "level": "info", "timestamp": "2026-05-12T01:00:00Z", "agent_id": "my-agent", "session_id": "sess_abc", "message": "request_completed", "tool_calls": ["web_search", "file_read"], "token_usage": {"input": 500, "output": 300}, "duration_ms": 1250 }

4. 告警配置

# alertmanager/rules.yaml groups: - name: agent_alerts rules: - alert: HighErrorRate expr: rate(agent_errors_total[5m]) > 0.1 for: 5m labels: severity: critical annotations: summary: "Agent错误率过高" - alert: HighLatency expr: histogram_quantile(0.95, agent_request_duration_seconds_bucket) > 10 for: 5m labels: severity: warning annotations: summary: "Agent P95延迟超过10秒" - alert: TokenBudgetExceeded expr: agent_token_usage_total > 1000000 labels: severity: warning annotations: summary: "Token使用量超过100万" - alert: AgentDown expr: up{job="openclaw-agent"} == 0 for: 1m labels: severity: critical annotations: summary: "Agent服务不可用"

📊 Grafana仪表盘

# 关键监控面板 ┌─────────────────────────────────────────────────┐ │ 📊 Agent Dashboard │ ├─────────────────────────────────────────────────┤ │ [请求速率] [错误率] [P50/P95延迟] │ │ ████████ 45/s █ 2.1% 1.2s / 5.6s │ │ │ │ [Token使用] [工具调用] [活跃会话] │ │ █████ 500K web:30 file:20 ████ 156 │ │ │ │ [成本追踪] │ │ 今日: $12.50 / 本月: $156.00 / 预算: $500.00 │ ├─────────────────────────────────────────────────┤ │ 🔥 最近错误 │ │ • 01:15 tool_timeout: web_search 30s exceeded │ │ • 01:12 context_overflow: 8192 tokens exceeded │ └─────────────────────────────────────────────────┘

✅ 可观测性最佳实践

开箱即用：从第一天就启用监控，不要等到出问题
RED法则：监控Rate（速率）、Errors（错误）、Duration（延迟）
USE法则：监控Utilization（利用率）、Saturation（饱和度）、Errors
成本追踪：Token和API调用成本是最重要的业务指标
告警分级：Critical立即处理，Warning可以等到工作时间

⚠️ 常见陷阱

日志过多导致存储爆炸 — 设置合理的日志保留策略
监控指标过多导致告警疲劳 — 只监控真正重要的指标
忽略成本追踪 — Token成本可能比服务器成本更高