凌晨3点30分,我的Agent在生产环境出了bug,但我完全不知道它干了什么。那一刻我领悟到:看不见的东西,永远修不好。
AI Agent的可观测性(Observability)是确保Agent系统稳定运行的关键。它包括日志记录、指标监控、分布式追踪和错误报告四个维度,让你能全面了解Agent的运行状态和健康程度。
| 支柱 | 目的 | 关键指标 |
|---|---|---|
| 🪵 日志(Logs) | 记录操作细节 | 请求日志、错误日志、审计日志 |
| 📈 指标(Metrics) | 量化运行状态 | 延迟、吞吐量、错误率、Token消耗 |
| 🔍 追踪(Traces) | 追踪请求链路 | 请求耗时、工具调用链、上下文传播 |
| 🔔 告警(Alerts) | 主动通知异常 | 阈值告警、异常检测、恢复通知 |
┌─────────────────────────────────────────────┐
│ Agent Runtime │
├─────────────────────────────────────────────┤
│ OpenTelemetry Instrumentation │
│ ├── Metrics (Prometheus format) │
│ ├── Traces (Jaeger/Zipkin) │
│ ├── Logs (Structured JSON) │
│ └── Health Checks (HTTP) │
├─────────────────────────────────────────────┤
│ Monitoring Stack │
│ ├── Prometheus (Metrics Storage) │
│ ├── Grafana (Visualization) │
│ ├── Jaeger (Distributed Tracing) │
│ ├── Loki (Log Aggregation) │
│ └── AlertManager (Alerting) │
└─────────────────────────────────────────────┘
// OpenClaw Agent指标配置
const agent = new Agent({
name: 'observed-agent',
model: 'claude-3-opus',
observability: {
metrics: {
enabled: true,
port: 9090, // Prometheus metrics端口
// 自定义指标
customMetrics: [
{
name: 'agent_request_duration',
type: 'histogram',
labels: ['model', 'tool', 'status']
},
{
name: 'agent_token_usage',
type: 'counter',
labels: ['type'] // input/output
},
{
name: 'agent_tool_calls',
type: 'counter',
labels: ['tool', 'success']
}
]
}
}
});
// Prometheus抓取指标
// GET http://localhost:9090/metrics
// agent_request_duration_seconds_bucket{model="claude-3-opus",...}
// agent_token_usage_total{type="input"} 12345
// 配置OpenTelemetry追踪
const { trace } = require('@opentelemetry/api');
const agent = new Agent({
name: 'traced-agent',
observability: {
tracing: {
enabled: true,
exporter: 'jaeger',
endpoint: 'http://jaeger:4318/v1/traces',
samplingRate: 0.1, // 10%采样率
// 自定义Span
customSpans: {
toolExecution: true,
contextCompression: true,
memoryRetrieval: true
}
}
}
});
// 追踪示例
// [Agent Request] (100ms)
// ├── [Tool: web_search] (50ms)
// ├── [Tool: file_read] (20ms)
// └── [Context Compress] (30ms)
// 日志配置
const agent = new Agent({
name: 'logged-agent',
observability: {
logging: {
level: 'info',
format: 'json',
output: './logs/agent.log',
// 结构化字段
fields: {
timestamp: true,
agent_id: true,
session_id: true,
request_id: true,
tool_calls: true,
token_usage: true,
duration_ms: true
},
// 敏感字段过滤
redact: ['api_key', 'password', 'token']
}
}
});
// 日志输出示例
{
"level": "info",
"timestamp": "2026-05-12T01:00:00Z",
"agent_id": "my-agent",
"session_id": "sess_abc",
"message": "request_completed",
"tool_calls": ["web_search", "file_read"],
"token_usage": {"input": 500, "output": 300},
"duration_ms": 1250
}
# alertmanager/rules.yaml
groups:
- name: agent_alerts
rules:
- alert: HighErrorRate
expr: rate(agent_errors_total[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "Agent错误率过高"
- alert: HighLatency
expr: histogram_quantile(0.95, agent_request_duration_seconds_bucket) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "Agent P95延迟超过10秒"
- alert: TokenBudgetExceeded
expr: agent_token_usage_total > 1000000
labels:
severity: warning
annotations:
summary: "Token使用量超过100万"
- alert: AgentDown
expr: up{job="openclaw-agent"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Agent服务不可用"
# 关键监控面板
┌─────────────────────────────────────────────────┐
│ 📊 Agent Dashboard │
├─────────────────────────────────────────────────┤
│ [请求速率] [错误率] [P50/P95延迟] │
│ ████████ 45/s █ 2.1% 1.2s / 5.6s │
│ │
│ [Token使用] [工具调用] [活跃会话] │
│ █████ 500K web:30 file:20 ████ 156 │
│ │
│ [成本追踪] │
│ 今日: $12.50 / 本月: $156.00 / 预算: $500.00 │
├─────────────────────────────────────────────────┤
│ 🔥 最近错误 │
│ • 01:15 tool_timeout: web_search 30s exceeded │
│ • 01:12 context_overflow: 8192 tokens exceeded │
└─────────────────────────────────────────────────┘