OpenClaw Agent 监控告警:给你的 Agent 装"仪表盘"
你的 Agent 在跑,但你不知道它在想什么。它是不是在偷偷浪费 Token?是不是频繁报错?是不是被恶意输入攻击了?没有监控的 Agent,就像一辆没有仪表盘的汽车——你能开,但你不知道油箱还剩多少,发动机有没有过热。今天,我们来给 Agent 装上完整的"仪表盘"。
为什么要监控 Agent?
核心指标:
- 性能:响应延迟、吞吐量、并发数
- 成本:Token 消耗、API 调用费用、每日/每月趋势
- 质量:任务成功率、用户满意度、输出质量评分
- 安全:异常输入检测、权限使用、敏感操作审计
- 健康:服务可用性、错误率、资源使用
OpenClaw 内置监控
启用监控
# .openclaw/monitoring.yaml
monitoring:
enabled: true
# 数据采集
collection:
interval: 30s # 采集间隔
metrics:
- name: "request_count"
type: counter
labels: ["model", "status", "skill"]
- name: "request_duration_ms"
type: histogram
buckets: [100, 500, 1000, 2000, 5000, 10000]
- name: "token_usage"
type: gauge
labels: ["model", "type"] # type: input/output
- name: "cost_usd"
type: counter
labels: ["model", "skill"]
- name: "active_sessions"
type: gauge
- name: "error_count"
type: counter
labels: ["error_type", "skill"]
# 存储后端
storage:
backend: "prometheus" # prometheus | datadog | custom
prometheus:
port: 9090
path: "/metrics"
访问监控数据
# 查看实时指标
curl http://localhost:9090/metrics
# 使用 openclaw CLI
openclaw metrics dashboard # 启动监控面板
openclaw metrics top 10 # Top 10 请求
openclaw metrics errors # 错误统计
openclaw metrics cost # 成本统计
Grafana 仪表盘配置
连接 Prometheus 数据源
# docker-compose.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
grafana:
image: grafana/grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
volumes:
- grafana-data:/var/lib/grafana
volumes:
grafana-data:
关键面板设计
| 面板 | PromQL 查询 | 可视化 |
|---|---|---|
| 请求 QPS | rate(request_count[5m]) |
时间序列 |
| P99 延迟 | histogram_quantile(0.99, rate(request_duration_ms_bucket[5m])) |
时间序列 |
| 错误率 | rate(error_count[5m]) / rate(request_count[5m]) * 100 |
仪表盘 |
| 日成本 | increase(cost_usd[24h]) |
统计卡片 |
| Token 消耗 | increase(token_usage_total[24h]) |
堆叠图 |
| 活跃会话 | active_sessions |
仪表盘 |
告警配置
告警规则
# alert_rules.yml
groups:
- name: openclaw_alerts
rules:
- alert: HighErrorRate
expr: rate(error_count[5m]) / rate(request_count[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "错误率超过 10%"
description: "当前错误率: {{ $value | humanizePercentage }}"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(request_duration_ms_bucket[5m])) > 5000
for: 10m
labels:
severity: warning
annotations:
summary: "P95 延迟超过 5 秒"
- alert: CostSpike
expr: increase(cost_usd[1h]) > 10
for: 1m
labels:
severity: warning
annotations:
summary: "单小时成本超过 $10"
description: "当前 1 小时成本: ${{ $value }}"
- alert: TokenBudgetWarning
expr: increase(token_usage_total[24h]) > 10000000
labels:
severity: warning
annotations:
summary: "24 小时 Token 消耗超过 1000 万"
告警通知渠道
# .openclaw/monitoring.yaml
alerts:
# 飞书告警
feishu:
webhook: "${FEISHU_WEBHOOK_URL}"
severity: ["critical", "high", "warning"]
# Slack 告警
slack:
webhook: "${SLACK_WEBHOOK_URL}"
severity: ["critical", "high"]
# 邮件告警
email:
smtp: "smtp.gmail.com:587"
from: "alerts@miaoquai.com"
to: ["ops@miaoquai.com"]
severity: ["critical"]
# PagerDuty(紧急)
pagerduty:
routing_key: "${PAGERDUTY_KEY}"
severity: ["critical"]
业务指标追踪
# 自定义业务指标
business_metrics:
- name: "task_completion_rate"
type: gauge
description: "任务完成率"
- name: "user_satisfaction"
type: gauge
source: "feedback_scores"
description: "用户满意度评分(1-5)"
- name: "agent_decisions"
type: counter
labels: ["decision_type", "confidence"]
description: "Agent 决策次数追踪"
- name: "skill_usage"
type: counter
labels: ["skill_name", "status"]
description: "各 Skill 使用统计"
日志管理
结构化日志
# .openclaw/logging.yaml
logging:
level: "info" # debug | info | warn | error
format: "json" # json | text
outputs:
- type: "file"
path: "/var/log/openclaw/agent.log"
rotation: "daily"
retention: 30
- type: "stdout"
color: true
- type: "elasticsearch"
hosts: ["https://es.example.com:9200"]
index: "openclaw-logs"
# 敏感信息过滤
redact:
patterns:
- "api_key"
- "password"
- "token"
- "authorization"
异常检测
# 自动异常检测
anomaly_detection:
enabled: true
# 基于统计的异常检测
statistical:
window: 24h
sensitivity: 2.0 # 标准差倍数
metrics:
- "request_duration_ms"
- "error_count"
- "cost_usd"
# 基于模式的异常检测
pattern:
- name: "token_spike"
condition: "token_usage > 3 * avg_7d"
severity: "warning"
- name: "sudden_error_increase"
condition: "error_rate > 5 * baseline"
severity: "critical"
- name: "unusual_skill_usage"
condition: "skill_calls > 10 * avg_24h"
severity: "info"
最佳实践
1. 分层监控
- 基础设施层:CPU、内存、磁盘、网络
- 应用层:请求量、延迟、错误率
- 业务层:任务完成率、用户满意度
- 成本层:Token 消耗、API 费用
2. 告警降噪
- 设置合理的阈值,减少误报
- 使用分组和抑制,避免告警风暴
- 定期审查告警规则,删除无效告警
3. 定期巡检
# 每日健康检查脚本
openclaw health check
openclaw metrics summary --period 24h
openclaw security audit --scope config