OpenClaw Agent 监控告警:给你的 Agent 装"仪表盘"

你的 Agent 在跑,但你不知道它在想什么。它是不是在偷偷浪费 Token?是不是频繁报错?是不是被恶意输入攻击了?没有监控的 Agent,就像一辆没有仪表盘的汽车——你能开,但你不知道油箱还剩多少,发动机有没有过热。今天,我们来给 Agent 装上完整的"仪表盘"。

为什么要监控 Agent?

核心指标:

OpenClaw 内置监控

启用监控

# .openclaw/monitoring.yaml

monitoring:
  enabled: true
  
  # 数据采集
  collection:
    interval: 30s  # 采集间隔
    metrics:
      - name: "request_count"
        type: counter
        labels: ["model", "status", "skill"]
      - name: "request_duration_ms"
        type: histogram
        buckets: [100, 500, 1000, 2000, 5000, 10000]
      - name: "token_usage"
        type: gauge
        labels: ["model", "type"]  # type: input/output
      - name: "cost_usd"
        type: counter
        labels: ["model", "skill"]
      - name: "active_sessions"
        type: gauge
      - name: "error_count"
        type: counter
        labels: ["error_type", "skill"]
        
  # 存储后端
  storage:
    backend: "prometheus"  # prometheus | datadog | custom
    prometheus:
      port: 9090
      path: "/metrics"

访问监控数据

# 查看实时指标
curl http://localhost:9090/metrics

# 使用 openclaw CLI
openclaw metrics dashboard  # 启动监控面板
openclaw metrics top 10     # Top 10 请求
openclaw metrics errors     # 错误统计
openclaw metrics cost       # 成本统计

Grafana 仪表盘配置

连接 Prometheus 数据源

# docker-compose.yml
version: '3.8'
services:
  prometheus:
    image: prom/prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"
      
  grafana:
    image: grafana/grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
    volumes:
      - grafana-data:/var/lib/grafana

volumes:
  grafana-data:

关键面板设计

面板 PromQL 查询 可视化
请求 QPS rate(request_count[5m]) 时间序列
P99 延迟 histogram_quantile(0.99, rate(request_duration_ms_bucket[5m])) 时间序列
错误率 rate(error_count[5m]) / rate(request_count[5m]) * 100 仪表盘
日成本 increase(cost_usd[24h]) 统计卡片
Token 消耗 increase(token_usage_total[24h]) 堆叠图
活跃会话 active_sessions 仪表盘

告警配置

告警规则

# alert_rules.yml
groups:
  - name: openclaw_alerts
    rules:
      - alert: HighErrorRate
        expr: rate(error_count[5m]) / rate(request_count[5m]) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "错误率超过 10%"
          description: "当前错误率: {{ $value | humanizePercentage }}"
          
      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(request_duration_ms_bucket[5m])) > 5000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "P95 延迟超过 5 秒"
          
      - alert: CostSpike
        expr: increase(cost_usd[1h]) > 10
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "单小时成本超过 $10"
          description: "当前 1 小时成本: ${{ $value }}"
          
      - alert: TokenBudgetWarning
        expr: increase(token_usage_total[24h]) > 10000000
        labels:
          severity: warning
        annotations:
          summary: "24 小时 Token 消耗超过 1000 万"

告警通知渠道

# .openclaw/monitoring.yaml

alerts:
  # 飞书告警
  feishu:
    webhook: "${FEISHU_WEBHOOK_URL}"
    severity: ["critical", "high", "warning"]
    
  # Slack 告警
  slack:
    webhook: "${SLACK_WEBHOOK_URL}"
    severity: ["critical", "high"]
    
  # 邮件告警
  email:
    smtp: "smtp.gmail.com:587"
    from: "alerts@miaoquai.com"
    to: ["ops@miaoquai.com"]
    severity: ["critical"]
    
  # PagerDuty(紧急)
  pagerduty:
    routing_key: "${PAGERDUTY_KEY}"
    severity: ["critical"]

业务指标追踪

# 自定义业务指标
business_metrics:
  - name: "task_completion_rate"
    type: gauge
    description: "任务完成率"
    
  - name: "user_satisfaction"
    type: gauge
    source: "feedback_scores"
    description: "用户满意度评分(1-5)"
    
  - name: "agent_decisions"
    type: counter
    labels: ["decision_type", "confidence"]
    description: "Agent 决策次数追踪"
    
  - name: "skill_usage"
    type: counter
    labels: ["skill_name", "status"]
    description: "各 Skill 使用统计"

日志管理

结构化日志

# .openclaw/logging.yaml

logging:
  level: "info"  # debug | info | warn | error
  format: "json"  # json | text
  
  outputs:
    - type: "file"
      path: "/var/log/openclaw/agent.log"
      rotation: "daily"
      retention: 30
      
    - type: "stdout"
      color: true
      
    - type: "elasticsearch"
      hosts: ["https://es.example.com:9200"]
      index: "openclaw-logs"
      
  # 敏感信息过滤
  redact:
    patterns:
      - "api_key"
      - "password"
      - "token"
      - "authorization"

异常检测

# 自动异常检测
anomaly_detection:
  enabled: true
  
  # 基于统计的异常检测
  statistical:
    window: 24h
    sensitivity: 2.0  # 标准差倍数
    metrics:
      - "request_duration_ms"
      - "error_count"
      - "cost_usd"
      
  # 基于模式的异常检测
  pattern:
    - name: "token_spike"
      condition: "token_usage > 3 * avg_7d"
      severity: "warning"
      
    - name: "sudden_error_increase"
      condition: "error_rate > 5 * baseline"
      severity: "critical"
      
    - name: "unusual_skill_usage"
      condition: "skill_calls > 10 * avg_24h"
      severity: "info"

最佳实践

1. 分层监控

2. 告警降噪

3. 定期巡检

# 每日健康检查脚本
openclaw health check
openclaw metrics summary --period 24h
openclaw security audit --scope config

相关资源