OpenClaw Agent 监控告警：给你的 Agent 装"仪表盘"

你的 Agent 在跑，但你不知道它在想什么。它是不是在偷偷浪费 Token？是不是频繁报错？是不是被恶意输入攻击了？没有监控的 Agent，就像一辆没有仪表盘的汽车——你能开，但你不知道油箱还剩多少，发动机有没有过热。今天，我们来给 Agent 装上完整的"仪表盘"。

为什么要监控 Agent？

            核心指标：
            性能：响应延迟、吞吐量、并发数
成本：Token 消耗、API 调用费用、每日/每月趋势
质量：任务成功率、用户满意度、输出质量评分
安全：异常输入检测、权限使用、敏感操作审计
健康：服务可用性、错误率、资源使用

        

OpenClaw 内置监控

启用监控

# .openclaw/monitoring.yaml

monitoring:
  enabled: true
  
  # 数据采集
  collection:
    interval: 30s  # 采集间隔
    metrics:
      - name: "request_count"
        type: counter
        labels: ["model", "status", "skill"]
      - name: "request_duration_ms"
        type: histogram
        buckets: [100, 500, 1000, 2000, 5000, 10000]
      - name: "token_usage"
        type: gauge
        labels: ["model", "type"]  # type: input/output
      - name: "cost_usd"
        type: counter
        labels: ["model", "skill"]
      - name: "active_sessions"
        type: gauge
      - name: "error_count"
        type: counter
        labels: ["error_type", "skill"]
        
  # 存储后端
  storage:
    backend: "prometheus"  # prometheus | datadog | custom
    prometheus:
      port: 9090
      path: "/metrics"

访问监控数据

# 查看实时指标
curl http://localhost:9090/metrics

# 使用 openclaw CLI
openclaw metrics dashboard  # 启动监控面板
openclaw metrics top 10     # Top 10 请求
openclaw metrics errors     # 错误统计
openclaw metrics cost       # 成本统计

Grafana 仪表盘配置

连接 Prometheus 数据源

# docker-compose.yml
version: '3.8'
services:
  prometheus:
    image: prom/prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"
      
  grafana:
    image: grafana/grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
    volumes:
      - grafana-data:/var/lib/grafana

volumes:
  grafana-data:

关键面板设计

面板	PromQL 查询	可视化
请求 QPS	`rate(request_count[5m])`	时间序列
P99 延迟	`histogram_quantile(0.99, rate(request_duration_ms_bucket[5m]))`	时间序列
错误率	`rate(error_count[5m]) / rate(request_count[5m]) * 100`	仪表盘
日成本	`increase(cost_usd[24h])`	统计卡片
Token 消耗	`increase(token_usage_total[24h])`	堆叠图
活跃会话	`active_sessions`	仪表盘

告警配置

告警规则

# alert_rules.yml
groups:
  - name: openclaw_alerts
    rules:
      - alert: HighErrorRate
        expr: rate(error_count[5m]) / rate(request_count[5m]) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "错误率超过 10%"
          description: "当前错误率: {{ $value | humanizePercentage }}"
          
      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(request_duration_ms_bucket[5m])) > 5000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "P95 延迟超过 5 秒"
          
      - alert: CostSpike
        expr: increase(cost_usd[1h]) > 10
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "单小时成本超过 $10"
          description: "当前 1 小时成本: ${{ $value }}"
          
      - alert: TokenBudgetWarning
        expr: increase(token_usage_total[24h]) > 10000000
        labels:
          severity: warning
        annotations:
          summary: "24 小时 Token 消耗超过 1000 万"

告警通知渠道

# .openclaw/monitoring.yaml

alerts:
  # 飞书告警
  feishu:
    webhook: "${FEISHU_WEBHOOK_URL}"
    severity: ["critical", "high", "warning"]
    
  # Slack 告警
  slack:
    webhook: "${SLACK_WEBHOOK_URL}"
    severity: ["critical", "high"]
    
  # 邮件告警
  email:
    smtp: "smtp.gmail.com:587"
    from: "alerts@miaoquai.com"
    to: ["ops@miaoquai.com"]
    severity: ["critical"]
    
  # PagerDuty（紧急）
  pagerduty:
    routing_key: "${PAGERDUTY_KEY}"
    severity: ["critical"]

业务指标追踪

# 自定义业务指标
business_metrics:
  - name: "task_completion_rate"
    type: gauge
    description: "任务完成率"
    
  - name: "user_satisfaction"
    type: gauge
    source: "feedback_scores"
    description: "用户满意度评分（1-5）"
    
  - name: "agent_decisions"
    type: counter
    labels: ["decision_type", "confidence"]
    description: "Agent 决策次数追踪"
    
  - name: "skill_usage"
    type: counter
    labels: ["skill_name", "status"]
    description: "各 Skill 使用统计"

日志管理

结构化日志

# .openclaw/logging.yaml

logging:
  level: "info"  # debug | info | warn | error
  format: "json"  # json | text
  
  outputs:
    - type: "file"
      path: "/var/log/openclaw/agent.log"
      rotation: "daily"
      retention: 30
      
    - type: "stdout"
      color: true
      
    - type: "elasticsearch"
      hosts: ["https://es.example.com:9200"]
      index: "openclaw-logs"
      
  # 敏感信息过滤
  redact:
    patterns:
      - "api_key"
      - "password"
      - "token"
      - "authorization"

异常检测

# 自动异常检测
anomaly_detection:
  enabled: true
  
  # 基于统计的异常检测
  statistical:
    window: 24h
    sensitivity: 2.0  # 标准差倍数
    metrics:
      - "request_duration_ms"
      - "error_count"
      - "cost_usd"
      
  # 基于模式的异常检测
  pattern:
    - name: "token_spike"
      condition: "token_usage > 3 * avg_7d"
      severity: "warning"
      
    - name: "sudden_error_increase"
      condition: "error_rate > 5 * baseline"
      severity: "critical"
      
    - name: "unusual_skill_usage"
      condition: "skill_calls > 10 * avg_24h"
      severity: "info"

最佳实践

1. 分层监控

基础设施层：CPU、内存、磁盘、网络
应用层：请求量、延迟、错误率
业务层：任务完成率、用户满意度
成本层：Token 消耗、API 费用

2. 告警降噪

设置合理的阈值，减少误报
使用分组和抑制，避免告警风暴
定期审查告警规则，删除无效告警

3. 定期巡检

# 每日健康检查脚本
openclaw health check
openclaw metrics summary --period 24h
openclaw security audit --scope config

OpenClaw Agent 监控告警：给你的 Agent 装"仪表盘"

为什么要监控 Agent？

OpenClaw 内置监控

启用监控

访问监控数据

Grafana 仪表盘配置

连接 Prometheus 数据源

关键面板设计

告警配置

告警规则

告警通知渠道

业务指标追踪

日志管理

结构化日志

异常检测

最佳实践

1. 分层监控

2. 告警降噪

3. 定期巡检

相关资源