Prometheus + Grafana 监控方案

架构概览

监控架构

Prometheus 负责采集和存储指标数据，Grafana 负责数据可视化和告警。

graph LR
    A[应用程序] -->|暴露指标| B[Prometheus]
    C[Node Exporter] -->|系统指标| B
    D[其他 Exporter] -->|各类指标| B
    B -->|查询数据| E[Grafana]
    E -->|可视化| F[Dashboard]
    B -->|告警规则| G[Alertmanager]
    G -->|通知| H[邮件/钉钉/Slack]

Prometheus 简介

📊 时序数据库 - 专为时间序列数据设计

🎯 多维数据模型 - 支持标签灵活查询

💪 强大的查询语言 - PromQL 功能丰富

快速部署

Docker Compose 一键部署

docker-compose.yml
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--storage.tsdb.retention.time=30d'
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=admin123
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana_data:/var/lib/grafana
    depends_on:
      - prometheus
    restart: unless-stopped

  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    ports:
      - "9100:9100"
    command:
      - '--path.rootfs=/host'
    volumes:
      - '/:/host:ro,rslave'
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:

Prometheus 配置

prometheus.yml
global:
  scrape_interval: 15s      # 抓取间隔
  evaluation_interval: 15s  # 规则评估间隔
  external_labels:
    cluster: 'production'
    region: 'cn-east'

# 告警管理器配置
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

# 告警规则文件
rule_files:
  - 'alerts/*.yml'

# 抓取配置
scrape_configs:
  # Prometheus 自身
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node Exporter (系统指标)
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']
        labels:
          instance: 'server-01'

  # .NET 应用
  - job_name: 'dotnet-app'
    static_configs:
      - targets: ['app:5000']
        labels:
          app: 'my-api'
          env: 'production'

  # 服务发现 (Kubernetes)
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

启动服务

# 启动所有服务
docker-compose up -d

# 查看日志
docker-compose logs -f prometheus
docker-compose logs -f grafana

# 访问
# Prometheus: http://localhost:9090
# Grafana: http://localhost:3000 (admin/admin123)

.NET 应用集成

1. 安装 NuGet 包

dotnet add package prometheus-net.AspNetCore

2. 配置应用

最小化配置
高级配置

Program.cs
using Prometheus;

var builder = WebApplication.CreateBuilder(args);

builder.Services.AddControllers();

var app = builder.Build();

// 暴露指标端点
app.UseMetricServer();     // 默认 /metrics

// 或使用中间件
app.UseHttpMetrics();      // 自动记录 HTTP 请求指标

app.MapControllers();

app.Run();

Program.cs
using Prometheus;

var builder = WebApplication.CreateBuilder(args);

builder.Services.AddControllers();

var app = builder.Build();

// 自定义指标端点
app.UseMetricServer(port: 9090, url: "/metrics");

// HTTP 指标中间件
app.UseHttpMetrics(options =>
{
    // 自定义标签
    options.AddCustomLabel("host", context => context.Request.Host.Host);

    // 排除某些路径
    options.RequestCount.Enabled = true;
    options.RequestDuration.Enabled = true;
});

// 健康检查
app.MapHealthChecks("/health");

// 指标端点单独配置
app.MapMetrics("/metrics");

app.MapControllers();

app.Run();

3. 自定义指标

using Prometheus;

public class OrderService
{
    // Counter: 计数器（只增不减）
    private static readonly Counter OrdersCreated = Metrics
        .CreateCounter("orders_created_total", "订单创建总数",
            new CounterConfiguration
            {
                LabelNames = new[] { "status", "payment_method" }
            });

    // Gauge: 仪表盘（可增可减）
    private static readonly Gauge OrdersInProgress = Metrics
        .CreateGauge("orders_in_progress", "正在处理的订单数");

    // Histogram: 直方图（分布统计）
    private static readonly Histogram OrderProcessingDuration = Metrics
        .CreateHistogram("order_processing_duration_seconds", "订单处理耗时",
            new HistogramConfiguration
            {
                Buckets = Histogram.LinearBuckets(start: 0.1, width: 0.1, count: 10)
            });

    // Summary: 摘要（百分位数）
    private static readonly Summary OrderAmount = Metrics
        .CreateSummary("order_amount_yuan", "订单金额",
            new SummaryConfiguration
            {
                Objectives = new[]
                {
                    new QuantileEpsilonPair(0.5, 0.05),  // 中位数
                    new QuantileEpsilonPair(0.9, 0.01),  // 90分位
                    new QuantileEpsilonPair(0.99, 0.001) // 99分位
                }
            });

    public async Task<Order> CreateOrderAsync(CreateOrderRequest request)
    {
        OrdersInProgress.Inc(); // 增加进行中的订单

        using (OrderProcessingDuration.NewTimer()) // 自动记录耗时
        {
            try
            {
                var order = await ProcessOrderAsync(request);

                // 记录指标
                OrdersCreated
                    .WithLabels(order.Status, order.PaymentMethod)
                    .Inc();

                OrderAmount.Observe(order.TotalAmount);

                return order;
            }
            finally
            {
                OrdersInProgress.Dec(); // 减少进行中的订单
            }
        }
    }
}

PromQL 查询语言

基础查询

# 查询指标
http_requests_total

# 带标签过滤
http_requests_total{method="GET", status="200"}

# 范围查询（最近5分钟）
http_requests_total[5m]

# 速率计算
rate(http_requests_total[5m])

# 求和
sum(rate(http_requests_total[5m]))

# 按标签分组求和
sum(rate(http_requests_total[5m])) by (method)

常用函数

函数	说明	示例
`rate()`	计算平均增长率	`rate(requests_total[5m])`
`irate()`	计算瞬时增长率	`irate(requests_total[5m])`
`sum()`	求和	`sum(memory_usage) by (instance)`
`avg()`	平均值	`avg(cpu_usage)`
`max()` / `min()`	最大/最小值	`max(latency)`
`count()`	计数	`count(up == 1)`
`topk()`	Top K	`topk(5, http_requests_total)`
`histogram_quantile()`	百分位数	`histogram_quantile(0.95, ...)`

实用查询示例

# CPU 使用率
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 内存使用率
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# HTTP 请求速率（QPS）
sum(rate(http_requests_total[1m])) by (method, endpoint)

# 95分位延迟
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# 错误率
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100

# 接口成功率
sum(rate(http_requests_total{status="200"}[5m]))
/
sum(rate(http_requests_total[5m])) * 100

Grafana 配置

1. 添加数据源

登录 Grafana (http://localhost:3000)
Configuration → Data Sources
Add data source → Prometheus
URL: http://prometheus:9090
Save & Test

2. 导入 Dashboard

常用 Dashboard ID
{
  "Node Exporter": 1860,
  ".NET Core": 10915,
  "Docker": 893,
  "Kubernetes": 315
}

导入步骤：

Dashboard → Import
输入 Dashboard ID
选择 Prometheus 数据源
Import

3. 创建自定义 Dashboard

示例 Panel 配置
{
  "title": "API 请求速率 (QPS)",
  "targets": [
    {
      "expr": "sum(rate(http_requests_total[1m])) by (endpoint)",
      "legendFormat": "{{endpoint}}"
    }
  ],
  "type": "graph"
}

告警配置

1. Prometheus 告警规则

alerts/app.yml
groups:
  - name: application
    interval: 30s
    rules:
      # API 错误率过高
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "API 错误率过高: {{ $value | humanizePercentage }}"
          description: "实例 {{ $labels.instance }} 错误率超过5%"

      # 响应时间过长
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
          ) > 1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "API 响应时间过长"
          description: "P95 延迟: {{ $value }}s"

      # 服务不可用
      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "服务不可用: {{ $labels.job }}"
          description: "实例 {{ $labels.instance }} 已宕机超过1分钟"

      # CPU 使用率过高
      - alert: HighCPUUsage
        expr: |
          100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CPU 使用率过高: {{ $value }}%"

      # 内存使用率过高
      - alert: HighMemoryUsage
        expr: |
          (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "内存使用率过高: {{ $value }}%"

      # 磁盘空间不足
      - alert: DiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "磁盘空间不足: {{ $value }}%"

2. Alertmanager 配置

alertmanager.yml
global:
  resolve_timeout: 5m

# 告警路由
route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'
  routes:
    # 严重告警立即通知
    - match:
        severity: critical
      receiver: 'critical'
      continue: true

    # 警告级别告警
    - match:
        severity: warning
      receiver: 'warning'

# 接收器
receivers:
  - name: 'default'
    webhook_configs:
      - url: 'http://alertmanager-webhook:5001/webhook'

  - name: 'critical'
    # 钉钉通知
    webhook_configs:
      - url: 'https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN'
        send_resolved: true

    # 邮件通知
    email_configs:
      - to: 'ops@example.com'
        from: 'alertmanager@example.com'
        smarthost: 'smtp.example.com:587'
        auth_username: 'alertmanager@example.com'
        auth_password: 'password'

  - name: 'warning'
    webhook_configs:
      - url: 'https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN'

# 抑制规则
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

最佳实践

监控最佳实践

1. 选择合适的指标类型

Counter: 请求数、错误数等单调递增的指标
Gauge: CPU、内存等可增可减的指标
Histogram: 请求延迟、响应大小等分布统计
Summary: 与 Histogram 类似，但在客户端计算分位数

2. 合理设置标签

// ✅ 好：标签值集合有限
OrdersCreated.WithLabels(status: "success", method: "alipay").Inc();

// ❌ 差：标签值无限制（会导致海量时间序列）
OrdersCreated.WithLabels(orderId: "12345").Inc();

3. 使用命名约定

指标名: <namespace>_<name>_<unit>
例如: http_requests_total, cpu_usage_percent

4. 设置合理的抓取间隔

默认 15s 适合大多数场景
关键指标可以缩短到 5s
非关键指标可以延长到 1m

5. 告警设置技巧

使用 for 子句避免瞬时抖动
设置合理的阈值
使用多级告警（warning/critical）
配置告警分组和抑制

总结

关键要点

Prometheus + Grafana 是业界主流监控方案
支持多维度指标采集和灵活查询
通过 Exporter 可以监控各种基础设施
.NET 应用通过 prometheus-net 轻松集成
配置灵活的告警规则实现主动监控

架构概览​

Prometheus 简介​

快速部署​

Docker Compose 一键部署​

Prometheus 配置​

启动服务​

.NET 应用集成​

1. 安装 NuGet 包​

2. 配置应用​

3. 自定义指标​

PromQL 查询语言​

基础查询​

常用函数​

实用查询示例​

Grafana 配置​

1. 添加数据源​

2. 导入 Dashboard​

3. 创建自定义 Dashboard​

告警配置​

1. Prometheus 告警规则​

2. Alertmanager 配置​

最佳实践​

1. 选择合适的指标类型​

2. 合理设置标签​

3. 使用命名约定​

4. 设置合理的抓取间隔​

5. 告警设置技巧​

总结​

相关资源​

架构概览

Prometheus 简介

快速部署

Docker Compose 一键部署

Prometheus 配置

启动服务

.NET 应用集成

1. 安装 NuGet 包

2. 配置应用

3. 自定义指标

PromQL 查询语言

基础查询

常用函数

实用查询示例

Grafana 配置

1. 添加数据源

2. 导入 Dashboard

3. 创建自定义 Dashboard

告警配置

1. Prometheus 告警规则

2. Alertmanager 配置

最佳实践

1. 选择合适的指标类型

2. 合理设置标签

3. 使用命名约定

4. 设置合理的抓取间隔

5. 告警设置技巧

总结

相关资源