Skip to main content

Prometheus + Grafana 监控方案

架构概览

监控架构

Prometheus 负责采集和存储指标数据,Grafana 负责数据可视化和告警。

graph LR
A[应用程序] -->|暴露指标| B[Prometheus]
C[Node Exporter] -->|系统指标| B
D[其他 Exporter] -->|各类指标| B
B -->|查询数据| E[Grafana]
E -->|可视化| F[Dashboard]
B -->|告警规则| G[Alertmanager]
G -->|通知| H[邮件/钉钉/Slack]

Prometheus 简介

📊 时序数据库 - 专为时间序列数据设计

🎯 多维数据模型 - 支持标签灵活查询

💪 强大的查询语言 - PromQL 功能丰富

快速部署

Docker Compose 一键部署

docker-compose.yml
version: '3.8'

services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--storage.tsdb.retention.time=30d'
restart: unless-stopped

grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin123
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- grafana_data:/var/lib/grafana
depends_on:
- prometheus
restart: unless-stopped

node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
ports:
- "9100:9100"
command:
- '--path.rootfs=/host'
volumes:
- '/:/host:ro,rslave'
restart: unless-stopped

alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
restart: unless-stopped

volumes:
prometheus_data:
grafana_data:

Prometheus 配置

prometheus.yml
global:
scrape_interval: 15s # 抓取间隔
evaluation_interval: 15s # 规则评估间隔
external_labels:
cluster: 'production'
region: 'cn-east'

# 告警管理器配置
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']

# 告警规则文件
rule_files:
- 'alerts/*.yml'

# 抓取配置
scrape_configs:
# Prometheus 自身
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']

# Node Exporter (系统指标)
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
labels:
instance: 'server-01'

# .NET 应用
- job_name: 'dotnet-app'
static_configs:
- targets: ['app:5000']
labels:
app: 'my-api'
env: 'production'

# 服务发现 (Kubernetes)
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true

启动服务

# 启动所有服务
docker-compose up -d

# 查看日志
docker-compose logs -f prometheus
docker-compose logs -f grafana

# 访问
# Prometheus: http://localhost:9090
# Grafana: http://localhost:3000 (admin/admin123)

.NET 应用集成

1. 安装 NuGet 包

dotnet add package prometheus-net.AspNetCore

2. 配置应用

Program.cs
using Prometheus;

var builder = WebApplication.CreateBuilder(args);

builder.Services.AddControllers();

var app = builder.Build();

// 暴露指标端点
app.UseMetricServer(); // 默认 /metrics

// 或使用中间件
app.UseHttpMetrics(); // 自动记录 HTTP 请求指标

app.MapControllers();

app.Run();

3. 自定义指标

using Prometheus;

public class OrderService
{
// Counter: 计数器(只增不减)
private static readonly Counter OrdersCreated = Metrics
.CreateCounter("orders_created_total", "订单创建总数",
new CounterConfiguration
{
LabelNames = new[] { "status", "payment_method" }
});

// Gauge: 仪表盘(可增可减)
private static readonly Gauge OrdersInProgress = Metrics
.CreateGauge("orders_in_progress", "正在处理的订单数");

// Histogram: 直方图(分布统计)
private static readonly Histogram OrderProcessingDuration = Metrics
.CreateHistogram("order_processing_duration_seconds", "订单处理耗时",
new HistogramConfiguration
{
Buckets = Histogram.LinearBuckets(start: 0.1, width: 0.1, count: 10)
});

// Summary: 摘要(百分位数)
private static readonly Summary OrderAmount = Metrics
.CreateSummary("order_amount_yuan", "订单金额",
new SummaryConfiguration
{
Objectives = new[]
{
new QuantileEpsilonPair(0.5, 0.05), // 中位数
new QuantileEpsilonPair(0.9, 0.01), // 90分位
new QuantileEpsilonPair(0.99, 0.001) // 99分位
}
});

public async Task<Order> CreateOrderAsync(CreateOrderRequest request)
{
OrdersInProgress.Inc(); // 增加进行中的订单

using (OrderProcessingDuration.NewTimer()) // 自动记录耗时
{
try
{
var order = await ProcessOrderAsync(request);

// 记录指标
OrdersCreated
.WithLabels(order.Status, order.PaymentMethod)
.Inc();

OrderAmount.Observe(order.TotalAmount);

return order;
}
finally
{
OrdersInProgress.Dec(); // 减少进行中的订单
}
}
}
}

PromQL 查询语言

基础查询

# 查询指标
http_requests_total

# 带标签过滤
http_requests_total{method="GET", status="200"}

# 范围查询(最近5分钟)
http_requests_total[5m]

# 速率计算
rate(http_requests_total[5m])

# 求和
sum(rate(http_requests_total[5m]))

# 按标签分组求和
sum(rate(http_requests_total[5m])) by (method)

常用函数

函数说明示例
rate()计算平均增长率rate(requests_total[5m])
irate()计算瞬时增长率irate(requests_total[5m])
sum()求和sum(memory_usage) by (instance)
avg()平均值avg(cpu_usage)
max() / min()最大/最小值max(latency)
count()计数count(up == 1)
topk()Top Ktopk(5, http_requests_total)
histogram_quantile()百分位数histogram_quantile(0.95, ...)

实用查询示例

# CPU 使用率
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 内存使用率
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# HTTP 请求速率(QPS)
sum(rate(http_requests_total[1m])) by (method, endpoint)

# 95分位延迟
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# 错误率
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100

# 接口成功率
sum(rate(http_requests_total{status="200"}[5m]))
/
sum(rate(http_requests_total[5m])) * 100

Grafana 配置

1. 添加数据源

  1. 登录 Grafana (http://localhost:3000)
  2. Configuration → Data Sources
  3. Add data source → Prometheus
  4. URL: http://prometheus:9090
  5. Save & Test

2. 导入 Dashboard

常用 Dashboard ID
{
"Node Exporter": 1860,
".NET Core": 10915,
"Docker": 893,
"Kubernetes": 315
}

导入步骤:

  1. Dashboard → Import
  2. 输入 Dashboard ID
  3. 选择 Prometheus 数据源
  4. Import

3. 创建自定义 Dashboard

示例 Panel 配置
{
"title": "API 请求速率 (QPS)",
"targets": [
{
"expr": "sum(rate(http_requests_total[1m])) by (endpoint)",
"legendFormat": "{{endpoint}}"
}
],
"type": "graph"
}

告警配置

1. Prometheus 告警规则

alerts/app.yml
groups:
- name: application
interval: 30s
rules:
# API 错误率过高
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "API 错误率过高: {{ $value | humanizePercentage }}"
description: "实例 {{ $labels.instance }} 错误率超过5%"

# 响应时间过长
- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
) > 1
for: 10m
labels:
severity: warning
annotations:
summary: "API 响应时间过长"
description: "P95 延迟: {{ $value }}s"

# 服务不可用
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "服务不可用: {{ $labels.job }}"
description: "实例 {{ $labels.instance }} 已宕机超过1分钟"

# CPU 使用率过高
- alert: HighCPUUsage
expr: |
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "CPU 使用率过高: {{ $value }}%"

# 内存使用率过高
- alert: HighMemoryUsage
expr: |
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "内存使用率过高: {{ $value }}%"

# 磁盘空间不足
- alert: DiskSpaceLow
expr: |
(node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10
for: 5m
labels:
severity: critical
annotations:
summary: "磁盘空间不足: {{ $value }}%"

2. Alertmanager 配置

alertmanager.yml
global:
resolve_timeout: 5m

# 告警路由
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
# 严重告警立即通知
- match:
severity: critical
receiver: 'critical'
continue: true

# 警告级别告警
- match:
severity: warning
receiver: 'warning'

# 接收器
receivers:
- name: 'default'
webhook_configs:
- url: 'http://alertmanager-webhook:5001/webhook'

- name: 'critical'
# 钉钉通知
webhook_configs:
- url: 'https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN'
send_resolved: true

# 邮件通知
email_configs:
- to: 'ops@example.com'
from: 'alertmanager@example.com'
smarthost: 'smtp.example.com:587'
auth_username: 'alertmanager@example.com'
auth_password: 'password'

- name: 'warning'
webhook_configs:
- url: 'https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN'

# 抑制规则
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']

最佳实践

监控最佳实践

1. 选择合适的指标类型

  • Counter: 请求数、错误数等单调递增的指标
  • Gauge: CPU、内存等可增可减的指标
  • Histogram: 请求延迟、响应大小等分布统计
  • Summary: 与 Histogram 类似,但在客户端计算分位数

2. 合理设置标签

// ✅ 好:标签值集合有限
OrdersCreated.WithLabels(status: "success", method: "alipay").Inc();

// ❌ 差:标签值无限制(会导致海量时间序列)
OrdersCreated.WithLabels(orderId: "12345").Inc();

3. 使用命名约定

  • 指标名: <namespace>_<name>_<unit>
  • 例如: http_requests_total, cpu_usage_percent

4. 设置合理的抓取间隔

  • 默认 15s 适合大多数场景
  • 关键指标可以缩短到 5s
  • 非关键指标可以延长到 1m

5. 告警设置技巧

  • 使用 for 子句避免瞬时抖动
  • 设置合理的阈值
  • 使用多级告警(warning/critical)
  • 配置告警分组和抑制

总结

关键要点
  • Prometheus + Grafana 是业界主流监控方案
  • 支持多维度指标采集和灵活查询
  • 通过 Exporter 可以监控各种基础设施
  • .NET 应用通过 prometheus-net 轻松集成
  • 配置灵活的告警规则实现主动监控

相关资源