Prometheus + Grafana 监控方案
架构概览
监控架构
Prometheus 负责采集和存储指标数据,Grafana 负责数据可视化和告警。
graph LR
A[应用程序] -->|暴露指标| B[Prometheus]
C[Node Exporter] -->|系统指标| B
D[其他 Exporter] -->|各类指标| B
B -->|查询数据| E[Grafana]
E -->|可视化| F[Dashboard]
B -->|告警规则| G[Alertmanager]
G -->|通知| H[邮件/钉钉/Slack]
Prometheus 简介
📊 时序数据库 - 专为时间序列数据设计
🎯 多维数据模型 - 支持标签灵活查询
💪 强大的查询语言 - PromQL 功能丰富
快速部署
Docker Compose 一键部署
docker-compose.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--storage.tsdb.retention.time=30d'
restart: unless-stopped
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin123
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- grafana_data:/var/lib/grafana
depends_on:
- prometheus
restart: unless-stopped
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
ports:
- "9100:9100"
command:
- '--path.rootfs=/host'
volumes:
- '/:/host:ro,rslave'
restart: unless-stopped
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
restart: unless-stopped
volumes:
prometheus_data:
grafana_data:
Prometheus 配置
prometheus.yml
global:
scrape_interval: 15s # 抓取间隔
evaluation_interval: 15s # 规则评估间隔
external_labels:
cluster: 'production'
region: 'cn-east'
# 告警管理器配置
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
# 告警规则文件
rule_files:
- 'alerts/*.yml'
# 抓取配置
scrape_configs:
# Prometheus 自身
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node Exporter (系统指标)
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
labels:
instance: 'server-01'
# .NET 应用
- job_name: 'dotnet-app'
static_configs:
- targets: ['app:5000']
labels:
app: 'my-api'
env: 'production'
# 服务发现 (Kubernetes)
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
启动服务
# 启动所有服务
docker-compose up -d
# 查看日志
docker-compose logs -f prometheus
docker-compose logs -f grafana
# 访问
# Prometheus: http://localhost:9090
# Grafana: http://localhost:3000 (admin/admin123)
.NET 应用集成
1. 安装 NuGet 包
dotnet add package prometheus-net.AspNetCore
2. 配置应用
- 最小化配置
- 高级配置
Program.cs
using Prometheus;
var builder = WebApplication.CreateBuilder(args);
builder.Services.AddControllers();
var app = builder.Build();
// 暴露指标端点
app.UseMetricServer(); // 默认 /metrics
// 或使用中间件
app.UseHttpMetrics(); // 自动记录 HTTP 请求指标
app.MapControllers();
app.Run();
Program.cs
using Prometheus;
var builder = WebApplication.CreateBuilder(args);
builder.Services.AddControllers();
var app = builder.Build();
// 自定义指标端点
app.UseMetricServer(port: 9090, url: "/metrics");
// HTTP 指标中间件
app.UseHttpMetrics(options =>
{
// 自定义标签
options.AddCustomLabel("host", context => context.Request.Host.Host);
// 排除某些路径
options.RequestCount.Enabled = true;
options.RequestDuration.Enabled = true;
});
// 健康检查
app.MapHealthChecks("/health");
// 指标端点单独配置
app.MapMetrics("/metrics");
app.MapControllers();
app.Run();
3. 自定义指标
using Prometheus;
public class OrderService
{
// Counter: 计数器(只增不减)
private static readonly Counter OrdersCreated = Metrics
.CreateCounter("orders_created_total", "订单创建总数",
new CounterConfiguration
{
LabelNames = new[] { "status", "payment_method" }
});
// Gauge: 仪表盘(可增可减)
private static readonly Gauge OrdersInProgress = Metrics
.CreateGauge("orders_in_progress", "正在处理的订单数");
// Histogram: 直方图(分布统计)
private static readonly Histogram OrderProcessingDuration = Metrics
.CreateHistogram("order_processing_duration_seconds", "订单处理耗时",
new HistogramConfiguration
{
Buckets = Histogram.LinearBuckets(start: 0.1, width: 0.1, count: 10)
});
// Summary: 摘要(百分位数)
private static readonly Summary OrderAmount = Metrics
.CreateSummary("order_amount_yuan", "订单金额",
new SummaryConfiguration
{
Objectives = new[]
{
new QuantileEpsilonPair(0.5, 0.05), // 中位数
new QuantileEpsilonPair(0.9, 0.01), // 90分位
new QuantileEpsilonPair(0.99, 0.001) // 99分位
}
});
public async Task<Order> CreateOrderAsync(CreateOrderRequest request)
{
OrdersInProgress.Inc(); // 增加进行中的订单
using (OrderProcessingDuration.NewTimer()) // 自动记录耗时
{
try
{
var order = await ProcessOrderAsync(request);
// 记录指标
OrdersCreated
.WithLabels(order.Status, order.PaymentMethod)
.Inc();
OrderAmount.Observe(order.TotalAmount);
return order;
}
finally
{
OrdersInProgress.Dec(); // 减少进行中的订单
}
}
}
}
PromQL 查询语言
基础查询
# 查询指标
http_requests_total
# 带标签过滤
http_requests_total{method="GET", status="200"}
# 范围查询(最近5分钟)
http_requests_total[5m]
# 速率计算
rate(http_requests_total[5m])
# 求和
sum(rate(http_requests_total[5m]))
# 按标签分组求和
sum(rate(http_requests_total[5m])) by (method)
常用函数
函数 | 说明 | 示例 |
---|---|---|
rate() | 计算平均增长率 | rate(requests_total[5m]) |
irate() | 计算瞬时增长率 | irate(requests_total[5m]) |
sum() | 求和 | sum(memory_usage) by (instance) |
avg() | 平均值 | avg(cpu_usage) |
max() / min() | 最大/最小值 | max(latency) |
count() | 计数 | count(up == 1) |
topk() | Top K | topk(5, http_requests_total) |
histogram_quantile() | 百分位数 | histogram_quantile(0.95, ...) |
实用查询示例
# CPU 使用率
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# 内存使用率
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# HTTP 请求速率(QPS)
sum(rate(http_requests_total[1m])) by (method, endpoint)
# 95分位延迟
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# 错误率
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100
# 接口成功率
sum(rate(http_requests_total{status="200"}[5m]))
/
sum(rate(http_requests_total[5m])) * 100
Grafana 配置
1. 添加数据源
- 登录 Grafana (
http://localhost:3000
) - Configuration → Data Sources
- Add data source → Prometheus
- URL:
http://prometheus:9090
- Save & Test
2. 导入 Dashboard
常用 Dashboard ID
{
"Node Exporter": 1860,
".NET Core": 10915,
"Docker": 893,
"Kubernetes": 315
}
导入步骤:
- Dashboard → Import
- 输入 Dashboard ID
- 选择 Prometheus 数据源
- Import
3. 创建自定义 Dashboard
示例 Panel 配置
{
"title": "API 请求速率 (QPS)",
"targets": [
{
"expr": "sum(rate(http_requests_total[1m])) by (endpoint)",
"legendFormat": "{{endpoint}}"
}
],
"type": "graph"
}
告警配置
1. Prometheus 告警规则
alerts/app.yml
groups:
- name: application
interval: 30s
rules:
# API 错误率过高
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "API 错误率过高: {{ $value | humanizePercentage }}"
description: "实例 {{ $labels.instance }} 错误率超过5%"
# 响应时间过长
- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
) > 1
for: 10m
labels:
severity: warning
annotations:
summary: "API 响应时间过长"
description: "P95 延迟: {{ $value }}s"
# 服务不可用
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "服务不可用: {{ $labels.job }}"
description: "实例 {{ $labels.instance }} 已宕机超过1分钟"
# CPU 使用率过高
- alert: HighCPUUsage
expr: |
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "CPU 使用率过高: {{ $value }}%"
# 内存使用率过高
- alert: HighMemoryUsage
expr: |
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "内存使用率过高: {{ $value }}%"
# 磁盘空间不足
- alert: DiskSpaceLow
expr: |
(node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10
for: 5m
labels:
severity: critical
annotations:
summary: "磁盘空间不足: {{ $value }}%"
2. Alertmanager 配置
alertmanager.yml
global:
resolve_timeout: 5m
# 告警路由
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
# 严重告警立即通知
- match:
severity: critical
receiver: 'critical'
continue: true
# 警告级别告警
- match:
severity: warning
receiver: 'warning'
# 接收器
receivers:
- name: 'default'
webhook_configs:
- url: 'http://alertmanager-webhook:5001/webhook'
- name: 'critical'
# 钉钉通知
webhook_configs:
- url: 'https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN'
send_resolved: true
# 邮件通知
email_configs:
- to: 'ops@example.com'
from: 'alertmanager@example.com'
smarthost: 'smtp.example.com:587'
auth_username: 'alertmanager@example.com'
auth_password: 'password'
- name: 'warning'
webhook_configs:
- url: 'https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN'
# 抑制规则
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
最佳实践
监控最佳实践
1. 选择合适的指标类型
- Counter: 请求数、错误数等单调递增的指标
- Gauge: CPU、内存等可增可减的指标
- Histogram: 请求延迟、响应大小等分布统计
- Summary: 与 Histogram 类似,但在客户端计算分位数
2. 合理设置标签
// ✅ 好:标签值集合有限
OrdersCreated.WithLabels(status: "success", method: "alipay").Inc();
// ❌ 差:标签值无限制(会导致海量时间序列)
OrdersCreated.WithLabels(orderId: "12345").Inc();
3. 使用命名约定
- 指标名:
<namespace>_<name>_<unit>
- 例如:
http_requests_total
,cpu_usage_percent
4. 设置合理的抓取间隔
- 默认 15s 适合大多数场景
- 关键指标可以缩短到 5s
- 非关键指标可以延长到 1m
5. 告警设置技巧
- 使用
for
子句避免瞬时抖动 - 设置合理的阈值
- 使用多级告警(warning/critical)
- 配置告警分组和抑制
总结
关键要点
- Prometheus + Grafana 是业界主流监控方案
- 支持多维度指标采集和灵活查询
- 通过 Exporter 可以监控各种基础设施
- .NET 应用通过 prometheus-net 轻松集成
- 配置灵活的告警规则实现主动监控