Kafka:Prometheus + Grafana 监控

1. Kafka Exporter 安装与配置

1.1 下载 Kafka Exporter

1
2
3
wget https://github.com/danielqsj/kafka_exporter/releases/download/v1.3.1/kafka_exporter-1.3.1.linux-amd64.tar.gz
tar zxvf kafka_exporter-1.3.1.linux-amd64.tar.gz
cd kafka_exporter-1.3.1.linux-amd64

1.2 启动 Kafka Exporter

1
2
3
4
5
# 单节点模式
nohup ./kafka_exporter --kafka.server=192.168.154.34:9092 &

# 集群模式(监控多个broker)
nohup ./kafka_exporter --kafka.server=192.168.154.34:9092,192.168.154.35:9092,192.168.154.36:9092 &

1.3 常用启动参数

参数 说明 示例
--kafka.server Kafka broker地址 --kafka.server=host:port
--web.listen-address 监听地址和端口 --web.listen-address=”:9308”
--log.level 日志级别 --log.level=debug
--topic.filter 主题过滤正则 --topic.filter=”.*“

2. Prometheus 配置

2.1 修改 prometheus.yml

1
2
3
4
5
6
scrape_configs:
- job_name: 'kafka_exporter'
static_configs:
- targets: ['192.168.154.34:9308']
labels:
cluster: 'production_kafka'

2.2 重新加载配置

1
2
3
4
# 发送SIGHUP信号给Prometheus进程
kill -HUP <prometheus_pid>
# 或使用HTTP API
curl -X POST http://localhost:9090/-/reload

3. Grafana 仪表板配置

3.1 导入Kafka监控仪表板

  1. 访问Grafana控制台
  2. 导航到 Create > Import
  3. 输入仪表板ID 7589 (官方Kafka仪表板)
  4. 选择对应的Prometheus数据源

3.2 关键监控指标

集群级别指标

指标名称 说明
kafka_brokers 存活的broker数量
kafka_topic_partitions 每个topic的分区数
kafka_consumergroup_lag 消费者组延迟

Broker级别指标

指标名称 说明
kafka_broker_info broker基本信息
kafka_broker_leader_count broker作为leader的分区数

Topic级别指标

指标名称 说明
kafka_topic_partition_count topic分区数
kafka_topic_partition_offset topic分区offset
kafka_topic_partition_current_offset 当前offset

4. 系统服务配置(可选)

4.1 创建systemd服务

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
cat > /etc/systemd/system/kafka_exporter.service <<EOF
[Unit]
Description=Kafka Exporter
After=network.target

[Service]
User=kafka
Group=kafka
ExecStart=/opt/kafka_exporter/kafka_exporter \
--kafka.server=192.168.154.34:9092 \
--web.listen-address=":9308"
Restart=always

[Install]
WantedBy=multi-user.target
EOF

4.2 管理服务

1
2
3
systemctl daemon-reload
systemctl start kafka_exporter
systemctl enable kafka_exporter

5. 告警规则配置

5.1 Prometheus告警规则示例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
groups:
- name: kafka.rules
rules:
- alert: KafkaBrokerDown
expr: count(kafka_broker_info) by (cluster) < 3
for: 5m
labels:
severity: critical
annotations:
summary: "Kafka broker down (instance {{ $labels.instance }})"
description: "Kafka cluster {{ $labels.cluster }} has only {{ $value }} healthy brokers"

- alert: HighConsumerLag
expr: kafka_consumergroup_lag > 10000
for: 10m
labels:
severity: warning
annotations:
summary: "High consumer lag (instance {{ $labels.instance }})"
description: "Consumer group {{ $labels.consumergroup }} has high lag of {{ $value }} messages"

6. 高级配置

6.1 监控多个Kafka集群

1
2
3
4
5
6
7
8
9
10
11
12
scrape_configs:
- job_name: 'kafka_prod'
static_configs:
- targets: ['192.168.1.10:9308']
labels:
cluster: 'kafka_production'

- job_name: 'kafka_dev'
static_configs:
- targets: ['192.168.2.10:9308']
labels:
cluster: 'kafka_development'

6.2 使用服务发现

1
2
3
4
5
6
7
8
9
scrape_configs:
- job_name: 'kafka_exporters'
consul_sd_configs:
- server: 'localhost:8500'
services: ['kafka_exporter']
relabel_configs:
- source_labels: [__meta_consul_tags]
regex: '.*,kafka,.*'
action: keep

7. 常见问题排查

  1. 无法连接Kafka集群

    • 检查Kafka broker地址是否正确
    • 验证网络连通性
    • 检查Kafka ACL配置
  2. Prometheus无法抓取指标

    • 检查kafka_exporter是否运行
    • 验证防火墙设置
    • 检查Prometheus配置语法
  3. Grafana无数据显示

    • 确认数据源配置正确
    • 检查时间范围设置
    • 验证指标名称是否正确