配置监控告警(中间件类可选)
本章节主要介绍部分监控指标的告警策略以及配置操作。在实际业务中,建议按照以下告警策略,配置监控指标的告警规则。
监控指标
|
监控项说明 |
涉及指标 |
采集周期 |
告警名称 |
告警级别 |
告警阈值 |
告警阈值、PromQL说明 |
|---|---|---|---|---|---|---|
|
节点读请求失败个数 |
EMS_GET_KV_STAT |
30s |
ems_server_io_read_status_abnormal |
重要级别 |
5% |
连续3个周期内出现报错触发;连续3个周期内未报恢复
PromQL:
avg(EMS_GET_KV_STAT{namespace="ems", container="ems-server", item="BackBad"}) by (pid) > 0
|
|
节点写请求失败个数 |
EMS_PUT_KV_STAT |
30s |
ems_server_io_write_status_abnormal |
重要级别 |
5% |
连续3个周期内出现报错触发;连续3个周期内未报恢复
PromQL:
avg(EMS_PUT_KV_STAT{namespace="ems", container="ems-server", item="BackBad"}) by (pid) > 0
|
|
请求成功率 |
EMS_PUT_KV_STAT、EMS_GET_KV_STAT |
30s |
ems_server_io_success_rate_low |
紧急级别 |
0.95 |
连续5min内成功率小于0.95触发(单个周期内总请求数大于100),5min内成功率大于0.95恢复
(avg(EMS_PUT_KV_STAT{namespace="ems", container="ems-server", item="BackGood"})+avg(EMS_GET_KV_STAT{namespace="ems", container="ems-server", item="BackGood"}))/(avg(EMS_PUT_KV_STAT{namespace="ems", container="ems-server", item="Start"})+avg(EMS_GET_KV_STAT{namespace="ems", container="ems-server", item="Start"}))<0.95
|
|
节点读写QPS |
ems_server_kv_concurrent_num_used、ems_server_kv_concurrent_num_limit |
30s |
ems_server_io_concurrent_high |
重要级别 |
1000 key/s |
连续3个周期内请求数大于1000 key/s并发限制触发,连续3个周期内小于1000 key/s恢复
PromQL:
avg(ems_server_kv_concurrent_num_used{namespace="ems", pod=~"ems-server.*"} / ems_server_kv_concurrent_num_limit{namespace="ems", pod=~"ems-server.*"}) by (pod) > 0.8
|
|
节点读请求时延 |
EMS_GET_KV_STAT |
30s |
ems_server_io_read_latency_high |
重要级别 |
100ms |
连续3个周期内存在时延大于100ms的读请求触发;连续3个周期内未出现大于100ms的读请求恢复
PromQL:
avg_over_time(EMS_GET_KV_STAT{namespace="ems", kubernetes_pod=~"ems-server.*", container="ems-server", item="Av"}[90s])>100
|
|
节点写请求时延 |
EMS_PUT_KV_STA |
30s |
ems_server_io_write_latency_high |
重要级别 |
200ms |
连续3个周期内存在时延大于200ms的写请求触发;连续3个周期内未出现大于200ms的写请求恢复
PromQL:
avg_over_time(EMS_PUT_KV_STAT{namespace="ems", kubernetes_pod=~"ems-server.*", container="ems-server", item="Av"}[90s])>200
|
|
ems server连接zk异常 |
ems_server_service_zk_status |
30s |
ems_server_zk_status_abnormal |
重要级别 |
0 |
连续3个周期内出现状态异常触发;连续3个周期内状态正常则恢复
PromQL:
avg_over_time(ems_server_service_zk_status{namespace="ems", kubernetes_pod=~"ems-server.*", container="ems-server"}[90s]) != 0
|
|
ems-server cpu使用率 |
ems_server_process_cpu_usage |
30s |
ems_server_cpu_usage_high |
重要级别 |
0.8 |
连续3个周期内大于阈值触发,连续3个周期内小于阈值恢复
avg_over_time(ems_server_process_cpu_usage{namespace="ems", kubernetes_pod=~"ems-server.*", container="ems-server"}[1m30s])>80
|
|
ems-server 句柄数 |
ems_server_fd_high |
30s |
ems_server_fd_high |
重要级别 |
20000 |
连续3个周期内大于阈值触发,连续3个周期内小于阈值恢复
avg_over_time(ems_server_all_process_fds{namespace="ems", kubernetes_pod=~"ems-server.*", container="ems-server"}[1m30s])>20000
|
|
ems-server 节点间带宽 |
ems_server_inter_io_bandwidth_high |
30s |
ems_server_inter_io_bandwidth_high |
重要级别 |
5GB/s |
连续3个周期内大于阈值触发,连续3个周期内小于阈值恢复
avg_over_time(MFV_CE_TASK_PULL_FLOW_CONTROL_s_all{namespace="ems", kubernetes_pod=~"ems-server.*", container="ems-server", item="Total"}[90s])/30/1000000000>5
|
|
ems-server 节点间带宽 |
ems_server_inter_io_bandwidth_high |
30s |
ems_server_inter_io_bandwidth_high |
重要级别 |
5GB/s |
连续3个周期内大于阈值触发,连续3个周期内小于阈值恢复
avg_over_time(MFV_CE_TASK_PUSH_FLOW_CONTROL_s_all{namespace="ems", kubernetes_pod=~"ems-server.*", container="ems-server", item="Total"}[90s])/30/1000000000>5
|
|
ems-server 节点间PUT IO时延 |
ems_server_inter_io_latency_high |
30s |
ems_server_inter_io_latency_high |
重要级别 |
50ms |
连续3个周期内大于阈值触发,连续3个周期内小于阈值恢复
avg_over_time(MFV_REM_KV_CLI_PUT_VAL{namespace="ems", kubernetes_pod=~"ems-server.*", container="ems-server", item="Av"}[90s])>50000
|
|
ems-server 节点间GET IO时延 |
ems_server_inter_io_latency_high |
30s |
ems_server_inter_io_latency_high |
重要级别 |
50ms |
连续3个周期内大于阈值触发,连续3个周期内小于阈值恢复
avg_over_time(MFV_REM_KV_CLI_GET_VAL{namespace="ems", kubernetes_pod=~"ems-server.*", container="ems-server", item="Av"}[90s])>50000
|
|
ems-server 大页内存有效利用率 |
ems_server_pool_memory_used/ems_server_pool_memory_total |
30s |
ems_server_hugepage_usage_high |
重要级别 |
0.95 |
连续3个周期内大于阈值触发,连续3个周期内小于阈值恢复
avg((ems_server_pool_memory_used{namespace="ems", pod=~"ems-server.*"}) /(ems_server_pool_memory_total{namespace="ems", pod=~"ems-server.*"})) by (pod) >0.95
|
|
ems-server 网卡带宽监控 |
ems_server_nic_bandwidth |
30s |
ems_server_nic_bandwidth_high |
重要级别 |
0.5 |
连续3个周期内大于阈值触发,连续3个周期内小于阈值恢复
avg_over_time(ems_server_nic_bandwidth{namespace="ems", pod=~"ems-server.*",container = "ems-server"}[90s]) >0.5
|
|
ems-ctrl 句柄数 |
ems_controller_all_process_fds |
30s |
ems_ctrl_fd_high |
重要级别 |
20000 |
连续3个周期内大于阈值触发,连续3个周期内小于阈值恢复
avg_over_time(ems_controller_all_process_fds{namespace="ems", kubernetes_pod=~"ems-controller.*", container="ems-controller"}[1m30s])>20000
|
|
ems-ctrl cpu使用率 |
ems_controller_process_cpu_usage |
30s |
ems_ctrl_cpu_usage_high |
重要级别 |
0.8 |
连续3个周期内大于阈值触发,连续3个周期内小于阈值恢复
avg_over_time(ems_controller_process_cpu_usage{namespace="ems", kubernetes_pod=~"ems-controller.*", container="ems-controller"}[1m30s]) > 80
|
|
zk进程状态 cpu使用率 |
dmk_zk_cpu_usage |
30s |
ems_zk_cpu_usage_high |
重要级别 |
0.8 |
连续3个周期内大于阈值触发,连续3个周期内小于阈值恢复
avg_over_time(dmk_zk_cpu_usage{namespace="ems", pod=~"ems-zookeeper.*"}[90s])>80
|
|
controller pod状态 |
kube_pod_status_ready |
30s |
ems_controller_pod_status_abnormal |
重要级别 |
1 |
ems controller pod status == notok(CCE集群监控),故障时间>5min
PromQL:
avg_over_time(kube_pod_status_ready{namespace="ems", condition="true", pod=~"ems-controller.*"}[90s]) != 1
|
|
server pod状态 |
kube_pod_status_ready |
30s |
ems_server_pod_status_abnormal |
重要级别 |
1 |
ems server pod status == notok (CCE集群监控),故障时间>5min
PromQL:
avg_over_time(kube_pod_status_ready{namespace="ems", condition="true", pod=~"ems-server.*"}[90s]) != 1
|
|
controller连接zk状态 |
ems_controller_service_zk_status |
30s |
ems_controller_zk_status_abnormal |
重要级别 |
-1 |
controller_service_zk_status == -1,故障时间>90s
PromQL:
avg_over_time(ems_controller_service_zk_status{namespace="ems", kubernetes_pod=~"ems-controller.*", container="ems-controller"}[90s]) != 0
|
|
计费信息上报功能状态 |
ems_controller_service_charge_report_status |
30s |
ems_controller_charge_report_status_abnormal |
重要级别 |
0 |
controller_service_charge_report_status == -1
PromQL:
avg_over_time(ems_controller_network_loss_packet_rate_qdisc{namespace="ems", kubernetes_pod=~"ems-controller.*", container="ems-controller"}[90s]) != 0
|
|
ems-ctrl容器网卡丢包 |
ems_controller_network_loss_packet_rate_qdisc |
30s |
ems_controller_network_loss_packet_rate_qdisc |
重要级别 |
0% |
连续3个周期内大于阈值触发,连续3个周期内小于阈值恢复
avg_over_time(ems_controller_network_loss_packet_rate_qdisc{namespace="ems", kubernetes_pod=~"ems-controller.*", container="ems-controller"}[90s])!=0
|
|
ems-server容器网卡丢包 |
ems_server_network_loss_packet_rate_qdisc |
30s |
ems_server_network_loss_packet_rate_qdisc |
重要级别 |
0% |
连续3个周期内大于阈值触发,连续3个周期内小于阈值恢复
avg_over_time(ems_server_network_loss_packet_rate_qdisc{namespace="ems", kubernetes_pod=~"ems-server.*", container="ems-server"}[90s])!=0
|
|
业务进程状态异常 |
ems_server_process_status |
30s |
ems_server_process_status |
紧急级别 |
1 |
连续3个周期内不等于0/1阈值触发,连续3个周期内小于阈值恢复
avg_over_time(ems_server_process_status{namespace="ems", kubernetes_pod=~"ems-server.*", container="ems-server"} [90s]) !=1
|
|
ems-ctrl主节频繁故障 |
ems_controller_master_change |
30s |
ems_controller_master_change |
紧急级别 |
3次 |
连续3个周期内切主三次
sum( changes(ems_controller_master_change{namespace="ems", kubernetes_pod=~"ems-controller.*", container="ems-controller"}[5m]) > 0 ) > 3
|
|
配置文件效验 |
ems_server_config_error_code |
30s |
ems_kv_config_error |
紧急级别 |
0 |
连续3个周期内不等于0阈值触发,连续3个周期内小于阈值恢复
avg_over_time(ems_server_config_error_code{namespace="ems", kubernetes_pod=~"ems-server.*", container="ems-server"} [90s]) !=0
|
操作步骤
配置监控告警步骤如下:
- 登录AOM控制台。
- 选择,进入告警规则页面。
- 单击“创建告警规则”,根据表1 ems指标告警内容填写规则,填写后单击“确定”即可。
其中规则填写要求如下:
- 规则名称对应表1 ems指标告警告警名称。
- 规则显示名称、描述为可选项。
- 规则类型选择指标告警规则
- 配置方式选择语法模式。
- prometheus实例选择收集运维指标中步骤四所创建的实例。
- 普罗语句对应表1 ems指标告警中的promQL内容。
- 告警级别对应表1 ems指标告警中的告警级别。
图1 创建告警规则
- 反复执行3,直到所有告警配置完毕。