更新时间:2025-11-25 GMT+08:00
分享

配置监控告警(中间件类可选)

本章节主要介绍部分监控指标的告警策略以及配置操作。在实际业务中,建议按照以下告警策略,配置监控指标的告警规则。

监控指标

表1 ems指标告警

监控项说明

涉及指标

采集周期

告警名称

告警级别

告警阈值

告警阈值、PromQL说明

节点读请求失败个数

EMS_GET_KV_STAT

30s

ems_server_io_read_status_abnormal

重要级别

5%

连续3个周期内出现报错触发;连续3个周期内未报恢复

PromQL:
avg(EMS_GET_KV_STAT{namespace="ems", container="ems-server", item="BackBad"}) by (pid) > 0

节点写请求失败个数

EMS_PUT_KV_STAT

30s

ems_server_io_write_status_abnormal

重要级别

5%

连续3个周期内出现报错触发;连续3个周期内未报恢复

PromQL:
avg(EMS_PUT_KV_STAT{namespace="ems", container="ems-server", item="BackBad"}) by (pid) > 0

请求成功率

EMS_PUT_KV_STAT、EMS_GET_KV_STAT

30s

ems_server_io_success_rate_low

紧急级别

0.95

连续5min内成功率小于0.95触发(单个周期内总请求数大于100),5min内成功率大于0.95恢复
(avg(EMS_PUT_KV_STAT{namespace="ems", container="ems-server", item="BackGood"})+avg(EMS_GET_KV_STAT{namespace="ems", container="ems-server", item="BackGood"}))/(avg(EMS_PUT_KV_STAT{namespace="ems", container="ems-server", item="Start"})+avg(EMS_GET_KV_STAT{namespace="ems", container="ems-server", item="Start"}))<0.95

节点读写QPS

ems_server_kv_concurrent_num_used、ems_server_kv_concurrent_num_limit

30s

ems_server_io_concurrent_high

重要级别

1000 key/s

连续3个周期内请求数大于1000 key/s并发限制触发,连续3个周期内小于1000 key/s恢复

PromQL:
avg(ems_server_kv_concurrent_num_used{namespace="ems", pod=~"ems-server.*"} / ems_server_kv_concurrent_num_limit{namespace="ems", pod=~"ems-server.*"}) by (pod) > 0.8

节点读请求时延

EMS_GET_KV_STAT

30s

ems_server_io_read_latency_high

重要级别

100ms

连续3个周期内存在时延大于100ms的读请求触发;连续3个周期内未出现大于100ms的读请求恢复

PromQL:
avg_over_time(EMS_GET_KV_STAT{namespace="ems", kubernetes_pod=~"ems-server.*", container="ems-server", item="Av"}[90s])>100

节点写请求时延

EMS_PUT_KV_STA

30s

ems_server_io_write_latency_high

重要级别

200ms

连续3个周期内存在时延大于200ms的写请求触发;连续3个周期内未出现大于200ms的写请求恢复

PromQL:
avg_over_time(EMS_PUT_KV_STAT{namespace="ems", kubernetes_pod=~"ems-server.*", container="ems-server", item="Av"}[90s])>200

ems server连接zk异常

ems_server_service_zk_status

30s

ems_server_zk_status_abnormal

重要级别

0

连续3个周期内出现状态异常触发;连续3个周期内状态正常则恢复

PromQL:
avg_over_time(ems_server_service_zk_status{namespace="ems", kubernetes_pod=~"ems-server.*", container="ems-server"}[90s]) != 0

ems-server cpu使用率

ems_server_process_cpu_usage

30s

ems_server_cpu_usage_high

重要级别

0.8

连续3个周期内大于阈值触发,连续3个周期内小于阈值恢复
avg_over_time(ems_server_process_cpu_usage{namespace="ems", kubernetes_pod=~"ems-server.*", container="ems-server"}[1m30s])>80

ems-server 句柄数

ems_server_fd_high

30s

ems_server_fd_high

重要级别

20000

连续3个周期内大于阈值触发,连续3个周期内小于阈值恢复
avg_over_time(ems_server_all_process_fds{namespace="ems", kubernetes_pod=~"ems-server.*", container="ems-server"}[1m30s])>20000

ems-server 节点间带宽

ems_server_inter_io_bandwidth_high

30s

ems_server_inter_io_bandwidth_high

重要级别

5GB/s

连续3个周期内大于阈值触发,连续3个周期内小于阈值恢复
avg_over_time(MFV_CE_TASK_PULL_FLOW_CONTROL_s_all{namespace="ems",   kubernetes_pod=~"ems-server.*", container="ems-server",   item="Total"}[90s])/30/1000000000>5

ems-server 节点间带宽

ems_server_inter_io_bandwidth_high

30s

ems_server_inter_io_bandwidth_high

重要级别

5GB/s

连续3个周期内大于阈值触发,连续3个周期内小于阈值恢复
avg_over_time(MFV_CE_TASK_PUSH_FLOW_CONTROL_s_all{namespace="ems",   kubernetes_pod=~"ems-server.*", container="ems-server",   item="Total"}[90s])/30/1000000000>5

ems-server 节点间PUT IO时延

ems_server_inter_io_latency_high

30s

ems_server_inter_io_latency_high

重要级别

50ms

连续3个周期内大于阈值触发,连续3个周期内小于阈值恢复
avg_over_time(MFV_REM_KV_CLI_PUT_VAL{namespace="ems", kubernetes_pod=~"ems-server.*", container="ems-server", item="Av"}[90s])>50000

ems-server 节点间GET IO时延

ems_server_inter_io_latency_high

30s

ems_server_inter_io_latency_high

重要级别

50ms

连续3个周期内大于阈值触发,连续3个周期内小于阈值恢复
avg_over_time(MFV_REM_KV_CLI_GET_VAL{namespace="ems", kubernetes_pod=~"ems-server.*", container="ems-server", item="Av"}[90s])>50000

ems-server 大页内存有效利用率

ems_server_pool_memory_used/ems_server_pool_memory_total

30s

ems_server_hugepage_usage_high

重要级别

0.95

连续3个周期内大于阈值触发,连续3个周期内小于阈值恢复
avg((ems_server_pool_memory_used{namespace="ems", pod=~"ems-server.*"}) /(ems_server_pool_memory_total{namespace="ems", pod=~"ems-server.*"})) by (pod) >0.95

ems-server 网卡带宽监控

ems_server_nic_bandwidth

30s

ems_server_nic_bandwidth_high

重要级别

0.5

连续3个周期内大于阈值触发,连续3个周期内小于阈值恢复
avg_over_time(ems_server_nic_bandwidth{namespace="ems", pod=~"ems-server.*",container = "ems-server"}[90s]) >0.5

ems-ctrl 句柄数

ems_controller_all_process_fds

30s

ems_ctrl_fd_high

重要级别

20000

连续3个周期内大于阈值触发,连续3个周期内小于阈值恢复
avg_over_time(ems_controller_all_process_fds{namespace="ems", kubernetes_pod=~"ems-controller.*", container="ems-controller"}[1m30s])>20000

ems-ctrl cpu使用率

ems_controller_process_cpu_usage

30s

ems_ctrl_cpu_usage_high

重要级别

0.8

连续3个周期内大于阈值触发,连续3个周期内小于阈值恢复
avg_over_time(ems_controller_process_cpu_usage{namespace="ems", kubernetes_pod=~"ems-controller.*", container="ems-controller"}[1m30s]) > 80

zk进程状态 cpu使用率

dmk_zk_cpu_usage

30s

ems_zk_cpu_usage_high

重要级别

0.8

连续3个周期内大于阈值触发,连续3个周期内小于阈值恢复
avg_over_time(dmk_zk_cpu_usage{namespace="ems", pod=~"ems-zookeeper.*"}[90s])>80

controller pod状态

kube_pod_status_ready

30s

ems_controller_pod_status_abnormal

重要级别

1

ems controller pod status == notok(CCE集群监控),故障时间>5min

PromQL:
avg_over_time(kube_pod_status_ready{namespace="ems", condition="true", pod=~"ems-controller.*"}[90s]) != 1

server pod状态

kube_pod_status_ready

30s

ems_server_pod_status_abnormal

重要级别

1

ems server pod status == notok (CCE集群监控),故障时间>5min

PromQL:
avg_over_time(kube_pod_status_ready{namespace="ems", condition="true", pod=~"ems-server.*"}[90s]) != 1

controller连接zk状态

ems_controller_service_zk_status

30s

ems_controller_zk_status_abnormal

重要级别

-1

controller_service_zk_status == -1,故障时间>90s

PromQL:
avg_over_time(ems_controller_service_zk_status{namespace="ems", kubernetes_pod=~"ems-controller.*", container="ems-controller"}[90s]) != 0

计费信息上报功能状态

ems_controller_service_charge_report_status

30s

ems_controller_charge_report_status_abnormal

重要级别

0

controller_service_charge_report_status == -1

PromQL:
avg_over_time(ems_controller_network_loss_packet_rate_qdisc{namespace="ems",   kubernetes_pod=~"ems-controller.*",   container="ems-controller"}[90s]) != 0

ems-ctrl容器网卡丢包

ems_controller_network_loss_packet_rate_qdisc

30s

ems_controller_network_loss_packet_rate_qdisc

重要级别

0%

连续3个周期内大于阈值触发,连续3个周期内小于阈值恢复
avg_over_time(ems_controller_network_loss_packet_rate_qdisc{namespace="ems",   kubernetes_pod=~"ems-controller.*",   container="ems-controller"}[90s])!=0

ems-server容器网卡丢包

ems_server_network_loss_packet_rate_qdisc

30s

ems_server_network_loss_packet_rate_qdisc

重要级别

0%

连续3个周期内大于阈值触发,连续3个周期内小于阈值恢复
avg_over_time(ems_server_network_loss_packet_rate_qdisc{namespace="ems",   kubernetes_pod=~"ems-server.*",   container="ems-server"}[90s])!=0

业务进程状态异常

ems_server_process_status

30s

ems_server_process_status

紧急级别

1

连续3个周期内不等于0/1阈值触发,连续3个周期内小于阈值恢复
avg_over_time(ems_server_process_status{namespace="ems", kubernetes_pod=~"ems-server.*", container="ems-server"} [90s]) !=1

ems-ctrl主节频繁故障

ems_controller_master_change

30s

ems_controller_master_change

紧急级别

3次

连续3个周期内切主三次
sum(   changes(ems_controller_master_change{namespace="ems", kubernetes_pod=~"ems-controller.*", container="ems-controller"}[5m]) > 0 ) > 3

配置文件效验

ems_server_config_error_code

30s

ems_kv_config_error

紧急级别

0

连续3个周期内不等于0阈值触发,连续3个周期内小于阈值恢复
avg_over_time(ems_server_config_error_code{namespace="ems", kubernetes_pod=~"ems-server.*", container="ems-server"} [90s]) !=0

操作步骤

配置监控告警步骤如下:

  1. 登录AOM控制台
  2. 选择告警中心 > 告警规则,进入告警规则页面。
  3. 单击“创建告警规则”,根据表1 ems指标告警内容填写规则,填写后单击“确定”即可。

    其中规则填写要求如下:

  4. 反复执行3,直到所有告警配置完毕。

相关文档