配置监控告警(中间件类可选)
本章节主要介绍部分监控指标的告警策略以及配置操作。在实际业务中,建议按照以下告警策略,配置监控指标的告警规则。
监控指标
| 监控项说明 | 涉及指标 | 采集周期 | 告警名称 | 告警级别 | 告警阈值 | 告警阈值、PromQL说明 |
|---|---|---|---|---|---|---|
| 节点读请求失败个数 | EMS_GET_KV_STAT | 30s | ems_server_io_read_status_abnormal | 重要级别 | 5% | 连续3个周期内出现报错触发;连续3个周期内未报恢复 PromQL: avg(EMS_GET_KV_STAT{namespace="ems", container="ems-server", item="BackBad"}) by (pid) > 0 |
| 节点写请求失败个数 | EMS_PUT_KV_STAT | 30s | ems_server_io_write_status_abnormal | 重要级别 | 5% | 连续3个周期内出现报错触发;连续3个周期内未报恢复 PromQL: avg(EMS_PUT_KV_STAT{namespace="ems", container="ems-server", item="BackBad"}) by (pid) > 0 |
| 请求成功率 | EMS_PUT_KV_STAT、EMS_GET_KV_STAT | 30s | ems_server_io_success_rate_low | 紧急级别 | 0.95 | 连续5min内成功率小于0.95触发(单个周期内总请求数大于100),5min内成功率大于0.95恢复 (avg(EMS_PUT_KV_STAT{namespace="ems", container="ems-server", item="BackGood"})+avg(EMS_GET_KV_STAT{namespace="ems", container="ems-server", item="BackGood"}))/(avg(EMS_PUT_KV_STAT{namespace="ems", container="ems-server", item="Start"})+avg(EMS_GET_KV_STAT{namespace="ems", container="ems-server", item="Start"}))<0.95 |
| 节点读写QPS | ems_server_kv_concurrent_num_used、ems_server_kv_concurrent_num_limit | 30s | ems_server_io_concurrent_high | 重要级别 | 1000 key/s | 连续3个周期内请求数大于1000 key/s并发限制触发,连续3个周期内小于1000 key/s恢复 PromQL: avg(ems_server_kv_concurrent_num_used{namespace="ems", pod=~"ems-server.*"} / ems_server_kv_concurrent_num_limit{namespace="ems", pod=~"ems-server.*"}) by (pod) > 0.8 |
| 节点读请求时延 | EMS_GET_KV_STAT | 30s | ems_server_io_read_latency_high | 重要级别 | 100ms | 连续3个周期内存在时延大于100ms的读请求触发;连续3个周期内未出现大于100ms的读请求恢复 PromQL: avg_over_time(EMS_GET_KV_STAT{namespace="ems", kubernetes_pod=~"ems-server.*", container="ems-server", item="Av"}[90s])>100 |
| 节点写请求时延 | EMS_PUT_KV_STA | 30s | ems_server_io_write_latency_high | 重要级别 | 200ms | 连续3个周期内存在时延大于200ms的写请求触发;连续3个周期内未出现大于200ms的写请求恢复 PromQL: avg_over_time(EMS_PUT_KV_STAT{namespace="ems", kubernetes_pod=~"ems-server.*", container="ems-server", item="Av"}[90s])>200 |
| ems server连接zk异常 | ems_server_service_zk_status | 30s | ems_server_zk_status_abnormal | 重要级别 | 0 | 连续3个周期内出现状态异常触发;连续3个周期内状态正常则恢复 PromQL: avg_over_time(ems_server_service_zk_status{namespace="ems", kubernetes_pod=~"ems-server.*", container="ems-server"}[90s]) != 0 |
| ems-server cpu使用率 | ems_server_process_cpu_usage | 30s | ems_server_cpu_usage_high | 重要级别 | 0.8 | 连续3个周期内大于阈值触发,连续3个周期内小于阈值恢复 avg_over_time(ems_server_process_cpu_usage{namespace="ems", kubernetes_pod=~"ems-server.*", container="ems-server"}[1m30s])>80 |
| ems-server 句柄数 | ems_server_fd_high | 30s | ems_server_fd_high | 重要级别 | 20000 | 连续3个周期内大于阈值触发,连续3个周期内小于阈值恢复 avg_over_time(ems_server_all_process_fds{namespace="ems", kubernetes_pod=~"ems-server.*", container="ems-server"}[1m30s])>20000 |
| ems-server 节点间带宽 | ems_server_inter_io_bandwidth_high | 30s | ems_server_inter_io_bandwidth_high | 重要级别 | 5GB/s | 连续3个周期内大于阈值触发,连续3个周期内小于阈值恢复 avg_over_time(MFV_CE_TASK_PULL_FLOW_CONTROL_s_all{namespace="ems", kubernetes_pod=~"ems-server.*", container="ems-server", item="Total"}[90s])/30/1000000000>5 |
| ems-server 节点间带宽 | ems_server_inter_io_bandwidth_high | 30s | ems_server_inter_io_bandwidth_high | 重要级别 | 5GB/s | 连续3个周期内大于阈值触发,连续3个周期内小于阈值恢复 avg_over_time(MFV_CE_TASK_PUSH_FLOW_CONTROL_s_all{namespace="ems", kubernetes_pod=~"ems-server.*", container="ems-server", item="Total"}[90s])/30/1000000000>5 |
| ems-server 节点间PUT IO时延 | ems_server_inter_io_latency_high | 30s | ems_server_inter_io_latency_high | 重要级别 | 50ms | 连续3个周期内大于阈值触发,连续3个周期内小于阈值恢复 avg_over_time(MFV_REM_KV_CLI_PUT_VAL{namespace="ems", kubernetes_pod=~"ems-server.*", container="ems-server", item="Av"}[90s])>50000 |
| ems-server 节点间GET IO时延 | ems_server_inter_io_latency_high | 30s | ems_server_inter_io_latency_high | 重要级别 | 50ms | 连续3个周期内大于阈值触发,连续3个周期内小于阈值恢复 avg_over_time(MFV_REM_KV_CLI_GET_VAL{namespace="ems", kubernetes_pod=~"ems-server.*", container="ems-server", item="Av"}[90s])>50000 |
| ems-server 大页内存有效利用率 | ems_server_pool_memory_used/ems_server_pool_memory_total | 30s | ems_server_hugepage_usage_high | 重要级别 | 0.95 | 连续3个周期内大于阈值触发,连续3个周期内小于阈值恢复 avg((ems_server_pool_memory_used{namespace="ems", pod=~"ems-server.*"}) /(ems_server_pool_memory_total{namespace="ems", pod=~"ems-server.*"})) by (pod) >0.95 |
| ems-server 网卡带宽监控 | ems_server_nic_bandwidth | 30s | ems_server_nic_bandwidth_high | 重要级别 | 0.5 | 连续3个周期内大于阈值触发,连续3个周期内小于阈值恢复 avg_over_time(ems_server_nic_bandwidth{namespace="ems", pod=~"ems-server.*",container = "ems-server"}[90s]) >0.5 |
| ems-ctrl 句柄数 | ems_controller_all_process_fds | 30s | ems_ctrl_fd_high | 重要级别 | 20000 | 连续3个周期内大于阈值触发,连续3个周期内小于阈值恢复 avg_over_time(ems_controller_all_process_fds{namespace="ems", kubernetes_pod=~"ems-controller.*", container="ems-controller"}[1m30s])>20000 |
| ems-ctrl cpu使用率 | ems_controller_process_cpu_usage | 30s | ems_ctrl_cpu_usage_high | 重要级别 | 0.8 | 连续3个周期内大于阈值触发,连续3个周期内小于阈值恢复 avg_over_time(ems_controller_process_cpu_usage{namespace="ems", kubernetes_pod=~"ems-controller.*", container="ems-controller"}[1m30s]) > 80 |
| zk进程状态 cpu使用率 | dmk_zk_cpu_usage | 30s | ems_zk_cpu_usage_high | 重要级别 | 0.8 | 连续3个周期内大于阈值触发,连续3个周期内小于阈值恢复 avg_over_time(dmk_zk_cpu_usage{namespace="ems", pod=~"ems-zookeeper.*"}[90s])>80 |
| controller pod状态 | kube_pod_status_ready | 30s | ems_controller_pod_status_abnormal | 重要级别 | 1 | ems controller pod status == notok(CCE集群监控),故障时间>5min PromQL: avg_over_time(kube_pod_status_ready{namespace="ems", condition="true", pod=~"ems-controller.*"}[90s]) != 1 |
| server pod状态 | kube_pod_status_ready | 30s | ems_server_pod_status_abnormal | 重要级别 | 1 | ems server pod status == notok (CCE集群监控),故障时间>5min PromQL: avg_over_time(kube_pod_status_ready{namespace="ems", condition="true", pod=~"ems-server.*"}[90s]) != 1 |
| controller连接zk状态 | ems_controller_service_zk_status | 30s | ems_controller_zk_status_abnormal | 重要级别 | -1 | controller_service_zk_status == -1,故障时间>90s PromQL: avg_over_time(ems_controller_service_zk_status{namespace="ems", kubernetes_pod=~"ems-controller.*", container="ems-controller"}[90s]) != 0 |
| 计费信息上报功能状态 | ems_controller_service_charge_report_status | 30s | ems_controller_charge_report_status_abnormal | 重要级别 | 0 | controller_service_charge_report_status == -1 PromQL: avg_over_time(ems_controller_network_loss_packet_rate_qdisc{namespace="ems", kubernetes_pod=~"ems-controller.*", container="ems-controller"}[90s]) != 0 |
| ems-ctrl容器网卡丢包 | ems_controller_network_loss_packet_rate_qdisc | 30s | ems_controller_network_loss_packet_rate_qdisc | 重要级别 | 0% | 连续3个周期内大于阈值触发,连续3个周期内小于阈值恢复 avg_over_time(ems_controller_network_loss_packet_rate_qdisc{namespace="ems", kubernetes_pod=~"ems-controller.*", container="ems-controller"}[90s])!=0 |
| ems-server容器网卡丢包 | ems_server_network_loss_packet_rate_qdisc | 30s | ems_server_network_loss_packet_rate_qdisc | 重要级别 | 0% | 连续3个周期内大于阈值触发,连续3个周期内小于阈值恢复 avg_over_time(ems_server_network_loss_packet_rate_qdisc{namespace="ems", kubernetes_pod=~"ems-server.*", container="ems-server"}[90s])!=0 |
| 业务进程状态异常 | ems_server_process_status | 30s | ems_server_process_status | 紧急级别 | 1 | 连续3个周期内不等于0/1阈值触发,连续3个周期内小于阈值恢复 avg_over_time(ems_server_process_status{namespace="ems", kubernetes_pod=~"ems-server.*", container="ems-server"} [90s]) !=1 |
| ems-ctrl主节频繁故障 | ems_controller_master_change | 30s | ems_controller_master_change | 紧急级别 | 3次 | 连续3个周期内切主三次 sum( changes(ems_controller_master_change{namespace="ems", kubernetes_pod=~"ems-controller.*", container="ems-controller"}[5m]) > 0 ) > 3 |
| 配置文件效验 | ems_server_config_error_code | 30s | ems_kv_config_error | 紧急级别 | 0 | 连续3个周期内不等于0阈值触发,连续3个周期内小于阈值恢复 avg_over_time(ems_server_config_error_code{namespace="ems", kubernetes_pod=~"ems-server.*", container="ems-server"} [90s]) !=0 |
操作步骤
配置监控告警步骤如下:
- 登录AOM控制台。
- 选择,进入告警规则页面。
- 单击“创建告警规则”,根据表1 ems指标告警内容填写规则,填写后单击“确定”即可。
其中规则填写要求如下:
- 规则名称对应表1 ems指标告警告警名称。
- 规则显示名称、描述为可选项。
- 规则类型选择指标告警规则。
- 配置方式选择语法模式。
- prometheus实例选择收集运维指标中步骤四所创建的实例。
- 普罗语句对应表1 ems指标告警中的promQL内容。
- 告警级别对应表1 ems指标告警中的告警级别。 图1 创建告警规则
- 反复执行3,直到所有告警配置完毕。