CCE日志K8s事件中心
CCE日志K8s事件中心仪表盘主要展示节点FD不足、节点磁盘空间不足、事件同步异常、事件分布等。
前提条件
- 已采集CCE日志,详情请参见云容器引擎CCE应用日志接入LTS。
- 日志配置结构化,详情请参见结构化配置。
背景信息
云容器引擎(Cloud Container Engine,简称CCE)提供高度可扩展的、高性能的企业级Kubernetes集群。借助云容器引擎,您可以在华为云上轻松部署、管理和扩展容器化应用程序。
分析网站访问情况
- 登录云日志服务控制台。
- 在左侧导航栏中选择“仪表盘 ”。
- 在仪表盘模板下方,选择“CCE日志K8s事件中心”仪表盘,查看图表详情。
CCE日志K8s事件中心仪表盘中的过滤器说明如下所示:
- 事件等级分为Warning和Normal。
- 事件类型,所关联的查询分析语句如下所示:
select distinct("name")
- 集群ID,所关联的查询分析语句如下所示:
select distinct("cluster_id")
- 命名空间,所关联的查询分析语句如下所示:
select distinct("namespace")
- 名称,所关联的查询分析语句如下所示:
select distinct("resource_name")
重要图表说明
CCE日志K8s事件中心仪表盘中重要图表说明如下所示:
- Conntrack Full,所关联的查询分析语句如下所示:
select diff[1] as "total", round((diff[1] - diff[2]) / diff[2] * 100 , 2 ) as "inc" from (select compare( "total", 3600) as diff from( select count(1) as "total" from log where "name"= 'ConntrackFull' ) )
- 事件同步异常,所关联的查询分析语句如下所示:
select diff[1] as "total", round((diff[1] - diff[2]) / diff[2] * 100 , 2 ) as "inc" from (select compare( "total", 3600) as diff from( select count(1) as "total" from log where "name"= 'NTPIsDown') )
- 节点Pid不足,所关联的查询分析语句如下所示:
select diff[1] as "total", round((diff[1] - diff[2]) / diff[2] * 100 , 2 ) as "inc" from (select compare( "total", 3600) as diff from( select count(1) as "total" from log where "name" in ('PIDPressure','NodeHasPIDPressure') ) )
- 节点FD不足,所关联的查询分析语句如下所示:
select diff[1] as "total", round((diff[1] - diff[2]) / diff[2] * 100 , 2 ) as "inc" from (select compare( "total", 3600) as diff from( select count(1) as "total" from log where "name"= 'NodeHasFDPressure') )
- 节点磁盘空间不足,所关联的查询分析语句如下所示:
select diff[1] as "total", round((diff[1] - diff[2]) / diff[2] * 100 , 2 ) as "inc" from (select compare( "total", 3600) as diff from( select count(1) as "total" from log where "name"= 'NodeHasDiskPressure') )
- Pod OOM,所关联的查询分析语句如下所示:
select diff[1] as "total", round((diff[1] - diff[2]) / diff[2] * 100 , 2 ) as "inc" from (select compare( "total", 3600) as diff from( select count(1) as "total" from log where "reason" in ('OOMKilling','PodOOMKilling')) )
- DockerHung,所关联的查询分析语句如下所示:
select diff[1] as "total", round((diff[1] - diff[2]) / diff[2] * 100 , 2 ) as "inc" from (select compare( "total", 3600) as diff from( select count(1) as "total" from log where "name"= 'Failed' and "reason" = 'DockerHung') )
- 节点重启,所关联的查询分析语句如下所示:
select diff[1] as "total", round((diff[1] - diff[2]) / diff[2] * 100 , 2 ) as "inc" from (select compare( "total", 3600) as diff from( select count(1) as "total" from log where "name"= 'NodeRebooted') )
- 镜像拉取失败,所关联的查询分析语句如下所示:
select diff[1] as "total", round((diff[1] - diff[2]) / diff[2] * 100 , 2 ) as "inc" from (select compare( "total", 3600) as diff from( select count(1) as "total" from log where "name"= 'Failed' and "reason" = 'ImagePullBackOff') )
- 节点OOM,所关联的查询分析语句如下所示:
select diff[1] as "total", round((diff[1] - diff[2]) / diff[2] * 100 , 2 ) as "inc" from (select compare( "total", 3600) as diff from( select count(1) as "total" from log where "name" = 'SystemOOM') )
- Pod启动失败,所关联的查询分析语句如下所示:
select diff[1] as "total", round((diff[1] - diff[2]) / diff[2] * 100 , 2 ) as "inc" from (select compare( "total", 3600) as diff from( select count(1) as "total" from log where "name"= 'Failed' and "resource_kind" = 'Pod' and "reason" = 'ImagePullBackOff') )
- 事件分布,所关联的查询分析语句如下所示:
select "type", count(*) as "事件数" group by "type"
- Warning事件趋势,所关联的查询分析语句如下所示:
select time_series(__time, 'PT1H', 'yyyy-MM-dd HH', '0') as "dt",count(1) as "count" from log where "type" = 'Warning' group by "dt" order by "dt"
- Error事件趋势,所关联的查询分析语句如下所示:
select time_series(__time, 'PT1H', 'yyyy-MM-dd HH', '0') as "dt",count(1) as "count" from log where "type" = 'Error' group by "dt" order by "dt"
- Pod OOM事件列表,所关联的查询分析语句如下所示:
select TIME_FORMAT( __time, 'yyyy-MM-dd HH:mm:ss', '+08:00') as "Time", "resource_kind" as "事件目标", "name" as "类型", "resource_name" as "目标名", "reason" as "详细内容" from log where "name" in ('OOMKilling','PodOOMKilling') order by __time desc limit 100
- Pod驱动事件列表,所关联的查询分析语句如下所示:
select TIME_FORMAT( __time, 'yyyy-MM-dd HH:mm:ss', '+08:00' ) as "Time", "resource_kind" as "事件目标", "name" as "类型", "resource_name" as "目标名", "reason" as "详细内容" from log where "name" = 'NodeControllerEviction' order by __time desc limit 100
- 重要事件列表,所关联的查询分析语句如下所示:
select TIME_FORMAT( __time, 'yyyy-MM-dd HH:mm:ss', '+08:00' ) as "Time", "type" as "等级", "resource_kind" as "事件目标", "name" as "类型", "resource_name" as "目标名", "reason" as "详细内容" from log where "type" in ('Warning','Error') order by __time desc limit 100