Basic Metrics: Container Metrics
This section describes the categories, names, and meanings of metrics reported to AOM from CCE's kube-prometheus-stack add-on or on-premises Kubernetes clusters.
Target Name |
Job Name |
Metric |
Description |
---|---|---|---|
|
coredns and node-local-dns |
coredns_build_info |
Information to build CoreDNS |
coredns_cache_entries |
Number of entries in the cache |
||
coredns_cache_size |
Cache size |
||
coredns_cache_hits_total |
Number of cache hits |
||
coredns_cache_misses_total |
Number of cache misses |
||
coredns_cache_requests_total |
Total number of DNS resolution requests in different dimensions |
||
coredns_dns_request_duration_seconds_bucket |
Histogram of DNS request duration (bucket) |
||
coredns_dns_request_duration_seconds_count |
Histogram of DNS request duration (count) |
||
coredns_dns_request_duration_seconds_sum |
Histogram of DNS request duration (sum) |
||
coredns_dns_request_size_bytes_bucket |
Histogram of the size of DNS request (bucket) |
||
coredns_dns_request_size_bytes_count |
Histogram of the size of DNS request (count) |
||
coredns_dns_request_size_bytes_sum |
Histogram of the size of DNS request (sum) |
||
coredns_dns_requests_total |
Number of DNS requests |
||
coredns_dns_response_size_bytes_bucket |
Histogram of the size of DNS response (bucket) |
||
coredns_dns_response_size_bytes_count |
Histogram of the size of DNS response (count) |
||
coredns_dns_response_size_bytes_sum |
Histogram of the size of DNS response (sum) |
||
coredns_dns_responses_total |
DNS response codes and number of DNS response codes |
||
coredns_forward_conn_cache_hits_total |
Number of cache hits for each protocol and data flow |
||
coredns_forward_conn_cache_misses_total |
Number of cache misses for each protocol and data flow |
||
coredns_forward_healthcheck_broken_total |
Unhealthy upstream count |
||
coredns_forward_healthcheck_failures_total |
Count of failed health checks per upstream |
||
coredns_forward_max_concurrent_rejects_total |
Number of requests rejected due to excessive concurrent requests |
||
coredns_forward_request_duration_seconds_bucket |
Histogram of forward request duration (bucket) |
||
coredns_forward_request_duration_seconds_count |
Histogram of forward request duration (count) |
||
coredns_forward_request_duration_seconds_sum |
Histogram of forward request duration (sum) |
||
coredns_forward_requests_total |
Number of requests for each data flow |
||
coredns_forward_responses_total |
Number of responses to each data flow |
||
coredns_health_request_duration_seconds_bucket |
Histogram of health request duration (bucket) |
||
coredns_health_request_duration_seconds_count |
Histogram of health request duration (count) |
||
coredns_health_request_duration_seconds_sum |
Histogram of health request duration (sum) |
||
coredns_health_request_failures_total |
Number of health request failures |
||
coredns_hosts_reload_timestamp_seconds |
Timestamp of the last reload of the host file |
||
coredns_kubernetes_dns_programming_duration_seconds_bucket |
Histogram of DNS programming duration (bucket) |
||
coredns_kubernetes_dns_programming_duration_seconds_count |
Histogram of DNS programming duration (count) |
||
coredns_kubernetes_dns_programming_duration_seconds_sum |
Histogram of DNS programming duration (sum) |
||
coredns_local_localhost_requests_total |
Number of localhost requests |
||
coredns_nodecache_setup_errors_total |
Number of nodecache setup errors |
||
coredns_dns_response_rcode_count_total |
Number of responses for each Zone and Rcode |
||
coredns_dns_request_count_total |
Number of DNS requests |
||
coredns_dns_request_do_count_total |
Number of requests with the DNSSEC OK (DO) bit set |
||
coredns_dns_do_requests_total |
Number of requests with the DO bit set |
||
coredns_dns_request_type_count_total |
Number of requests for each Zone and Type |
||
coredns_panics_total |
Total number of panics |
||
coredns_plugin_enabled |
Whether a plugin is enabled |
||
coredns_reload_failed_total |
Number of last reload failures |
||
serviceMonitor/monitoring/kube-apiserver/0 |
apiserver |
aggregator_unavailable_apiservice |
Number of unavailable APIServices |
apiserver_admission_controller_admission_duration_seconds_bucket |
Processing delay of an Admission Controller |
||
apiserver_admission_webhook_admission_duration_seconds_bucket |
Processing delay of an Admission Webhook |
||
apiserver_admission_webhook_admission_duration_seconds_count |
Number of Admission Webhook processing requests |
||
apiserver_client_certificate_expiration_seconds_bucket |
Remaining validity period of the client certificate |
||
apiserver_client_certificate_expiration_seconds_count |
Remaining validity period of the client certificate |
||
apiserver_current_inflight_requests |
Number of read requests in process |
||
apiserver_request_duration_seconds_bucket |
Delay of the client's access to the APIServer |
||
apiserver_request_total |
Number of different requests to the APIServer |
||
go_goroutines |
Number of goroutines |
||
kubernetes_build_info |
Information to build Kubernetes |
||
process_cpu_seconds_total |
Total process CPU time |
||
process_resident_memory_bytes |
Size of the resident memory set for a process |
||
rest_client_requests_total |
Number of REST requests |
||
workqueue_adds_total |
Number of adds handled by a work queue |
||
workqueue_depth |
Depth of a work queue |
||
workqueue_queue_duration_seconds_bucket |
Duration when a task exists in the work queue |
||
aggregator_unavailable_apiservice_total |
Number of unavailable APIServices |
||
rest_client_request_duration_seconds_bucket |
Histogram of REST request duration |
||
serviceMonitor/monitoring/kubelet/0 |
kubelet |
kubelet_certificate_manager_client_expiration_renew_errors |
Number of certificate renewal errors |
kubelet_certificate_manager_client_ttl_seconds |
Time-to-live (TTL) of the Kubelet client certificate |
||
kubelet_cgroup_manager_duration_seconds_bucket |
Duration of the cgroup manager operations (bucket) |
||
kubelet_cgroup_manager_duration_seconds_count |
Duration of the cgroup manager operations (count) |
||
kubelet_node_config_error |
If a configuration-related error occurs on a node, the value of this metric is true (1). If there is no configuration-related error, the value is false (0). |
||
kubelet_node_name |
Node name. The value is always 1. |
||
kubelet_pleg_relist_duration_seconds_bucket |
Duration of relisting pods in PLEG (bucket) |
||
kubelet_pleg_relist_duration_seconds_count |
Duration of relisting pods in PLEG (count) |
||
kubelet_pleg_relist_interval_seconds_bucket |
Interval between relisting operations in PLEG (bucket) |
||
kubelet_pod_start_duration_seconds_count |
Time required for starting a single pod (count) |
||
kubelet_pod_start_duration_seconds_bucket |
Time required for starting a single pod (bucket) |
||
kubelet_pod_worker_duration_seconds_bucket |
Duration for synchronizing a single pod. Operation type: create, update, or sync |
||
kubelet_running_containers |
Number of running containers |
||
kubelet_running_pods |
Number of running pods |
||
kubelet_runtime_operations_duration_seconds_bucket |
Duration of the runtime operations (bucket) |
||
kubelet_runtime_operations_errors_total |
Number of runtime operation errors listed by operation type |
||
kubelet_runtime_operations_total |
Number of runtime operations listed by operation type |
||
kubelet_volume_stats_available_bytes |
Number of available bytes in a volume |
||
kubelet_volume_stats_capacity_bytes |
Capacity of the volume in bytes |
||
kubelet_volume_stats_inodes |
Total number of inodes in a volume |
||
kubelet_volume_stats_inodes_used |
Number of used inodes in a volume |
||
kubelet_volume_stats_used_bytes |
Number of used bytes in a volume |
||
storage_operation_duration_seconds_bucket |
Duration of each storage operation (bucket) |
||
storage_operation_duration_seconds_count |
Duration of each storage operation (count) |
||
storage_operation_errors_total |
Number of storage operation errors |
||
volume_manager_total_volumes |
Number of volumes in the Volume Manager |
||
rest_client_requests_total |
Number of HTTP client requests partitioned by status code, method, and host |
||
rest_client_request_duration_seconds_bucket |
Request delay (bucket) |
||
process_resident_memory_bytes |
Size of the resident memory set for a process |
||
process_cpu_seconds_total |
Total process CPU time |
||
go_goroutines |
Number of goroutines |
||
serviceMonitor/monitoring/kubelet/1 |
kubelet |
container_cpu_cfs_periods_total |
Number of elapsed enforcement period intervals |
container_cpu_cfs_throttled_periods_total |
Number of throttled period intervals |
||
container_cpu_cfs_throttled_seconds_total |
Total time duration the container has been throttled |
||
container_cpu_load_average_10s |
Value of container CPU load average over the last 10 seconds |
||
container_cpu_usage_seconds_total |
Cumulative CPU time consumed by a container in core-seconds |
||
container_file_descriptors |
Number of open file descriptors for a container |
||
container_fs_inodes_free |
Number of available inodes in a file system |
||
container_fs_inodes_total |
Number of inodes in a file system |
||
container_fs_io_time_seconds_total |
Cumulative seconds spent on doing I/Os by the disk or file system |
||
container_fs_limit_bytes |
Total disk or file system capacity that can be consumed by a container |
||
container_fs_read_seconds_total |
Cumulative number of seconds the container spent on reading disk or file system data |
||
container_fs_reads_bytes_total |
Cumulative amount of disk or file system data read by a container |
||
container_fs_reads_total |
Cumulative number of disk or file system reads completed by a container |
||
container_fs_usage_bytes |
File system usage |
||
container_fs_write_seconds_total |
Cumulative number of seconds the container spent on writing data to the disk or file system |
||
container_fs_writes_bytes_total |
Total amount of data written by a container to a disk or file system |
||
container_fs_writes_total |
Cumulative number of disk or file system writes completed by a container |
||
container_memory_cache |
Memory used for the page cache of a container |
||
container_memory_failcnt |
Number of memory usage hits limits |
||
container_memory_max_usage_bytes |
Maximum memory usage recorded for a container |
||
container_memory_rss |
Size of the resident memory set for a container |
||
container_memory_swap |
Container swap usage |
||
container_memory_usage_bytes |
Current memory usage of a container |
||
container_memory_working_set_bytes |
Memory usage of the working set of a container |
||
container_network_receive_bytes_total |
Total volume of data received by the container network |
||
container_network_receive_errors_total |
Cumulative number of errors encountered during reception |
||
container_network_receive_packets_dropped_total |
Cumulative number of packets dropped during reception |
||
container_network_receive_packets_total |
Cumulative number of packets received |
||
container_network_transmit_bytes_total |
Total volume of data transmitted on the container network |
||
container_network_transmit_errors_total |
Cumulative number of errors encountered during transmission |
||
container_network_transmit_packets_dropped_total |
Cumulative number of packets dropped during transmission |
||
container_network_transmit_packets_total |
Cumulative number of packets transmitted |
||
container_spec_cpu_quota |
CPU quota of the container |
||
container_spec_memory_limit_bytes |
Memory limit for the container |
||
machine_cpu_cores |
Number of logical CPU cores |
||
machine_memory_bytes |
Amount of memory |
||
serviceMonitor/monitoring/kube-state-metrics/0 |
kube-state-metrics-prom |
kube_cronjob_status_active |
Running cronjob |
kube_cronjob_info |
Cronjob information |
||
kube_cronjob_labels |
Label of a cronjob |
||
kube_configmap_info |
ConfigMap information |
||
kube_daemonset_created |
DaemonSet creation time |
||
kube_daemonset_status_current_number_scheduled |
Number of DaemonSets that are being scheduled |
||
kube_daemonset_status_desired_number_scheduled |
Number of DaemonSets expected to be scheduled |
||
kube_daemonset_status_number_available |
Number of nodes that should be running a DaemonSet pod and have at least one DaemonSet pod running and available |
||
kube_daemonset_status_number_misscheduled |
Number of nodes that are not expected to run a DaemonSet pod |
||
kube_daemonset_status_number_ready |
Number of nodes that should be running the DaemonSet pods and have one or more DaemonSet pods running and ready |
||
kube_daemonset_status_number_unavailable |
Number of nodes that should be running the DaemonSet pods but have none of the DaemonSet pods running and available |
||
kube_daemonset_status_updated_number_scheduled |
Number of nodes that are running an updated DaemonSet pod |
||
kube_deployment_created |
Deployment creation timestamp |
||
kube_deployment_labels |
Deployment labels |
||
kube_deployment_metadata_generation |
Sequence number representing a specific generation of the desired state |
||
kube_deployment_spec_replicas |
Number of desired replicas for a Deployment |
||
kube_deployment_spec_strategy_rollingupdate_max_unavailable |
Maximum number of unavailable replicas during a rolling update of a Deployment |
||
kube_deployment_status_observed_generation |
The generation observed by the Deployment controller |
||
kube_deployment_status_replicas |
Number of current replicas of a Deployment |
||
kube_deployment_status_replicas_available |
Number of available replicas per Deployment |
||
kube_deployment_status_replicas_ready |
Number of ready replicas per Deployment |
||
kube_deployment_status_replicas_unavailable |
Number of unavailable replicas per Deployment |
||
kube_deployment_status_replicas_updated |
Number of updated replicas per Deployment |
||
kube_job_info |
Information about the job |
||
kube_namespace_labels |
Namespace labels |
||
kube_node_labels |
Node labels |
||
kube_node_info |
Information about a node |
||
kube_node_spec_taint |
Taint of a node |
||
kube_node_spec_unschedulable |
Whether new pods can be scheduled to a node |
||
kube_node_status_allocatable |
Allocatable resources on a node |
||
kube_node_status_capacity |
Capacity for different resources on a node |
||
kube_node_status_condition |
Condition of a node |
||
kube_node_volcano_oversubscription_status |
Node oversubscription status |
||
kube_persistentvolume_status_phase |
Phase of a PV status |
||
kube_persistentvolumeclaim_status_phase |
Phase of a PVC status |
||
kube_persistentvolume_info |
Information about a PV |
||
kube_persistentvolumeclaim_info |
Information about a PVC |
||
kube_pod_container_info |
Information about a container running in the pod |
||
kube_pod_container_resource_limits |
Number of container resource limits |
||
kube_pod_container_resource_requests |
Number of container resource requests |
||
kube_pod_container_status_last_terminated_reason |
Last reason the container was in a terminated state |
||
kube_pod_container_status_ready |
Whether the container's readiness check succeeded |
||
kube_pod_container_status_restarts_total |
Number of container restarts |
||
kube_pod_container_status_running |
Whether the container is running. |
||
kube_pod_container_status_terminated |
Whether the container is terminated |
||
kube_pod_container_status_terminated_reason |
The reason why the container is in a terminated state |
||
kube_pod_container_status_waiting |
Whether the container is waiting |
||
kube_pod_container_status_waiting_reason |
The reason why the container is in the waiting state |
||
kube_pod_info |
Information about a pod |
||
kube_pod_labels |
Pod labels |
||
kube_pod_owner |
Information about the pod's owner |
||
kube_pod_status_phase |
Current phase of a pod |
||
kube_pod_status_ready |
Whether the pod is ready |
||
kube_secret_info |
Information about a secret |
||
kube_statefulset_created |
StatefulSet creation timestamp |
||
kube_statefulset_labels |
Information about StatefulSet labels |
||
kube_statefulset_metadata_generation |
Sequence number representing a specific generation of the desired state for a StatefulSet |
||
kube_statefulset_replicas |
Number of desired pods for a StatefulSet |
||
kube_statefulset_status_observed_generation |
The generation observed by the StatefulSet controller |
||
kube_statefulset_status_replicas |
Number of replicas per StatefulSet |
||
kube_statefulset_status_replicas_ready |
Number of ready replicas per StatefulSet |
||
kube_statefulset_status_replicas_updated |
Number of updated replicas per StatefulSet |
||
kube_job_spec_completions |
Desired number of successfully finished pods that should run with the job |
||
kube_job_status_failed |
Failed jobs |
||
kube_job_status_succeeded |
Successful jobs |
||
kube_node_status_allocatable_cpu_cores |
Number of allocatable CPU cores of a node |
||
kube_node_status_allocatable_memory_bytes |
Total allocatable memory of a node |
||
kube_replicaset_owner |
Information about the ReplicaSet's owner |
||
kube_resourcequota |
Information about resource quota |
||
kube_pod_spec_volumes_persistentvolumeclaims_info |
Information about the PVC associated with the pod |
||
serviceMonitor/monitoring/prometheus-lightweight/0 |
prometheus-lightweight |
vm_persistentqueue_blocks_dropped_total |
Number of dropped blocks in a send queue |
vm_persistentqueue_blocks_read_total |
Number of blocks read by a send queue |
||
vm_persistentqueue_blocks_written_total |
Number of blocks written to a send queue |
||
vm_persistentqueue_bytes_pending |
Number of pending bytes in a send queue |
||
vm_persistentqueue_bytes_read_total |
Number of bytes read by a send queue |
||
vm_persistentqueue_bytes_written_total |
Number of bytes written to a send queue |
||
vm_promscrape_active_scrapers |
Number of active scrapes |
||
vm_promscrape_conn_read_errors_total |
Number of read errors during scrapes |
||
vm_promscrape_conn_write_errors_total |
Number of write errors during scrapes |
||
vm_promscrape_max_scrape_size_exceeded_errors_total |
Number of failed scrapes due to the exceeded response size |
||
vm_promscrape_scrape_duration_seconds_sum |
Duration of scrapes (sum) |
||
vm_promscrape_scrape_duration_seconds_count |
Duration of scrapes (count) |
||
vm_promscrape_scrapes_total |
Number of scrapes |
||
vmagent_remotewrite_bytes_sent_total |
Number of bytes sent via a remote write |
||
vmagent_remotewrite_duration_seconds_sum |
Time required for a remote write (sum) |
||
vmagent_remotewrite_duration_seconds_count |
Time required for a remote write (count) |
||
vmagent_remotewrite_packets_dropped_total |
Number of dropped packets during a remote write |
||
vmagent_remotewrite_pending_data_bytes |
Number of pending bytes during a remote write |
||
vmagent_remotewrite_requests_total |
Number of requests of the remote write |
||
vmagent_remotewrite_retries_count_total |
Number of retries of the remote write |
||
go_goroutines |
Number of goroutines |
||
serviceMonitor/monitoring/node-exporter/0 |
node-exporter |
node_boot_time_seconds |
Node boot time |
node_context_switches_total |
Number of context switches |
||
node_cpu_seconds_total |
Seconds each CPU spent doing each type of work |
||
node_disk_io_now |
Number of I/Os in progress |
||
node_disk_io_time_seconds_total |
Total seconds spent doing I/Os |
||
node_disk_io_time_weighted_seconds_total |
The weighted number of seconds spent doing I/Os |
||
node_disk_read_bytes_total |
Number of bytes that are read |
||
node_disk_read_time_seconds_total |
Number of seconds spent by all reads |
||
node_disk_reads_completed_total |
Number of reads completed |
||
node_disk_write_time_seconds_total |
Number of seconds spent by all writes |
||
node_disk_writes_completed_total |
Number of writes completed |
||
node_disk_written_bytes_total |
Number of bytes that are written |
||
node_docker_thinpool_data_space_available |
Available data space of a docker thin pool |
||
node_docker_thinpool_metadata_space_available |
Available metadata space of a docker thin pool |
||
node_exporter_build_info |
Node exporter build information |
||
node_filefd_allocated |
Allocated file descriptors |
||
node_filefd_maximum |
Maximum number of file descriptors |
||
node_filesystem_avail_bytes |
File system space that is available for use |
||
node_filesystem_device_error |
Whether an error occurred while getting statistics for the given device |
||
node_filesystem_free_bytes |
Remaining space of a file system |
||
node_filesystem_readonly |
Read-only file system |
||
node_filesystem_size_bytes |
Consumed space of a file system |
||
node_forks_total |
Number of forks |
||
node_intr_total |
Number of interruptions that occurred |
||
node_load1 |
1-minute average CPU load |
||
node_load15 |
15-minute average CPU load |
||
node_load5 |
5-minute average CPU load |
||
node_memory_Buffers_bytes |
Memory of the node buffer |
||
node_memory_Cached_bytes |
Memory for the node page cache |
||
node_memory_MemAvailable_bytes |
Available memory of a node |
||
node_memory_MemFree_bytes |
Free memory of a node |
||
node_memory_MemTotal_bytes |
Total memory of a node |
||
node_network_receive_bytes_total |
Total amount of received data |
||
node_network_receive_drop_total |
Cumulative number of packets dropped during reception |
||
node_network_receive_errs_total |
Cumulative number of errors encountered during reception |
||
node_network_receive_packets_total |
Cumulative number of packets received |
||
node_network_transmit_bytes_total |
Total amount of transmitted data |
||
node_network_transmit_drop_total |
Cumulative number of dropped packets during transmission |
||
node_network_transmit_errs_total |
Cumulative number of errors encountered during transmission |
||
node_network_transmit_packets_total |
Cumulative number of packets transmitted |
||
node_procs_blocked |
Blocked processes |
||
node_procs_running |
Running processes |
||
node_sockstat_sockets_used |
Number of sockets in use |
||
node_sockstat_TCP_alloc |
Number of allocated TCP sockets |
||
node_sockstat_TCP_inuse |
Number of TCP sockets in use |
||
node_sockstat_TCP_orphan |
Number of orphaned TCP sockets |
||
node_sockstat_TCP_tw |
Number of TCP sockets in the TIME_WAIT state |
||
node_sockstat_UDPLITE_inuse |
Number of UDP-Lite sockets in use |
||
node_sockstat_UDP_inuse |
Number of UDP sockets in use |
||
node_sockstat_UDP_mem |
UDP socket buffer usage |
||
node_timex_offset_seconds |
Time offset |
||
node_timex_sync_status |
Synchronization status of node clocks |
||
node_uname_info |
Labeled system information as provided by the uname system call |
||
node_vmstat_oom_kill |
OOM kill in /proc/vmstat |
||
process_cpu_seconds_total |
Total process CPU time |
||
process_max_fds |
Maximum number of file descriptors of a process |
||
process_open_fds |
Opened file descriptors by a process |
||
process_resident_memory_bytes |
Size of the resident memory set for a process |
||
process_start_time_seconds |
Process start time |
||
process_virtual_memory_bytes |
Virtual memory size for a process |
||
process_virtual_memory_max_bytes |
Maximum virtual memory size for a process |
||
node_netstat_Tcp_ActiveOpens |
Number of TCP connections that directly change from the CLOSED state to the SYN-SENT state |
||
node_netstat_Tcp_PassiveOpens |
Number of TCP connections that directly change from the LISTEN state to the SYN-RCVD state |
||
node_netstat_Tcp_CurrEstab |
Number of TCP connections in the ESTABLISHED or CLOSE-WAIT state |
||
node_vmstat_pgmajfault |
Number of major faults per second in /proc/vmstat |
||
node_vmstat_pgpgout |
Number of page out between main memory and block device in /proc/vmstat |
||
node_vmstat_pgfault |
Number of page faults the system has made per second in /proc/vmstat |
||
node_vmstat_pgpgin |
Number of page in between main memory and block device in /proc/vmstat |
||
node_processes_max_processes |
PID limit value |
||
node_processes_pids |
Number of PIDs |
||
node_nf_conntrack_entries |
Number of currently allocated flow entries for connection tracking |
||
node_nf_conntrack_entries_limit |
Maximum size of a connection tracking table |
||
promhttp_metric_handler_requests_in_flight |
Number of metrics being processed |
||
go_goroutines |
Number of node exporter goroutines |
||
podMonitor/monitoring/nvidia-gpu-device-plugin/0 |
monitoring/nvidia-gpu-device-plugin |
cce_gpu_utilization |
GPU compute usage |
cce_gpu_memory_utilization |
GPU memory usage |
||
cce_gpu_encoder_utilization |
GPU encoding usage |
||
cce_gpu_decoder_utilization |
GPU decoding usage |
||
cce_gpu_utilization_process |
GPU compute usage of each process |
||
cce_gpu_memory_utilization_process |
GPU memory usage of each process |
||
cce_gpu_encoder_utilization_process |
GPU encoding usage of each process |
||
cce_gpu_decoder_utilization_process |
GPU decoding usage of each process |
||
cce_gpu_memory_used |
Used GPU memory |
||
cce_gpu_memory_total |
Total GPU memory |
||
cce_gpu_memory_free |
Free GPU memory |
||
cce_gpu_bar1_memory_used |
Used GPU BAR1 memory |
||
cce_gpu_bar1_memory_total |
Total GPU BAR1 memory |
||
cce_gpu_clock |
GPU clock frequency |
||
cce_gpu_memory_clock |
GPU memory frequency |
||
cce_gpu_graphics_clock |
GPU frequency |
||
cce_gpu_video_clock |
GPU video processor frequency |
||
cce_gpu_temperature |
GPU temperature |
||
cce_gpu_power_usage |
GPU power |
||
cce_gpu_total_energy_consumption |
Total GPU energy consumption |
||
cce_gpu_pcie_link_bandwidth |
GPU PCIe bandwidth |
||
cce_gpu_nvlink_bandwidth |
GPU NVLink bandwidth |
||
cce_gpu_pcie_throughput_rx |
GPU PCIe RX bandwidth |
||
cce_gpu_pcie_throughput_tx |
GPU PCIe TX bandwidth |
||
cce_gpu_nvlink_utilization_counter_rx |
GPU NVLink RX bandwidth |
||
cce_gpu_nvlink_utilization_counter_tx |
GPU NVLink TX bandwidth |
||
cce_gpu_retired_pages_sbe |
Number of GPU single-bit error isolation pages |
||
cce_gpu_retired_pages_dbe |
Number of GPU dual-bit error isolation pages |
||
xgpu_memory_total |
Total xGPU memory |
||
xgpu_memory_used |
Used xGPU memory |
||
xgpu_core_percentage_total |
Total xGPU compute |
||
xgpu_core_percentage_used |
Used xGPU compute |
||
gpu_schedule_policy |
There are three GPU modes specified by three values. The value 0 indicates the GPU memory isolation, compute sharing mode. The value 1 indicates the GPU memory and compute isolation mode. The value 2 indicates the default mode, indicating that the GPU is not virtualized. |
||
xgpu_device_health |
Health status of xGPU. The value 0 indicates that the xGPU is healthy, and the value 1 indicates that the xGPU is unhealthy. |
||
serviceMonitor/monitoring/prometheus-server/0 |
prometheus-server |
prometheus_build_info |
Information to build Prometheus |
prometheus_engine_query_duration_seconds |
Query time |
||
prometheus_engine_query_duration_seconds_count |
Number of queries |
||
prometheus_sd_discovered_targets |
Number of targets discovered by each job |
||
prometheus_remote_storage_bytes_total |
Number of bytes sent |
||
prometheus_remote_storage_enqueue_retries_total |
Number of retries for entering a queue |
||
prometheus_remote_storage_highest_timestamp_in_seconds |
Highest timestamp that has come into the remote storage via the Appender interface, in seconds since epoch |
||
prometheus_remote_storage_queue_highest_sent_timestamp_seconds |
Highest timestamp successfully sent by a remote write |
||
prometheus_remote_storage_samples_dropped_total |
Total number of samples read from the WAL but not sent to remote storage |
||
prometheus_remote_storage_samples_failed_total |
Number of samples that failed to be sent to remote storage |
||
prometheus_remote_storage_samples_in_total |
Number of samples read into remote storage |
||
prometheus_remote_storage_samples_pending |
Number of samples pending in shards to be sent to remote storage |
||
prometheus_remote_storage_samples_retried_total |
Number of samples which failed to be sent to remote storage but were retried |
||
prometheus_remote_storage_samples_total |
Total number of samples sent to remote storage |
||
prometheus_remote_storage_shard_capacity |
Capacity of each shard of the queue used for parallel sending to the remote storage |
||
prometheus_remote_storage_shards |
Number of shards used for parallel sending to the remote storage |
||
prometheus_remote_storage_shards_desired |
Number of shards that the queues shard calculation wants to run based on the rate of samples in vs. samples out |
||
prometheus_remote_storage_shards_max |
Maximum number of shards that the queue is allowed to run |
||
prometheus_remote_storage_shards_min |
Minimum number of shards that the queue is allowed to run |
||
prometheus_tsdb_wal_segment_current |
WAL segment index that TSDB is currently writing to |
||
prometheus_tsdb_head_chunks |
Number of chunks in the head block |
||
prometheus_tsdb_head_series |
Number of series in the head block |
||
prometheus_tsdb_head_samples_appended_total |
Number of appended samples |
||
prometheus_wal_watcher_current_segment |
Current segment the WAL watcher is reading records from |
||
prometheus_target_interval_length_seconds |
Actual intervals between scrapes |
||
prometheus_target_interval_length_seconds_count |
Actual intervals between scrapes (count) |
||
prometheus_target_interval_length_seconds_sum |
Actual intervals between scrapes (sum) |
||
prometheus_target_scrapes_exceeded_body_size_limit_total |
Number of scrapes that hit the body size limit |
||
prometheus_target_scrapes_exceeded_sample_limit_total |
Number of scrapes that hit the sample limit |
||
prometheus_target_scrapes_sample_duplicate_timestamp_total |
Number scraped samples with duplicate timestamps |
||
prometheus_target_scrapes_sample_out_of_bounds_total |
Number of samples rejected due to timestamp falling outside of the time bounds |
||
prometheus_target_scrapes_sample_out_of_order_total |
Number of out-of-order samples |
||
prometheus_target_sync_length_seconds |
Interval for synchronizing the scrape pool |
||
prometheus_target_sync_length_seconds_count |
Interval for synchronizing the scrape pool (count) |
||
prometheus_target_sync_length_seconds_sum |
Interval for synchronizing the scrape pool (sum) |
||
promhttp_metric_handler_requests_in_flight |
Number of metrics being processed |
||
promhttp_metric_handler_requests_total |
Number of metric processing times |
||
go_goroutines |
Number of goroutines |
||
podMonitor/monitoring/virtual-kubelet-pods/0 |
monitoring/virtual-kubelet-pods |
container_cpu_load_average_10s |
Value of container CPU load average over the last 10 seconds |
container_cpu_system_seconds_total |
Cumulative container CPU system time |
||
container_cpu_usage_seconds_total |
Cumulative CPU time consumed by a container in core-seconds |
||
container_cpu_user_seconds_total |
Usage of user CPU time |
||
container_cpu_cfs_periods_total |
Number of elapsed enforcement period intervals |
||
container_cpu_cfs_throttled_periods_total |
Number of throttled period intervals |
||
container_cpu_cfs_throttled_seconds_total |
Total time duration the container has been throttled |
||
container_fs_inodes_free |
Number of available inodes in a file system |
||
container_fs_usage_bytes |
File system usage |
||
container_fs_inodes_total |
Number of inodes in a file system |
||
container_fs_io_current |
Number of I/Os currently in progress in a disk or file system |
||
container_fs_io_time_seconds_total |
Cumulative seconds spent on doing I/Os by the disk or file system |
||
container_fs_io_time_weighted_seconds_total |
Cumulative weighted I/O time of a disk or file system |
||
container_fs_limit_bytes |
Total disk or file system capacity that can be consumed by a container |
||
container_fs_reads_bytes_total |
Cumulative amount of disk or file system data read by a container |
||
container_fs_read_seconds_total |
Cumulative number of seconds the container spent on reading disk or file system data |
||
container_fs_reads_merged_total |
Cumulative number of merged disk or file system reads made by the container. |
||
container_fs_reads_total |
Cumulative number of disk or file system reads completed by a container |
||
container_fs_sector_reads_total |
Cumulative number of disk or file system sector reads completed by a container |
||
container_fs_sector_writes_total |
Cumulative number of disk or file system sector writes completed by a container |
||
container_fs_writes_bytes_total |
Total amount of data written by a container to a disk or file system |
||
container_fs_write_seconds_total |
Cumulative number of seconds the container spent on writing data to the disk or file system |
||
container_fs_writes_merged_total |
Cumulative number of merged container writes to the disk or file system |
||
container_fs_writes_total |
Cumulative number of disk or file system writes completed by a container |
||
container_blkio_device_usage_total |
Blkio device bytes usage |
||
container_memory_failures_total |
Cumulative number of container memory allocation failures |
||
container_memory_failcnt |
Number of memory usage hits limits |
||
container_memory_cache |
Memory used for the page cache of a container |
||
container_memory_mapped_file |
Size of the container memory mapped file. |
||
container_memory_max_usage_bytes |
Maximum memory usage recorded for a container |
||
container_memory_rss |
Size of the resident memory set for a container |
||
container_memory_swap |
Container swap usage |
||
container_memory_usage_bytes |
Current memory usage of a container |
||
container_memory_working_set_bytes |
Memory usage of the working set of a container |
||
container_network_receive_bytes_total |
Total volume of data received by the container network |
||
container_network_receive_errors_total |
Cumulative number of errors encountered during reception |
||
container_network_receive_packets_dropped_total |
Cumulative number of packets dropped during reception |
||
container_network_receive_packets_total |
Cumulative number of packets received |
||
container_network_transmit_bytes_total |
Total volume of data transmitted on the container network |
||
container_network_transmit_errors_total |
Cumulative number of errors encountered during transmission |
||
container_network_transmit_packets_dropped_total |
Cumulative number of packets dropped during transmission |
||
container_network_transmit_packets_total |
Cumulative number of packets transmitted |
||
container_processes |
Number of processes running inside the container |
||
container_sockets |
Number of open sockets for the container |
||
container_file_descriptors |
Number of open file descriptors for a container |
||
container_threads |
Number of threads running inside the container |
||
container_threads_max |
Maximum number of threads allowed inside the container |
||
container_ulimits_soft |
Soft ulimit value of process 1 in the container. Unlimited if the value is -1, except priority and nice. |
||
container_tasks_state |
Number of tasks in the specified state, such as sleeping, running, stopped, uninterruptible, or ioawaiting |
||
container_spec_cpu_period |
CPU period of the container |
||
container_spec_cpu_shares |
CPU share of the container |
||
container_spec_cpu_quota |
CPU quota of the container |
||
container_spec_memory_limit_bytes |
Memory limit for the container |
||
container_spec_memory_reservation_limit_bytes |
Memory reservation limit for the container |
||
container_spec_memory_swap_limit_bytes |
Memory swap limit for the container |
||
container_start_time_seconds |
Running time of the container. |
||
container_last_seen |
Last time a container was seen by the exporter |
||
container_accelerator_memory_used_bytes |
GPU accelerator memory that is being used by the container |
||
container_accelerator_memory_total_bytes |
Total available memory of a GPU accelerator |
||
container_accelerator_duty_cycle |
Percentage of time when a GPU accelerator is actually running |
||
podMonitor/monitoring/everest-csi-controller/0 |
monitoring/everest-csi-controller |
everest_action_result_total |
Number of action results |
everest_function_duration_seconds_bucket |
Histogram of action duration (bucket) |
||
everest_function_duration_seconds_count |
Histogram of action duration (count) |
||
everest_function_duration_seconds_sum |
Histogram of action duration (sum) |
||
everest_function_duration_quantile_seconds |
Time quantile required by the action |
||
node_volume_read_completed_total |
Number of completed reads |
||
node_volume_read_merged_total |
Number of merged reads |
||
node_volume_read_bytes_total |
Total number of bytes read by a sector |
||
node_volume_read_time_milliseconds_total |
Total read duration |
||
node_volume_write_completed_total |
Number of completed writes |
||
node_volume_write_merged_total |
Number of merged writes |
||
node_volume_write_bytes_total |
Total number of bytes written into a sector |
||
node_volume_write_time_milliseconds_total |
Total write duration |
||
node_volume_io_now |
Number of ongoing I/Os |
||
node_volume_io_time_seconds_total |
Total I/O operation duration |
||
node_volume_capacity_bytes_available |
Available capacity |
||
node_volume_capacity_bytes_total |
Total capacity |
||
node_volume_capacity_bytes_used |
Used capacity |
||
node_volume_inodes_available |
Available inodes |
||
node_volume_inodes_total |
Total number of inodes |
||
node_volume_inodes_used |
Used inodes |
||
node_volume_read_transmissions_total |
Number of read transmission times |
||
node_volume_read_timeouts_total |
Number of read timeouts |
||
node_volume_read_sent_bytes_total |
Number of bytes read |
||
node_volume_read_queue_time_milliseconds_total |
Read queue waiting time |
||
node_volume_read_rtt_time_milliseconds_total |
Read RTT |
||
node_volume_write_transmissions_total |
Number of write transmissions |
||
node_volume_write_timeouts_total |
Number of write timeouts |
||
node_volume_write_queue_time_milliseconds_total |
Write queue waiting time |
||
node_volume_write_rtt_time_milliseconds_total |
Write RTT |
||
node_volume_localvolume_stats_capacity_bytes |
Local storage capacity |
||
node_volume_localvolume_stats_available_bytes |
Available local storage |
||
node_volume_localvolume_stats_used_bytes |
Used local storage |
||
node_volume_localvolume_stats_inodes |
Number of inodes for a local volume |
||
node_volume_localvolume_stats_inodes_used |
Used inodes for a local volume |
||
podMonitor/monitoring/nginx-ingress-controller/0 |
monitoring/nginx-ingress-controller |
nginx_ingress_controller_bytes_sent |
Number of bytes sent to the client |
nginx_ingress_controller_connect_duration_seconds |
Duration for connecting to the upstream server |
||
nginx_ingress_controller_header_duration_seconds |
Time required for receiving the first header from the upstream server |
||
nginx_ingress_controller_ingress_upstream_latency_seconds |
Upstream service latency |
||
nginx_ingress_controller_request_duration_seconds |
Time required for processing a request, in milliseconds |
||
nginx_ingress_controller_request_size |
Length of a request, including the request line, header, and body |
||
nginx_ingress_controller_requests |
Total number of HTTP requests processed by Nginx Ingress Controller since it starts |
||
nginx_ingress_controller_response_duration_seconds |
Time required for receiving the response from the upstream server |
||
nginx_ingress_controller_response_size |
Length of a response, including the request line, header, and body |
||
nginx_ingress_controller_nginx_process_connections |
Number of client connections in the active, read, write, or wait state |
||
nginx_ingress_controller_nginx_process_connections_total |
Total number of client connections in the accepted or handled state |
||
nginx_ingress_controller_nginx_process_cpu_seconds_total |
Total CPU time consumed by the Nginx process (unit: second) |
||
nginx_ingress_controller_nginx_process_num_procs |
Number of processes |
||
nginx_ingress_controller_nginx_process_oldest_start_time_seconds |
Start time in seconds since January 1, 1970 |
||
nginx_ingress_controller_nginx_process_read_bytes_total |
Number of bytes read |
||
nginx_ingress_controller_nginx_process_requests_total |
Total number of requests processed by Nginx since startup |
||
nginx_ingress_controller_nginx_process_resident_memory_bytes |
Resident memory usage of a process, that is, the actual physical memory usage |
||
nginx_ingress_controller_nginx_process_virtual_memory_bytes |
Virtual memory usage of a process, that is, the total memory allocated to the process, including the actual physical memory and virtual swap space |
||
nginx_ingress_controller_nginx_process_write_bytes_total |
Amount of data written by the Nginx process to disks or other devices for long-term storage |
||
nginx_ingress_controller_build_info |
Build information of Nginx Ingress Controller, including the version and compilation time |
||
nginx_ingress_controller_check_success |
Health check result of Nginx Ingress Controller. 1: Normal. 0: Abnormal |
||
nginx_ingress_controller_config_hash |
Configured hash value |
||
nginx_ingress_controller_config_last_reload_successful |
Whether the Nginx Ingress Controller configuration is successfully reloaded |
||
nginx_ingress_controller_config_last_reload_successful_timestamp_seconds |
Last timestamp when the Nginx Ingress Controller configuration was successfully reloaded |
||
nginx_ingress_controller_ssl_certificate_info |
Nginx Ingress Controller certificate information |
||
nginx_ingress_controller_success |
Cumulative number of reload operations of Nginx Ingress Controller |
||
nginx_ingress_controller_orphan_ingress |
Whether the ingress is isolated. 1: Isolated. 0: Not isolated. namespace indicates the namespace where the ingress is located, ingress indicates the ingress name. type indicates that the isolation type (options: no-service and no-endpoint). |
||
nginx_ingress_controller_admission_config_size |
Size of the admission controller configuration |
||
nginx_ingress_controller_admission_render_duration |
Rendering duration of the admission controller |
||
nginx_ingress_controller_admission_render_ingresses |
Length of ingresses rendered by the admission controller |
||
nginx_ingress_controller_admission_roundtrip_duration |
Time spent by the admission controller to process new events |
||
nginx_ingress_controller_admission_tested_duration |
Time spent on admission controller tests |
||
nginx_ingress_controller_admission_tested_ingresses |
Length of ingresses processed by the admission controller |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.