Updated on 2024-05-27 GMT+08:00

Cluster Monitoring

Clusters deployed using CCE are monitored. On the Cluster Monitoring page, you can view multiple basic metrics (such as cluster status, CPU usage, memory usage, and node status), and related alarms and events in real time. Based on them, you can monitor cluster statuses and handle risks in a timely manner, ensuring stable cluster running.

Precautions

The host status can be Normal, Abnormal, Warning, Silent, or Deleted. The running status of a host is displayed as Abnormal when the host is faulty due to network failures or host power-off or shut-down, or when a threshold alarm is reported on the host.

Procedure

  1. Log in to the AOM 2.0 console.
  2. In the navigation pane, choose Infrastructure Monitoring > Cluster Monitoring.
  3. In the upper right corner of the page, set cluster filter criteria.

    1. Set a time range to view the CCE clusters that report information. There are two methods to set a time range:

      Method 1: Use a predefined time label, such as Last hour or Last 6 hours. You can select a time range as required.

      Method 2: Specify the start time and end time to customize a time range. You can specify 30 days at most.

    2. Set the interval for refreshing information. Click and select a value from the drop-down list, such as Refresh manually or 1 minute auto refresh.

  4. Set search criteria (such as the creation time, CPU usage, and cluster name) to find the target cluster.
  1. Click a cluster to go to its details page. In the navigation pane on the left, monitor cluster running conditions by cluster, dashboard, or alarm.

    • View information about nodes, workloads, pods (container groups), and containers by cluster.
      • In the navigation pane on the left, choose Insights > Node to view information about all nodes in the cluster in real time, including the status, IP address, pod status, CPU usage, and memory usage.
        • In the upper part of the node list, filter nodes by node name.
        • Click in the upper right corner and select or deselect options as required.
        • Click a node to view its related resources, alarms, and events, and common system devices such as GPUs and NICs.
          • On the Overview tab page, Cloud-Native Monitoring (New) is selected by default. You can view metrics such as CPU, memory, and network. Click Using ICAgent (Old) and select a target Prometheus instance from the drop-down list. You can view metrics such as CPU, physical memory, and host status.

            To use cloud-native monitoring, connect your cluster to a Prometheus instance for CCE first.

            If there is no Prometheus instance for CCE, click Prometheus Monitoring to create a Prometheus instance by referring to Prometheus Instance for CCE. After the instance is created, click its name. On the instance details page, choose Integration Center and then connect the CCE cluster.

            Click in the upper right corner and select a predefined time label or customize a time range from the drop-down list to view resource information.

            Click in the upper right corner to obtain the latest resource information in real time.

            Click in the upper right corner of the page to view resource information in full screen.

          • On the Related Resources tab page, the pod (container group) to which the node belongs is displayed.
      • In the navigation pane on the left, choose Insights > Workload to view the status and resource usage of all workloads in the cluster.
        • In the upper part of the workload list, filter workloads by workload type or name.
        • Click in the upper right corner and select or deselect options as required.
        • Click a workload to view its related resources, alarms, events, and dashboards.
          • On the Overview tab page, Cloud-Native Monitoring (New) is selected by default. You can view metrics such as CPU, memory, and network. Click Using ICAgent (Old) and select a target Prometheus instance from the drop-down list. You can view metrics such as CPU, physical memory, and file system.
          • On the Related Resources tab page, the pod (container group) to which the workload belongs is displayed.
      • In the navigation pane on the left, choose Insights > Pod to view the status and resource usage of all pods in the cluster.
        • In the upper part of the container group list, filter container groups by name.
        • Click in the upper right corner and select or deselect options as required.
        • Click a container group to view its related resources, alarms, events, and dashboards.
          • On the Overview tab page, Cloud-Native Monitoring (New) is selected by default. You can view metrics such as CPU, memory, and network. Click Using ICAgent (Old) and select a target Prometheus instance from the drop-down list. You can view metrics such as CPU, physical memory, and file system.
          • On the Related Resources tab page, view nodes, workloads, and containers by name.
      • In the navigation pane on the left, choose Insights > Container to view the status and resource usage of all containers in the cluster.
        • In the upper part of the container list, filter containers by name.
        • Click in the upper right corner and select or deselect options as required.
        • Click a container to view its related resources, alarms, events, and dashboards. On the Related Resources tab page, the container group to which the container belongs is displayed by default. View nodes, workloads, and container groups by name.
    • View the cluster running status from the alarm management perspective.
      • In the navigation pane on the left, choose Alarm Management > Alarm List to view alarm details of the cluster. For details, see Viewing Alarms.
      • In the navigation pane on the left, choose Alarm Management > Event List to view event details of the cluster. For details, see Viewing Events.
      • In the navigation pane on the left, choose Alarm Management > Alarm Rules to view the alarm rules related to the cluster. Modify the alarm rules as required. For details, see Managing Alarm Rules.
    • In the navigation pane on the left, choose Dashboard to view the running status of the current cluster.
      • A CCE Prometheus instance has been connected:

        Select Cluster View, Pod View, Host View, or Node View from the drop-down list to view key metrics such as the CPU usage and physical memory usage.

      • No CCE Prometheus instance is connected:

        Choose Prometheus Monitoring and then add a Prometheus instance. For details, see Prometheus Instance for CCE After the instance is created, click its name. On the instance details page, choose Integration Center and then connect the CCE cluster.