Updated on 2024-10-14 GMT+08:00

CCE Network Metrics Exporter

Introduction

Dolphin is an add-on for monitoring and managing container network traffic. The current version of dolphin can collect traffic statistics of containers that do not use the host network mode in CCE Turbo clusters and performs node-wide container connectivity check.

The IP and TCP traffic can be monitored by pod or flow. You can use podSelector to select the monitoring backend. Multiple monitoring tasks can be created and monitoring metrics can be selected as required. The label information of pods can also be obtained. The monitoring information has been adapted to the Prometheus format. You can call the Prometheus API to view monitoring data.

Constraints

  • This add-on can be installed only in CCE Turbo clusters 1.19 or later. The add-on pods can run only on nodes running EulerOS on x86.
  • This add-on can be installed on nodes that use the containerd or Docker container engine. In containerd nodes, it can trace pod updates in real time. In Docker nodes, it can query pod updates in polling mode.
  • Only traffic statistics of secure containers using the Kata container runtimes and common containers using the runC container runtimes in a CCE Turbo cluster can be collected.
  • After the add-on is installed, traffic is by default not monitored. You need to create a MonitorPolicy to configure a monitoring task for traffic monitoring.
  • Pods using the host network mode cannot be monitored.
  • Ensure that there are sufficient resources on a node for installing the add-on.
  • The source of monitoring labels and user labels must be already available before a pod is created.
  • You can specify a maximum of five labels (a maximum of 10 labels in versions later than 1.3.4). You cannot specify the labels used by the system. Labels used by the system include pod, task, ipfamily, srcip, dstip, srcport, dstport, and protocol.

Installing the Add-on

  1. Log in to the CCE console and click the CCE Turbo cluster name to access the cluster. Click Add-ons in the navigation pane, locate CCE Network Metrics Exporter on the right, and click Install.
  2. On the Install Add-on page, check the add-on configuration.

    No parameter can be configured for the current add-on.

  3. Click Install.

    After the add-on is installed, select the cluster and click Add-ons in the navigation pane. On the displayed page, view the add-on in the Add-ons Installed area.

Components

Table 1 Add-on components

Component

Description

Resource Type

dolphin

Used to monitor the container network traffic of CCE Turbo clusters

DaemonSet

Monitoring Metrics of dolphin

You can deliver a monitoring task by creating a MonitorPolicy. A MonitorPolicy can be created by calling an API or using the kubectl apply command after logging in to a worker node. A MonitorPolicy represents a monitoring task and provides optional parameters such as selector and podLabel. The following table describes the supported monitoring metrics.

Table 2 Supported monitoring metrics

Monitoring Metric

Monitoring Item

Granularity

Supported Runtime

Supported Cluster Version

Supported Add-on Version

Supported OS

Number of IPv4 packets sent to the Internet

dolphin_ip4_send_pkt_internet

Pod

runC/Kata

v1.19 or later

1.1.2

EulerOS 2.9 on x86

EulerOS 2.10 on x86

Number of IPv4 bytes sent to the Internet

dolphin_ip4_send_byte_internet

Pod

runC/Kata

v1.19 or later

1.1.2

EulerOS 2.9 on x86

EulerOS 2.10 on x86

Number of received IPv4 packets

dolphin_ip4_rcv_pkt

Pod

runC/Kata

v1.19 or later

1.1.2

EulerOS 2.9 on x86

EulerOS 2.10 on x86

Number of received IPv4 bytes

dolphin_ip4_rcv_byte

Pod

runC/Kata

v1.19 or later

1.1.2

EulerOS 2.9 on x86

EulerOS 2.10 on x86

Number of sent IPv4 packets

dolphin_ip4_send_pkt

Pod

runc/kata

v1.19 or later

1.1.2

EulerOS 2.9 on x86

EulerOS 2.10 on x86

Number of sent IPv4 bytes

dolphin_ip4_send_byte

Pod

runC/Kata

v1.19 or later

1.1.2

EulerOS 2.9 on x86

EulerOS 2.10 on x86

Health status of the latest health check

dolphin_health_check_status

Pod

runc/kata

v1.19 or later

1.2.2

EulerOS 2.9 on x86

EulerOS 2.10 on x86

Total number of successful health checks

dolphin_health_check_successful_counter

Pod

runC/Kata

v1.19 or later

1.2.2

EulerOS 2.9 on x86

EulerOS 2.10 on x86

Total number of failed health checks

dolphin_health_check_failed_counter

Pod

runC/Kata

v1.19 or later

1.2.2

EulerOS 2.9 on x86

EulerOS 2.10 on x86

Number of received IP packets

dolphin_ip_receive_pkt

Pod

runC

v1.23 or later

1.3.5

EulerOS 2.9 on x86

EulerOS 2.10 on x86

Number of received IP bytes

dolphin_ip_receive_byte

Pod

runC

v1.23 or later

1.3.5

EulerOS 2.9 on x86

EulerOS 2.10 on x86

Number of sent IP packets

dolphin_ip_send_pkt

Pod

runC

v1.23 or later

1.3.5

EulerOS 2.9 on x86

EulerOS 2.10 on x86

Number of sent IP bytes

dolphin_ip_send_byte

Pod

runC

v1.23 or later

1.3.5

EulerOS 2.9 on x86

EulerOS 2.10 on x86

Number of received TCP packets

dolphin_tcp_receive_pkt

Pod

runC

v1.23 or later

1.3.5

EulerOS 2.9 on x86

EulerOS 2.10 on x86

Number of received TCP bytes

dolphin_tcp_receive_byte

Pod

runC

v1.23 or later

1.3.5

EulerOS 2.9 on x86

EulerOS 2.10 on x86

Number of sent TCP packets

dolphin_tcp_send_pkt

Pod

runC

v1.23 or later

1.3.5

EulerOS 2.9 on x86

EulerOS 2.10 on x86

Number of sent TCP bytes

dolphin_tcp_send_byte

Pod

runC

v1.23 or later

1.3.5

EulerOS 2.9 on x86

EulerOS 2.10 on x86

Number of retransmitted TCP packets

dolphin_tcp_retrans

Pod

runC

v1.23 or later

1.3.5

EulerOS 2.9 on x86

EulerOS 2.10 on x86

Number of new TCP connections

dolphin_tcp_connection

Pod

runC

v1.23 or later

1.3.5

EulerOS 2.9 on x86

EulerOS 2.10 on x86

Number of received IP packets

dolphin_flow_ip_receive_pkt

Flow

runC

v1.23 or later

1.3.5

EulerOS 2.9 on x86

EulerOS 2.10 on x86

Number of received IP bytes

dolphin_flow_ip_receive_byte

Flow

runC

v1.23 or later

1.3.5

EulerOS 2.9 on x86

EulerOS 2.10 on x86

Number of sent IP packets

dolphin_flow_ip_send_pkt

Flow

runC

v1.23 or later

1.3.5

EulerOS 2.9 on x86

EulerOS 2.10 on x86

Number of sent IP bytes

dolphin_flow_ip_send_byte

Flow

runC

v1.23 or later

1.3.5

EulerOS 2.9 on x86

EulerOS 2.10 on x86

Number of received TCP packets

dolphin_flow_tcp_receive_pkt

Flow

runC

v1.23 or later

1.3.5

EulerOS 2.9 on x86

EulerOS 2.10 on x86

Number of received TCP bytes

dolphin_flow_tcp_receive_byte

Flow

runC

v1.23 or later

1.3.5

EulerOS 2.9 on x86

EulerOS 2.10 on x86

Number of sent TCP packets

dolphin_flow_tcp_send_pkt

Flow

runC

v1.23 or later

1.3.5

EulerOS 2.9 on x86

EulerOS 2.10 on x86

Number of sent TCP bytes

dolphin_flow_tcp_send_byte

Flow

runC

v1.23 or later

1.3.5

EulerOS 2.9 on x86

EulerOS 2.10 on x86

Number of retransmitted TCP packets

dolphin_flow_tcp_retrans

Flow

runC

v1.23 or later

1.3.5

EulerOS 2.9 on x86

EulerOS 2.10 on x86

TCP smoothed round trip

dolphin_flow_tcp_srtt

Flow

runC

v1.23 or later

1.3.5

EulerOS 2.9 on x86

EulerOS 2.10 on x86

Delivering a Monitoring Task

The template for creating a MonitorPolicy is as follows:

apiVersion: crd.dolphin.io/v1
kind: MonitorPolicy
metadata:
    name: example-task            # Monitoring task name.
    namespace: kube-system        # The value must be kube-system. This field is mandatory.
spec:
    selector:                     # (Optional) Backend monitored by the dolphin add-on, for example, labelSelector. By default, all containers on the node are monitored.
    matchLabels:
      app: nginx
    matchExpressions:
      - key: app
        operator: In
        values:
          - nginx
  podLabel: [app]               # (Optional) Pod label.
  ip4Tx:                        # (Optional) Indicates whether to collect statistics about the number of sent IPv4 packets and the number of sent IPv4 bytes. This function is disabled by default.
    enable: true
    ip4Rx:                        # (Optional) Indicates whether to collect statistics about the number of received IPv4 packets and the number of received IPv4 bytes. This function is disabled by default.
    enable: true
    ip4TxInternet:                # (Optional) Indicates whether to collect statistics about the number of sent IPv4 packets and the number of sent IPv4  bytes. This function is disabled by default.
    enable: true
  healthCheck:                  # (Optional) Whether to collect statistics about whether the latest health check result is healthy and the total number of healthy times and unhealthy times in the pod health checks of the local node. This function is disabled by default.
    enable: true                # true false
    failureThreshold: 3         # (Optional) Number of failures that determine the health check is unhealthy. One check failure is considered as unhealthy by default.
    periodSeconds: 5            # (Optional) Interval between health checks, in seconds. The default value is 60.
    command: ""                 # (Optional) Health check command. The value can be ping (default), arping, or curl.
    ipFamilies: [""]            # (Optional) Health check IP address family. The value is IPv4 by default.
    port: 80                    # (Optional) Port number, which is mandatory when curl is used.
    path: ""                    # (Optional) HTTP API path, which is mandatory when curl is used.
  monitor:
    ip:
      ipReceive:
        aggregateType: flow       # (Optional). The value can be pod (monitored by pod) or flow (monitored by flow).
      ipSend:
        aggregateType: flow       # (Optional). The value can be pod (monitored by pod) or flow (monitored by flow).
    tcp:
      tcpReceive:
        aggregateType: flow       # (Optional). The value can be pod (monitored by pod) or flow (monitored by flow).
      tcpSend:
        aggregateType: flow       # (Optional). The value can be pod (monitored by pod) or flow (monitored by flow).
      tcpRetrans:
        aggregateType: flow       # (Optional). The value can be pod (monitored by pod) or flow (monitored by flow).
      tcpRtt:
        aggregateType: flow       # (Optional). The value can be flow (monitored by flow). The unit is μs.
      tcpNewConnection:
        aggregateType: pod        # (Optional). The value can be pod (monitored by pod).

PodLabel: You can enter the labels of multiple pods and separate them with commas (,), for example, [app, version].

Labels must comply with the following rules. The corresponding regular expression is (^[a-zA-Z_]$)|(^([a-zA-Z][a-zA-Z0-9_]|_[a-zA-Z0-9])([a-zA-Z0-9_]){0,254}$).

  • A maximum of five labels can be entered (a maximum of 10 labels in versions later than 1.3.4). A label can contain a maximum of 256 characters.
  • The value cannot start with a digit or double underscores (_).
  • The format of a single label must comply with A-Za-z_0-9.
You can create, modify, and delete monitoring tasks in the preceding format. A maximum of 10 monitoring tasks can be created. When multiple monitoring tasks match the same monitoring backend, each monitoring backend generates the monitoring metrics specific to the number of monitoring tasks.
  • If you modify or delete a monitoring task, monitoring data collected by the monitoring task will be lost. Therefore, exercise caution when performing this operation.
  • If the add-on is uninstalled, the MonitorPolicy of the monitoring task will be removed together with the add-on.

Example application scenarios:

  1. The example below monitors all pods with label app=nginx selected by the labelselector on a node and generates the three health check metrics. By default, the ping command is used to detect local pods. If the monitored container contains the test and app labels, the key-value information of the corresponding label is carried in the monitoring metrics. Otherwise, the value of the corresponding label is not found.
    apiVersion: crd.dolphin.io/v1
    kind: MonitorPolicy
    metadata:
      name: example-task  
      namespace: kube-system        
    spec:
      selector:
        matchLabels:
          app: nginx
      podLabel: [test, app] 
      healthCheck: 
        enable: true
        failureThreshold: 3
        periodSeconds: 5
  2. The example below monitors all pods with label app=nginx selected by the labelselector on a node and generates the three health check metrics. Customized curl command is used, which considers only the network connectivity. That is, no matter what the HTTP code is returned by the program, the pod is considered healthy as long as the network is connected. If the monitored container contains the test and app labels, the key-value information of the corresponding label is carried in the monitoring metrics. Otherwise, the value of the corresponding label is not found.
    apiVersion: crd.dolphin.io/v1
    kind: MonitorPolicy
    metadata:
      name: example-task
      namespace: kube-system
    spec:
      selector:
        matchLabels:
          app: nginx
      podLabel: [test, app]
      healthCheck: 
        enable: true
        failureThreshold: 3
        periodSeconds: 5 
        command: "curl"
        port: 80
        path: "healthz"
  3. The example below monitors all pods with label app=nginx selected by the labelselector on a node and generates monitoring data by pod, including the number of sent IP packets, received IP packets, sent IP bytes, received IP bytes, sent TCP packets, received TCP packets, sent TCP bytes, received TCP bytes, retransmitted TCP packets, and new TCP connections. If the monitored container contains the test and app labels, the key-value information of the corresponding label is carried in the monitoring metrics. Otherwise, the value of the corresponding label is not found.
    apiVersion: crd.dolphin.io/v1
    kind: MonitorPolicy
    metadata:
      name: example-task
      namespace: kube-system
    spec:
      selector:
        matchLabels:
          app: nginx
      podLabel: [test, app]
      monitor:
        ip:
          ipReceive:
            aggregateType: pod
          ipSend:
            aggregateType: pod
        tcp:
          tcpReceive:
            aggregateType: pod
          tcpSend:
            aggregateType: pod
          tcpRetrans:
            aggregateType: pod
          tcpNewConnection:
            aggregateType: pod
  4. The example below monitors all pods with label app=nginx selected by the labelselector on a node and generates monitoring data by flow, including the number of sent IP packets, received IP packets, sent IP bytes, received IP bytes, sent TCP packets, received TCP packets, sent TCP bytes, received TCP bytes, retransmitted TCP packets, and TCP round-trip time (µs). If the monitored container contains the test and app labels, the key-value information of the corresponding label is carried in the monitoring metrics. Otherwise, the value of the corresponding label is not found. Flow-based monitoring helps you learn about detailed container traffic information. It generates a large amount of data that occupies more CPU and memory resources.Use flow-based monitoring based on your needs.

    A flow-based IP monitoring task (one or more IP monitoring items enabled in a MonitorPolicy) occupies 2.6 MB kernel memory. A flow-based TCP monitoring task (one or more TCP monitoring items enabled in a MonitorPolicy) occupies 14 MB kernel memory.

    apiVersion: crd.dolphin.io/v1
    kind: MonitorPolicy
    metadata:
      name: example-task
      namespace: kube-system
    spec:
      selector:
        matchLabels:
          app: nginx
      podLabel: [test, app]
      monitor:
        ip:
          ipReceive:
            aggregateType: flow
          ipSend:
            aggregateType: flow
        tcp:
          tcpReceive:
            aggregateType: flow
          tcpSend:
            aggregateType: flow
          tcpRetrans:
            aggregateType: flow
          tcpRtt:
            aggregateType: flow

    If the data generated by flow-based monitoring exceeds a certain limit, excess flow statistics will be lost. The restrictions are as follows:

    • A maximum of 50,000 TCP flows (per monitoring task) can be collected in kernel mode within 10 seconds.
    • A maximum of 10,000 IP flows (per monitoring task) can be collected in kernel mode within 10 seconds.
    • A maximum of 60,000 flow statistical records (all monitoring tasks) can be cached at the interval between two CloudScope data fetches.
    • If CloudScope does not obtain monitoring data for a long time, only the monitoring data generated within the latest hour will be cached.
  5. The example below monitors all pods on a node and generates the number of sent IPv4 packets and the number of sent IPv4 bytes. If the monitored container contains the app label, the key-value information of the corresponding label is carried in the monitoring metrics. Otherwise, the value of the corresponding label is not found.
    apiVersion: crd.dolphin.io/v1
    kind: MonitorPolicy
    metadata:
      name: example-task  
      namespace: kube-system        
    spec:
      podLabel: [app]
      ip4Tx:
        enable: true
  6. The example below monitors all pods with label app=nginx selected by the labelselector on a node and generates the number of sent IPv4 packets, received IPv4 packets, sent IPv4 bytes, received IPv4 bytes, IPv4 packets sent to the public network, and IPv4 bytes sent to the public network. If the monitored container contains the test and app labels, the key-value information of the corresponding label is carried in the monitoring metrics. Otherwise, the value of the corresponding label is not found.
    apiVersion: crd.dolphin.io/v1
    kind: MonitorPolicy
    metadata:
      name: example-task
      namespace: kube-system
    spec:
      selector:
        matchLabels:
          app: nginx
      podLabel: [test, app]
      ip4Tx:
        enable: true
      ip4Rx:
        enable: true
      ip4TxInternet:      
        enable: true

Checking Traffic Statistics

The monitoring data collected by this add-on is exported in Prometheus exporter format, which can be obtained in the following ways:

  • Directly access service port 10001 provided by the dolphin add-on, for example, http://{POD_IP}:10001/metrics.

    Note that if you access the dolphin service port on a node, allow access from the security group of the node and pod.

Examples of the monitored information:

  • Example 1 (number of IPv4 packets sent to the Internet):
    dolphin_ip4_send_pkt_internet{app="nginx",pod="default/nginx-66c9c65dbf-zjg24",task="kube-system/example-task "} 241

    In the preceding example, the namespace of the pod is default, the pod name is nginx-66c9c65dbf-zjg24, the label is app, and the value is nginx. This metric is created by monitoring task example-task, and the number of IPv4 packets sent by the pod to the public network is 241.

  • Example 2 (number of IPv4 bytes sent to the Internet):
    dolphin_ip4_send_byte_internet{app="nginx",pod="default/nginx-66c9c65dbf-zjg24",task="kube-system/example-task" } 23618

    In the preceding example, the namespace of the pod is default, the pod name is nginx-66c9c65dbf-zjg24, the label is app, and the value is nginx. This metric is created by monitoring task example-task, and the number of IPv4 bytes sent by the pod to the public network is 23618.

  • Example 3 (number of sent IPv4 packets):
    dolphin_ip4_send_pkt{app="nginx",pod="default/nginx-66c9c65dbf-zjg24",task="kube-system/example-task "} 379

    In the preceding example, the namespace of the pod is default, the pod name is nginx-66c9c65dbf-zjg24, the label is app, and the value is nginx. This metric is created by monitoring task example-task, and the number of IPv4 packets sent by the pod is 379.

  • Example 4 (number of sent IPv4 bytes):
    dolphin_ip4_send_byte{app="nginx",pod="default/nginx-66c9c65dbf-zjg24",task="kube-system/example-task "} 33129

    In the preceding example, the namespace of the pod is default, the pod name is nginx-66c9c65dbf-zjg24, the label is app, and the value is nginx. This metric is created by monitoring task example-task, and the number of IPv4 bytes sent by the pod is 33129.

  • Example 5 (number of received IPv4 packets):
    dolphin_ip4_rcv_pkt{app="nginx",pod="default/nginx-66c9c65dbf-zjg24",task="kube-system/example-task "} 464

    In the preceding example, the namespace of the pod is default, the pod name is nginx-66c9c65dbf-zjg24, the label is app, and the value is nginx. This metric is created by monitoring task example-task, and the number of IPv4 packets received by the pod is 464.

  • Example 6 (number of received IPv4 bytes):
    dolphin_ip4_rcv_byte{app="nginx",pod="default/nginx-66c9c65dbf-zjg24",task="kube-system/example-task "} 34654

    In the preceding example, the namespace of the pod is default, the pod name is nginx-66c9c65dbf-zjg24, the label is app, and the value is nginx. This metric is created by monitoring task example-task, and the number of IPv4 bytes received by the pod is 34654.

  • Example 7 (health check status)
    dolphin_health_check_status{app="nginx",pod="default/nginx-b74766f5f-7582p",task="kube-system/example-task"} 0

    In the preceding example, the namespace of the pod is kube-system, the pod name is default/nginx-deployment-b74766f5f-7582p, the label is app, and the value is nginx. This metric is created by monitoring task example-task, and the network health status of the pod is 0 (healthy). If the network status is unhealthy, the value will be 1.

  • Example 8 (number of successful health checks)
    dolphin_health_check_successful_counter{app="nginx",pod="default/nginx-b74766f5f-7582p",task="kube-system/example-task"} 5

    In the preceding example, the namespace of the pod is kube-system, the pod name is default/nginx-deployment-b74766f5f-7582p, the label is app, and the value is nginx. This metric is created by monitoring task example-task, and the number of successful network health checks for the pod is 5.

  • Example 9 (number of failed health check failures)
    dolphin_health_check_failed_counter{app="nginx",pod="default/nginx-b74766f5f-7582p",task="kube-system/example-task"} 0

    In the preceding example, the namespace of the pod is kube-system, the pod name is default/nginx-deployment-b74766f5f-7582p, the label is app, and the value is nginx. This metric is created by monitoring task example-task, and the number of failed network health checks for the pod is 0.

  • Example 10 (flow-based monitoring result):
    dolphin_flow_tcp_send_byte{app="nginx",dstip="192.168.0.89",dstport="80",ipfamily="ipv4",pod="kube-system/nginx-b74766f5f-7582p",srcip="192.168.1.67",srcport="12973",task="kube-system/example-task"} 1725 1700538280914

    In the preceding example, the namespace of the pod is kube-system, the pod name is nginx-b74766f5f-7582p, the label is app, and the value is nginx. This metric is created by monitoring task example-task, and the number of IPv4 TCP bytes sent from 192.168.1.67:12973 to 192.168.0.89:80 is 1725. The timestamp is 1700538280914.

  • Example 11 (pod-based monitoring result):
    dolphin_tcp_send_pkt{app="nginx",ipfamily="ipv4",pod="kube-system/nginx-b74766f5f-7582p",task="kube-system/example-task"} 14
    dolphin_tcp_send_pkt{app="nginx",ipfamily="ipv6",pod="kube-system/nginx-b74766f5f-7582p",task="kube-system/example-task"} 0

    In the preceding example, the namespace of the pod is kube-system, the pod name is nginx-b74766f5f-7582p, the label is app, and the value is nginx. This metric is created by monitoring task example-task, and the number of IPv4 packets sent by the pod is 14. 0 IPv6 packets were sent by this pod.

If the container does not contain the specified label, the label value in the response body is not found. The format is as follows:

dolphin_ip4_send_byte_internet{test="not found", pod="default/nginx-66c9c65dbf-zjg24",task="default" } 23618