Updated on 2025-08-18 GMT+08:00

Cloud Native Log Collection

Description

The Cloud Native Log Collection plug-in (formerly log-agent) is developed based on Fluent Bit and OpenTelemetry for collecting logs and Kubernetes events. The plug-in can collect standard output logs of training and inference instances in a cluster to LTS.

Log Collection Reliability

The log system's main purpose is to record all stages of data for service components, including startup, initialization, exit, runtime details, and exceptions. It is primarily employed in O&M scenarios for tasks like checking component status and analyzing fault causes.

Standard streams (stdout and stderr) use non-persistent storage. However, data integrity may be compromised due to the following risks:

  • Log rotation and compression potentially deleting old files
  • Temporary storage volumes being cleared when Kubernetes pods end
  • Automatic OS cleanup triggered by limited node storage space

While Cloud Native Log Collection employs techniques like multi-level buffering, priority queues, and resumable uploads to enhance log collection reliability, logs could still be lost in the following situations:

  • The service log throughput surpasses the collector's processing capacity.
  • The service pod is abruptly terminated and reclaimed by CCE.
  • The log collector pod experiences exceptions.

Based on the best practices of cloud native log collection, the following suggestions are provided:

  • Use dedicated, high-reliable streams to record critical service data (for example, financial transactions) and store the data in persistent storage.
  • Do not store sensitive information such as customer details, payment credentials, and session tokens in logs.

Constraints

You are advised to install 1.7.3 or later.

Supported CCE versions: v1.21 to v1.32

Plug-in Performance Specifications

Performance Item

Description

Remarks

Size of a log

A single log cannot be larger than 512 KB. If multi-line logs are collected, the length of each line will be calculated separately.

N/A

Maximum number of collected files

On a single node, no more than 4,095 files can be listened by all log collection rules.

N/A

Configuration update

Configuration updates take effect in 1 to 3 minutes.

N/A

Installing a Plug-in

Install the specified plug-in in the resource pool.

  1. Log in to the ModelArts console. In the navigation pane on the left, choose Standard Cluster.
  2. Click the resource pool to access its details page.
  3. On the resource pool details page, click the Plug-ins tab.
  4. Locate the plug-in to be installed in the list and click Install.
  5. In the displayed dialog box, configure the parameters.
    Table 1 Parameters for configuring Cloud Native Log Collection

    Parameter

    Sub-Parameter

    Description

    Specifications

    Plug-in Version

    Version of Cloud Native Log Collection to be deployed. Version 1.7.3 is supported.

    Plug-in Specifications

    Preset: Select Small or Large.

    Small: A cluster that supports a maximum of 5,000 logs per second

    Large: A cluster that supports a maximum of 10,000 logs per second

    Custom: You can adjust the number of plug-in instances and resource quotas as required. High availability is not possible with a single instance. If an error occurs on the node where the plug-in instance runs, the plug-in will fail.

    Configuration List

    Detailed configurations of the specified specifications

    Parameter Configuration

    Log Group

    Select a log group from the drop-down list. A log group is the basic unit for LTS to manage logs.

    Log Stream

    Select a log stream from the drop-down list.

    A log stream is the basic unit for log reads and writes. If there are many logs to collect, you are advised to separate logs into different log streams based on log types, and name log streams in an easily identifiable way.

    Collect Logical Subpool Logs

    Logs for logical subpools are not collected by default. Once this function is enabled, you can collect logs for each logical subpool and set the collection policy.

    Click Add Logical Pool, select a created logical pool and the corresponding log group and log stream.

  6. Read "Usage Notes" and select I have read and understand the preceding information.
  7. Click OK.

Components

Component

Description

Resource Type

fluent-bit

Lightweight log collector and forwarder deployed on each node to collect logs. In 1.5.0 and later versions, logs are directly reported to LTS.

DaemonSet

cop-logs

Used to generate soft links for collected files and run in the same pod as fluent-bit.

DaemonSet

log-operator

Used to generate internal configuration files.

Deployment

otel-collector

Used to collect Kubernetes events and report them to LTS and AOM, and receive and report logs to LTS.

The log reporting scope depends on the plug-in version. In 1.5.1 and later versions, this component reports only workload logs that are scaled to CCI.

Deployment

Change History

Plug-in Version

Supported CCE Cluster Versions

New Feature

1.7.3

v1.21

v1.23

v1.25

v1.27

v1.28

v1.29

v1.30

v1.31

Collecting standard output logs of containers is supported.

1.7.2

v1.21

v1.23

v1.25

v1.27

v1.28

v1.29

v1.30

v1.31

Logs can be compressed in gzip format and sent to LTS.