Cloud Native Log Collection
When logging is enabled (Enabling Logging), the Cloud Native Log Collection add-on is automatically installed for an on-premises cluster or attached cluster. You can also manually install this add-on by referring to this section. For details about this add-on, see Cloud Native Log Collection.
Introduction
The Cloud Native Log Collection add-on (log-agent) is based on Fluent Bit and OpenTelemetry. It supports CRD-based log collection policies, as well as collects and forwards stdout logs, container file logs, node logs, and Kubernetes events of containers in a cluster. After the Cloud Native Log Collection add-on is installed, stdout logs and Kubernetes events are collected by default. For details about how to use the Cloud Native Log Collection add-on to collect logs, see Collecting Data Plane Logs.
Log Collection Reliability
The log system's main purpose is to record all stages of data for service components, including startup, initialization, exit, runtime details, and exceptions. It is primarily employed in O&M scenarios for tasks like checking component status and analyzing fault causes.
Standard streams (stdout and stderr) and local log files use non-persistent storage. However, data integrity may be compromised due to the following risks:
- Log rotation and compression potentially deleting old files
- Temporary storage volumes being cleared when Kubernetes pods end
- Automatic OS cleanup triggered by limited node storage space
While the Cloud Native Log Collection add-on employs techniques like multi-level buffering, priority queues, and resumable uploads to enhance log collection reliability, logs could still be lost in the following situations:
- The service log throughput surpasses the collector's processing capacity.
- The service pod is abruptly terminated and reclaimed by CCE.
- The log collector pod experiences exceptions.
The following lists some recommended best practices for cloud native log management. You can review and implement them thoughtfully.
- Use dedicated, high-reliable streams to record critical service data (for example, financial transactions) and store the data in persistent storage.
- Avoid storing sensitive information like customer details, payment credentials, and session tokens in logs.
Constraints
- This add-on is only available in clusters v1.21 or later.
- A maximum of 50 log collection rules can be configured for each cluster.
- This add-on cannot collect .gz, .tar, and .zip logs.
- If the node storage driver is Device Mapper, container file logs must be collected from the path where the data disk is attached to the node.
- If the container runtime is containerd, each stdout log cannot be in multiple lines.
- In each cluster, up to 10,000 single-line logs can be collected per second, and up to 2,000 multi-line logs can be collected per second.
- The container running time must be longer than 1 minute for log collection to prevent logs from being deleted too quickly.
Permissions
The fluent-bit component of the Cloud Native Log Collection add-on reads and collects the stdout logs on each node, container file logs, and node logs based on the collection configuration.
The following permissions are required for running the fluent-bit component:
- CAP_DAC_OVERRIDE: ignores the discretionary access control (DAC) restrictions on files.
- CAP_FOWNER: ignores the restrictions that the file owner ID must match the process user ID.
- DAC_READ_SEARCH: ignores the DAC restrictions on file reading and catalog research.
- SYS_PTRACE: allows all processes to be traced.
Assigning Authorization Before Installing the Cloud Native Log Collection Add-on in an On-Premises Cluster or Attached Cluster
The Cloud Native Log Collection add-on needs to be authenticated before it accesses LTS and AOM. This add-on leverages workload identities to allow workloads in an on-premises cluster or attached cluster to impersonate IAM users to access cloud services.
Workload identities allow you to add the public key of an on-premises cluster for an IAM IdP and add a rule to map a ServiceAccount to an IAM account. During workload deployment, the token of the ServiceAccount is mounted to the workload. This token is used to access cloud services. This way, the AK/SK of the IAM account is not required, reducing security risks.
- Obtain the JSON Web Key Set (JWKS) issued by the private key of the on-premises cluster or attached cluster. The JWKS is used to verify the ServiceAccount token issued by this cluster.
    
    - Access the on-premises cluster or attached cluster using kubectl.
- Run the following command to obtain the public key:
      kubectl get --raw /openid/v1/jwks A json string is returned, containing the signature public key of the cluster for accessing the IdP. { "keys": [ { "kty": "RSA", "e": "AQAB", "use": "sig", "kid": "Ew29q....", "alg": "RS256", "n": "peJdm...." } ] }
 
- Create an IdP for your on-premises cluster in IAM.
    
    - Log in to the IAM console, query the ID of the project that the on-premises cluster or attached cluster belongs to, create an IdP, and select OpenID Connect for Protocol. Enter the IdP name for log-agent. For details, see Table 1. For details about how to configure permissions for a user group, see User Group Policy Content. 
      Table 1 log-agent IdP settings Add-on Name IdP Name Client ID Namespace ServiceAccount Name Minimum Permissions on User Groups log-agent ucs-cluster-identity-{Project ID} ucs-cluster-identity monitoring log-agent-serviceaccount aom:alarm:* lts:*:* Figure 1 Modifying IdP information  
- Click OK and modify the IdP information as described in Table 2. Click Create Rule to create an identity conversion rule.
      Figure 2 Modifying IdP information  Table 2 IdP parameters Parameter Description Access Type Select Programmatic access. Configuration Information - Identity Provider URL: Enter https://kubernetes.default.svc.cluster.local.
- Client ID: Enter the client ID of log-agent. For details, see Table 1.
- Signing Key: Enter the JWKS of the on-premises cluster or attached cluster obtained in 1. If multiple clusters are involved, use commas (,) to separate their keys.
 Identity Conversion Rules An identity conversion rule maps a ServiceAccount in an on-premises cluster to an IAM user group. - Attribute: sub
- Condition: any_one_of
- Value:
             Value format: system:serviceaccount:Namespace:ServiceAccountName. Change Namespace to the namespace for which the ServiceAccount is to be created, and change ServiceAccountName to the name of the ServiceAccount to be created. For example, if the value is system:serviceaccount:monitoring:log-agent-serviceaccount, a ServiceAccount named log-agent is created in the monitoring namespace and mapped to the corresponding user group. The IAM token obtained using this ServiceAccount has the permissions of the user group. NOTE:ServiceAccountName and user group permissions are mandatory for running add-ons in an on-premises cluster or attached cluster. For details, see Table 1. 
 Figure 3 Creating an identity conversion rule  
- Click OK.
 
- Log in to the IAM console, query the ID of the project that the on-premises cluster or attached cluster belongs to, create an IdP, and select OpenID Connect for Protocol. Enter the IdP name for log-agent. For details, see Table 1. For details about how to configure permissions for a user group, see User Group Policy Content. 
      
Installing the Cloud Native Log Collection Add-on in an On-Premises Cluster or Attached Cluster
- Log in to the UCS console and choose Fleets. Then, click the cluster name to access the cluster console. In the navigation pane, choose Add-ons. Locate Cloud Native Log Collection on the right and click Install.
- In the Install Add-on window, configure the specifications.
    
    Table 3 Add-on specifications Parameter Description Add-on Specifications The add-on specifications can be of the Low, High, or custom-resources type. Pods Number of pods that will be created to match the selected add-on specifications. If you select custom-resources, you can adjust the number of pods as required. Containers The log-agent add-on contains the following containers, whose specifications can be adjusted as required: - fluent-bit: indicates the log collector, which is installed on each node as a DaemonSet.
- cop-logs: generates and updates configuration files on the collection side.
- log-operator: parses and updates log collection rules.
- otel-collector: forwards logs collected by fluent-bit to LTS in a centralized manner.
 
- Configure the parameters in Parameters.
    
    Interconnection with AOM: If this option is enabled, Kubernetes events will be collected and reported to AOM. You can configure alarm rules on AOM. 
- Configure the network for reporting add-on instance logs.
    
    - Public network: This option features flexibility, cost-effectiveness, and easy access. It is only available for clusters that can access the public network.
- Direct Connect or VPN: After you connect an on-premises data center to a VPC over Direct Connect or VPN, you can use a VPC endpoint to access CIA over the private network. This option features high speed, low latency, and high security. For details, see Using Direct Connect or VPN to Report Logs of On-Premises Clusters or Attached Clusters.
 
- Click Install.
log-agent Components
| Component | Description | Resource Type | 
|---|---|---|
| fluent-bit | Lightweight log collector and forwarder deployed on each node to collect logs | DaemonSet | 
| cop-logs | Used to generate soft links for collected files and run in the same pod as fluent-bit | DaemonSet | 
| log-operator | Used to generate internal configuration files | Deployment | 
| otel-collector | Used to collect logs from applications and services and report the logs to LTS | Deployment | 
Change History
| Add-on Version | Supported Cluster Version | New Feature | 
|---|---|---|
| 1.4.1 | v1.21 v1.22 v1.23 v1.24 v1.25 v1.26 v1.27 v1.28 v1.29 v1.30 v1.31 | This is the first official release. It can be installed in the on-premises clusters. | 
Reporting Custom Events to AOM
The log-agent add-on reports all warning events and some normal events to AOM. You can also set the events to be reported as required.
- Run the following command on the cluster to modify the event collection settings:
- Modify the event collection settings as required.
    apiVersion: logging.openvessel.io/v1 kind: LogConfig metadata: annotations: helm.sh/resource-policy: keep name: default-event-aom namespace: kube-system spec: inputDetail: # Settings on UCS from which events are collected type: event # Type of logs to be collected. Do not change the value. event: normalEvents: # Used to configure normal events enable: true # Whether to enable normal event collection includeNames: # Names of events to be collected. If this parameter is not specified, all events will be collected. - NotTriggerScaleUp excludeNames: # Names of events that are not collected. If this parameter is not specified, all events will be collected. - NotTriggerScaleUp warningEvents: # Used to configure warning events enable: true # Whether to enable warning event collection includeNames: # Names of events to be collected. If this parameter is not specified, all events will be collected. - NotTriggerScaleUp excludeNames: # Names of events that are not collected. If this parameter is not specified, all events will be collected. - NotTriggerScaleUp outputDetail: type: AOM # Type of the system that receives the events. Do not change the value. AOM: events: - name: DeleteNodeWithNoServer # Event name. This parameter is mandatory. resourceType: Namespace # Type of the resource that operations are performed on. severity: Major # Event severity after an event is reported to AOM, which can be Critical, Major, Minor, or Info. The default value is Major.
log-agent Events
During log-agent installation and running, the log-operator component reports events. You can determine whether log-agent is installed and determine fault causes based on these events. For details, see Table 6.
| Event Name | Description | 
|---|---|
| InitLTSFailed | Failed to initialize the log streams in the LTS log group. | 
| WatchAKSKFailed | Failed to listen to the AK/SK. | 
| WatchAKSKSuccessful | AK/SK listened. | 
| RequestLTSFailed | Failed to request the LTS interface. | 
| InitLTSSuccessful | Log streams in the LTS log group initialized. | 
| CreateWebhookConfigFailed | Failed to create MutatingWebhookConfiguration. | 
| CreateWebhookConfigSuccessful | MutatingWebhookConfiguration created. | 
| StartServerSuccessful | Listening enabled. | 
| StartServerFailed | Failed to enable listening. | 
| StartManagerFailed | Failed to enable CRD listening. | 
| InjectAnnotationFailed | Failed to inject annotations. | 
| InjectAnnotationSuccessful | Annotations injected. | 
| UpdateLogConfigFailed | Failed to update the logconfig information. | 
| GetConfigListFailed | Failed to obtain the CR list. | 
| GenerateConfigFailed | Failed to generate the fluent-bit and otel settings. | 
log-agent Metrics
The log-operator, fluent-bit, and otel-collector components of the log-agent add-on have a series of metrics. You can use AOM or Prometheus to monitor these metrics to check the running of the log-agent add-on in a timely manner. For details, see Monitoring Custom Metrics Using AOM or Monitoring Custom Metrics Using Prometheus. The following lists the metrics:
- log-operator (only for Huawei Cloud clusters)
    
    Address: /metrics Protocol: HTTPS Table 7 Metrics Metric Description Type log_operator_aksk_latest_update_times Last update time of the AK/SK Gauge log_operator_aksk_update_total Cumulative count of AK/SK update times Counter log_operator_send_request_total Cumulative count of requests that have been sent Counter log_operator_webhook_listen_status Webhook listening status Gauge log_operator_http_request_duration_seconds HTTP request latency Histogram log_operator_http_request_total Cumulative count of HTTP requests Counter log_operator_webhook_request_total Cumulative count of webhook requests Counter 
- fluent-bit
    
    Address: /api/v1/metrics/prometheus Protocol: HTTP Table 8 Metrics Metric Description Type fluentbit_filter_add_records_total Number of log records that the filter has successfully ingested Counter fluentbit_filter_drop_records_total Number of log records that have been dropped by the filter Counter fluentbit_input_bytes_total Number of bytes of log records that the input instance has successfully ingested Counter fluentbit_input_files_closed_total Total number of files closed by the input instance Counter fluentbit_input_files_opened_total Total number of files opened by the input instance Counter fluentbit_input_files_rotated_total Total number of files rotated by the input instance Counter fluentbit_input_records_total Number of log records the input instance has successfully ingested Counter fluentbit_output_dropped_records_total Number of log records that have been dropped by the output instance Counter fluentbit_output_errors_total Number of chunks that have faced an error Counter fluentbit_output_proc_bytes_total Number of bytes of log records that the output instance has successfully sent Counter fluentbit_output_proc_records_total Number of log records the output instance has successfully sent Counter fluentbit_output_retried_records_total Number of log records that experienced a retry Counter fluentbit_output_retries_total Number of times the output instance requested a retry for a chunk Counter fluentbit_uptime Number of seconds that Fluent Bit has been running Counter fluentbit_build_info Build and version information of Fluent Bit Gauge 
- otel-collector
    
    Address: /metrics Protocol: HTTP Table 9 Metrics Metric Description Type otelcol_exporter_enqueue_failed_log_records Number of log records failed to be added to the sending queue Counter otelcol_exporter_enqueue_failed_metric_points Number of metric points failed to be added to the sending queue Counter otelcol_exporter_enqueue_failed_spans Number of spans failed to be added to the sending queue Counter otelcol_exporter_send_failed_log_records Number of log records failed to be sent Counter otelcol_exporter_sent_log_records Number of log records that have been sent Counter otelcol_process_cpu_seconds Total CPU user and system time in seconds Counter otelcol_process_memory_rss Total physical memory (resident set size) Gauge otelcol_process_runtime_heap_alloc_bytes Bytes of allocated heap objects Gauge otelcol_process_runtime_total_alloc_bytes Cumulative bytes allocated for heap objects Counter otelcol_process_runtime_total_sys_memory_bytes Total bytes of memory obtained from the OS Gauge otelcol_process_uptime Uptime of the process in seconds Counter otelcol_receiver_accepted_log_records Number of log records received and processed by the OpenTelemetry receiver Counter otelcol_receiver_refused_log_records Number of log records rejected by the OpenTelemetry receiver Counter 
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot 
    