Updated on 2025-09-05 GMT+08:00

Intelligent Risk Detection for OpenSearch Clusters

CSS provides intelligent O&M to help detect potential risks for clusters, along with risk handling suggestions.

Scenarios

Intelligent O&M for clusters supports the following functions:

  • Creating a Scan Task: Start a scan task to trigger an intelligent health check and diagnosis on the current cluster.
  • Checking the Risk Items of a Cluster: After a scan task is completed, check the risks identified by the scan, confirm and handle these risks in a timely manner based on risk handling suggestions.
  • Deleting a Scan Task: Delete scan tasks that you no longer need. After a scan task is deleted, the system deletes all diagnoses generated by it.

The check items of intelligent O&M are as follows:

  • Check the current health status of a cluster. Red: Some primary shards are not allocated. Yellow: Some replica shards are not allocated. Green: All shards have been allocated.
  • Check the number of nodes and the number of AZs of a distributed OpenSearch cluster to evaluate its high-availability status.
  • Check whether index replicas are enabled. An index with no replicas may become unavailable in the case of a node failure. If local disks are used, this may even lead to data loss.
  • Check for Kibana index conflicts in a cluster.
  • Check disk usage. If the disk usage of a node is too high, new index shards may fail to be allocated to the node and the cluster performance may be affected.
  • Check whether the storage usage of a cluster's data nodes or cold data nodes is balanced. Unbalanced storage distribution may result in unbalanced cluster loads and increase read/write latency.
  • Monitor node disconnection or unavailability in a cluster for 5 consecutive minutes each time.
  • Check for nodes with too many shards. An excessively large number of shards will consume excessive node resources, increasing read/write latency and slowing down metadata update.
  • Check the sizes of all shards. A large shard may impact performance, occupy excessive node memory, and slow down shard restoration during cluster scaling or fault recovery.
  • Check for new versions the current cluster may upgrade to.
  • Check for snapshot creation failures or the absence of any snapshot creation records in the last seven days.

Granting Access to SMN

To send alarm notifications via SMN after a scan task is completed, you must first be granted access to the SMN service. Additionally, you must create a topic on the SMN console in advance. For details, see Creating a Topic.

  1. Log in to the CSS management console.

    You must log in using a CSS administrator account.

  2. In the navigation pane, choose Service Authorization.
  3. On the Service Authorization page, click Create Agency. In the dialog box displayed, confirm that the agency is successfully created.
    • If an agency has been created, "css_smn_agency exist, no need to created." is displayed in the upper right corner.
    • If you do not have the permission to create an agency, an error message will be displayed in the upper right corner indicating "no permission", in which case, check that the administrator account has been assigned the IAM permission.

Creating a Scan Task

  1. Log in to the CSS management console.
  2. In the navigation pane on the left, choose Clusters > OpenSearch.
  3. In the cluster list, click the name of the target cluster. The cluster information page is displayed.
  4. Choose Monitoring > Intelligent O&M.
  5. On the Intelligent O&M tab, click Scan in the upper-left corner. In the displayed dialog box, configure the scan task.
    Table 1 Configuring a scan task

    Parameter

    Description

    Name

    Custom name of a scan task.

    Enter 4 to 64 characters and start with a lowercase letter. Only lowercase letters, digits, hyphens (-), and underscores (_) are allowed.

    Description

    Brief description of the task.

    Send SMN notification upon task completion

    Whether to send an SMN notification upon completion of the scan.

    • When selected, you need to further configure SMN Topic and Notification Level. Upon completion of the scan, an SMN notification is sent if a risk higher than or equal to the notification level you set is detected.
    • When deselected (default), no SMN notification will be sent upon completion of the scan.

    SMN Topic

    If you select Send SMN notification upon task completion, you need to set the SMN topic.

    Notification Level

    If you select Send SMN notification upon task completion, you need to set the risk level.

    If the scan result contains risks at this level or higher, SMN will send notifications that list all the risk items in the result.

  6. After the configuration is complete, click OK.
  7. When the status of the scan task changes to Completed, you can check the cluster's risk items.

Checking the Risk Items of a Cluster

  1. On the Intelligent O&M page, select a completed scan task, and click on the lower-right left to check the task details, including the task creation time, summary, ID, and the risk items found by it.
    Figure 1 Checking scan task details
  2. Click on the left of a risk item to check its details, including the check item, risk description, and risk handling suggestion.

    You can handle cluster risks in a timely manner based on the suggestions provided by the system.

    Figure 2 Risk items
  3. Select a scan task, and click Export Risk in the Operation column to download the scan result.

Deleting a Scan Task

You can delete scan tasks that are no longer needed. Deleting a scan task also deletes all its diagnoses and scan report.

  1. On the Intelligent O&M page, locate the scan task you want to delete, and click Delete in the Operation column.
  2. In the displayed dialog box, enter DELETE, and click OK to delete the task.