Updated on 2026-06-12 GMT+08:00

Intelligent Diagnostics

During routine operations, you must periodically monitor the health of your OpenSearch clusters to ensure service stability. However, manual health checks are time-consuming, labor-consuming, and prone to oversight. Key issues may be easily missed. To address this, CSS provides an Intelligent O&M feature. It supports both scheduled and on-demand diagnostics, covering essential checks such as data node disk usage, cluster health status, and node CPU usage. Through these checks, O&M personnel can detect and handle potential risks in a timely manner to ensure cluster reliability and availability. By default, Intelligent O&M retains the most recent 11 diagnostic reports, so you can always review the latest results. You can also export any report for further analysis or long-term archiving.

Constraints

  • Intelligent diagnostics cannot be enabled while the cluster is in one of the following Cluster Status: frozen, creating, or creation failed.
  • While a diagnostic task is in progress, you cannot manually start another task.
  • Running diagnostics on a cluster that is in the process of a configuration change (for example, being scaled up or down, or having its node specifications changed) may lead to inaccurate results. You are advised to avoid such times.
  • When intelligent diagnostics is enabled, the system is allowed by default to read cluster settings and data configuration information for diagnostic or analytical purposes.

Diagnostic Modes

Two diagnostic modes are available: scheduled and manual. Choose one based on your service requirements.

Table 1 Comparing the two diagnostic modes

Dimension

Scheduled Diagnostics

Manual Diagnostics

Triggering method

Automatically performed daily by schedule

Manually triggered and executed right away

When to use

Routine inspection, creating a health baseline, and long-term trend observation

Quick troubleshooting in response to cluster failures or errors, or cluster status verification before and after change operations

Typical users

O&M teams (routine maintenance)

R&D or O&M personnel (troubleshooting)

Diagnostic type

  • Full check
  • Cluster unavailability check
  • Full check
  • Cluster unavailability check

Diagnostic Items

Table 2 lists the diagnostic items supported by Intelligent O&M.

Table 2 Diagnostic items

No.

Diagnostic Item

Description

1

Data Node Disk Usage Check

Identify data nodes (including cold data nodes) with the highest disk usage in the cluster. Excessive disk usage can affect cluster stability.

  • High risk: When a node's disk usage reaches 85%, no new index replicas will be allocated to it.
  • Medium risk: When a node's disk usage exceeds 80%, take immediate action to prevent further increase.

2

Data Node Disk Usage Balance Check

Check whether disk usage is balanced across all regular data nodes and cold data nodes. Unbalanced disk usage may cause performance issues.

  • High risk: Maximum disk usage difference between nodes ≥ 40%
  • Medium risk: 25% ≤ Maximum disk usage difference between nodes < 40%

3

Large Index Shard Check

Check for large index shards that may cause cluster load imbalance.

  • High risk: Shard storage ≥ 50 GB; and the index only has one primary shard.
  • Medium risk: Shard storage ≥ 50 GB; and the number of shards is not divisible by the number of data nodes or cold data nodes.

4

Cluster Node Configuration Check

Check the high availability settings of clusters.

  • Single/Dual Node Risk: For clusters without a dedicated Master node, at least three Data nodes (including cold data nodes) are required to ensure cluster availability during a single node failure.
  • Overloaded Master Node Risk: When there are 10 or more Data nodes, dedicated Master nodes should be configured to prevent excessive load on the Master node.
  • Single Client Node Risk: A cluster containing only one dedicated Client node may result in cluster inaccessibility if that single Client node fails.

5

Cluster Health Check

Check the overall health status of all indices in the cluster.

  • High risk (red): There are unallocated primary shards (i.e., some indices are unavailable).
  • Medium risk (yellow): Primary shards are properly allocated, but replicas are missing (compromised data availability).
  • Normal (green): All primary and replica shards are properly allocated.

6

Maximum Per-Node Shard Quantity Check

Check that the number of shards allocated to each node remains within the limit set by cluster.max_shards_per_node, preventing shard allocation failures.

  • High risk: Maximum number of shards allocated to a node ≥ 90% of the per-node shard limit
  • Medium risk: 85% of the per-node shard limit ≤ Maximum number of shards allocated to a node < 90% of the per-node shard limit

7

Oversized Shard Check

Check for shards that have reached or exceeded 100 GB in the cluster. Oversized shards can increase query latency and slow down fault recovery. You are advised to keep the shard size under 50 GB.

  • High risk: Maximum shard size ≥ 500 GB
  • Medium risk: 100 GB ≤ Maximum shard size < 500 GB

8

Hot Index Read/Write Check (Every Minute)

Identify the 10 indices with the most query and write requests within the last minute and display them in descending order. For each of these identified indices, verify that the number of its primary shards is divisible by the number of data nodes or cold data nodes to prevent load imbalance.

9

Node Disconnection Check

Check the connectivity of cluster nodes. Node disconnection will severely compromise cluster performance and therefore needs to be handled immediately.

10

Balanced Node Connections Check

Check the number of connections to port 9200 on the client nodes (or data nodes if there are no client nodes), and display nodes by their connection count in descending order.

11

Client Connection Check (Source IP Address Analysis)

Check the number of connections to port 9200 on the client nodes (or data nodes if there are no client nodes), and display all external source IP addresses by connection count in descending order as well as details about the top 10 connections.

12

Node CPU Usage Check (Interval: 1 Minute)

Measure the CPU usage of each node twice every minute, and calculate the average for each node to evaluate their load status.

  • High risk: Average CPU usage ≥ 90% or no data is obtained.
  • Medium risk: 80% ≤ Average CPU usage < 90%

13

JVM Heap Memory Usage Check (Interval: 1 Minute)

Measure the JVM heap memory usage of each node twice every minute, and calculate the average for each node to evaluate their load status.

  • High risk: Average JVM heap memory usage ≥ 90% or no data is obtained.
  • Medium risk: 80% ≤ Average JVM heap memory usage < 90%

14

Local-Disk Cluster Index Replica Check

For a cluster consisting of nodes that use local disks, check whether all indices have replicas. Otherwise, a single disk failure can lead to permanent data loss.

Configuring Scheduled Diagnostics

Configuring daily scheduled diagnostics enables continuous health risk monitoring for clusters without human intervention.

  1. Log in to the CSS management console.
  2. In the navigation pane on the left, choose Clusters > OpenSearch.
  3. In the cluster list, click the name of the target cluster. The cluster information page is displayed.
  4. Choose Intelligent O&M > Intelligent Diagnostics.
  5. On the Intelligent Diagnostics page, click Enable Scheduled Diagnostics. In the displayed dialog box, configure a scheduled diagnostics task.
    Table 3 Enabling scheduled diagnostics

    Parameter

    Description

    Diagnostic Type

    Select a diagnostic type.

    • Full check: All check items are supported. You can select all or only some of the check items.
    • Cluster unavailability check: When a cluster is unavailable, run this diagnosis to quickly diagnose its health status.

    Diagnostic Item

    Select diagnostic items to be included in the scheduled task.

    • When Diagnostic Type is set to Full check, you can customize the diagnostic items.
    • When Diagnostic Type is set to Cluster unavailability check, the cluster unavailability check items are selected by default.

    Time Zone

    Select your current time zone. The start time of the scheduled diagnostic task is affected by it.

    Diagnosed

    Select the time to start the scheduled diagnostic task every day.

  6. Click OK to save the task information.

    The scheduled task information is displayed on the Intelligent Diagnostics page, and daily reports generated by the task will be displayed below on this page or on the Historical Reports tab.

    Figure 1 Viewing a scheduled diagnostics task

    The report will show high-risk items, medium-risk items, normal items, and reminders, as well as the results and optimization suggestions for each diagnostic item. High- and medium-risk items should be prioritized.

  7. On the Intelligent Diagnostics page, manage the scheduled diagnostic task, including modifying and disabling it.
    • Modify the scheduled diagnostic task: Click Modify Settings. In the displayed dialog box, modify the settings and click OK to update the scheduled diagnostic policy.
    • Disable the scheduled diagnostic task: Click Disable Scheduled Diagnostics. In the displayed dialog box, click OK to disable scheduled diagnostics. The scheduled task is removed from the Intelligent Diagnostics page.

Performing Manual Diagnostics

Perform a one-time manual diagnostic for quick troubleshooting or to verify cluster status. This is recommended when your cluster experiences performance issues or errors, or after changes are made to your cluster.

  1. Log in to the CSS management console.
  2. In the navigation pane on the left, choose Clusters > OpenSearch.
  3. In the cluster list, click the name of the target cluster. The cluster information page is displayed.
  4. Choose Intelligent O&M > Intelligent Diagnostics.
  5. On the Intelligent Diagnostics page, configure a diagnostic task and click Run Diagnostics to start the task.
    Figure 2 Manual diagnostics
    Table 4 Manual diagnostics

    Parameter

    Description

    Diagnostic Type

    Select a diagnostic type.

    • Full check: All check items are supported. You can select all or only some of the check items.
    • Cluster unavailability check: When a cluster is unavailable, run this diagnosis to quickly diagnose its health status.

    Diagnostic Item

    Select diagnostic items.

    • When Diagnostic Type is set to Full check, you can customize the diagnostic items.
    • When Diagnostic Type is set to Cluster unavailability check, the cluster unavailability check items are selected by default.
  6. Wait for 1 to 3 minutes. The diagnostic report will be displayed below.

    The report will show high-risk items, medium-risk items, normal items, and reminders, as well as the results and optimization suggestions for each diagnostic item. High- and medium-risk items should be prioritized.

  7. Click Export Report to save the diagnostic report to your local PC for further analysis or archiving.

Viewing Historical Reports

Historical diagnostic reports help you compare cluster health across different periods, which aids capacity planning and problem diagnosis.

A maximum of 10 historical reports can be retained. (This is the total for both scheduled and manual diagnostic tasks.) When this limit is exceeded, the oldest reports will be deleted automatically. Export and save important reports in a timely manner.

  1. Log in to the CSS management console.
  2. In the navigation pane on the left, choose Clusters > OpenSearch.
  3. In the cluster list, click the name of the target cluster. The cluster information page is displayed.
  4. Choose Intelligent O&M > Historical Reports.
  5. Set Created to filter diagnostic reports by time range.
    Figure 3 Selecting historical reports
  6. Click Export Report on the right to save the current historical report to the local PC.