Intelligent Diagnostics
During routine operations, you must periodically monitor the health of your OpenSearch clusters to ensure service stability. However, manual health checks are time-consuming, labor-consuming, and prone to oversight. Key issues may be easily missed. To address this, CSS provides an Intelligent O&M feature. It supports both scheduled and on-demand diagnostics, covering essential checks such as data node disk usage, cluster health status, and node CPU usage. Through these checks, O&M personnel can detect and handle potential risks in a timely manner to ensure cluster reliability and availability. By default, Intelligent O&M retains the most recent 11 diagnostic reports, so you can always review the latest results. You can also export any report for further analysis or long-term archiving.
Constraints
- Intelligent diagnostics cannot be enabled while the cluster is in one of the following Cluster Status: frozen, creating, or creation failed.
- While a diagnostic task is in progress, you cannot manually start another task.
- Running diagnostics on a cluster that is in the process of a configuration change (for example, being scaled up or down, or having its node specifications changed) may lead to inaccurate results. You are advised to avoid such times.
- When intelligent diagnostics is enabled, the system is allowed by default to read cluster settings and data configuration information for diagnostic or analytical purposes.
Diagnostic Modes
Two diagnostic modes are available: scheduled and manual. Choose one based on your service requirements.
| Dimension | Scheduled Diagnostics | Manual Diagnostics |
|---|---|---|
| Triggering method | Automatically performed daily by schedule | Manually triggered and executed right away |
| When to use | Routine inspection, creating a health baseline, and long-term trend observation | Quick troubleshooting in response to cluster failures or errors, or cluster status verification before and after change operations |
| Typical users | O&M teams (routine maintenance) | R&D or O&M personnel (troubleshooting) |
| Diagnostic type |
|
|
Diagnostic Items
Table 2 lists the diagnostic items supported by Intelligent O&M.
| No. | Diagnostic Item | Description |
|---|---|---|
| 1 | Data Node Disk Usage Check | Identify data nodes (including cold data nodes) with the highest disk usage in the cluster. Excessive disk usage can affect cluster stability.
|
| 2 | Data Node Disk Usage Balance Check | Check whether disk usage is balanced across all regular data nodes and cold data nodes. Unbalanced disk usage may cause performance issues.
|
| 3 | Large Index Shard Check | Check for large index shards that may cause cluster load imbalance.
|
| 4 | Cluster Node Configuration Check | Check the high availability settings of clusters.
|
| 5 | Cluster Health Check | Check the overall health status of all indices in the cluster.
|
| 6 | Maximum Per-Node Shard Quantity Check | Check that the number of shards allocated to each node remains within the limit set by cluster.max_shards_per_node, preventing shard allocation failures.
|
| 7 | Oversized Shard Check | Check for shards that have reached or exceeded 100 GB in the cluster. Oversized shards can increase query latency and slow down fault recovery. You are advised to keep the shard size under 50 GB.
|
| 8 | Hot Index Read/Write Check (Every Minute) | Identify the 10 indices with the most query and write requests within the last minute and display them in descending order. For each of these identified indices, verify that the number of its primary shards is divisible by the number of data nodes or cold data nodes to prevent load imbalance. |
| 9 | Node Disconnection Check | Check the connectivity of cluster nodes. Node disconnection will severely compromise cluster performance and therefore needs to be handled immediately. |
| 10 | Balanced Node Connections Check | Check the number of connections to port 9200 on the client nodes (or data nodes if there are no client nodes), and display nodes by their connection count in descending order. |
| 11 | Client Connection Check (Source IP Address Analysis) | Check the number of connections to port 9200 on the client nodes (or data nodes if there are no client nodes), and display all external source IP addresses by connection count in descending order as well as details about the top 10 connections. |
| 12 | Node CPU Usage Check (Interval: 1 Minute) | Measure the CPU usage of each node twice every minute, and calculate the average for each node to evaluate their load status.
|
| 13 | JVM Heap Memory Usage Check (Interval: 1 Minute) | Measure the JVM heap memory usage of each node twice every minute, and calculate the average for each node to evaluate their load status.
|
| 14 | Local-Disk Cluster Index Replica Check | For a cluster consisting of nodes that use local disks, check whether all indices have replicas. Otherwise, a single disk failure can lead to permanent data loss. |
Configuring Scheduled Diagnostics
Configuring daily scheduled diagnostics enables continuous health risk monitoring for clusters without human intervention.
- Log in to the CSS management console.
- In the navigation pane on the left, choose Clusters > OpenSearch.
- In the cluster list, click the name of the target cluster. The cluster information page is displayed.
- Choose Intelligent O&M > Intelligent Diagnostics.
- On the Intelligent Diagnostics page, click Enable Scheduled Diagnostics. In the displayed dialog box, configure a scheduled diagnostics task.
Table 3 Enabling scheduled diagnostics Parameter
Description
Diagnostic Type
Select a diagnostic type.
- Full check: All check items are supported. You can select all or only some of the check items.
- Cluster unavailability check: When a cluster is unavailable, run this diagnosis to quickly diagnose its health status.
Diagnostic Item
Select diagnostic items to be included in the scheduled task.
- When Diagnostic Type is set to Full check, you can customize the diagnostic items.
- When Diagnostic Type is set to Cluster unavailability check, the cluster unavailability check items are selected by default.
Time Zone
Select your current time zone. The start time of the scheduled diagnostic task is affected by it.
Diagnosed
Select the time to start the scheduled diagnostic task every day.
- Click OK to save the task information.
The scheduled task information is displayed on the Intelligent Diagnostics page, and daily reports generated by the task will be displayed below on this page or on the Historical Reports tab.
Figure 1 Viewing a scheduled diagnostics task
The report will show high-risk items, medium-risk items, normal items, and reminders, as well as the results and optimization suggestions for each diagnostic item. High- and medium-risk items should be prioritized.
- On the Intelligent Diagnostics page, manage the scheduled diagnostic task, including modifying and disabling it.
- Modify the scheduled diagnostic task: Click Modify Settings. In the displayed dialog box, modify the settings and click OK to update the scheduled diagnostic policy.
- Disable the scheduled diagnostic task: Click Disable Scheduled Diagnostics. In the displayed dialog box, click OK to disable scheduled diagnostics. The scheduled task is removed from the Intelligent Diagnostics page.
Performing Manual Diagnostics
Perform a one-time manual diagnostic for quick troubleshooting or to verify cluster status. This is recommended when your cluster experiences performance issues or errors, or after changes are made to your cluster.
- Log in to the CSS management console.
- In the navigation pane on the left, choose Clusters > OpenSearch.
- In the cluster list, click the name of the target cluster. The cluster information page is displayed.
- Choose Intelligent O&M > Intelligent Diagnostics.
- On the Intelligent Diagnostics page, configure a diagnostic task and click Run Diagnostics to start the task. Figure 2 Manual diagnostics
Table 4 Manual diagnostics Parameter
Description
Diagnostic Type
Select a diagnostic type.
- Full check: All check items are supported. You can select all or only some of the check items.
- Cluster unavailability check: When a cluster is unavailable, run this diagnosis to quickly diagnose its health status.
Diagnostic Item
Select diagnostic items.
- When Diagnostic Type is set to Full check, you can customize the diagnostic items.
- When Diagnostic Type is set to Cluster unavailability check, the cluster unavailability check items are selected by default.
- Wait for 1 to 3 minutes. The diagnostic report will be displayed below.
The report will show high-risk items, medium-risk items, normal items, and reminders, as well as the results and optimization suggestions for each diagnostic item. High- and medium-risk items should be prioritized.
- Click Export Report to save the diagnostic report to your local PC for further analysis or archiving.
Viewing Historical Reports
Historical diagnostic reports help you compare cluster health across different periods, which aids capacity planning and problem diagnosis.
A maximum of 10 historical reports can be retained. (This is the total for both scheduled and manual diagnostic tasks.) When this limit is exceeded, the oldest reports will be deleted automatically. Export and save important reports in a timely manner.
- Log in to the CSS management console.
- In the navigation pane on the left, choose Clusters > OpenSearch.
- In the cluster list, click the name of the target cluster. The cluster information page is displayed.
- Choose Intelligent O&M > Historical Reports.
- Set Created to filter diagnostic reports by time range. Figure 3 Selecting historical reports
- Click Export Report on the right to save the current historical report to the local PC.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot