ALM-12192 Host Load Exceeds the Threshold
Alarm Description
The system checks the average load every 30 seconds and compares the actual average load with the threshold. This alarm is generated when the average load exceeds the threshold for multiple consecutive times (10 by default).
This alarm is cleared when Trigger Count is 1 and the average load is less than or equal to the threshold. This alarm is cleared when Trigger Count is greater than 1 and the average load is less than or equal to 90% of the threshold.
This alarm applies only to MRS 3.3.1 or later.
Alarm Attributes
Alarm ID |
Alarm Severity |
Auto Cleared |
---|---|---|
12192 |
Major |
Yes |
Alarm Parameters
Type |
Parameter |
Description |
---|---|---|
Location Information |
Source |
Specifies the cluster or system for which the alarm was generated. |
ServiceName |
Specifies the service for which the alarm was generated. |
|
RoleName |
Specifies the role for which the alarm was generated. |
|
HostName |
Specifies the host for which the alarm was generated. |
|
Additional Information |
Trigger Condition |
Specifies the alarm triggering condition. |
Impact on the System
- Latency: Service processes may run slowly and there is a latency.
- Service failure: Service processing may be slow, time out, or fail. As a result, jobs may fail to run.
Possible Causes
The host cannot meet service requirements. The average load reaches the upper limit. Alternatively, requirements surged during peak hours, and the average load reaches the upper limit in a short period.
Handling Procedure
Check the host CPU load.
- On FusionInsight Manager, choose O&M > Alarm > Alarms. In the alarm list, expand the alarm details and click the name of the host for which the alarm is generated in Location area.
- On the Hosts page, select the host for which the alarm is generated. Click the Chart tab, select Host Status, and check whether the Average Host Load per CPU Core is greater than 3.
- Log in to the host for which the alarm is reported as user omm.
- Run the top command to check whether the us value of %Cpu(s) is greater than 80.
- Run the following command to obtain the name of the process with high CPU usage, query the process logs (in the /var/log/Bigdata directory) based on the process name, and check whether the service logs contain error information:
ps -ef | grep PID
- Run the top command to check whether the value of wa is greater than 10.0.
- Run the iotop command as the root user to check the processes with high disk read/write usage and determine whether the processes are unnecessary based on service needs.
- If yes, run the following command to stop unnecessary processes. (If PID is not displayed, press P to switch TID to PID.)
kill -9 PID
- If no, go to 8.
- If yes, run the following command to stop unnecessary processes. (If PID is not displayed, press P to switch TID to PID.)
- Wait 5 minutes and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 9.
Collect fault information.
- On FusionInsight Manager, choose O&M. In the navigation pane on the left, choose Log > Download.
- Expand the Service drop-down list, select NodeAgent for the target cluster, and click OK.
- Click the edit icon in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
- Contact O&M engineers and provide the collected logs.
Alarm Clearance
This alarm is automatically cleared after the fault is rectified.
Related Information
None.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot