ALM-14039 Slow DataNodes Exist in the Cluster
Alarm Description
The system checks the number of slow operations per second on HDFS DataNode instances every 60 seconds and compares the number with the threshold. This alarm is generated when the number of slow operations per second of an HDFS DataNode instance has exceeded the threshold for three minutes.
This alarm is cleared when the number of slow operations per second of the HDFS DataNode instance is less than or equal to the threshold.
This alarm applies only to MRS 3.5.0 or later.
Alarm Attributes
Alarm ID |
Alarm Severity |
Auto Cleared |
---|---|---|
14039 |
Major (default threshold: 100) |
Yes |
Alarm Parameters
Type |
Parameter |
Description |
---|---|---|
Location Information |
Source |
Specifies the cluster for which the alarm was generated. |
ServiceName |
Specifies the service for which the alarm was generated. |
|
RoleName |
Specifies the role for which the alarm was generated. |
|
HostName |
Specifies the host for which the alarm was generated. |
|
Additional Information |
Trigger Condition |
Specifies the alarm triggering condition. |
Impact on the System
Slow DataNodes on HDFS affect the data read and write performance of HDFS.
Possible Causes
- The disk I/O rate of the HDFS DataNode instance is low, and the HDFS DataNode processing capability reaches the bottleneck.
- The network transmission rate between HDFS DataNode instances is low.
Handling Procedure
Check whether the disk I/O rate of the DataNode instance is low.
- Log in to FusionInsight Manager and choose O&M > Alarm > Alarms. In the Location field of the alarm details, view the host name of the DataNode instance for which this alarm is generated.
- Choose Cluster > Services > HDFS, click the Instances tab, and click the DataNode role based on the host name obtained in 1.
- Click the Chart tab and select Performance from the Chart Category area. Check whether any data in Slow Flush or Sync Occurrences Per Second, Slow SyncWriterOsCache Occurrences Per Second, and Slow WriteDataToDisk Occurrences Per Second charts is high.
- On FusionInsight Manager, choose O&M > Alarm > Alarms and check whether ALM-12033 Slow Disk Fault exists.
- Obtain information about the disk where slow operations occur.
- Log in to the DataNode using the IP address obtained in 1 as user omm and run the following commands to view the run log:
cd /var/log/Bigdata/hdfs/dn/
vim hadoop-omm-datanode-Hostname.log
- Search for keyword slow in the log to identify the disk where slow operations occur.
- Log in to the DataNode using the IP address obtained in 1 as user omm and run the following commands to view the run log:
- Rectify the fault based on the obtained disk information by following the handling procedure of ALM-12033 Slow Disk Fault.
- Wait 5 minutes and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 8.
Check whether the network transmission rate between HDFS DataNode instances is low.
- On FusionInsight Manager, choose Cluster > Services > HDFS, click the Chart tab, select Performance in the Chart Category area, and check whether any data in the Slow Write Packet To DownStream Count Per Second and Slow Ack To Upstream Count Per Second charts is high.
- Log in to the DataNode using the IP address obtained in 1 as user omm and run the following commands to view the run log:
cd /var/log/Bigdata/hdfs/dn/
vim hadoop-omm-datanode-Hostname.log
- Search for keyword slow in the log to identify the upstream and downstream nodes where slow operations occur.
- Check whether the network communication between the current node and the nodes obtained in 10 is normal.
- If yes, go to 13.
- If no, contact the network administrator to repair the network.
- Wait 5 minutes and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 13.
Collect fault information.
- On FusionInsight Manager, choose O&M. In the navigation pane on the left, choose Log > Download.
- Expand the Service drop-down list, and select HDFS for the target cluster.
- Click the edit icon in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
- Contact O&M engineers and provide the collected logs.
Alarm Clearance
This alarm is automatically cleared after the fault is rectified.
Related Information
None.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot